lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George Aroush" <geo...@aroush.net>
Subject RE: noobie question
Date Thu, 25 May 2006 01:46:24 GMT
For my solution, the only thing I store in the Lucene index is the primarily
key.  This kind of a solution allows me to keep the Lucene index as small as
possible, which means searching and updating the index is fast.

Anything which is post search -- extracting hit snippets, highlighting, etc
-- are done by another process which I can easily host on another server.

If you design your system along those lines, you can provide a scalable
solution.  Also, I would suggest that you design your solution for fast
searching first, and take care of indexing, highlighting, etc later.

-- George Aroush

-----Original Message-----
From: Pamela Foxcroft [mailto:pamelafoxcroft@gmail.com] 
Sent: Wednesday, May 24, 2006 12:11 PM
To: lucene-net-dev@incubator.apache.org
Subject: Re: noobie question

Hi Jeff & George

OK, I guess we are stroing a lot of data in our index. Basically we are
storing 10 metags and their values. The only ones which is always populated
our are Primary key value, and our date value (we are indexing a database).
The rest are almost always empty.

Pam


On 5/23/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
>
> Hi Pam -
>
> > I am confused, what do you mean by storing data in my index?
> (George, correct me if I'm wrong here.)
>
> What George is referring to is the different manners in which data can 
> be included in an index.  Take a look at the Field class and you'll 
> notice a series of static methods that store data in a number of ways.  
> The static methods define four different ways to include data in an 
> index -- Keyword, Unindexed, Unstored, and Text.  These are just 
> wrapper definitions for indexing, storing and tokenizing index
information.
>
> "Indexing" means including data with a field that would be searchable.
> "Storing" means including data with a field for presentation.
> "Tokenizing" means using analyzed data with a field that's been 
> designated as indexed (searchable).
>
> For the four static methods:
> Keyword - values are indexed (searchable) and stored but not tokenized 
> Unindexed - values are stored but not indexed or tokenized Unstored - 
> values are indexed and tokenized (searchable) but not stored Text - 
> values are indexed, tokenized and stored
>
> In making decisions about index composition, choose the field storage 
> method that best matches the need for your particular data field.  The 
> fewer data fields you need, the smaller the index, the better the 
> performance.
>
>
> > Thanks to you and Jeff for all of your help! I really appreciate it!
> That's why the list is here.  :-)
>
> -- j
>
>
>
> On 5/23/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> >
> > Hi George
> >
> > I am confused, what do you mean by storing data in my index?
> >
> > Thanks to you and Jeff for all of your help! I really appreciate it!
> >
> > Pam
> >
> >
> > On 5/22/06, George Aroush <george@aroush.net> wrote:
> > >
> > > Hi Pam and Jeff,
> > >
> > > You can't load 7Gb of index into memory.  A typical Windows
> application
> > > can't access more then 2Gb of RAM -- so if a machine has 8Gg and 
> > > only Lucene is running chance are that you still have a lot of 
> > > real memory not
> being
> > > used.
> > >
> > > You need to investigate and find out why your index grew to 7Gb 
> > > and
> > reduce
> > > it's size.  For example, are you storing any data in Lucene's
> index?  If
> > > so,
> > > consider not doing so.
> > >
> > > Monitor your CPU and see that it is being max'ed out or not.  
> > > Chance
> are
> > > that it is and if queries are still taking log to run then your 
> > > focus should be on disk I/O.
> > >
> > > Regards,
> > >
> > > -- George Aroush
> > >
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Saturday, May 20, 2006 11:18 AM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > - Our index is currently 7 Gigs. I take it we should have more 
> > > than 7
> > Gigs
> > > or RAM on our machine? Can we get any other hardware specs? IE 2, 
> > > 4
> > procs?
> > >
> > > You can go with big RAM, but I haven't found that to be a huge 
> > > boost
> in
> > > search perf.  We run dual-proc Xeons for our search servers, as 
> > > CPU
> has
> > > been
> > > the bottleneck.  Sorts are particularly egregious when it comes to 
> > > CPU load as well.  Bang for the buck, running the new dual-core 
> > > Opterons are
> > > *amazingly* strong performers.
> > >
> > > - Each html doc we have has 10 metatags which we store. Other than
> date,
> > > and
> > > a 10 byte string for one of the metatags, the metatags are almost
> always
> > > empty. Will this degrade performance?
> > >
> > > I would not expect this to degrade your performance.
> > >
> > > - Also when you suggest we distribute our index, on what criteria 
> > > do
> we
> > > partition? It looks like we need to optimize our IO for reads 
> > > which
> > means
> > > raid 5 or a solid state ram drive to me. Is this correct? Could we
> > perhaps
> > > cache it in ram (file system cache) by issuing warm up queries?
> > >
> > > The faster your disk, the better.  And yes, warm-up queries are a 
> > > big help.
> > > In our instance, warm up queries need to be logically distributed 
> > > to
> hit
> > > all
> > > the searchers.
> > >
> > >
> > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > >
> > > > Hi George
> > > >
> > > > Our index is currently 7 Gigs. I take it we should have more 
> > > > than 7 Gigs or RAM on our machine? Can we get any other hardware 
> > > > specs? IE
> 2,
> > > > 4 procs?
> > > >
> > > > Each html doc we have has 10 metatags which we store. Other than
> date,
> > > > and a 10 byte string for one of the metatags, the metatags are
> almost
> > > > always empty. Will this degrade performance?
> > > >
> > > > Also when you suggest we distribute our index, on what criteria 
> > > > do
> we
> > > > partition? It looks like we need to optimize our IO for reads 
> > > > which means raid 5 or a solid state ram drive to me. Is this
correct?
> Could
> > > > we perhaps cache it in ram (file system cache) by issuing warm 
> > > > up
> > > queries?
> > > >
> > > > BTW - we will be running on the wintel platform using c#.
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > >
> > > > On 5/19/06, George Aroush <george@aroush.net> wrote:
> > > > >
> > > > > Hi Pam,
> > > > >
> > > > > You also need to investigate your hardware configuration.  
> > > > > Beside the usual of having a fast CPU and max out your memory, 
> > > > > make sure have a fast hard drive.
> > > > >
> > > > > As a Lucene index grows, anything you do with Lucene becomes 
> > > > > I/O bound, thus a fast hard drive is critical.  Simply moving 
> > > > > from 5400rpm to 7200rpm
> > > > will
> > > > > give you a noticeable difference -- switch to a fast SCSI/RAID
> hard
> > > > > rive and you will even see better results.  And yet even 
> > > > > better,
> if
> > > > > you
> > > > distribute
> > > > > your index across multiple hard-drives/portions.
> > > > >
> > > > > One other thing to look for, are you storing any data in your
> Lucene
> > > > > index?
> > > > > If so, consider not doing it.  The goal is to keep the index 
> > > > > size
> as
> > > > small
> > > > > as possible to reduce I/O.
> > > > >
> > > > > Good luck.
> > > > >
> > > > > -- George Aroush
> > > > >
> > > > > -----Original Message-----
> > > > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > > > Sent: Friday, May 19, 2006 4:28 PM
> > > > > To: lucene-net-dev@incubator.apache.org
> > > > > Subject: Re: noobie question
> > > > >
> > > > > Yes, the merge parameters does affect indexing performance, 
> > > > > but compactness also affects search performance as your index 
> > > > > gets larger.  As you incrementally update the index, the 
> > > > > fragmentation effect (which the
> > > > merge
> > > > > properties will dictate) causes performance degradation at 
> > > > > search
> > > time.
> > > > >
> > > > > As for index size, I don't know about any hard and fast rules.  
> > > > > We have about 7-8GB of indexes of varying structure, and those 
> > > > > are spread out
> > > > over
> > > > > about 40 indexes.  We try to keep individual indexes below 
> > > > > 300MB,
> as
> > > > > the operational hassles after that size seem to be more
> burdensome.
> > > > > We also use distributed searching so our indexes are allocated 
> > > > > across multiple machines (no duplication).  As a rule, we also 
> > > > > try to stay below 2.5GB of
> > > > aggregate
> > > > > indexes on one machine.  Our indexes are a full corpus and we 
> > > > > must
> > > > search
> > > > > across all indexes all the time.  You can structure your 
> > > > > indexes more effectively if you don't need to search the full 
> > > > > corpus all
> the
> > > time.
> > > > >
> > > > > With multiple indexes being searched collectively, you'll soon 
> > > > > be using the MultiSearcher class.  Be sure to look at 
> > > > > MultiReader, as it makes a difference in search performance (nice
caching).
> > > > >
> > > > > -- j
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > > > >
> > > > > > Hi Jeff
> > > > > >
> > > > > > A couple more questions. Don't the merge parameters 
> > > > > > determine
> how
> > > > > > aggressively the index is compacted? And if so, doesn't this

> > > > > > affect only indexing performance and not search performance?
> > > > > >
> > > > > > Secondly how large should each index be? Should I be
> partitioning
> > > > > > the indexes, ie by date range? So one index for Decemeber 
> > > > > > 2005, one for January, etc? Or is it done by size?
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > > On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com>
wrote:
> > > > > > >
> > > > > > > Hi Pamela -
> > > > > > >
> > > > > > > Performance certainly changes as your index grows, and

> > > > > > > it's
> not
> > > > > > > even necessarily a linear progression.  How you indexed

> > > > > > > your data,
> > > > > > compression
> > > > > > > factors, compound vs. loose file format, number of 
> > > > > > > indexes,
> etc.
> > > > > > > all
> > > > > > play
> > > > > > > a
> > > > > > > part in affecting search performance at runtime.
> > > > > > >
> > > > > > > There are a lot of places to look for improvements.  I

> > > > > > > would suggest looking at your specific indexes and see
if 
> > > > > > > you can break those up into smaller indexes -- this will

> > > > > > > lead you to
> the
> > > > > > > MultiSearcher (and, if you have multi-processor hardware,
> > > ParallelMultiSearcher).
> > > > > > >
> > > > > > > Leave your index updating operation out of the picture
for 
> > > > > > > the
> > > > moment.
> > > > > > > Indexing can have a big impact on search performance, so

> > > > > > > take that out
> > > > > > of
> > > > > > > the equation.  After you're able to get to better runtime
> search
> > > > > > > performance, go back and add indexing to the mix.  I can

> > > > > > > tell you from experience that most search systems with

> > > > > > > indexes of substantial size are executing indexing 
> > > > > > > operations on separate systems to avoid performance impacts.
> > > > > > >
> > > > > > > Hope this helps.
> > > > > > >
> > > > > > > -- j
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com>
wrote:
> > > > > > > >
> > > > > > > > I have been developing a C# search solution for an
> application
> > > > > > > > which
> > > > > > has
> > > > > > > > tens of millions of web pages. Most of these web pages

> > > > > > > > are under 1
> > > > > k.
> > > > > > > >
> > > > > > > > While our initial pilot was very encouraging on our

> > > > > > > > tests of 1,000,000 docs, when we scaled up to 10 million

> > > > > > > > subsecond searches are now taking 8-10 seconds.
> > > > > > > >
> > > > > > > > Where should I focus my efforts to increase search
speed?
> > > > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > > > >
> > > > > > > > We only have one machine right now which serves indexing

> > > > > > > > and
> > > > > > searching.
> > > > > > > >
> > > > > > > > TIA
> > > > > > > >
> > > > > > > > Pam
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>


Mime
View raw message