lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pamela Foxcroft" <pamelafoxcr...@gmail.com>
Subject Re: noobie question
Date Tue, 23 May 2006 17:51:42 GMT
Hi George

I am confused, what do you mean by storing data in my index?

Thanks to you and Jeff for all of your help! I really appreciate it!

Pam


On 5/22/06, George Aroush <george@aroush.net> wrote:
>
> Hi Pam and Jeff,
>
> You can't load 7Gb of index into memory.  A typical Windows application
> can't access more then 2Gb of RAM -- so if a machine has 8Gg and only
> Lucene
> is running chance are that you still have a lot of real memory not being
> used.
>
> You need to investigate and find out why your index grew to 7Gb and reduce
> it's size.  For example, are you storing any data in Lucene's index?  If
> so,
> consider not doing so.
>
> Monitor your CPU and see that it is being max'ed out or not.  Chance are
> that it is and if queries are still taking log to run then your focus
> should
> be on disk I/O.
>
> Regards,
>
> -- George Aroush
>
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Saturday, May 20, 2006 11:18 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> - Our index is currently 7 Gigs. I take it we should have more than 7 Gigs
> or RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?
>
> You can go with big RAM, but I haven't found that to be a huge boost in
> search perf.  We run dual-proc Xeons for our search servers, as CPU has
> been
> the bottleneck.  Sorts are particularly egregious when it comes to CPU
> load
> as well.  Bang for the buck, running the new dual-core Opterons are
> *amazingly* strong performers.
>
> - Each html doc we have has 10 metatags which we store. Other than date,
> and
> a 10 byte string for one of the metatags, the metatags are almost always
> empty. Will this degrade performance?
>
> I would not expect this to degrade your performance.
>
> - Also when you suggest we distribute our index, on what criteria do we
> partition? It looks like we need to optimize our IO for reads which means
> raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
> cache it in ram (file system cache) by issuing warm up queries?
>
> The faster your disk, the better.  And yes, warm-up queries are a big
> help.
> In our instance, warm up queries need to be logically distributed to hit
> all
> the searchers.
>
>
> On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> >
> > Hi George
> >
> > Our index is currently 7 Gigs. I take it we should have more than 7
> > Gigs or RAM on our machine? Can we get any other hardware specs? IE 2,
> > 4 procs?
> >
> > Each html doc we have has 10 metatags which we store. Other than date,
> > and a 10 byte string for one of the metatags, the metatags are almost
> > always empty. Will this degrade performance?
> >
> > Also when you suggest we distribute our index, on what criteria do we
> > partition? It looks like we need to optimize our IO for reads which
> > means raid 5 or a solid state ram drive to me. Is this correct? Could
> > we perhaps cache it in ram (file system cache) by issuing warm up
> queries?
> >
> > BTW - we will be running on the wintel platform using c#.
> >
> > TIA
> >
> > Pam
> >
> >
> > On 5/19/06, George Aroush <george@aroush.net> wrote:
> > >
> > > Hi Pam,
> > >
> > > You also need to investigate your hardware configuration.  Beside
> > > the usual of having a fast CPU and max out your memory, make sure
> > > have a fast hard drive.
> > >
> > > As a Lucene index grows, anything you do with Lucene becomes I/O
> > > bound, thus a fast hard drive is critical.  Simply moving from
> > > 5400rpm to 7200rpm
> > will
> > > give you a noticeable difference -- switch to a fast SCSI/RAID hard
> > > rive and you will even see better results.  And yet even better, if
> > > you
> > distribute
> > > your index across multiple hard-drives/portions.
> > >
> > > One other thing to look for, are you storing any data in your Lucene
> > > index?
> > > If so, consider not doing it.  The goal is to keep the index size as
> > small
> > > as possible to reduce I/O.
> > >
> > > Good luck.
> > >
> > > -- George Aroush
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Friday, May 19, 2006 4:28 PM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > Yes, the merge parameters does affect indexing performance, but
> > > compactness also affects search performance as your index gets
> > > larger.  As you incrementally update the index, the fragmentation
> > > effect (which the
> > merge
> > > properties will dictate) causes performance degradation at search
> time.
> > >
> > > As for index size, I don't know about any hard and fast rules.  We
> > > have about 7-8GB of indexes of varying structure, and those are
> > > spread out
> > over
> > > about 40 indexes.  We try to keep individual indexes below 300MB, as
> > > the operational hassles after that size seem to be more burdensome.
> > > We also use distributed searching so our indexes are allocated
> > > across multiple machines (no duplication).  As a rule, we also try
> > > to stay below 2.5GB of
> > aggregate
> > > indexes on one machine.  Our indexes are a full corpus and we must
> > search
> > > across all indexes all the time.  You can structure your indexes
> > > more effectively if you don't need to search the full corpus all the
> time.
> > >
> > > With multiple indexes being searched collectively, you'll soon be
> > > using the MultiSearcher class.  Be sure to look at MultiReader, as
> > > it makes a difference in search performance (nice caching).
> > >
> > > -- j
> > >
> > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > >
> > > > Hi Jeff
> > > >
> > > > A couple more questions. Don't the merge parameters determine how
> > > > aggressively the index is compacted? And if so, doesn't this
> > > > affect only indexing performance and not search performance?
> > > >
> > > > Secondly how large should each index be? Should I be partitioning
> > > > the indexes, ie by date range? So one index for Decemeber 2005,
> > > > one for January, etc? Or is it done by size?
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > > On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
> > > > >
> > > > > Hi Pamela -
> > > > >
> > > > > Performance certainly changes as your index grows, and it's not
> > > > > even necessarily a linear progression.  How you indexed your
> > > > > data,
> > > > compression
> > > > > factors, compound vs. loose file format, number of indexes, etc.
> > > > > all
> > > > play
> > > > > a
> > > > > part in affecting search performance at runtime.
> > > > >
> > > > > There are a lot of places to look for improvements.  I would
> > > > > suggest looking at your specific indexes and see if you can
> > > > > break those up into smaller indexes -- this will lead you to the
> > > > > MultiSearcher (and, if you have multi-processor hardware,
> ParallelMultiSearcher).
> > > > >
> > > > > Leave your index updating operation out of the picture for the
> > moment.
> > > > > Indexing can have a big impact on search performance, so take
> > > > > that out
> > > > of
> > > > > the equation.  After you're able to get to better runtime search
> > > > > performance, go back and add indexing to the mix.  I can tell
> > > > > you from experience that most search systems with indexes of
> > > > > substantial size are executing indexing operations on separate
> > > > > systems to avoid performance impacts.
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > -- j
> > > > >
> > > > >
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > > > >
> > > > > > I have been developing a C# search solution for an application
> > > > > > which
> > > > has
> > > > > > tens of millions of web pages. Most of these web pages are
> > > > > > under 1
> > > k.
> > > > > >
> > > > > > While our initial pilot was very encouraging on our tests of
> > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond
> > > > > > searches are now taking 8-10 seconds.
> > > > > >
> > > > > > Where should I focus my efforts to increase search speed?
> > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > >
> > > > > > We only have one machine right now which serves indexing and
> > > > searching.
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message