lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Rodenburg" <jeff.rodenb...@gmail.com>
Subject Re: noobie question
Date Mon, 22 May 2006 21:43:48 GMT
You could certainly load a 7gb index into memory, given sufficient hardware
running 64-bit Windows.  That said, I wouldn't suggest trying to carry a
single 7gb index in a single server's memory.

Keeping an index below a 2Gb threshold is only treating a symptom and isn't
really sustainable if your index is already in the 7Gb range.  The issue at
hand is dealing with the indexed data as efficiently as possible.  Following
George's suggestion for stripping the index down, i.e. just using searchable
entities, is one possible approach.  In our situation, we have quite a few
fields of data that would be performance hits elsewhere on our system to
retrieve at search run-time, so the lesser evil is to include them in our
index.  Just depends on your requirements to determine what's best.
Likewise, monitoring your hardware statistics for bottlenecks aren't
invalid, but I doubt you'll be able to make the modifications necessary to
achieve the results you'd like to see on hardware config changes alone.

Based on the conversation we've had thus far and a few assumptions on my
part, I doubt you'll be able to keep your search times anywhere near the
thresholds you'd like to see.  You can help yourself with reduced index
size, tweaked hardware configurations, and indexing strategies, but there is
no silver bullet here.  If my experiences hold true for you, you'll end up
addressing each of these areas as your look for efficiencies of scale.

-- j

On 5/22/06, George Aroush <george@aroush.net> wrote:
>
> Hi Pam and Jeff,
>
> You can't load 7Gb of index into memory.  A typical Windows application
> can't access more then 2Gb of RAM -- so if a machine has 8Gg and only
> Lucene
> is running chance are that you still have a lot of real memory not being
> used.
>
> You need to investigate and find out why your index grew to 7Gb and reduce
> it's size.  For example, are you storing any data in Lucene's index?  If
> so,
> consider not doing so.
>
> Monitor your CPU and see that it is being max'ed out or not.  Chance are
> that it is and if queries are still taking log to run then your focus
> should
> be on disk I/O.
>
> Regards,
>
> -- George Aroush
>
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Saturday, May 20, 2006 11:18 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> - Our index is currently 7 Gigs. I take it we should have more than 7 Gigs
> or RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?
>
> You can go with big RAM, but I haven't found that to be a huge boost in
> search perf.  We run dual-proc Xeons for our search servers, as CPU has
> been
> the bottleneck.  Sorts are particularly egregious when it comes to CPU
> load
> as well.  Bang for the buck, running the new dual-core Opterons are
> *amazingly* strong performers.
>
> - Each html doc we have has 10 metatags which we store. Other than date,
> and
> a 10 byte string for one of the metatags, the metatags are almost always
> empty. Will this degrade performance?
>
> I would not expect this to degrade your performance.
>
> - Also when you suggest we distribute our index, on what criteria do we
> partition? It looks like we need to optimize our IO for reads which means
> raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
> cache it in ram (file system cache) by issuing warm up queries?
>
> The faster your disk, the better.  And yes, warm-up queries are a big
> help.
> In our instance, warm up queries need to be logically distributed to hit
> all
> the searchers.
>
>
> On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> >
> > Hi George
> >
> > Our index is currently 7 Gigs. I take it we should have more than 7
> > Gigs or RAM on our machine? Can we get any other hardware specs? IE 2,
> > 4 procs?
> >
> > Each html doc we have has 10 metatags which we store. Other than date,
> > and a 10 byte string for one of the metatags, the metatags are almost
> > always empty. Will this degrade performance?
> >
> > Also when you suggest we distribute our index, on what criteria do we
> > partition? It looks like we need to optimize our IO for reads which
> > means raid 5 or a solid state ram drive to me. Is this correct? Could
> > we perhaps cache it in ram (file system cache) by issuing warm up
> queries?
> >
> > BTW - we will be running on the wintel platform using c#.
> >
> > TIA
> >
> > Pam
> >
> >
> > On 5/19/06, George Aroush <george@aroush.net> wrote:
> > >
> > > Hi Pam,
> > >
> > > You also need to investigate your hardware configuration.  Beside
> > > the usual of having a fast CPU and max out your memory, make sure
> > > have a fast hard drive.
> > >
> > > As a Lucene index grows, anything you do with Lucene becomes I/O
> > > bound, thus a fast hard drive is critical.  Simply moving from
> > > 5400rpm to 7200rpm
> > will
> > > give you a noticeable difference -- switch to a fast SCSI/RAID hard
> > > rive and you will even see better results.  And yet even better, if
> > > you
> > distribute
> > > your index across multiple hard-drives/portions.
> > >
> > > One other thing to look for, are you storing any data in your Lucene
> > > index?
> > > If so, consider not doing it.  The goal is to keep the index size as
> > small
> > > as possible to reduce I/O.
> > >
> > > Good luck.
> > >
> > > -- George Aroush
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Friday, May 19, 2006 4:28 PM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > Yes, the merge parameters does affect indexing performance, but
> > > compactness also affects search performance as your index gets
> > > larger.  As you incrementally update the index, the fragmentation
> > > effect (which the
> > merge
> > > properties will dictate) causes performance degradation at search
> time.
> > >
> > > As for index size, I don't know about any hard and fast rules.  We
> > > have about 7-8GB of indexes of varying structure, and those are
> > > spread out
> > over
> > > about 40 indexes.  We try to keep individual indexes below 300MB, as
> > > the operational hassles after that size seem to be more burdensome.
> > > We also use distributed searching so our indexes are allocated
> > > across multiple machines (no duplication).  As a rule, we also try
> > > to stay below 2.5GB of
> > aggregate
> > > indexes on one machine.  Our indexes are a full corpus and we must
> > search
> > > across all indexes all the time.  You can structure your indexes
> > > more effectively if you don't need to search the full corpus all the
> time.
> > >
> > > With multiple indexes being searched collectively, you'll soon be
> > > using the MultiSearcher class.  Be sure to look at MultiReader, as
> > > it makes a difference in search performance (nice caching).
> > >
> > > -- j
> > >
> > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > >
> > > > Hi Jeff
> > > >
> > > > A couple more questions. Don't the merge parameters determine how
> > > > aggressively the index is compacted? And if so, doesn't this
> > > > affect only indexing performance and not search performance?
> > > >
> > > > Secondly how large should each index be? Should I be partitioning
> > > > the indexes, ie by date range? So one index for Decemeber 2005,
> > > > one for January, etc? Or is it done by size?
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > > On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
> > > > >
> > > > > Hi Pamela -
> > > > >
> > > > > Performance certainly changes as your index grows, and it's not
> > > > > even necessarily a linear progression.  How you indexed your
> > > > > data,
> > > > compression
> > > > > factors, compound vs. loose file format, number of indexes, etc.
> > > > > all
> > > > play
> > > > > a
> > > > > part in affecting search performance at runtime.
> > > > >
> > > > > There are a lot of places to look for improvements.  I would
> > > > > suggest looking at your specific indexes and see if you can
> > > > > break those up into smaller indexes -- this will lead you to the
> > > > > MultiSearcher (and, if you have multi-processor hardware,
> ParallelMultiSearcher).
> > > > >
> > > > > Leave your index updating operation out of the picture for the
> > moment.
> > > > > Indexing can have a big impact on search performance, so take
> > > > > that out
> > > > of
> > > > > the equation.  After you're able to get to better runtime search
> > > > > performance, go back and add indexing to the mix.  I can tell
> > > > > you from experience that most search systems with indexes of
> > > > > substantial size are executing indexing operations on separate
> > > > > systems to avoid performance impacts.
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > -- j
> > > > >
> > > > >
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > > > >
> > > > > > I have been developing a C# search solution for an application
> > > > > > which
> > > > has
> > > > > > tens of millions of web pages. Most of these web pages are
> > > > > > under 1
> > > k.
> > > > > >
> > > > > > While our initial pilot was very encouraging on our tests of
> > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond
> > > > > > searches are now taking 8-10 seconds.
> > > > > >
> > > > > > Where should I focus my efforts to increase search speed?
> > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > >
> > > > > > We only have one machine right now which serves indexing and
> > > > searching.
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message