lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George Aroush" <geo...@aroush.net>
Subject RE: noobie question
Date Thu, 25 May 2006 01:33:34 GMT
Ahh, I wasn't thinking of 64bit OS.  Speaking of which, have you or has
anyone compiled Lucene.Net or Java Lucene for that matter, as 64bit
application and got it running?

-- George Aroush

-----Original Message-----
From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com] 
Sent: Monday, May 22, 2006 5:44 PM
To: lucene-net-dev@incubator.apache.org
Subject: Re: noobie question

You could certainly load a 7gb index into memory, given sufficient hardware
running 64-bit Windows.  That said, I wouldn't suggest trying to carry a
single 7gb index in a single server's memory.

Keeping an index below a 2Gb threshold is only treating a symptom and isn't
really sustainable if your index is already in the 7Gb range.  The issue at
hand is dealing with the indexed data as efficiently as possible.  Following
George's suggestion for stripping the index down, i.e. just using searchable
entities, is one possible approach.  In our situation, we have quite a few
fields of data that would be performance hits elsewhere on our system to
retrieve at search run-time, so the lesser evil is to include them in our
index.  Just depends on your requirements to determine what's best.
Likewise, monitoring your hardware statistics for bottlenecks aren't
invalid, but I doubt you'll be able to make the modifications necessary to
achieve the results you'd like to see on hardware config changes alone.

Based on the conversation we've had thus far and a few assumptions on my
part, I doubt you'll be able to keep your search times anywhere near the
thresholds you'd like to see.  You can help yourself with reduced index
size, tweaked hardware configurations, and indexing strategies, but there is
no silver bullet here.  If my experiences hold true for you, you'll end up
addressing each of these areas as your look for efficiencies of scale.

-- j

On 5/22/06, George Aroush <george@aroush.net> wrote:
>
> Hi Pam and Jeff,
>
> You can't load 7Gb of index into memory.  A typical Windows 
> application can't access more then 2Gb of RAM -- so if a machine has 
> 8Gg and only Lucene is running chance are that you still have a lot of 
> real memory not being used.
>
> You need to investigate and find out why your index grew to 7Gb and 
> reduce it's size.  For example, are you storing any data in Lucene's 
> index?  If so, consider not doing so.
>
> Monitor your CPU and see that it is being max'ed out or not.  Chance 
> are that it is and if queries are still taking log to run then your 
> focus should be on disk I/O.
>
> Regards,
>
> -- George Aroush
>
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Saturday, May 20, 2006 11:18 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> - Our index is currently 7 Gigs. I take it we should have more than 7 
> Gigs or RAM on our machine? Can we get any other hardware specs? IE 2, 4
procs?
>
> You can go with big RAM, but I haven't found that to be a huge boost 
> in search perf.  We run dual-proc Xeons for our search servers, as CPU 
> has been the bottleneck.  Sorts are particularly egregious when it 
> comes to CPU load as well.  Bang for the buck, running the new 
> dual-core Opterons are
> *amazingly* strong performers.
>
> - Each html doc we have has 10 metatags which we store. Other than 
> date, and a 10 byte string for one of the metatags, the metatags are 
> almost always empty. Will this degrade performance?
>
> I would not expect this to degrade your performance.
>
> - Also when you suggest we distribute our index, on what criteria do 
> we partition? It looks like we need to optimize our IO for reads which 
> means raid 5 or a solid state ram drive to me. Is this correct? Could 
> we perhaps cache it in ram (file system cache) by issuing warm up queries?
>
> The faster your disk, the better.  And yes, warm-up queries are a big 
> help.
> In our instance, warm up queries need to be logically distributed to 
> hit all the searchers.
>
>
> On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> >
> > Hi George
> >
> > Our index is currently 7 Gigs. I take it we should have more than 7 
> > Gigs or RAM on our machine? Can we get any other hardware specs? IE 
> > 2,
> > 4 procs?
> >
> > Each html doc we have has 10 metatags which we store. Other than 
> > date, and a 10 byte string for one of the metatags, the metatags are 
> > almost always empty. Will this degrade performance?
> >
> > Also when you suggest we distribute our index, on what criteria do 
> > we partition? It looks like we need to optimize our IO for reads 
> > which means raid 5 or a solid state ram drive to me. Is this 
> > correct? Could we perhaps cache it in ram (file system cache) by 
> > issuing warm up
> queries?
> >
> > BTW - we will be running on the wintel platform using c#.
> >
> > TIA
> >
> > Pam
> >
> >
> > On 5/19/06, George Aroush <george@aroush.net> wrote:
> > >
> > > Hi Pam,
> > >
> > > You also need to investigate your hardware configuration.  Beside 
> > > the usual of having a fast CPU and max out your memory, make sure 
> > > have a fast hard drive.
> > >
> > > As a Lucene index grows, anything you do with Lucene becomes I/O 
> > > bound, thus a fast hard drive is critical.  Simply moving from 
> > > 5400rpm to 7200rpm
> > will
> > > give you a noticeable difference -- switch to a fast SCSI/RAID 
> > > hard rive and you will even see better results.  And yet even 
> > > better, if you
> > distribute
> > > your index across multiple hard-drives/portions.
> > >
> > > One other thing to look for, are you storing any data in your 
> > > Lucene index?
> > > If so, consider not doing it.  The goal is to keep the index size 
> > > as
> > small
> > > as possible to reduce I/O.
> > >
> > > Good luck.
> > >
> > > -- George Aroush
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Friday, May 19, 2006 4:28 PM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > Yes, the merge parameters does affect indexing performance, but 
> > > compactness also affects search performance as your index gets 
> > > larger.  As you incrementally update the index, the fragmentation 
> > > effect (which the
> > merge
> > > properties will dictate) causes performance degradation at search
> time.
> > >
> > > As for index size, I don't know about any hard and fast rules.  We 
> > > have about 7-8GB of indexes of varying structure, and those are 
> > > spread out
> > over
> > > about 40 indexes.  We try to keep individual indexes below 300MB, 
> > > as the operational hassles after that size seem to be more burdensome.
> > > We also use distributed searching so our indexes are allocated 
> > > across multiple machines (no duplication).  As a rule, we also try 
> > > to stay below 2.5GB of
> > aggregate
> > > indexes on one machine.  Our indexes are a full corpus and we must
> > search
> > > across all indexes all the time.  You can structure your indexes 
> > > more effectively if you don't need to search the full corpus all 
> > > the
> time.
> > >
> > > With multiple indexes being searched collectively, you'll soon be 
> > > using the MultiSearcher class.  Be sure to look at MultiReader, as 
> > > it makes a difference in search performance (nice caching).
> > >
> > > -- j
> > >
> > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > >
> > > > Hi Jeff
> > > >
> > > > A couple more questions. Don't the merge parameters determine 
> > > > how aggressively the index is compacted? And if so, doesn't this 
> > > > affect only indexing performance and not search performance?
> > > >
> > > > Secondly how large should each index be? Should I be 
> > > > partitioning the indexes, ie by date range? So one index for 
> > > > Decemeber 2005, one for January, etc? Or is it done by size?
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > > On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
> > > > >
> > > > > Hi Pamela -
> > > > >
> > > > > Performance certainly changes as your index grows, and it's 
> > > > > not even necessarily a linear progression.  How you indexed 
> > > > > your data,
> > > > compression
> > > > > factors, compound vs. loose file format, number of indexes, etc.
> > > > > all
> > > > play
> > > > > a
> > > > > part in affecting search performance at runtime.
> > > > >
> > > > > There are a lot of places to look for improvements.  I would 
> > > > > suggest looking at your specific indexes and see if you can 
> > > > > break those up into smaller indexes -- this will lead you to 
> > > > > the MultiSearcher (and, if you have multi-processor hardware,
> ParallelMultiSearcher).
> > > > >
> > > > > Leave your index updating operation out of the picture for the
> > moment.
> > > > > Indexing can have a big impact on search performance, so take 
> > > > > that out
> > > > of
> > > > > the equation.  After you're able to get to better runtime 
> > > > > search performance, go back and add indexing to the mix.  I 
> > > > > can tell you from experience that most search systems with 
> > > > > indexes of substantial size are executing indexing operations 
> > > > > on separate systems to avoid performance impacts.
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > -- j
> > > > >
> > > > >
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > > > >
> > > > > > I have been developing a C# search solution for an 
> > > > > > application which
> > > > has
> > > > > > tens of millions of web pages. Most of these web pages are 
> > > > > > under 1
> > > k.
> > > > > >
> > > > > > While our initial pilot was very encouraging on our tests of

> > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond 
> > > > > > searches are now taking 8-10 seconds.
> > > > > >
> > > > > > Where should I focus my efforts to increase search speed?
> > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > >
> > > > > > We only have one machine right now which serves indexing and
> > > > searching.
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>


Mime
View raw message