lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pamela Foxcroft" <pamelafoxcr...@gmail.com>
Subject Re: noobie question
Date Sat, 20 May 2006 01:31:57 GMT
Hi George

Our index is currently 7 Gigs. I take it we should have more than 7 Gigs or
RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?

Each html doc we have has 10 metatags which we store. Other than date, and a
10 byte string for one of the metatags, the metatags are almost always
empty. Will this degrade performance?

Also when you suggest we distribute our index, on what criteria do we
partition? It looks like we need to optimize our IO for reads which means
raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
cache it in ram (file system cache) by issuing warm up queries?

BTW - we will be running on the wintel platform using c#.

TIA

Pam


On 5/19/06, George Aroush <george@aroush.net> wrote:
>
> Hi Pam,
>
> You also need to investigate your hardware configuration.  Beside the
> usual
> of having a fast CPU and max out your memory, make sure have a fast hard
> drive.
>
> As a Lucene index grows, anything you do with Lucene becomes I/O bound,
> thus
> a fast hard drive is critical.  Simply moving from 5400rpm to 7200rpm will
> give you a noticeable difference -- switch to a fast SCSI/RAID hard rive
> and
> you will even see better results.  And yet even better, if you distribute
> your index across multiple hard-drives/portions.
>
> One other thing to look for, are you storing any data in your Lucene
> index?
> If so, consider not doing it.  The goal is to keep the index size as small
> as possible to reduce I/O.
>
> Good luck.
>
> -- George Aroush
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Friday, May 19, 2006 4:28 PM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> Yes, the merge parameters does affect indexing performance, but
> compactness
> also affects search performance as your index gets larger.  As you
> incrementally update the index, the fragmentation effect (which the merge
> properties will dictate) causes performance degradation at search time.
>
> As for index size, I don't know about any hard and fast rules.  We have
> about 7-8GB of indexes of varying structure, and those are spread out over
> about 40 indexes.  We try to keep individual indexes below 300MB, as the
> operational hassles after that size seem to be more burdensome.  We also
> use
> distributed searching so our indexes are allocated across multiple
> machines
> (no duplication).  As a rule, we also try to stay below 2.5GB of aggregate
> indexes on one machine.  Our indexes are a full corpus and we must search
> across all indexes all the time.  You can structure your indexes more
> effectively if you don't need to search the full corpus all the time.
>
> With multiple indexes being searched collectively, you'll soon be using
> the
> MultiSearcher class.  Be sure to look at MultiReader, as it makes a
> difference in search performance (nice caching).
>
> -- j
>
> On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> >
> > Hi Jeff
> >
> > A couple more questions. Don't the merge parameters determine how
> > aggressively the index is compacted? And if so, doesn't this affect
> > only indexing performance and not search performance?
> >
> > Secondly how large should each index be? Should I be partitioning the
> > indexes, ie by date range? So one index for Decemeber 2005, one for
> > January, etc? Or is it done by size?
> >
> > TIA
> >
> > Pam
> >
> > On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
> > >
> > > Hi Pamela -
> > >
> > > Performance certainly changes as your index grows, and it's not even
> > > necessarily a linear progression.  How you indexed your data,
> > compression
> > > factors, compound vs. loose file format, number of indexes, etc. all
> > play
> > > a
> > > part in affecting search performance at runtime.
> > >
> > > There are a lot of places to look for improvements.  I would suggest
> > > looking at your specific indexes and see if you can break those up
> > > into smaller indexes -- this will lead you to the MultiSearcher
> > > (and, if you have multi-processor hardware, ParallelMultiSearcher).
> > >
> > > Leave your index updating operation out of the picture for the moment.
> > > Indexing can have a big impact on search performance, so take that
> > > out
> > of
> > > the equation.  After you're able to get to better runtime search
> > > performance, go back and add indexing to the mix.  I can tell you
> > > from experience that most search systems with indexes of substantial
> > > size are executing indexing operations on separate systems to avoid
> > > performance impacts.
> > >
> > > Hope this helps.
> > >
> > > -- j
> > >
> > >
> > >
> > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > >
> > > > I have been developing a C# search solution for an application
> > > > which
> > has
> > > > tens of millions of web pages. Most of these web pages are under 1
> k.
> > > >
> > > > While our initial pilot was very encouraging on our tests of
> > > > 1,000,000 docs, when we scaled up to 10 million subsecond searches
> > > > are now taking 8-10 seconds.
> > > >
> > > > Where should I focus my efforts to increase search speed? Should I
> > > > be using the RAMDirectory? MultiSearcher?
> > > >
> > > > We only have one machine right now which serves indexing and
> > searching.
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message