lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Rodenburg" <jeff.rodenb...@gmail.com>
Subject Re: noobie question
Date Sat, 20 May 2006 15:11:57 GMT
Correct on our configuration, give or take a few 100 MB.  :-)
And we have three servers accessed simultaneously for each search.

For our index, we're dealing with information that's geographically defined,
so our indexes are broken up along those lines.  We still monitor each index
for size, but the geographic data drives our index maintenance logic.  We've
indexed approximately 20 MM rows of information.

Our partitioning criteria serves two purposes: query efficiency and index
maintainability.  Depending on how your index is structured (the Lucene
settings + your own document structure), these two can compete with each
other to the point of being polar.  Generally you'll want to find a happy
medium between the two.  While we have many rows of data and our index
documents contain quite a few fields of data, many of them are simple data
fields that aren't large (database is the data source).  By contrast, if we
were indexing full-on text documents, I'm sure our index would be
substantially larger and we'd likely take a different approach.

I did a lot of research prior to constructing our index, and with as much
feedback and data that I could glean, trial-and-error proved to be the most
effective manner in determining what to do and how to do it.

-- j


On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
>
> OK, I'm very confused here Jeff. It sound like what you are suggesting is
> that you have multiple indexes per machine, each around 300 Mbyes, which
> means about 2.5/.3 = 8 indexes per machine, and you have 7.5/2.5 =3
> machines
> in the mix. Is this correct?
>
> On what criteria do you partition your index? Date, or some other
> criteria,
> or is it merely size?
>
> I think we have indexed 1 million rows and our index is 7 Gigs.
>
> Pam
>
>
> On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
> >
> > Yes, the merge parameters does affect indexing performance, but
> > compactness
> > also affects search performance as your index gets larger.  As you
> > incrementally update the index, the fragmentation effect (which the
> merge
> > properties will dictate) causes performance degradation at search time.
> >
> > As for index size, I don't know about any hard and fast rules.  We have
> > about 7-8GB of indexes of varying structure, and those are spread out
> over
> > about 40 indexes.  We try to keep individual indexes below 300MB, as the
> > operational hassles after that size seem to be more burdensome.  We also
> > use
> > distributed searching so our indexes are allocated across multiple
> > machines
> > (no duplication).  As a rule, we also try to stay below 2.5GB of
> aggregate
> > indexes on one machine.  Our indexes are a full corpus and we must
> search
> > across all indexes all the time.  You can structure your indexes more
> > effectively if you don't need to search the full corpus all the time.
> >
> > With multiple indexes being searched collectively, you'll soon be using
> > the
> > MultiSearcher class.  Be sure to look at MultiReader, as it makes a
> > difference in search performance (nice caching).
> >
> > -- j
> >
> > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > >
> > > Hi Jeff
> > >
> > > A couple more questions. Don't the merge parameters determine how
> > > aggressively the index is compacted? And if so, doesn't this affect
> only
> > > indexing performance and not search performance?
> > >
> > > Secondly how large should each index be? Should I be partitioning the
> > > indexes, ie by date range? So one index for Decemeber 2005, one for
> > > January,
> > > etc? Or is it done by size?
> > >
> > > TIA
> > >
> > > Pam
> > >
> > > On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
> > > >
> > > > Hi Pamela -
> > > >
> > > > Performance certainly changes as your index grows, and it's not even
> > > > necessarily a linear progression.  How you indexed your data,
> > > compression
> > > > factors, compound vs. loose file format, number of indexes, etc. all
> > > play
> > > > a
> > > > part in affecting search performance at runtime.
> > > >
> > > > There are a lot of places to look for improvements.  I would suggest
> > > > looking
> > > > at your specific indexes and see if you can break those up into
> > smaller
> > > > indexes -- this will lead you to the MultiSearcher (and, if you have
> > > > multi-processor hardware, ParallelMultiSearcher).
> > > >
> > > > Leave your index updating operation out of the picture for the
> moment.
> > > > Indexing can have a big impact on search performance, so take that
> out
> > > of
> > > > the equation.  After you're able to get to better runtime search
> > > > performance, go back and add indexing to the mix.  I can tell you
> from
> > > > experience that most search systems with indexes of substantial size
> > are
> > > > executing indexing operations on separate systems to avoid
> performance
> > > > impacts.
> > > >
> > > > Hope this helps.
> > > >
> > > > -- j
> > > >
> > > >
> > > >
> > > > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > > > >
> > > > > I have been developing a C# search solution for an application
> which
> > > has
> > > > > tens of millions of web pages. Most of these web pages are under
1
> > k.
> > > > >
> > > > > While our initial pilot was very encouraging on our tests of
> > 1,000,000
> > > > > docs,
> > > > > when we scaled up to 10 million subsecond searches are now taking
> > 8-10
> > > > > seconds.
> > > > >
> > > > > Where should I focus my efforts to increase search speed? Should
I
> > be
> > > > > using
> > > > > the RAMDirectory? MultiSearcher?
> > > > >
> > > > > We only have one machine right now which serves indexing and
> > > searching.
> > > > >
> > > > > TIA
> > > > >
> > > > > Pam
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message