lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Rodenburg" <jeff.rodenb...@gmail.com>
Subject Re: noobie question
Date Fri, 19 May 2006 17:04:02 GMT
The Compound file format is the default file format for the index that you
create (at least in v1.4.x).  When creating an index, you can specify
true/false in a constructor that indicates if you wish the index file to be
compacted or not.  Check out
http://lucene.apache.org/java/docs/fileformats.html to understand this
better.

When you're index gets to be of significant size, the file format can become
very important.  Using the default compound format, searching will tend to
be faster (assuming all other things equal) but index updates will be
slower; vice versa, searching may be slower but index updates can be
faster.  There are three other properties that can affect the mix as well:
mergefactor, minmergedocs, and maxmergedocs.  Tweaking these properties in
conjunction with the file format settings grows in importance as your index
size increases.  Check out the thread at
http://www.gossamer-threads.com/lists/lucene/java-user/11999?search_string=minmergedocs;#11999
.

-- j



On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
>
> Thanks Jeff, I am a little confused by the compound vs loose file format
> you
> speak of.
>
> We are indexing html docs and indexing 10 metatags. By indexing I mean we
> index the body, but we also query the properties. I am not sure what the
> correct definition is.
>
> Are you saying that if we were merely indexing the document bodies we
> would
> be further ahead? We need to restrict our searches by date, and a few
> other
> properties, so its really important that we be able to do these
> restrictions.
>
> TIA
>
> Pam
>
>
> On 5/19/06, Jeff Rodenburg <jeff.rodenburg@gmail.com> wrote:
> >
> > Hi Pamela -
> >
> > Performance certainly changes as your index grows, and it's not even
> > necessarily a linear progression.  How you indexed your data,
> compression
> > factors, compound vs. loose file format, number of indexes, etc. all
> play
> > a
> > part in affecting search performance at runtime.
> >
> > There are a lot of places to look for improvements.  I would suggest
> > looking
> > at your specific indexes and see if you can break those up into smaller
> > indexes -- this will lead you to the MultiSearcher (and, if you have
> > multi-processor hardware, ParallelMultiSearcher).
> >
> > Leave your index updating operation out of the picture for the moment.
> > Indexing can have a big impact on search performance, so take that out
> of
> > the equation.  After you're able to get to better runtime search
> > performance, go back and add indexing to the mix.  I can tell you from
> > experience that most search systems with indexes of substantial size are
> > executing indexing operations on separate systems to avoid performance
> > impacts.
> >
> > Hope this helps.
> >
> > -- j
> >
> >
> >
> > On 5/19/06, Pamela Foxcroft <pamelafoxcroft@gmail.com> wrote:
> > >
> > > I have been developing a C# search solution for an application which
> has
> > > tens of millions of web pages. Most of these web pages are under 1 k.
> > >
> > > While our initial pilot was very encouraging on our tests of 1,000,000
> > > docs,
> > > when we scaled up to 10 million subsecond searches are now taking 8-10
> > > seconds.
> > >
> > > Where should I focus my efforts to increase search speed? Should I be
> > > using
> > > the RAMDirectory? MultiSearcher?
> > >
> > > We only have one machine right now which serves indexing and
> searching.
> > >
> > > TIA
> > >
> > > Pam
> > >
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message