lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jokin Cuadrado" <>
Subject Re: Performance improvements - BitArrays
Date Wed, 17 Dec 2008 09:29:19 GMT
┬┐Could you share this implementation? As lucene 2.4 has incorporated
also his own openbitset implementation it would be necesary anyway for
the next version.


On Wed, Dec 17, 2008 at 1:31 AM, Michael Garski <> wrote:
> In 2.3, the document id is checked in the filter after it is scored and
> before it is passed to the hit collector, which can result in a poor
> performing search executed with a common term and a sparsely populated
> filter.  I created my own filter implementation based off of the
> DocSet/OpenBitSet classes that are in Solr, where the implementation of
> getting the next set bit is very efficient, and does not use unsafe
> code.  With my own filter implementation I was also able to work around
> the memory leak issue with the cached BitArrays that Digy has noted
> earlier.
> Filter implementation in Lucene 2.4 is overhauled to allow you to create
> your own filter implementation, defaulting to the OpenBitSet.
> Additionally, I believe the filter is enumerated along with the
> termdocs, leading to faster searches with sparsely populated filters.
> Michael
> -----Original Message-----
> From: Andrei Alecu []
> Sent: Tuesday, December 16, 2008 3:33 PM
> To:
> Subject: Re: Performance improvements - BitArrays
> Hi Digy,
> Unfortunately I don't exactly have any performance tests with the
> raw/trunk Lucene.
> Like I said, I'm using my own filters which depend on being able to go
> through the BitArray fast, a search for a term in my index can return
> even as much as 500,000 results. I get the BitArray from Lucene and then
> go through all 500,000 set bits so I can find out which document id it
> is, and then look up the coordinates of that document id and compare
> with my boundary box.
> This requires a fast GetNextSetBit/Get implementation which is only
> possible through unsafe code.
> I can safely say however, that with the default .NET BitArray class, and
> default GetNextSetBit implementation, the search was completely
> unusable, sporting times of several seconds.
> So, with the trunk Lucene, using the default filters, the net gain is
> probably in the 10% range at most, I've seen that not many bit pushing
> is involved with regular searches, bit maps being retrieved directly
> from the Lucene index. But the moment you want to do something more
> complicated with filters, then the limitations of the .NET BitArray
> class become apparent.
> Right now I can do very complex searches such as: "term AND field:(1111
> OR 2222 OR 3333 OR 4444 OR 5555)" in under 200 ms for the initial
> search, and then about 50 ms for any subsequent searches. I also
> implemented some WeakReference based caching.
> Regarding licensing issues, I would suggest you code a BitArray class
> from scratch and name it something else, it's not really hard to make a
> bit manipulation class, and that would solve licensing issues. I only
> attached mine for reference (and mine is not fully optimized anyway).
> Regards,
> Andrei
> Digy wrote:
>> Hi Andrei,
>> Thanks for sharing your experience, but I have two questions regarding
> to
>> your code.
>> - Do you have some performance tests using BitArray & BitArrayEx? What
> is
>> the gain?
>> - I am not good about licensing issues. Do you think that it can be
> used in
>> an apache project(since I see an MS copyright)?
>> DIGY.
>> -----Original Message-----
>> From: Andrei Alecu []
>> Sent: Saturday, December 13, 2008 3:05 PM
>> To:
>> Subject: Performance improvements - BitArrays
>> Hello all,
>> This is my first message here. I've been using Lucene.NET for a while
> for a
>> pretty big local search engine website and thought I'd share some of
> the
>> performance improvements I came across.
>> I made lots of changes to the Lucene code base in the process, and had
> to
>> write my own geographical based filters and sort classes.
>> I got it to a point where it can do a search through 17,000,000
> records in
>> under 200 ms, all this while also filtering results so they are within
>> specified coordinates (each listing having its own coordinates), and
> also
>> sorting results by distance to a certain point of origin - all on one
>> machine. And I'm pretty sure it can be made even faster.
>> I'd like to share one of the easy and pretty noticeable performance
>> improvements I did.
>> Lucene.NET, in its current implementation, makes heavy use of the .NET
>> BitArray class. This class is not really performance optimized at all
> (you
>> can look through the sources from Microsoft).
>> The problem is that Microsoft sealed the BitArray class so you can't
> derive
>> from it, so I got the sources from Microsoft and then heavily
> optimized them
>> for performance using pointers (unsafe code), and some improved
> algorithms
>> for finding the Next Set Bit (which I use extensively in my own
>> Geo-filters).
>> Also, if you can remove some of the valid argument checks from the
> BitArray
>> class, then that's all the better for performance as well, because
> methods
>> that throw explicit errors are not inlined by the .NET JIT.
>> The other big improvement I did was to keep a renewable pool of
> BitArray
>> classes in memory in order to minimize garbage collection. What this
> means
>> is, whenever I need a new BitArray, instead of creating a completely
> new
>> instance that the GC has to manage, I get one from a pool of
> 'released'
>> BitArrays. When I'm done with a BitArray, instead of letting it go to
> the
>> GC, I just put it back in the pool so it can be reused. This reduced
> GCs to
>> a minimum in my application, and the memory footprint of my
> application is
>> also much more stable.
>> I realize however that this second improvement would break the Java
> code
>> compatibility so it's just something more advanced users would want to
> do
>> for themselves.
>> I'm not sure if attaching files here works, but I'll attach the
> performance
>> BitArray class I have.
>> Thanks for all your hard work everyone!
>> Kind regards,
>> Andrei Alecu

View raw message