lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jokin Cuadrado" <joki...@gmail.com>
Subject Re: Performance improvements - BitArrays
Date Wed, 17 Dec 2008 09:29:19 GMT
┬┐Could you share this implementation? As lucene 2.4 has incorporated
also his own openbitset implementation it would be necesary anyway for
the next lucene.net version.

Jokin

On Wed, Dec 17, 2008 at 1:31 AM, Michael Garski <mgarski@myspace.com> wrote:
> In 2.3, the document id is checked in the filter after it is scored and
> before it is passed to the hit collector, which can result in a poor
> performing search executed with a common term and a sparsely populated
> filter.  I created my own filter implementation based off of the
> DocSet/OpenBitSet classes that are in Solr, where the implementation of
> getting the next set bit is very efficient, and does not use unsafe
> code.  With my own filter implementation I was also able to work around
> the memory leak issue with the cached BitArrays that Digy has noted
> earlier.
>
> Filter implementation in Lucene 2.4 is overhauled to allow you to create
> your own filter implementation, defaulting to the OpenBitSet.
> Additionally, I believe the filter is enumerated along with the
> termdocs, leading to faster searches with sparsely populated filters.
>
>
> Michael
>
>
>
> -----Original Message-----
> From: Andrei Alecu [mailto:andrei@tachyon-labs.com]
> Sent: Tuesday, December 16, 2008 3:33 PM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: Performance improvements - BitArrays
>
> Hi Digy,
>
> Unfortunately I don't exactly have any performance tests with the
> raw/trunk Lucene.
>
> Like I said, I'm using my own filters which depend on being able to go
> through the BitArray fast, a search for a term in my index can return
> even as much as 500,000 results. I get the BitArray from Lucene and then
>
> go through all 500,000 set bits so I can find out which document id it
> is, and then look up the coordinates of that document id and compare
> with my boundary box.
>
> This requires a fast GetNextSetBit/Get implementation which is only
> possible through unsafe code.
>
> I can safely say however, that with the default .NET BitArray class, and
>
> default GetNextSetBit implementation, the search was completely
> unusable, sporting times of several seconds.
>
> So, with the trunk Lucene, using the default filters, the net gain is
> probably in the 10% range at most, I've seen that not many bit pushing
> is involved with regular searches, bit maps being retrieved directly
> from the Lucene index. But the moment you want to do something more
> complicated with filters, then the limitations of the .NET BitArray
> class become apparent.
>
> Right now I can do very complex searches such as: "term AND field:(1111
> OR 2222 OR 3333 OR 4444 OR 5555)" in under 200 ms for the initial
> search, and then about 50 ms for any subsequent searches. I also
> implemented some WeakReference based caching.
>
> Regarding licensing issues, I would suggest you code a BitArray class
> from scratch and name it something else, it's not really hard to make a
> bit manipulation class, and that would solve licensing issues. I only
> attached mine for reference (and mine is not fully optimized anyway).
>
> Regards,
> Andrei
>
>
>
> Digy wrote:
>> Hi Andrei,
>>
>> Thanks for sharing your experience, but I have two questions regarding
> to
>> your code.
>> - Do you have some performance tests using BitArray & BitArrayEx? What
> is
>> the gain?
>> - I am not good about licensing issues. Do you think that it can be
> used in
>> an apache project(since I see an MS copyright)?
>>
>> DIGY.
>>
>> -----Original Message-----
>> From: Andrei Alecu [mailto:andrei@tachyon-labs.com]
>> Sent: Saturday, December 13, 2008 3:05 PM
>> To: lucene-net-dev@incubator.apache.org
>> Subject: Performance improvements - BitArrays
>>
>> Hello all,
>>
>> This is my first message here. I've been using Lucene.NET for a while
> for a
>> pretty big local search engine website and thought I'd share some of
> the
>> performance improvements I came across.
>>
>> I made lots of changes to the Lucene code base in the process, and had
> to
>> write my own geographical based filters and sort classes.
>>
>> I got it to a point where it can do a search through 17,000,000
> records in
>> under 200 ms, all this while also filtering results so they are within
>> specified coordinates (each listing having its own coordinates), and
> also
>> sorting results by distance to a certain point of origin - all on one
>> machine. And I'm pretty sure it can be made even faster.
>>
>> I'd like to share one of the easy and pretty noticeable performance
>> improvements I did.
>>
>> Lucene.NET, in its current implementation, makes heavy use of the .NET
>> BitArray class. This class is not really performance optimized at all
> (you
>> can look through the sources from Microsoft).
>>
>> The problem is that Microsoft sealed the BitArray class so you can't
> derive
>> from it, so I got the sources from Microsoft and then heavily
> optimized them
>> for performance using pointers (unsafe code), and some improved
> algorithms
>> for finding the Next Set Bit (which I use extensively in my own
>> Geo-filters).
>>
>> Also, if you can remove some of the valid argument checks from the
> BitArray
>> class, then that's all the better for performance as well, because
> methods
>> that throw explicit errors are not inlined by the .NET JIT.
>>
>> The other big improvement I did was to keep a renewable pool of
> BitArray
>> classes in memory in order to minimize garbage collection. What this
> means
>> is, whenever I need a new BitArray, instead of creating a completely
> new
>> instance that the GC has to manage, I get one from a pool of
> 'released'
>> BitArrays. When I'm done with a BitArray, instead of letting it go to
> the
>> GC, I just put it back in the pool so it can be reused. This reduced
> GCs to
>> a minimum in my application, and the memory footprint of my
> application is
>> also much more stable.
>>
>> I realize however that this second improvement would break the Java
> code
>> compatibility so it's just something more advanced users would want to
> do
>> for themselves.
>>
>> I'm not sure if attaching files here works, but I'll attach the
> performance
>> BitArray class I have.
>>
>> Thanks for all your hard work everyone!
>>
>> Kind regards,
>> Andrei Alecu
>>
>>
>>
>>
>>
>
>
>
>

Mime
View raw message