lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Pook <andy.p...@gmail.com>
Subject SparseFacetedSearch
Date Thu, 22 Nov 2012 17:28:47 GMT
This is a version of SimpleFacetedSearch that is usable with indexes that
have a large number of facet values.

SimpleFS calculates a bitmap for each value with a bit for each document. A
1 in any given position states that that value appears in that document.

The problem is that if you have 1000's of values for a facet then you have
a potentially large bitmap for each one. The function is something like
(values * documents /8) bytes. This memory requirement can grown very
quickly.
In our usage we have indexes with 3-4 million documents and facets with up
to 100,000 values which would need something in the order of 48GB.

SparseFacetSearcher works by recording the doc IDs for each value. So a
higher price per hit is paid (4 bytes). But only for hits. There is no cost
for misses. It works out that if you have more that 32 values per field
then SparseFS is more memory efficient.
I'll be adding a file that describes this in more detail.

Also using SimpleFS creates a new set of bitmaps on each search. Again,
with large number of values and documents this puts the GC under pressure.
SparseFS uses synchronised enumerables (my term) to walk the docIDs on the
FS against the issued query (see WalkingIntersect) to find the docIDs in
common. So there's very little memory allocation.

Lastly, there is a bug in SimpleFS.FieldValuesBitSets.GetFieldValues which
means it doesn't stop iterating the TermEnum once it has finished with the
groupByField. So, if you have other fields with lots of terms, it can take
quite a while to exit the while loop.
I can create an issue/patch for this fix in SimpleFS separately.

I have put SparseFS up on
https://github.com/Artesian/SparseFacetedSearchI've also ported the
tests that come with SimpleFS.

I anyone would like to have a look and critique (be gentle with me :) ) I
would appreciate any feedback you have.
I hope that some version of this can be added to contrib.

Cheers,
  Andy

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message