lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy" <digyd...@gmail.com>
Subject RE: [Lucene.Net] Adding proper System.IO.Stream support to Lucene.Net
Date Tue, 24 May 2011 21:47:33 GMT
Hi Christopher,

According to my experience, this kind of eMails eighter gets no response or
too many fluctuations with no result.
What about preparing a *small* "Proof of Concept" code that passes all unit
tests.

DIGY

-----Original Message-----
From: Christopher Currens [mailto:currens.chris@gmail.com] 
Sent: Wednesday, May 25, 2011 12:08 AM
To: lucene-net-dev@lucene.apache.org
Subject: [Lucene.Net] Adding proper System.IO.Stream support to Lucene.Net

All,

I've spent the past few days looking at what it would take to implement
proper streaming of data into and out of an index.  Fortunately, it hasn't
proven very difficult at all, leaving me with a solution that works very
nicely.  Now that I know it's possible, I wanted to discuss with the
community the best way to add this to the API.

Currently, it's setup that a field can have a Stream value if its binary
(System.IO.Stream StreamValue()).  I have plans to, wherever in Lucene a
byte[] is used, to replace it with streaming functions, internally.  I think
its a good idea to keep the byte[] BinaryValue() as it is, but essentially
replace it, by default, with a kind of lazy loading.  In the current version
of lucene, if a user were to open a document with a binary field, that
entire field will be loaded into memory.

The idea behind replacing the internals of FieldsReader.cs by passing a
stream along instead of a byte[], is that people using the API to stream the
data out will load no more into memory than they have to.  People using the
byte[] BinaryValue() function to get the binary data will actually have
improved performance as well, as the byte array will be loaded when calling
the method, instead of the creation of the document.

As a final note on binary data streaming, by streaming the data in, we
obviously can't support compression of those fields.  The compression in
Lucene is poor anyway, as it's not compression that can be done in blocks,
it requires large amounts of memory as it needs all the data in memory to do
the compression, which is also done in a separate byte array.  However, an
ability I had briefly talked to Troy about in person, was the ability to add
StreamFilters, so that data passed is filtered first by a compression
algorithm or such before its stored in the index.  However, that doesn't
really apply directly to the lucene domain, but it does at least afford the
user the opportunity to be able to do that via streaming data into
lucene.net.

I also want to add proper TextReader support to Lucene.Net.  A large
difference between the Java and .NET versions of lucene is that the Java
version supports setting a field's value to a TextReader, that both analyzes
and stores the data.  Due to the fact that the TextReader in .Net doesn't
support resetting or seeking of the underlying stream, we can only analyze
the text in lucene, we can't store the field.

A solution that comes to mind would be creating a util class, something
like SeekableTextReader, that inherits from TextReader that can be passed to
the field, with special behavior that allows it to be reset, and thus both
analyzed and stored.  Perhaps the largest downside to that solution, is in
order to keep the API the same while allowing it to be stored, it would
require fairly ugly checks like "if(reader is SeekableTextReader) //do
this".

Perhaps a cleaner solution would be to add yet another value to the Field
class that allowed for a SeekableTextReader to be passed.  This way has its
own downsides, in that now there are two methods that expect TextReaders,
one stores and one doesn't, seems rather confusing.  But I suppose this is
why I was looking for the community's opinion in the first place.


The more comments about this the better.  I think adding this could add some
much needed functionality to Lucene, and start setting apart its performance
from the Java version.


Thanks,
Christopher


Mime
View raw message