lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Currens <currens.ch...@gmail.com>
Subject Re: [Lucene.Net] Adding proper System.IO.Stream support to Lucene.Net
Date Tue, 24 May 2011 22:04:57 GMT
Digy,

That's probably a good idea.  I need to clean up the code as it stands now
and make sure the unit tests pass.  I'm going to shoot for getting a patch
out in the next couple hours.

Thanks,
Christopher

On Tue, May 24, 2011 at 2:47 PM, Digy <digydigy@gmail.com> wrote:

> Hi Christopher,
>
> According to my experience, this kind of eMails eighter gets no response or
> too many fluctuations with no result.
> What about preparing a *small* "Proof of Concept" code that passes all unit
> tests.
>
> DIGY
>
> -----Original Message-----
> From: Christopher Currens [mailto:currens.chris@gmail.com]
> Sent: Wednesday, May 25, 2011 12:08 AM
> To: lucene-net-dev@lucene.apache.org
> Subject: [Lucene.Net] Adding proper System.IO.Stream support to Lucene.Net
>
> All,
>
> I've spent the past few days looking at what it would take to implement
> proper streaming of data into and out of an index.  Fortunately, it hasn't
> proven very difficult at all, leaving me with a solution that works very
> nicely.  Now that I know it's possible, I wanted to discuss with the
> community the best way to add this to the API.
>
> Currently, it's setup that a field can have a Stream value if its binary
> (System.IO.Stream StreamValue()).  I have plans to, wherever in Lucene a
> byte[] is used, to replace it with streaming functions, internally.  I
> think
> its a good idea to keep the byte[] BinaryValue() as it is, but essentially
> replace it, by default, with a kind of lazy loading.  In the current
> version
> of lucene, if a user were to open a document with a binary field, that
> entire field will be loaded into memory.
>
> The idea behind replacing the internals of FieldsReader.cs by passing a
> stream along instead of a byte[], is that people using the API to stream
> the
> data out will load no more into memory than they have to.  People using the
> byte[] BinaryValue() function to get the binary data will actually have
> improved performance as well, as the byte array will be loaded when calling
> the method, instead of the creation of the document.
>
> As a final note on binary data streaming, by streaming the data in, we
> obviously can't support compression of those fields.  The compression in
> Lucene is poor anyway, as it's not compression that can be done in blocks,
> it requires large amounts of memory as it needs all the data in memory to
> do
> the compression, which is also done in a separate byte array.  However, an
> ability I had briefly talked to Troy about in person, was the ability to
> add
> StreamFilters, so that data passed is filtered first by a compression
> algorithm or such before its stored in the index.  However, that doesn't
> really apply directly to the lucene domain, but it does at least afford the
> user the opportunity to be able to do that via streaming data into
> lucene.net.
>
> I also want to add proper TextReader support to Lucene.Net.  A large
> difference between the Java and .NET versions of lucene is that the Java
> version supports setting a field's value to a TextReader, that both
> analyzes
> and stores the data.  Due to the fact that the TextReader in .Net doesn't
> support resetting or seeking of the underlying stream, we can only analyze
> the text in lucene, we can't store the field.
>
> A solution that comes to mind would be creating a util class, something
> like SeekableTextReader, that inherits from TextReader that can be passed
> to
> the field, with special behavior that allows it to be reset, and thus both
> analyzed and stored.  Perhaps the largest downside to that solution, is in
> order to keep the API the same while allowing it to be stored, it would
> require fairly ugly checks like "if(reader is SeekableTextReader) //do
> this".
>
> Perhaps a cleaner solution would be to add yet another value to the Field
> class that allowed for a SeekableTextReader to be passed.  This way has its
> own downsides, in that now there are two methods that expect TextReaders,
> one stores and one doesn't, seems rather confusing.  But I suppose this is
> why I was looking for the community's opinion in the first place.
>
>
> The more comments about this the better.  I think adding this could add
> some
> much needed functionality to Lucene, and start setting apart its
> performance
> from the Java version.
>
>
> Thanks,
> Christopher
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message