lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Currens (JIRA)" <j...@apache.org>
Subject [Lucene.Net] [jira] [Commented] (LUCENENET-417) implement streams as field values
Date Wed, 15 Jun 2011 16:43:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENENET-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049856#comment-13049856
] 

Christopher Currens commented on LUCENENET-417:
-----------------------------------------------

That's a valid question.  I think it's mostly common (but not limited to) when Lucene is used
to index file systems.  As an example, extracted text out of some xls files can be *shudder*
in the hundreds of mb.  When accuracy is needed in a search, the MaxFieldLength.Unlimited
becomes important, as we don't want silent truncation of search terms.  The idea of streaming
it, as I said before, was more for handling _program memory_, especially when multiple indexes
are read/written at the same time, rather than the ability to index a large file.  Granted,
there are other ways to solve the problem, like what you sort of suggested, breaking up a
larger file into smaller chunks.  However, not all data is divisible like a book would be,
so it's not an ideal solution, especially if you're storing file metadata along with full
text.

> implement streams as field values
> ---------------------------------
>
>                 Key: LUCENENET-417
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-417
>             Project: Lucene.Net
>          Issue Type: New Feature
>          Components: Lucene.Net Core
>            Reporter: Christopher Currens
>         Attachments: StreamValues.patch
>
>
> Adding binary values to a field is an expensive operation, as the whole binary data must
be loaded into memory and then written to the index.  Adding the ability to use a stream instead
of a byte array could not only speed up the indexing process, but reducing the memory footprint
as well.
> -Java lucene has the ability to use a TextReader the both analyze and store text in the
index.-  Lucene.NET lacks the ability to store string data in the index via streams. This
should be a feature added into Lucene .NET as well.  My thoughts are to add another Field
constructor, that is Field(string name, System.IO.Stream stream, System.Text.Encoding encoding),
that will allow the text to be analyzed and stored into the index.
> Comments about this approach are greatly appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message