lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prescott Nasser <geobmx...@hotmail.com>
Subject RE: Lucene.net file I/O inefficiency and a question
Date Tue, 31 Jan 2017 08:12:24 GMT
Hey Vincent - 

We love any and all help. 

As for contributions, you can create issues in JIRA (https://issues.apache.org/jira/browse/LUCENENET/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel)
- admittedly we aren't great at keeping this up to date or on track. You can also submit a
pull request and state that the code is your original work and you license it under the Apache
License v2 (http://www.apache.org/licenses/LICENSE-2.0)

Best,
~Prescott

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Monday, January 30, 2017 11:46 PM
To: dev@lucenenet.apache.org
Subject: Lucene.net file I/O inefficiency and a question

Hello everyone,

This message contains two subjects, but since the second one is more of a question, I'll use
the first subject as a "hook", hoping to get an answer  to the next one.
(start of  first subject)
There is an inefficient implementation of file I/O in Lucene.net, most notably in FSDirectory.FSIndexOutput.
The number of write calls can be reduced by a factor of 2.
First we see this, which seems to be a copy paste from the Java code:

            /// <summary>
            /// The maximum chunk size is 8192 bytes, because <seealso cref="RandomAccessFile"/>
mallocs
            /// a native buffer outside of stack if the write buffer size is larger.
            /// </summary>
            internal const int CHUNK_SIZE = 8192;

And then further on:

            protected internal override void FlushBuffer(byte[] b, int offset, int size)
            {
                //Debug.Assert(IsOpen);
                while (size > 0)
                {
                    int toWrite = Math.Min(CHUNK_SIZE, size);
                    File.Write(b, offset, toWrite);
                    offset += toWrite;
                    size -= toWrite;
                }
                //Debug.Assert(size == 0);
            }


This is not needed: in .NET FileStream.Write delegates to the native Win32 file implementation
and allocates nothing, regardless the size of the buffer.
Wouldn't it be better to write:

            protected internal override void FlushBuffer(byte[] b, int offset, int size)
            {
              //Debug.Assert(IsOpen);
              File.Write(b, offset, size);
            }

... and get rid of the CHUNK_SIZE?
The default buffer size (from the BufferedIndexOutput class) is 16384 bytes, so this will
reduce the number of I/O calls by 2.
There is a similar modification that can be done for SimpleFSIndexInput.ReadInternal.
There may be other places where similar code is used, but I couldn't conclusively prove a
similar modification would help.
(end of the first subject)

Here's my question:  This is the third suggestion I'm making, based of real-world usage of
Lucene.net:

-          Proposal to speed up implementation of LowercaseFilter/charUtils.toLower

-          CreateTempFile is not thread-safe (no, it really is not)

-          Lucene.net file I/O inefficiency
I'd like to make contributions to the Lucene.net project, but several personal and external
factors are preventing me to be a contributor (in the Apache sense). I also may not have anything
else or significant to contribute after this: there is no way to know.
How can I make sure that these suggestions are actually considered for ending up in the code?
I've seen contributors doing modifications on behalf of other people. I care about problems
being solved, and do not care about who's name is on them. What's the best way to proceed?
Would it be better to post these things on GitHub somewhere?


Vincent



Mime
View raw message