lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elad Margalit <eladm...@gmail.com>
Subject Re: Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower
Date Wed, 28 Dec 2016 09:15:37 GMT
Great catch! 
This is similar to the char sequence which can sometimes replaced by string builder 
.net platform has many advantages doing many things better natively, so whenever it's possible
to take this advantage, it would be great 

Sent from my iPhone

> On 28 Dec 2016, at 10:46, Van Den Berghe, Vincent <Vincent.VanDenBerghe@bvdinfo.com>
wrote:
> 
> Hi,
> 
> I've been doing performance measurements using the latest Lucene.net, and profiling with
the standard English analyzer (and all analyzers with a lower case filter) indicates that
a LOT of time is spent in LowerCaseFilter.IncrementToken() method, doing this:
> 
> charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);
> 
> In my test cases, this dominates the execution time.
> The performance is horrible since inside charUtils.ToLower, for every code point in the
buffer a 1-integer array and a new string containing the string representation of that code
point are created, which is subsequently lowercased and converted back:
> 
> public static int ToLowerCase(int codePoint)
>    {
>      return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
>      {
>        codePoint
>      }, 0, 1).ToLowerInvariant(), 0);
>    }
> 
> This creates heap pressure (due to the huge amount of temporary int[1] and string objects
that fill up Gen0) and is highly inefficient because of the inner loops for which the C# compiler
isn't able to eliminate the bounds checks.
> Yes, this is indeed what the Java code does, but in .NET the ToLowerInvariant method
already takes care of the correct Unicode codepoints parsing, so I think we can replace the
 charUtils.ToLower  method with the following implementation:
> 
>        public void ToLower(char[] buffer, int offset, int limit)
>        {
>            Debug.Assert(buffer.Length >= limit);
>            Debug.Assert(offset <= 0 && offset <= buffer.Length);
>            new string(buffer, offset, limit).ToLowerInvariant().CopyTo(0, buffer, offset,
limit);
>        }
> 
> This appears to do exactly the same thing, but much more efficiently. Internally, the
ToLowerInvariant ultimately delegates to a native call (COMNlsInfo::InternalChangeCaseString)
which uses Windows's LCMapStringEx Win32 API and is orders of magnitude faster than anything
we can write in managed code, even taking the P/Invoke overhead and call setup costs into
account.
> After this change, the path through charUtils.ToLower no longer dominates the execution
time.
> 
> Just sayin' <g>
> 
> 
> Vincent Van Den Berghe

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message