lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Van Den Berghe, Vincent" <Vincent.VanDenBer...@bvdinfo.com>
Subject Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower
Date Wed, 28 Dec 2016 08:46:20 GMT
Hi,

I've been doing performance measurements using the latest Lucene.net, and profiling with the
standard English analyzer (and all analyzers with a lower case filter) indicates that a LOT
of time is spent in LowerCaseFilter.IncrementToken() method, doing this:

charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);

In my test cases, this dominates the execution time.
The performance is horrible since inside charUtils.ToLower, for every code point in the buffer
a 1-integer array and a new string containing the string representation of that code point
are created, which is subsequently lowercased and converted back:

public static int ToLowerCase(int codePoint)
    {
      return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
      {
        codePoint
      }, 0, 1).ToLowerInvariant(), 0);
    }

This creates heap pressure (due to the huge amount of temporary int[1] and string objects
that fill up Gen0) and is highly inefficient because of the inner loops for which the C# compiler
isn't able to eliminate the bounds checks.
Yes, this is indeed what the Java code does, but in .NET the ToLowerInvariant method already
takes care of the correct Unicode codepoints parsing, so I think we can replace the  charUtils.ToLower
 method with the following implementation:

        public void ToLower(char[] buffer, int offset, int limit)
        {
            Debug.Assert(buffer.Length >= limit);
            Debug.Assert(offset <= 0 && offset <= buffer.Length);
            new string(buffer, offset, limit).ToLowerInvariant().CopyTo(0, buffer, offset,
limit);
        }

This appears to do exactly the same thing, but much more efficiently. Internally, the ToLowerInvariant
ultimately delegates to a native call (COMNlsInfo::InternalChangeCaseString) which uses Windows's
LCMapStringEx Win32 API and is orders of magnitude faster than anything we can write in managed
code, even taking the P/Invoke overhead and call setup costs into account.
After this change, the path through charUtils.ToLower no longer dominates the execution time.

Just sayin' <g>


Vincent Van Den Berghe

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message