lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oren Eini (Ayende Rahien)" <aye...@ayende.com>
Subject Re: Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower
Date Wed, 28 Dec 2016 09:27:06 GMT
Why not?

public void ToLower(char[] buffer, int offset, int limit)
        {
            Debug.Assert(buffer.Length >= limit);
            Debug.Assert(offset <= 0 && offset <= buffer.Length);
            for (int i = offset; i < limit; i++)
            {
                buffer[i] = char.ToLowerInvariant(buffer[i]);
            }
        }

This does zero allocations all together.

Better yet, see:
https://github.com/ravendb/ravendb/blob/v4.0/src/Raven.Server/Documents/DocumentKeyWorker.cs#L35

for (var i = 0; i < key.Length; i++)
{
    var ch = key[i];
    if (ch > 127) // not ASCII, use slower mode
        goto UnlikelyUnicode;
    if ((ch >= 65) && (ch <= 90))
        buffer[i] = (byte) (ch | 0x20);
    else
        buffer[i] = (byte) ch;
}

For the common case of ASCII chars, this is much more efficient.


*Hibernating Rhinos Ltd  *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969 <052-548-6969>

Office: +972-4-622-7811 <04-622-7811> *l *Fax: +972-153-4-622-7811



On Wed, Dec 28, 2016 at 10:46 AM, Van Den Berghe, Vincent <
Vincent.VanDenBerghe@bvdinfo.com> wrote:

> Hi,
>
> I've been doing performance measurements using the latest Lucene.net, and
> profiling with the standard English analyzer (and all analyzers with a
> lower case filter) indicates that a LOT of time is spent in
> LowerCaseFilter.IncrementToken() method, doing this:
>
> charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);
>
> In my test cases, this dominates the execution time.
> The performance is horrible since inside charUtils.ToLower, for every code
> point in the buffer a 1-integer array and a new string containing the
> string representation of that code point are created, which is subsequently
> lowercased and converted back:
>
> public static int ToLowerCase(int codePoint)
>     {
>       return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
>       {
>         codePoint
>       }, 0, 1).ToLowerInvariant(), 0);
>     }
>
> This creates heap pressure (due to the huge amount of temporary int[1] and
> string objects that fill up Gen0) and is highly inefficient because of the
> inner loops for which the C# compiler isn't able to eliminate the bounds
> checks.
> Yes, this is indeed what the Java code does, but in .NET the
> ToLowerInvariant method already takes care of the correct Unicode
> codepoints parsing, so I think we can replace the  charUtils.ToLower
> method with the following implementation:
>
>         public void ToLower(char[] buffer, int offset, int limit)
>         {
>             Debug.Assert(buffer.Length >= limit);
>             Debug.Assert(offset <= 0 && offset <= buffer.Length);
>             new string(buffer, offset, limit).ToLowerInvariant().CopyTo(0,
> buffer, offset, limit);
>         }
>
> This appears to do exactly the same thing, but much more efficiently.
> Internally, the ToLowerInvariant ultimately delegates to a native call
> (COMNlsInfo::InternalChangeCaseString) which uses Windows's LCMapStringEx
> Win32 API and is orders of magnitude faster than anything we can write in
> managed code, even taking the P/Invoke overhead and call setup costs into
> account.
> After this change, the path through charUtils.ToLower no longer dominates
> the execution time.
>
> Just sayin' <g>
>
>
> Vincent Van Den Berghe
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message