lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Van Den Berghe, Vincent" <Vincent.VanDenBer...@bvdinfo.com>
Subject RE: Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower
Date Wed, 28 Dec 2016 09:53:17 GMT
I don't think that a per-character call to char.ToLowerInvariant is identical to a call to
string.ToLowerInvariant() where the string is the character array to be converted. This only
works if a character represents one code point.

A single char corresponds to exactly one code point only if you have characters from the 
BMP (Basic Multilingual Pane). Otherwise, this code is invalid: a character may be part of
a surrogate pair, for which the character(s) following it are part of the same code point.
If you lowercase character per character, such cases may be incorrectly converted.

See https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx and https://msdn.microsoft.com/en-us/library/system.char.convertfromutf32(v=vs.110).aspx
for some background on the matter. .NET calls two distinct API's for converting strings and
characters: one is not implemented with the other.

You indirectly bring up an excellent point: my code assumes that the string constructed from
the char[] is in canonical composition. I think this is the case, but I'm not that familiar
with Lucene's handling of characters to be sure. Otherwise, one needs to insert Normalize(NormalizationForm.FormC)
before the call to ToLowerInvariant(). But it may be overkill.


Vincent


-----Original Message-----
From: Oren Eini (Ayende Rahien) [mailto:ayende@ayende.com] 
Sent: Wednesday, December 28, 2016 10:27 AM
To: dev@lucenenet.apache.org
Subject: Re: Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower

Why not?

public void ToLower(char[] buffer, int offset, int limit)
        {
            Debug.Assert(buffer.Length >= limit);
            Debug.Assert(offset <= 0 && offset <= buffer.Length);
            for (int i = offset; i < limit; i++)
            {
                buffer[i] = char.ToLowerInvariant(buffer[i]);
            }
        }

This does zero allocations all together.

Better yet, see:
https://github.com/ravendb/ravendb/blob/v4.0/src/Raven.Server/Documents/DocumentKeyWorker.cs#L35

for (var i = 0; i < key.Length; i++)
{
    var ch = key[i];
    if (ch > 127) // not ASCII, use slower mode
        goto UnlikelyUnicode;
    if ((ch >= 65) && (ch <= 90))
        buffer[i] = (byte) (ch | 0x20);
    else
        buffer[i] = (byte) ch;
}

For the common case of ASCII chars, this is much more efficient.


*Hibernating Rhinos Ltd  *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969 <052-548-6969>

Office: +972-4-622-7811 <04-622-7811> *l *Fax: +972-153-4-622-7811



On Wed, Dec 28, 2016 at 10:46 AM, Van Den Berghe, Vincent < Vincent.VanDenBerghe@bvdinfo.com>
wrote:

> Hi,
>
> I've been doing performance measurements using the latest Lucene.net, 
> and profiling with the standard English analyzer (and all analyzers 
> with a lower case filter) indicates that a LOT of time is spent in
> LowerCaseFilter.IncrementToken() method, doing this:
>
> charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);
>
> In my test cases, this dominates the execution time.
> The performance is horrible since inside charUtils.ToLower, for every 
> code point in the buffer a 1-integer array and a new string containing 
> the string representation of that code point are created, which is 
> subsequently lowercased and converted back:
>
> public static int ToLowerCase(int codePoint)
>     {
>       return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
>       {
>         codePoint
>       }, 0, 1).ToLowerInvariant(), 0);
>     }
>
> This creates heap pressure (due to the huge amount of temporary int[1] 
> and string objects that fill up Gen0) and is highly inefficient 
> because of the inner loops for which the C# compiler isn't able to 
> eliminate the bounds checks.
> Yes, this is indeed what the Java code does, but in .NET the 
> ToLowerInvariant method already takes care of the correct Unicode 
> codepoints parsing, so I think we can replace the  charUtils.ToLower 
> method with the following implementation:
>
>         public void ToLower(char[] buffer, int offset, int limit)
>         {
>             Debug.Assert(buffer.Length >= limit);
>             Debug.Assert(offset <= 0 && offset <= buffer.Length);
>             new string(buffer, offset, 
> limit).ToLowerInvariant().CopyTo(0,
> buffer, offset, limit);
>         }
>
> This appears to do exactly the same thing, but much more efficiently.
> Internally, the ToLowerInvariant ultimately delegates to a native call
> (COMNlsInfo::InternalChangeCaseString) which uses Windows's 
> LCMapStringEx
> Win32 API and is orders of magnitude faster than anything we can write 
> in managed code, even taking the P/Invoke overhead and call setup 
> costs into account.
> After this change, the path through charUtils.ToLower no longer 
> dominates the execution time.
>
> Just sayin' <g>
>
>
> Vincent Van Den Berghe
>
Mime
View raw message