lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From laimis <...@git.apache.org>
Subject [GitHub] lucenenet pull request: different approach to recasing test string...
Date Mon, 29 Dec 2014 02:30:01 GMT
GitHub user laimis opened a pull request:

    https://github.com/apache/lucenenet/pull/31

    different approach to recasing test strings

    This fixes a bunch of test failures with tests in Lucene.Net.Analysis that look something
like this:
    
    Lengths are not the same: mixedUp = 䧢ൄ, length = 3, sb = 񤧢ൄ, length = 4
      Expected: True
      But was:  False
    
    Lengths are not the same: mixedUp = ۀ迥ȟ, length = 6, sb = ۀ񘿥Ȟ,
length = 7
      Expected: True
      But was:  False
    
    The issue is with a code path that is invoked only in certain conditions from here: https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.TestFramework/Util/TestUtil.cs#L1415
    
    When debugging the tests, it became apparent that the chars were "dropped" in the re-casing
function (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.TestFramework/Util/TestUtil.cs#L707).
If the randomly generated string contained Unicode surrogate pairs, the resulting string was
shorter and did not correctly match the source string (besides the casing part).
    
    The drop happens because casts of the code point that represents a surrogate pairs to
char (eg. (char)codePoint) is inaccurate, the value is too large for 16 bits. The conversion
ends up changing the re-cased char with a completely different character. And then only one
character gets appended to the destination string builder when the source used two chars (high
and low surrogate), thus the resulting string is shorter by one or more chars (depending on
how many surrogate pairs were present in the source string).
    
    I looked at the Java implementation (https://github.com/apache/lucene-solr/blob/lucene_solr_4_8_0/lucene/test-framework/src/java/org/apache/lucene/util/TestUtil.java#L543)
and it makes sense why it works there. Java's Character toUpper / toLower handle code points
appropriately, without casting them to chars. It then also uses StringBuilder's appendCodePoint
and not append to make sure that code point for surrogate pairs results in two characters
being appended.
    
    I considered two options for the implementation. One which contained extension method
for StringBuilder for AppendCodePoint which would add two appropriate chars for surrogate
pair code point but then still had issues with how to appropriately deal with to Lower / Upper
methods for all the possible code points. Java's Character version has implementation for
properly selecting unicode planes that the code point belongs to and each plane has an appropriate
implementation of those methods. Such functionality does not appear to be exposed in .NET,
at least publicly.
    
    So I settled on the implementation that you can see in this branch. Basically it detects
if the next char pair is a surrogate pair and then takes them both for possible casing function
and appending to the string builder. All tests in Analysis pass after this change.
    
    I renamed the function to appropriately reflect that code points are not used anymore.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/laimis/lucenenet recase_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucenenet/pull/31.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #31
    
----
commit 6cfc4fe83ef1a1958e311b34fb86a159782ef2e5
Author: Laimonas Simutis <laimis@gmail.com>
Date:   2014-12-29T02:24:12Z

    different approach to recasing test strings

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message