lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conniey <...@git.apache.org>
Subject [GitHub] lucenenet issue #191: Migrating Lucene.Net to .NET Core
Date Tue, 13 Dec 2016 07:24:40 GMT
Github user conniey commented on the issue:

    https://github.com/apache/lucenenet/pull/191
  
    1. Sentence breaking not working when first word of sentence is lower case.
        * According to the [sentence boundary rules](http://www.unicode.org/reports/tr29/#Sentence_Boundary_Rules)
icu follows, it is returning the correct sentence breaks. (It is defined in the section "Do
not break after full stop in certain contexts. [See note below.]").
    2. The response for 1 also applies, where it is breaking prematurely on new-lines.
    3. Word breaking is happening on hyphenated words instead of treating them as a single
word, for example, "high-performance" should be considered a single word, not 2 words.
        * According to their [word break rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules),
we are returning the expected behaviour. The hyphens that are visualised are breaking hyphens,
but if we had added a soft hyphen, it would not have broken the word.
    4. "The ThaiWordBreaker class was added to work-around another BreakIterator difference
from Java - namely that in Java Thai characters were broken into separate "words" if adjacent
to non-Thai characters."
        * Unfortunately, this is due to the word breaking rules in ICU since it sees these
as part of the same word since they are characters.
    
    One way to fix the points above is to use a RuleBasedBreakIterator and modify the default
rules for creating a break iterator.  Would that work for Lucene.NET? I would have to add
a native method to icu-dotnet to call to [ubrk_openRules](http://icu-project.org/apiref/icu4c/ubrk_8h.html#a11826cb21213916c2d91579b673d8949)
to let you create a BreakIterator.  The default rules are here:
    
    * [Sentence rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/sent.txt)
    * [Word rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/word.txt)
    * [Blog post on creating custom rules](http://sujitpal.blogspot.com/2008/05/tokenizing-text-with-icu4js.html)
    
    5. I updated ThaiTokenizer with your code snippet and tested it against TestNumeralBreakages
    
    RE: BreakIterator Dependencies
    
    * I agree that it should be an abstract class and have more functionality (ie. moving
backwards and forwards) similar to its Java counterpart.  I'll see about writing a PR and
submitting it to [sillsdev/icu-dotnet](https://github.com/sillsdev/icu-dotnet) to see if they
will accept this feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message