lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From NightOwl888 <...@git.apache.org>
Subject [GitHub] lucenenet issue #191: Migrating Lucene.Net to .NET Core
Date Mon, 12 Dec 2016 18:27:08 GMT
Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/191
  
    > In your next steps section, is there anything required of me?
    
    Just to get this updated with master if you get the chance before I do, but it looks like
you beat me to it.
    
    I am working on fixing the BreakIterator-related Highlighter tests now. Still have 7 that
are failing. I am trying to identify all of the issues, so this list may not be complete,
but this is what I have found so far.
    
    icu-dotnet Issues
    ----
    1. Sentence breaking not working when first word of sentence is lower case. In Java, both
of the following return the same results (the latter):
    
    ```
    ?Icu.BreakIterator.GetBoundaries(Icu.BreakIterator.UBreakIteratorType.SENTENCE, new Icu.Locale("en-US"),
"test this is.  another sentence this test has.  far away is that planet.")
    Count = 1
        [0]: {Start: [0], End: [72]}
    ?Icu.BreakIterator.GetBoundaries(Icu.BreakIterator.UBreakIteratorType.SENTENCE, new Icu.Locale("en-US"),
"Test this is.  Another sentence this test has.  Far away is that planet.")
    Count = 3
        [0]: {Start: [0], End: [15]}
        [1]: {Start: [15], End: [48]}
        [2]: {Start: [48], End: [72]}
    ```
    2. Sentence incorrectly breaking when there is a `\n` in the string. In this string: `"any
application that requires\nfull-text search, especially cross-platform. \nApache Lucene is
an open source project available for free download."` we are expecting the first sentence
to end at 76, but getting 30.
    3. Word breaking is happening on hyphenated words instead of treating them as a single
word, for example, "high-performance" should be considered a single word, not 2 words.
    
    4. The ThaiWordBreaker class was added to work-around another BreakIterator difference
from Java - namely that in Java Thai characters were broken into separate "words" if adjacent
to non-Thai characters. For example "สวัสดีkrap", should break to "สวัสดี"
and "krap". Ideally, icu-dotnet would handle this, but this solution is acceptable if that
is unreasonable to do.
    5. If we are keeping the ThaiWordBreaker, I just ran the tests dealing with Thai numerals,
and my assumption that those should be broken just like Thai characters was incorrect. So,
the [these lines](https://github.com/conniey/lucenenet/blob/08453c16290465842866affa6f2fdd35517608b6/src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs#L235-L236)
should be changed to:
    ```
    isThai = char.IsLetter(c) && thaiPattern.IsMatch(c.ToString());
    isNonThai = char.IsLetter(c) && !isThai;
    ```
    You may wish to also change the variable names to `isThaiLetter`, `isNonThaiLetter`, etc.
to make this more clear in the code. You can use the following test to verify the results.
    
    ```
    [Test, LuceneNetSpecific]
    public void TestNumeralBreaking() 
    {
    	  ThaiAnalyzer analyzer = new ThaiAnalyzer(TEST_VERSION_CURRENT, CharArraySet.EMPTY_SET);
    	  AssertAnalyzesTo(analyzer, "๑๒๓456", new String[] { "๑๒๓456" });
      }
    ```
    
    BreakIterator Dependencies
    ---
    
    Also, it seems like an easier path to setup the BreakIterator similar to the way it was
in Java (as an abstract class), since it is being passed as method and constructor parameters
and because it is meant to be an extension point where you can design your own word breaking
if you need to customize the default ICU behavior. So, I ported the abstract `BreakIterator`
class and have created a concrete `IcuBreakIterator` to wrap the icu-dotnet "BreakIterator"
static functions (with the ability to pass locale and "type" to the constructor). I am still
working on creating tests to verify the behavior against Java. Basically, the same "iterator"
logic that exists in Java to move forward, backward, or arbitrarily through the break points
is in this class.
    
    However, since this class depends directly on icu-dotnet, we can't just put it into our
Support namespace without adding an icu-dotnet dependency to Lucene.Net.Core. And since other
parts of Lucene (SimpleCN, ICU, Analysis.Common, etc) depend on BreakIterator functionality,
it would be simpler to share this behavior if it were part of a common library. 
    
    While we could build our own, it would be an extra dependency that doesn't exist in Lucene.
Ideally, it should go in icu-dotnet (since in Java it was part of the JDK, which icu-dotnet
is emulating). If this functionality were in icu-dotnet, it would not just benefit the Lucene.Net
project, but could potentially make other projects easier to port from Java. WDYT?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message