lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shad Storhaug (JIRA)" <>
Subject [jira] [Closed] (LUCENENET-573) Make IcuBreakIterator more like the JDK's BreakIterator.getInstance()
Date Tue, 13 Aug 2019 05:34:00 GMT


Shad Storhaug closed LUCENENET-573.
    Resolution: Won't Fix

Rather than trying to patch the ICU {{BreakIterator}} to match the JDK, a more logical default
behavior is to embrace the default supplied by ICU. ICU provides the means for the end user
to supply custom rules, so we shouldn't worry about the fact that Lucene's tests don't all
pass based on this behavior, but just provide confirmation that we can override the default
as well as confirmation that our ICU4N {{BreakIterator}} matches the behavior of ICU4J.

Java tests were created based on the ICU {{BreakIterator}}'s default behavior, and then ported
back to C# to confirm they match. A mock {{JdkBreakIterator}} with custom rules was also created
to stand in for the ICU4N {{BreakIterator}} to confirm we can change ICU4N to match JDK's

> Make IcuBreakIterator more like the JDK's BreakIterator.getInstance()
> ---------------------------------------------------------------------
>                 Key: LUCENENET-573
>                 URL:
>             Project: Lucene.Net
>          Issue Type: Improvement
>          Components: Lucene.Net.ICU
>    Affects Versions: Lucene.Net 4.8.0
>            Reporter: Shad Storhaug
>            Priority: Major
> The IcuBreakIterator is a wrapper around the icu-dotnet library. It implements the JDK
BreakIterator business logic that was previously missing there, but has since been added in
the form of a RuleBasedBreakIterator. IcuBreakIterator is utilized by Lucene.Net.Analysis.Common.Th.ThaiAnalyzer,
Lucene.Net.Highlighter.PostingsHighlight, and Lucene.Net.Highlighter.VectorHighlight. While
all of the tests are passing for these components, it is primarily because of hacks that were
added as workarounds. In reality, the functionality of IcuBreakIterator has many rule-based
differences that make its breaking text behavior act quite differently than the JDK.
> We need to investigate whether the RuleBasedBreakIterator in icu-dotnet can be utilized
as is, or if it can be improved to more closely emulate the BreakIterator functionality in
the JDK.

This message was sent by Atlassian JIRA

View raw message