lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shad Storhaug <s...@shadstorhaug.com>
Subject RE: Collation
Date Tue, 06 Sep 2016 18:05:50 GMT
Yes, I got this email, but not the original reply from Christopher - thanks for keeping me
in the loop. I added the dev@lucenenet.apache.org email to my safe senders list a couple of
days ago but apparently that wasn't sufficient, so I have added both of your email addresses
on this email and Christopher's as well - hopefully that will suffice.

If you have replied to any of my other emails, please forward them to me once again (I didn't
get any replies).

And yes, another snag I realized about our current incarnation of ICU is that there is no
way to strong name Analysis.Common because ICU4NET is not strong named. If memory serves correctly,
it is not possible to strong name assemblies that depend on unmanaged resources - if that
is true, it is a strong case to eliminate ICU4NET sooner rather than later. Strong named assemblies
cannot depend on assemblies that are not strong named, so it is pretty critical for an open
source project to support strong naming for those projects that require it.

I took a look at icu-dotnet and it looks like it has all of the pieces we need for Collation
(at first glance anyway), however the "BreakIterator" in there is a static class (that is
not even inherited from IEnumerator) that doesn't do what we need. What we need is an Enumerator
that works out how to determine one word from the next in Thai - a language that doesn't use
spaces to delineate words. Their "BreakIterator" uses spaces and/or punctuation to determine
word breaks. I haven't looked under the hood much on ICU4NET, but I know separating Thai words
programmatically is a very complex problem. If there are no other working BreakIterator implementations
available for .NET, it seems like the next best option to get a .NET core-compatible working
implementation would be to port it from Java.

That being said, in terms of importance the Collation is much higher because it is a cross-culture
feature. BreakIterator is used in Lucene for the Thai analyzer and for the text highlighter
for all other languages. In a pinch we could get away with breaking on spaces and punctuation
for cross-culture support in the highlighter and simply not supporting Thai (either for the
highlighter or Analyzer) for .NET core. This seems like the most reasonable tradeoff given
how difficult it will be (or rather how many man hours it will take to get there) to support
Thai in .NET core as well as how low on the totem pole Thai is in terms of world languages
(and I am sitting in Thailand as I write this). Perhaps it might even make sense to make the
Analysis.Th namespace and other parts that support BreakIterator (such as the text highlighter)
for Thai into its own .NET NuGet package for .NET 4.6 so the BreakIterator dependency can
be isolated to just that package, and the rest can then compile and deploy in both .NET 4.6
and .NET core with a stripped down version of BreakIterator that works in most languages other
than Thai.

Frankly, I would personally like to see Lucene.Net 4.8 released before Lucene 7.0 is released
rather than having everyone bend over backwards to try to fit Thai language support into .NET
core/Azure.

> Is this something that we should wait for so that the migration of the 
> Collation namespace is a more direct port, or should we go ahead with 
> trying to use the .NET classes? I just want to make sure that we are 
> not changing the internal workings of these classes so much that they 
> don't work the same as their Java counterparts. The piece that I kept 
> getting hung up on was the RuleBasedCollator which icu-dotnet has a 
> direct port of (along with Collator and Locale).

I'd say first set a reference to icu-dotnet and see if you can get all of the collator tests
to pass. If so, then bring the classes over into our Support namespace. If not, then continue
down the path I started - I think there are about 5 or 6 more dependencies that will be required
to get all of the Collation pieces from Java (that is, if the Enumerator I mentioned before
doesn't work), and there should be some way to plug in your own implementation of RuleBasedCollator
on a per-locale basis. My thought was to port it over as a mostly Java style implementation,
get the tests passing, and then start swapping out the pieces like the SortKey and (possibly)
subclassing CompareInfo.

Either way, I think that bringing/porting the code over into Lucene.Net is a better option
than setting a reference to a library so we have better control over how .NET core-compatible
the code is and so we don't take on another dependency.

Thanks,
Shad Storhaug/NightOwl888

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar
Syn-Hershko
Sent: Tuesday, September 6, 2016 9:10 PM
To: dev@lucenenet.apache.org
Subject: Re: Collation

Just a heads up - I tried reaching out to Shad privately and mails to him bounce.. hopefully
he can see this :)

Collation and ICU both sound quite painful - would love to see us reducing our dependencies
on that front, I already got reports of our current ICU deps not playing along with Azure

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer
& Consultant Lucene.NET committer and PMC member

On Sun, Sep 4, 2016 at 10:30 PM, Christopher Haws <cribs2@gmail.com> wrote:

> @NightOwl888
>
> No problem. I had a pretty busy week at work so I wasn't able to work 
> on it during the week. I came to the same conclusions as you regarding 
> CompareInfo, SortKey, and CultureInfo being .NET's closest equivalent 
> to Java's Collation and Locale classes.
>
> Something that I did find while looking through the dev mailing list 
> is that Connie Yau, from Microsoft, has replaced ICU4NET with 
> icu-dotnet in their port to .NET Core.
>
> http://mail-archives.apache.org/mod_mbox/lucenenet-dev/201605.mbox/%
> 3CCY1PR0301MB0761AE82FE1401AD03CB36E4B84B0%40CY1PR0301MB0761.namprd03.
> prod.outlook.com%3E
>
> https://github.com/conniey/lucenenet
>
> Is this something that we should wait for so that the migration of the 
> Collation namespace is a more direct port, or should we go ahead with 
> trying to use the .NET classes? I just want to make sure that we are 
> not changing the internal workings of these classes so much that they 
> don't work the same as their Java counterparts. The piece that I kept 
> getting hung up on was the RuleBasedCollator which icu-dotnet has a 
> direct port of (along with Collator and Locale).
>
> icu-dotnet: https://github.com/sillsdev/icu-dotnet
>
> Let me know what you think.
>
> Thanks!
> Christopher Haws
>
Mime
View raw message