lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shad Storhaug <s...@shadstorhaug.com>
Subject Collation
Date Sun, 04 Sep 2016 12:13:40 GMT
@ChristopherHaws

It seems we both missed each other's reply. I made some progress on this over the last couple
of days, but being that your reply was 8 days ago you could be much further along than I am.
I have ported a dozen dependencies of Collator over. I got to the point where I noticed that
CollationElementIterator depends on NormalizerBase for most of its functionality, and that
in turn NormalizerBase supports a lot of the ICU functionality, and was about to contact you
to find out if you have already ported NormalizerBase for ICU when I saw your reply.

Here is the latest (incomplete and non-compiling): https://github.com/NightOwl888/lucenenet/commit/1196ca4d038ae89a8fa25b04609a7b0b768ee833
(also this commit is needed: https://github.com/NightOwl888/lucenenet/commit/33678e89111152220769adee12c08e1cc86bead7).

Some info you might find helpful:


1.       The rough equivalent of Collator in .NET is the System.Globalization.CompareInfo
class. There is an instance of CompareInfo created when you new up a CultureInfo object that
is available at the CultureInfo.CompareInfo property.

2.       The equivalent of CollationKey in .NET is the System.Globalization.SortKey class,
which is what is returned from CompareInfo.GetSortKey().

3.       System.Globalization.CompareInfo won't completely replace the Collator class because
it has some state (strength and decomposition) that CompareInfo doesn't have. Collator can
(and probably should) subclass CompareInfo to add this state.

4.       There doesn't appear to be a .NET equivalent of RuleBasedCollator, and it appears
that the only reason why we need a Collator extension point is to support custom rules over
and above what is already available in System.Globalization.CultureInfo.CompareInfo.

5.       To support custom Collators, we need a ICollatorProvider interface. However, the
LocaleProviderAdapter appears to be roughly equivalent to what happens when you new up CultureInfo
so we probably don't need it. There needs to be a mapping of CultureInfo to ICollatorProvider.
It might be possible to extend .NET to do so, but failing that there could just be a static
dictionary to map these. If the dictionary doesn't have the specific mapping, there could
be a default provider that creates a default RuleBasedCollator with a rule string matching
the culture (I haven't figured out where these rule strings are in Java - probably in resource
files). Basically, make the dictionary empty unless someone wants to add custom collator providers
for specific cultures, and fallback on our defaults. There probably should be a way to set
the custom collator providers in code (statically at application startup) as well as in configuration
files.

6.       The System.Globalization.TextElementEnumerator might be able to be used for similar
functionality as the CollationElementIterator. See https://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator(v=vs.110).aspx.
That said, if you want to port over ICU now you might be better off using the TextElementEnumerator
with the NormalizationBase class.

7.       .NET supports normalization in the form "somestring".Normalize(NormalizationForm).

8.       It will probably make sense to have an extension method on CultureInfo named GetCollator()
to make it handy to get the collator that is associated with the CultureInfo instance. Replacing
the CultureInfo.CompareInfo property with an instance of Collator with custom rules feels
wrong, since it could break functionality of applications that are currently using that property.

For now, I will quit working on Collation. Hopefully, you can use some of this to finish the
implementation.

Thanks,
Shad Storhaug/NightOwl888


@NightOwl888<https://github.com/NightOwl888> I didnt see your message until just now.
I was working on converting from Locale to CultureInfo. I made a stub class for Collator,
but this should probably be changed to an abstract class with a default implementation or
removed entirely.
Anyways, 9 out of 14 of the unit tests are succeeding. I am off to bed for now.
ChristopherHaws@c11205d<https://github.com/ChristopherHaws/lucenenet/commit/c11205d75f070788940464ac1b8c3c6d915ecb92>
BTW, are you on the lucenenet-dev mailing list? We should probably move this conversation
to there.
Thanks!









Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message