lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dean Harding" <dean.hard...@dload.com.au>
Subject RE: Sort differences between .NET and Java in Lucene.Net 2.0
Date Thu, 14 Dec 2006 02:39:15 GMT
You certainly don't want CompareOrdinal!

.NET is doing the right thing in this case, but so is Java. 

The problem is that Ø is not a character that is used in US English (or any
English, for that matter), so the actual order that would be returned when
doing a compare in a locale like en-us is not really important.

What IS important is if you do the comparison in the context of a locale
that DOES use the Ø character. If you change your .NET culture name or your
Java locale to (for example) "da" (that is, Danish) then the results are the
same.

So the bug, I believe, is in the test case which is relying on which is, in
my opinion, undefined.

Dean.


> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Thursday, 14 December 2006 1:07 pm
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi Joe and all,
> 
> I don't think we can use CompareOrdinal() as it doesn't take locale into
> consideration.
> 
> The issue is with the following function in
> Lucene.Net.Search.FieldSortedHitQueue.cs:
> 
>     public int Compare(ScoreDoc i, ScoreDoc j)
>     {
>         return collator.Compare(index[i.doc].ToString(),
> index[j.doc].ToString());
>     }
> 
> To demonstrate how Java and C# differ in the way they do compare, here is
> a
> sample code:
> 
>     // C# code: you get back -1 for 'res'
>     string s1 = "H\u00D8T";
>     string s2 = "HUT";
>     System.Globalization.CultureInfo locale = new
> System.Globalization.CultureInfo("en-US");
>     System.Globalization.CompareInfo collator = locale.CompareInfo;
>     int res = collator.Compare(s1, s2);
> 
>     // Java code: you get back 1 for 'res'
>     String s1 = "H\u00D8T";
>     String s2 = "HUT";
>     Collator collator = Collator.getInstance (Locale.US);
>     int diff = collator.compare(s1, s2);
> 
> Who is doing the right thing?  Or am I missing additional calls before I
> can
> compare?
> 
> My goal is to understand why the difference exist and thus we can judge
> how
> serious this is and either fix it or accept it as a language difference.
> 
> Btw, I am going to post this question on the Java Lucene mailing list to
> see
> what folks on the Java land have to say.
> 
> Regards,
> 
> -- George Aroush
> 
> 
> -----Original Message-----
> From: Joe Shaw [mailto:joeshaw@novell.com]
> Sent: Wednesday, December 13, 2006 1:35 PM
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi,
> 
> On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> > This is why those two tests are failing and I wander if this is a
> > defect in NET or in the way the culture info is used in those two
> > languages or if there is more culture setting I have to do in .NET.
> >
> > My thinking is, in .NET during compare, "\u00D8", is being treated as
> > ASCII "O" and not the Unicode character that it really is.
> 
> This isn't the case, because if so "HOT" would be equal to "H\u00D8T".
> 
> I think that the sort order is just different between .NET and Java -- ie,
> the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in Java --
> at
> least in the culture you're using.
> 
> If you're looking for the actual numerical values of the characters for
> comparison (in which "\u00D8" would be quite a bit higher than both "O"
> and "U", you probably want to use String.CompareOrdinal()).
> 
> BTW, doing culture insensitive string comparisons might be a good thing to
> do anyway.  From the MSDN docs for String.Compare(string, string):
> 
>         The comparison uses the current culture to obtain
>         culture-specific information such as casing rules and the
>         alphabetic order of individual characters. For example, a
>         culture could specify that certain combinations of characters be
>         treated as a single character, or uppercase and lowercase
>         characters be compared in a particular way, or that the sorting
>         order of a character depends on the characters that precede or
>         follow it.
> 
> For more info, see the String.Compare() docs:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/cpref/html/
> frlrfsystemStringclassComparetopic.asp
> 
> Joe




Mime
View raw message