lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George Aroush" <geo...@aroush.net>
Subject RE: Sort differences between .NET and Java in Lucene.Net 2.0
Date Fri, 15 Dec 2006 01:56:46 GMT
Hi Dean,

No, I do not intened to use CompareOrdinal -- that would break Lucene.Net.

I have posted this question on Java Lucene mailing list; I got one response
suggesting that Java is doing it wrong.  I am certain about this.

I have done some more research, and so far, I am agreeing with your
analyses.  For example, like you said, using the Danish locale gave me the
same result with Java and .NET.

Does everyone agree that this is not an issue with Lucene.Net 2.0 such that
I should release 2.0 as "final"?  I should point out that this same problem
also existed in 1.9.1, 1.9 1.4.3, 1.4 and earlier releases.  In those
releases, this test didn't exist to expose it.

Regards,

-- George Aroush


-----Original Message-----
From: Dean Harding [mailto:dean.harding@dload.com.au] 
Sent: Wednesday, December 13, 2006 9:39 PM
To: lucene-net-user@incubator.apache.org;
lucene-net-dev@incubator.apache.org
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

You certainly don't want CompareOrdinal!

.NET is doing the right thing in this case, but so is Java. 

The problem is that Ø is not a character that is used in US English (or any
English, for that matter), so the actual order that would be returned when
doing a compare in a locale like en-us is not really important.

What IS important is if you do the comparison in the context of a locale
that DOES use the Ø character. If you change your .NET culture name or your
Java locale to (for example) "da" (that is, Danish) then the results are the
same.

So the bug, I believe, is in the test case which is relying on which is, in
my opinion, undefined.

Dean.


> -----Original Message-----
> From: George Aroush [mailto:george@aroush.net]
> Sent: Thursday, 14 December 2006 1:07 pm
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi Joe and all,
> 
> I don't think we can use CompareOrdinal() as it doesn't take locale 
> into consideration.
> 
> The issue is with the following function in
> Lucene.Net.Search.FieldSortedHitQueue.cs:
> 
>     public int Compare(ScoreDoc i, ScoreDoc j)
>     {
>         return collator.Compare(index[i.doc].ToString(),
> index[j.doc].ToString());
>     }
> 
> To demonstrate how Java and C# differ in the way they do compare, here 
> is a sample code:
> 
>     // C# code: you get back -1 for 'res'
>     string s1 = "H\u00D8T";
>     string s2 = "HUT";
>     System.Globalization.CultureInfo locale = new 
> System.Globalization.CultureInfo("en-US");
>     System.Globalization.CompareInfo collator = locale.CompareInfo;
>     int res = collator.Compare(s1, s2);
> 
>     // Java code: you get back 1 for 'res'
>     String s1 = "H\u00D8T";
>     String s2 = "HUT";
>     Collator collator = Collator.getInstance (Locale.US);
>     int diff = collator.compare(s1, s2);
> 
> Who is doing the right thing?  Or am I missing additional calls before 
> I can compare?
> 
> My goal is to understand why the difference exist and thus we can 
> judge how serious this is and either fix it or accept it as a language 
> difference.
> 
> Btw, I am going to post this question on the Java Lucene mailing list 
> to see what folks on the Java land have to say.
> 
> Regards,
> 
> -- George Aroush
> 
> 
> -----Original Message-----
> From: Joe Shaw [mailto:joeshaw@novell.com]
> Sent: Wednesday, December 13, 2006 1:35 PM
> To: lucene-net-dev@incubator.apache.org
> Cc: lucene-net-user@incubator.apache.org
> Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
> 
> Hi,
> 
> On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> > This is why those two tests are failing and I wander if this is a 
> > defect in NET or in the way the culture info is used in those two 
> > languages or if there is more culture setting I have to do in .NET.
> >
> > My thinking is, in .NET during compare, "\u00D8", is being treated 
> > as ASCII "O" and not the Unicode character that it really is.
> 
> This isn't the case, because if so "HOT" would be equal to "H\u00D8T".
> 
> I think that the sort order is just different between .NET and Java -- 
> ie, the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in 
> Java -- at least in the culture you're using.
> 
> If you're looking for the actual numerical values of the characters 
> for comparison (in which "\u00D8" would be quite a bit higher than both
"O"
> and "U", you probably want to use String.CompareOrdinal()).
> 
> BTW, doing culture insensitive string comparisons might be a good 
> thing to do anyway.  From the MSDN docs for String.Compare(string,
string):
> 
>         The comparison uses the current culture to obtain
>         culture-specific information such as casing rules and the
>         alphabetic order of individual characters. For example, a
>         culture could specify that certain combinations of characters be
>         treated as a single character, or uppercase and lowercase
>         characters be compared in a particular way, or that the sorting
>         order of a character depends on the characters that precede or
>         follow it.
> 
> For more info, see the String.Compare() docs:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-
> us/cpref/html/
> frlrfsystemStringclassComparetopic.asp
> 
> Joe




Mime
View raw message