lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy (JIRA)" <>
Subject [jira] Updated: (LUCENENET-188) Index/TestIndexInput/TestRead fails - (invalid UTF8 sequence).
Date Wed, 12 Aug 2009 20:15:14 GMT


Digy updated LUCENENET-188:

    Attachment: TestIndexInput.patch

{quote} {color:red} 
The Java programming language, which uses UTF-16 for its internal text representation, supports
a non-standard modification of UTF-8 for string serialization. This encoding is called modified
UTF-8. There are two differences between modified and standard UTF-8. The first difference
is that the null character (U+0000) is encoded with two bytes instead of one, specifically
as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string,
presumably to address the concern that if the encoded string is processed in a language such
as C where a null byte signifies the end of a string.
{color} {quote}

This explains the difference. Java treats c080 as null char but .Net as invalid char.


> Index/TestIndexInput/TestRead fails -  (invalid UTF8 sequence).
> ---------------------------------------------------------------
>                 Key: LUCENENET-188
>                 URL:
>             Project: Lucene.Net
>          Issue Type: Bug
>         Environment: Lucene.Net 2.4.0
>            Reporter: Digy
>            Priority: Trivial
>         Attachments: IndexInput.patch, TestIndexInput.patch
> This test fails since  "System.Text.Encoding.UTF8.GetString(bytes, 0, length)" emits
\ufffd char for invalid UTF-8 sequences and Java's "String(bytes, 0, length, "UTF-8")" outputs
> I will attach a very bad implemented patch to show the problem but won't commit it unless
a clever (and performant) solution is found.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message