lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy (JIRA)" <j...@apache.org>
Subject [jira] Closed: (LUCENENET-5) CJK Tokenizer in NLS fails to stop at end of input buffer.
Date Sat, 15 Nov 2008 01:16:46 GMT

     [ https://issues.apache.org/jira/browse/LUCENENET-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Digy closed LUCENENET-5.
------------------------

    Resolution: Fixed
      Assignee: Digy

Not supported version.

> CJK Tokenizer in NLS fails to stop at end of input buffer.
> ----------------------------------------------------------
>
>                 Key: LUCENENET-5
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-5
>             Project: Lucene.Net
>          Issue Type: Bug
>         Environment: lucene.net.nls.1.3.2.2 on .NET 1.1 SP1
>            Reporter: Ben Tregenna
>            Assignee: Digy
>            Priority: Minor
>
> When using the CJKTokenizer from the National Language Support Pack to tokenize simple
Japanese text, the tokenizer fails to indicate EOS correctly. 
> Example code snippet (suitable for use as an nUnit test):
> public void SimpleTokenization()
> {
> 	TextReader tr = new StringReader("???");
> 	CJKTokenizer tokenizer = new CJKTokenizer(tr);
> 	Assert.AreEqual("??", tokenizer.Next().TermText(), "First Token is correct");
> 	Assert.AreEqual("??", tokenizer.Next().TermText(), "Second Token is correct");
> 	Assert.AreEqual(string.Empty, tokenizer.Next().TermText(), "Returns empty string as
final token");
> 	Assert.IsNull(tokenizer.Next(), "Returns null after end of string");
> }
> The current code treats the final buffer as circular and so returns as a third token
"??" and then keeps return these three tokens cyclically. The problem comes from the condition
for checking EOS from the TextReader input. In Java, Reader.read() returns -1 on EOS but in
.NET TextReader.Read returns 0 on EOS and so the terminating condition needs altering. 
> The diff to fix is pretty trivial:
> CJKTokenizer.cs: 162c162
> <                               if (dataLen == -1)
> ---
> >                               if (dataLen == 0)
> As a final note to the unwary - the comment at the start of the CJKTokenizer.Next() seems
to indicate that null will be returned immediately at EOS "Returns the next token in the stream,
or null at EOS." However I always get an empty token then null as indicated in the snippet
above. The logic now seems to reflect the lucene-java logic exactly so whether this is a bug,
a feature or a poor method summary remains unclear to me.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message