lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Dufrasne (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENENET-354) The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in the original string
Date Thu, 01 Apr 2010 18:45:27 GMT
The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in the
original string
-------------------------------------------------------------------------------------------------------------

                 Key: LUCENENET-354
                 URL: https://issues.apache.org/jira/browse/LUCENENET-354
             Project: Lucene.Net
          Issue Type: Bug
         Environment: Lucene.Net 2.9.1
            Reporter: Matt Dufrasne


The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in
the original string.

I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before.
When indexing "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:

"bb hhh_ffff5_ssss"

After some testing, I've found that this is because of the number. If I input

"BB_HHH_FFFF_SSSS", I get

"bb hhh ffff ssss"

At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed
to have this behavior but I fail to see why.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message