lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENENET-354) The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present in the original string
Date Thu, 01 Apr 2010 22:30:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENENET-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852582#action_12852582
] 

Digy commented on LUCENENET-354:
--------------------------------

Hi Matt,
I compared the  Lucene.Net 2.9.2 & Lucene.Java 2.9.2, they both output the same tokens
for your input.
So it is not a bug. StandardAnalyzer works this way. 
Even it were a bug,  changing StandardAnalyzer would result in compatibility problems among
Lucene.Net & Lucene.Java versions.

So, If it not suitable for your needs,  you may want to use a different analyzer or write
a custom analyzer that works like the way you want.

DIGY


> The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present
in the original string
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENENET-354
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-354
>             Project: Lucene.Net
>          Issue Type: Bug
>         Environment: Lucene.Net 2.9.1
>            Reporter: Matt Dufrasne
>
> The StandardAnalyzer tokenizer doesn't tokenize on all tokens when numbers are present
in the original string.
> I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before.
When indexing "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:
> "bb hhh_ffff5_ssss"
> After some testing, I've found that this is because of the number. If I input
> "BB_HHH_FFFF_SSSS", I get
> "bb hhh ffff ssss"
> At this point, I'm leaning towards a tokenizer bug unless the presence of the number
is supposed to have this behavior but I fail to see why.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message