lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Simatov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENENET-559) Search word request on Chinese is not working properly
Date Thu, 14 Jul 2016 11:43:20 GMT
Alex Simatov created LUCENENET-559:
--------------------------------------

             Summary: Search word request on Chinese is not working properly
                 Key: LUCENENET-559
                 URL: https://issues.apache.org/jira/browse/LUCENENET-559
             Project: Lucene.Net
          Issue Type: Bug
          Components: Lucene.Net Core
    Affects Versions: Lucene.Net 5.0 PCL
            Reporter: Alex Simatov


Originally we used Lucene 2.3 in the project for years.
Some time ago we made an update to the 5.0.0 version of Lucene.
After that Chinese analyzing stopped working normally (I did not test it on Japanese or Korean)

We have the following code to process the search request:

1. analyzer = new ClassicAnalyzer();
2. logger.Write2Log(queryString);
3. QueryParser qp = new QueryParser(fieldName, analyzer);
4. Query query = qp.parse(queryString);
5. logger.Write2Log(query.toString(fieldName));
6. int hits = searcher.search(query, 1).totalHits;

Analyzer on line 1 could be changed by config.
Line 2 is printing what we put to the Lucene.
Line 5 is printing how the query modified in Lucene

Normally we are using the string 打不开~0.7 for 70% or more accuracy and  打不开 to
find exact this word.
~0.7 functionality was marked as deprecated since 4.0 version, however it is still worked
on English at least.

What was before (on Lucene 2.3):
Line 2: 打不开~0.7 
Line 5: 打不开~0.7
If we provide the correct string for analysis, line 6 returns correct result

The same for case of 打不开 without accuracy (without ~0.7)

What is now (on Lucene 5.0):
Line 2: 打不开~0.7 
Line 5: 打不开~0
As I understood it is modifying of deprecated parameter to newly supported one with a little
different meaning (at least it is working like I said on English).
The string for analysis contains the 打不开, however line 6 shows nothing is found.

Line 2: 打不开 
Line 5: 打 不 开
Lucene added spaces, which are interpreted as OR operator. As result Line 6 returns that keyword
found even if it is only one 不 symbol in the string for analysis.

The same scenario was tested on the CJKAnalyzer, ClassicAnalyzer  and SmartChineseAnalyzer.
Results are the same: neither one of them has the same functionality as analyzer on Lucene
2.3

Is it known problem in the product? Could you please explain or provide any docs about how
the search should work for Chinese in mentioned cases.
Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message