lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shad Storhaug (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENENET-595) Wildcard search with special characters "#" not working
Date Thu, 31 Aug 2017 12:31:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENENET-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148916#comment-16148916
] 

Shad Storhaug commented on LUCENENET-595:
-----------------------------------------

I suspect this is due to the use of {{StandardAnalyzer}} when you write your index. Per the
{{StandardTokenizer}} docs (https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html):

{quote}
This should be a good tokenizer for most European-language documents:

* Splits words at punctuation characters, removing punctuation. However, a dot that's not
followed by whitespace is considered part of a token.
* Splits words at hyphens, unless there's a number in the token, in which case the whole token
is interpreted as a product number and is not split.
* Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application,
please consider copying this source code directory to your project and maintaining your own
grammar-based tokenizer.{quote}

Although it doesn't specifically state it, I suspect that the # (or for that matter any other
special character) is being removed from the analyzed data, and does not match your query
because it does not exist in the index. I suggest trying another analyzer that doesn't meddle
with special characters, such as {{WhitespaceAnalyzer}} (https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/WhitespaceAnalyzer.html),
or build a custom analyzer to meet your exact needs.

Keep in mind when the data that is stored isn't always the same as the analyzed data, and
it is the analyzed data that is used during the search.

> Wildcard search with special characters "#" not working
> -------------------------------------------------------
>
>                 Key: LUCENENET-595
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-595
>             Project: Lucene.Net
>          Issue Type: Bug
>          Components: Lucene.Net Core
>    Affects Versions: Lucene.Net 3.0.3
>            Reporter: Singaravelu
>            Priority: Blocker
>
> I'm using Lucene.Net 3.0.3.0 version in my website to search list of courses.
> I have few courses which contains the special character "#" like, C#, C#.Net, etc.
> But When I search with the term "C#" it showing 0 results.
> I'm using StandardAnalyzer and MultiFieldQueryParser also allowing wildcard search (AllowLeadingWildcard
= true).
> Here is my code:
> var analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
> {
> BooleanQuery query = new BooleanQuery();
> var nameParser = new MultiFieldQueryParser(Version.LUCENE_30, new[] { "Column1", " Column2",
" Column3" }, analyzer);
> if (!string.IsNullOrEmpty(searchCriteria.CourseName))
> {
> query.Add(parseQuery(GetTerms(searchCriteria.CourseName.ReplaceDiacritics()), nameParser),
Occur.MUST);
> }
> ScoreDoc[] hits = searcher.Search(query, null, hits_limit, Sort.RELEVANCE).ScoreDocs;
> var results = _mapLuceneToDataList(hits, searcher);
> analyzer.Close();
> searcher.Dispose();
> return results;
> }
> For indexing: 
> The word "C#" indexed and stored correctly. 
> doc.Add(new Field("Title", sampleData.CourseName, Field.Store.YES, Field.Index.ANALYZED));
> Kindly let me know what I have to do to retrieve the result when I search with the term
"C#".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message