lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DI Edgar Piskernik <off...@piskernik.com>
Subject Re: Lucene.NET 4.8. Overriding Special Characters
Date Fri, 05 May 2017 07:16:36 GMT
Hi Steve,

this is what I came up using (I left some other options in the example, which you didn’t
explicitly ask for, but might be handy for your project).

public class DemoAnalyzer:Analyzer
    {
        private CharArraySet _stopWords;

        public DemoAnalyzer(CharArraySet stopWords)
        {
            _stopWords.Add(StandardAnalyzer.STOP_WORDS_SET); //English Stopwords
            _stopWords.Add(GermanAnalyzer.DEFAULT_STOPWORD_FILE); //German Stopwords
        }

        public override TextReader InitReader(string fieldName, TextReader reader)
        {
            NormalizeCharMap.Builder normalizer = new NormalizeCharMap.Builder();
            normalizer.Add(":", " ");
            normalizer.Add(„*“, „ „); //add all the other characters you like to filter
out…
            return new MappingCharFilter(normalizer.Build(), reader);
        }

        public override TokenStreamComponents CreateComponents(string fieldName, TextReader
reader)
        {
            var source = new WhitespaceTokenizer(LuceneVersion.LUCENE_48, reader);
            TokenStream result = new StandardFilter(LuceneVersion.LUCENE_48, source);
            result = new WordDelimiterFilter(LuceneVersion.LUCENE_48, result, 0, null);
            result = new LengthFilter(LuceneVersion.LUCENE_48, result, 3, 20); //ignore too
short/long tokens
            Regex regex = new Regex(@"\d+", RegexOptions.Compiled); //Remove all numbers from
the query
            result = new PatternReplaceFilter(result,regex,string.Empty,true);
            result = new LowerCaseFilter(LuceneVersion.LUCENE_48, result); //everything in
lowercase
            result = new StopFilter(LuceneVersion.LUCENE_48, result, _stopWords); //remove
all german and english stopwords
            return new TokenStreamComponents(source, result);
        }
    }

Edgar


> Am 05.05.2017 um 06:51 schrieb Oren Eini (Ayende Rahien) <ayende@ayende.com>:
> 
> That is typically done by changing the analyzer in question.
> 
> *Hibernating Rhinos Ltd  *
> 
> Oren Eini* l CEO l *Mobile: + 972-52-548-6969
> 
> Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811
> 
> 
> 
> On Fri, May 5, 2017 at 5:10 AM, Steve Mannina <steve_mannina@outlook.com>
> wrote:
> 
>> How do change the list of special characters that Lucene filters out? I
>> found how to change the list of stop words, but can't find out how to
>> change the list of special characters. Thanks.
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message