lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [lucenenet] NightOwl888 commented on issue #246: Custom StopWord Analyzer - Exception Cannot read from a closed TextReader.
Date Tue, 28 Apr 2020 19:28:49 GMT

NightOwl888 commented on issue #246:
URL: https://github.com/apache/lucenenet/issues/246#issuecomment-620808822


   As `CreateComponents()` is a factory method (meaning it is a creational operation), only
short-lived dependencies should be disposed there. Since you are disposing the stream first
before returning it, it is not in a state where the caller of `CreateComponents()` can utilize
it.
   
   To make a customized standard analyzer, the best approach would be to model your new class
after the [existing StandardAnalyzer class](https://github.com/apache/lucenenet/blob/8cf15f7fd0bb7b22bb2e865895998583d049ab92/src/Lucene.Net.Analysis.Common/Analysis/Standard/StandardAnalyzer.cs).
   
   ```c#
       public sealed class MyStopwordAnalyzer : StopwordAnalyzerBase
       {
           /// <summary>
           /// An unmodifiable set containing some common English words that are usually not
           /// useful for searching. 
           /// </summary>
           public static readonly CharArraySet STOP_WORDS_SET = LoadEnglishStopWordsSet();
   
           private static CharArraySet LoadEnglishStopWordsSet() // LUCENENET: Avoid static
constructors (see https://github.com/apache/lucenenet/pull/224#issuecomment-469284006)
           {
               IList<string> stopWords = new string[] { "a", "an", "and", "are", "as",
"at", "be",
                   "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of",
"on",
                   "or", "such", "that", "the", "their", "then", "there", "these", "they",
"this",
                   "to", "was", "will", "with" };
   #pragma warning disable 612, 618
               var stopSet = new CharArraySet(LuceneVersion.LUCENE_CURRENT, stopWords, false);
   #pragma warning restore 612, 618
               return CharArraySet.UnmodifiableSet(stopSet);
           }
   
           /// <summary>
           /// Builds an analyzer with the given stop words. </summary>
           /// <param name="matchVersion"> Lucene compatibility version - See <see
cref="MyStopwordAnalyzer"/> </param>
           /// <param name="stopWords"> stop words  </param>
           public MyStopwordAnalyzer(LuceneVersion matchVersion, CharArraySet stopWords)
               : base(matchVersion, stopWords)
           {
           }
   
           /// <summary>
           /// Builds an analyzer with the default stop words (<see cref="STOP_WORDS_SET"/>).
</summary>
           /// <param name="matchVersion"> Lucene compatibility version - See <see
cref="MyStopwordAnalyzer"/> </param>
           public MyStopwordAnalyzer(LuceneVersion matchVersion)
               : this(matchVersion, STOP_WORDS_SET)
           {
           }
   
           /// <summary>
           /// Builds an analyzer with the stop words from the given reader. </summary>
           /// <seealso cref="WordlistLoader.GetWordSet(TextReader, LuceneVersion)"/>
           /// <param name="matchVersion"> Lucene compatibility version - See <see
cref="MyStopwordAnalyzer"/> </param>
           /// <param name="stopwords"> <see cref="TextReader"/> to read stop
words from  </param>
           public MyStopwordAnalyzer(LuceneVersion matchVersion, TextReader stopwords)
               : this(matchVersion, LoadStopwordSet(stopwords, matchVersion))
           {
           }
   
           protected override TokenStreamComponents CreateComponents(string fieldName, TextReader
reader)
           {
               var src = new StandardTokenizer(m_matchVersion, reader);
               TokenStream tok = new StandardFilter(m_matchVersion, src);
               // tok = new LowerCaseFilter(m_matchVersion, tok); // optional
               tok = new StopFilter(m_matchVersion, tok, m_stopwords);
               return new TokenStreamComponents(src, tok);
           }
       }
   ```
   
   Do note that the existing `StandardAnalyzer` class also allows passing in a `CharArraySet`
containing stopwords, which may meet your needs if you wish to use the `LowerCaseFilter` to
normalize your text.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message