lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [lucenenet] willson556 commented on issue #296: IndexOutOfRangeException when searching
Date Wed, 14 Oct 2020 21:43:06 GMT

willson556 commented on issue #296:
URL: https://github.com/apache/lucenenet/issues/296#issuecomment-708676417


   I am able to reliably reproduce with one of my datasets but I'm not sure if I could write
a test to fail. I'm running on .NET Core/x64.
   
   Similar stack trace to everyone after OP:
   ```
      at Lucene.Net.Util.Automaton.UTF32ToUTF8.Convert(Automaton utf32) 
      at Lucene.Net.Util.Automaton.CompiledAutomaton..ctor(Automaton automaton, Nullable`1
finite, Boolean simplify) 
      at Lucene.Net.Search.FuzzyTermsEnum.InitAutomata(Int32 maxDistance) 
      at Lucene.Net.Search.FuzzyTermsEnum.GetAutomatonEnum(Int32 editDistance, BytesRef lastTerm)

      at Lucene.Net.Search.FuzzyTermsEnum.MaxEditDistanceChanged(BytesRef lastTerm, Int32
maxEdits, Boolean init) 
      at Lucene.Net.Search.FuzzyTermsEnum..ctor(Terms terms, AttributeSource atts, Term term,
Single minSimilarity, Int32 prefixLength, Boolean transpositions) 
      at Lucene.Net.Search.FuzzyQuery.GetTermsEnum(Terms terms, AttributeSource atts) 
      at Lucene.Net.Search.MultiTermQuery.RewriteMethod.GetTermsEnum(MultiTermQuery query,
Terms terms, AttributeSource atts) 
      at Lucene.Net.Search.TermCollectingRewrite`1.CollectTerms(IndexReader reader, MultiTermQuery
query, TermCollector collector) 
      at Lucene.Net.Search.TopTermsRewrite`1.Rewrite(IndexReader reader, MultiTermQuery query)

      at Lucene.Net.Search.MultiTermQuery.Rewrite(IndexReader reader) 
      at Lucene.Net.Search.BooleanQuery.Rewrite(IndexReader reader) 
      at Lucene.Net.Search.IndexSearcher.Rewrite(Query original) 
      at Lucene.Net.Search.IndexSearcher.CreateNormalizedWeight(Query query) 
      at Lucene.Net.Search.IndexSearcher.Search(Query query, Filter filter, Int32 n) 
      at Lucene.Net.Search.IndexSearcher.Search(Query query, Int32 n) 
   ```
   Using this analyzer (I'm just starting to come up to speed with Lucene so I'm not sure
the arrangement of filters actually makes any sense):
   ```c#
   public class NGramAnalyzer : Analyzer
   {
       private readonly LuceneVersion version;
       private readonly int minGram;
       private readonly int maxGram;
   
       public NGramAnalyzer(LuceneVersion version, int minGram = 2, int maxGram = 8)
       {
           this.version = version;
           this.minGram = minGram;
           this.maxGram = maxGram;
       }
   
       /// <inheritdoc />
       protected override TextReader InitReader(string fieldName, TextReader reader)
       {
           var charMap = new NormalizeCharMap.Builder();
           charMap.Add("_", " ");
           return new MappingCharFilter(charMap.Build(), reader);
       }
   
       /// <inheritdoc />
       protected override TokenStreamComponents CreateComponents(string fieldName, TextReader
reader)
       {
           // Splits words at punctuation characters, removing punctuation.
           // Splits words at hyphens, unless there's a number in the token...
           // Recognizes email addresses and internet hostnames as one token.
           var tokenizer = new StandardTokenizer(version, reader);
   
           TokenStream filter = new StandardFilter(version, tokenizer);
   
           // Normalizes token text to lower case.
           filter = new LowerCaseFilter(version, filter);
   
           // Removes stop words from a token stream.
           filter = new StopFilter(version, filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
   
           filter = new EnglishMinimalStemFilter(filter);
   
           filter = new NGramTokenFilter(version, filter, minGram, maxGram);
           return new TokenStreamComponents(tokenizer, filter);
       }
   }
   ```
   
   Setup is then:
   
   ```c#
   var indexStore = new RAMDirectory();
   var indexConfig = new IndexWriterConfig(Version, Analyzer);
   indexWriter = new IndexWriter(indexStore, indexConfig);
   initialIndexingTask = Task.Run(() =>
                                                 {
                                                     var stopwatch = Stopwatch.StartNew();
                                                     indexWriter.AddDocuments(collection.Select(GetAndSubscribeToDocument));
                                                     indexWriter.Commit();
                                                     Debug.WriteLine(@$"{typeof(TDocument)}
Indexing: {stopwatch.ElapsedMilliseconds}ms");
                                                 });
   ```
   
   Searching after initial indexing is complete is done with:
   
   ```c#
   using var reader = DirectoryReader.Open(indexWriter.Directory);
   var searcher = new IndexSearcher(reader);
   
   Query? parsedQuery;
   try
   {
       var queryParser = new MultiFieldQueryParser(Version, DefaultSearchFields, Analyzer);
       var terms = new HashSet<Term>();
       queryParser.Parse(query).Rewrite(reader).ExtractTerms(terms);
   
       var boolQuery = new BooleanQuery();
       terms.ForEach(t =>
                       {
                           boolQuery.Add(new FuzzyQuery(t), Occur.SHOULD);
                           boolQuery.Add(new WildcardQuery(t), Occur.SHOULD);
                       });
   
       parsedQuery = boolQuery;
   }
   catch (Exception)
   {
       // TODO: User feedback
       return new (TDocument doc, float score)[0];
   }
   
   var hits = searcher.Search(parsedQuery, resultLimit);
   ```
   
   I've archived off the dataset and code so that I can hopefully go back and gather more
data to help troubleshoot. It's worth noting that in my current repro case, I have 4 separate
instances of this (RAMDirectory, IndexWriter, and Reader+Searcher) all running at the same
time (and with _nearly_ identical datasets). A quick look through the code up and down the
stack trace didn't show me anything in Lucene that was obviously shared between those instances
that could be the culprit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message