lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Van Den Berghe (JIRA)" <>
Subject [Lucene.Net] [jira] [Created] (LUCENENET-414) The definition of CharArraySet is dangerously confusing and leads to bugs when used.
Date Fri, 13 May 2011 06:54:47 GMT
The definition of CharArraySet is dangerously confusing and leads to bugs when used.

                 Key: LUCENENET-414
             Project: Lucene.Net
          Issue Type: Bug
          Components: Lucene.Net Core
    Affects Versions: Lucene.Net 2.9.2
         Environment: Irrelevant
            Reporter: Vincent Van Den Berghe
            Priority: Minor
             Fix For: Lucene.Net 2.9.2

Right now, CharArraySet derives from System.Collections.Hashtable, but doesn't actually use
this base type for storing elements.
However, the StandardAnalyzer.STOP_WORDS_SET is exposed as a System.Collections.Hashtable.
The trivial code to build your own stopword set using the StandardAnalyzer.STOP_WORDS_SET
and adding your own set of stopwords like this:

CharArraySet myStopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET, ignoreCase: false);
foreach (string domainSpecificStopWord in DomainSpecificStopWords)

... will fail because the CharArraySet accepts an ICollection, which will be passed the Hashtable
instance of STOP_WORDS_SET: the resulting myStopWords will only contain the DomainSpecificStopWords,
and not those from STOP_WORDS_SET.

One workaround would be to replace the first line with this:

CharArraySet stopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET.Count + DomainSpecificStopWords.Length,
ignoreCase: false);
foreach (string domainSpecificStopWord in (CharArraySet)StandardAnalyzer.STOP_WORDS_SET)

... but this makes use of the implementation detail (the STOP_WORDS_SET is really an UnmodifiableCharArraySet
which is itself a CharArraySet). It works because it forces the foreach() to use the correct
CharArraySet.GetEnumerator(), which is defined as a "new" method (this has a bad code smell
to it)

At least 2 possibilities exist to solve this problem:
- Make CharArraySet use the Hashtable instance and a custom comparator, instead of its own
- Make CharArraySet use HashSet<char[]>, defined in .NET 4.0.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message