lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Van Den Berghe, Vincent (JIRA)" <j...@apache.org>
Subject [Lucene.Net] [jira] [Commented] (LUCENENET-414) The definition of CharArraySet is dangerously confusing and leads to bugs when used.
Date Fri, 13 May 2011 13:22:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENENET-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033031#comment-13033031
] 

Van Den Berghe, Vincent commented on LUCENENET-414:
---------------------------------------------------

Hello Digy,

Thanks for your response.
I don't want to sound overly pedantic (but please tell me if I do), but this changed implementation
solves only part of the problem.
Now, CharArraySet derives from Set<T>, which itself derives from List<T>. Items
are now stored both in this base class, as in the private HashSet<string> _Set.
However, because List<T> doesn't define its modifiers Add(T), Clear() and Remove(T)
as virtual, the derived implementation defines them as "new". 
This violates a variant of the Liskov substitution principle: an operation on the derived
type has not the same effect as the same operation on the base type.
In this case, it means that the following code will cause the items in the List<T> base
type and in the _Set to be desynchronized:

	CharArraySet set=...
	List<string> same=set;
	same.Add("whatever");
	// at this point, same.Contains("whatever")==true but set.Contains("whatever")==false even
though it's the same instance.

You might rightfully retort that this never happens and I should mind my own business, but
I know at least one poor soul who did just that: me :-(.

On a completely unrelated matter, the new implementation has 2 methods:

			public void Add(System.Collections.Generic.IList<T> items)
			public void Add(Support.Set<T> items)

.. which can be collapsed into one, since the only thing used in both cases is the enumerator:

			public void Add(IEnumerable<T> items)

I don't recall the design rule, but it's something like "to increase reuse, make your function
parameters are general as possible, but their return value as specific as possible".
I am unable to get 2.9.4g to investigate further, but if you are moving towards the Generic
collections in Lucene, the following implementation should be a drop-in replacement, without
suffering from the aforementioned quirks:

		[Serializable]
		public class Set<T> : ICollection<T>
		{
			private readonly System.Collections.Generic.HashSet<T> _Set = new System.Collections.Generic.HashSet<T>();
			bool _ReadOnly = false;

			public Set()
			{
			}

			public Set(bool readOnly)
			{
				this._ReadOnly = readOnly;
			}

			public bool ReadOnly
			{
				set
				{
					_ReadOnly = value;
				}
				get
				{
					return _ReadOnly;
				}
			}

			public virtual void Add(T item)
			{
				if (_ReadOnly) throw new NotSupportedException();
				if (_Set.Contains(item)) return;
				_Set.Add(item);
			}

			public void Add(IEnumerable<T> items)
			{
				if (_ReadOnly) throw new NotSupportedException();
				foreach (T item in items)
				{
					if (_Set.Contains(item)) continue;
					_Set.Add(item);
				}
			}


			public void Clear()
			{
				if (_ReadOnly) throw new NotSupportedException();
				_Set.Clear();
			}

			public bool Contains(T item)
			{
				return _Set.Contains(item);
			}

			public void CopyTo(T[] array, int arrayIndex)
			{
				_Set.CopyTo(array, arrayIndex);
			}

			public int Count
			{
				get { return _Set.Count; }
			}

			public bool IsReadOnly
			{
				get { return _ReadOnly; }
			}

			public bool Remove(T item)
			{
				if (_ReadOnly) throw new NotSupportedException();
				return _Set.Remove(item);
			}

			public IEnumerator<T> GetEnumerator()
			{
				return _Set.GetEnumerator();
			}

			System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
			{
				return _Set.GetEnumerator();
			}


			public void RemoveAll(System.Collections.Generic.IList<T> list)
			{
				if (_ReadOnly) throw new NotSupportedException();
			}

			public void RetainAll(System.Collections.Generic.IList<T> list)
			{
				if (_ReadOnly) throw new NotSupportedException();
			}

			public void RemoveAll(System.Collections.ArrayList list)
			{
				if (_ReadOnly) throw new NotSupportedException();
			}

			public void RetainAll(System.Collections.ArrayList list)
			{
				if (_ReadOnly) throw new NotSupportedException();
			}
		}

I've copied RemoveAll/RetainAll verbatim, since I don't know what they are for. I also left
them nonvirtual.
Note that it implements ICollection<T> and not IList<T>: from what I can determine,
IList<T> is not necessary. It's also weird to have an indexed access on a type that
by definition cannot have one.
My medicine won't work if non-generic collection interfaces are still lurking in other parts
of the code, because List<T>, for historical reasons, also implements the non-generic
IList and ICollection interfaces.
Alas, this opens another can of worms (IList<T> doesn't implement IList, ICollection<T>
doesn't implement ICollection, but IEnumerable<T> implements IEnumerable), but the margin
of this e-mail is too small to write down the cure.
I'll stop rambling now and will try my own medicine as soon as I'm able to get the 2.9.4g
from the ASF SVN.

In the meantime, have a nice week-end and thank you for being the cornerstone of .NET's premier
search engine.


Vincent 



> The definition of CharArraySet is dangerously confusing and leads to bugs when used.
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENENET-414
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-414
>             Project: Lucene.Net
>          Issue Type: Bug
>          Components: Lucene.Net Core
>    Affects Versions: Lucene.Net 2.9.2
>         Environment: Irrelevant
>            Reporter: Vincent Van Den Berghe
>            Priority: Minor
>             Fix For: Lucene.Net 2.9.2
>
>
> Right now, CharArraySet derives from System.Collections.Hashtable, but doesn't actually
use this base type for storing elements.
> However, the StandardAnalyzer.STOP_WORDS_SET is exposed as a System.Collections.Hashtable.
The trivial code to build your own stopword set using the StandardAnalyzer.STOP_WORDS_SET
and adding your own set of stopwords like this:
> CharArraySet myStopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET, ignoreCase:
false);
> foreach (string domainSpecificStopWord in DomainSpecificStopWords)
>     stopWords.Add(domainSpecificStopWord);
> ... will fail because the CharArraySet accepts an ICollection, which will be passed the
Hashtable instance of STOP_WORDS_SET: the resulting myStopWords will only contain the DomainSpecificStopWords,
and not those from STOP_WORDS_SET.
> One workaround would be to replace the first line with this:
> CharArraySet stopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET.Count + DomainSpecificStopWords.Length,
ignoreCase: false);
> foreach (string domainSpecificStopWord in (CharArraySet)StandardAnalyzer.STOP_WORDS_SET)
>     stopWords.Add(domainSpecificStopWord);
> ... but this makes use of the implementation detail (the STOP_WORDS_SET is really an
UnmodifiableCharArraySet which is itself a CharArraySet). It works because it forces the foreach()
to use the correct CharArraySet.GetEnumerator(), which is defined as a "new" method (this
has a bad code smell to it)
> At least 2 possibilities exist to solve this problem:
> - Make CharArraySet use the Hashtable instance and a custom comparator, instead of its
own implementation.
> - Make CharArraySet use HashSet<char[]>, defined in .NET 4.0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message