lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shad Storhaug (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENENET-612) SERIOUS issues with PerFieldAnalyzerWrapper in 4.8
Date Mon, 12 Aug 2019 00:37:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENENET-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shad Storhaug updated LUCENENET-612:
------------------------------------
    Description: 
This came in on the user mailing list on 15-July-2019 and was originally reported by Bryan
Rojo (BryanRojo@elliotelectric.com)

 
{quote}Not necessarily a bug, but for some people who use PerFieldAnalyzerWrapper like I do
this might be worth noting.

PerFieldAnalyzerWrapper has been "improved" in 4.8 and now uses a PER_FIELD_REUSE_STRATEGY
which means that the tokenized fields will be stored in a dictionary, so If you have multiple
fields with the same name in your document, then you will only be able to index the very first
one that makes it into that dictionary.

So the problem with this is that you can potentially lose thousands of terms in your index,
which could cause your searches to be of very low quality.

BEWARE.
{quote}
 
There are 2 issues that need to be resolved to address this:

1. The documentation for {{PerFieldAnalyzerWrapper}} should be updated to inform users that
if they need to use multiple dictionary keys with the same name, they should use {{TreeDictionary<K,
V>}}.
2. {{TreeDictionary<K, V>}} does not currently implement {{System.Collections.Generic.IDictionary<TKey,
TValue>}}, as it was brought over from C5 as-is.

Another thing of note is that C5 has added support for .NET Standard 1.0 since this was brought
over.

However, there still seems to be a few problems that make the C5 types incompatible with Lucene.Net,
most notably the lack of support for {{System.Collections.Generic.IDictionary<TKey, TValue>}}
in {{TreeDictionary}} and {{System.Collections.Generic.ISet<T>}} in {{TreeSet}} (the
latter of which has already been patched in {{Lucene.Net.Support.TreeSet}}).

I [reported|https://github.com/sestoft/C5/issues/53] the lack of support for {{ISet<T>}}
on 6-Nov-2016, but although the maintainers agree this should be done, it still hasn't been.
Perhaps a PR to the C5 project is the way to get this done, which would allow us to finally
remove these collection copies from Lucene.Net.Support and add a package dependency on C5.

Another option is to shop around to see if there are any other generic TreeSet/TreeDictionary
implementations that have popped up since late 2016 that we can check for compatibility.

  was:
This came in on the user mailing list on 15-July-2019 and was originally reported by Bryan
Rojo (BryanRojo@elliotelectric.com)

 
{quote}Not necessarily a bug, but for some people who use PerFieldAnalyzerWrapper like I do
this might be worth noting.

PerFieldAnalyzerWrapper has been "improved" in 4.8 and now uses a PER_FIELD_REUSE_STRATEGY
which means that the tokenized fields will be stored in a dictionary, so If you have multiple
fields with the same name in your document, then you will only be able to index the very first
one that makes it into that dictionary.

So the problem with this is that you can potentially lose thousands of terms in your index,
which could cause your searches to be of very low quality.

BEWARE.
{quote}
 
There are 2 issues that need to be resolved to address this:

1. The documentation for {{PerFieldAnalyzerWrapper}} should be updated to inform users that
if they need to use multiple dictionary keys, they should use {{TreeDictionary<K, V>}}.
2. {{TreeDictionary<K, V>}} does not currently implement {{System.Collections.Generic.IDictionary<TKey,
TValue>}}, as it was brought over from C5 as-is.

Another thing of note is that C5 has added support for .NET Standard 1.0 since this was brought
over.

However, there still seems to be a few problems that make the C5 types incompatible with Lucene.Net,
most notably the lack of support for {{System.Collections.Generic.IDictionary<TKey, TValue>}}
in {{TreeDictionary}} and {{System.Collections.Generic.ISet<T>}} in {{TreeSet}} (the
latter of which has already been patched in {{Lucene.Net.Support.TreeSet}}).

I [reported|https://github.com/sestoft/C5/issues/53] the lack of support for {{ISet<T>}}
on 6-Nov-2016, but although the maintainers agree this should be done, it still hasn't been.
Perhaps a PR to the C5 project is the way to get this done, which would allow us to finally
remove these collection copies from Lucene.Net.Support and add a package dependency on C5.

Another option is to shop around to see if there are any other generic TreeSet/TreeDictionary
implementations that have popped up since late 2016 that we can check for compatibility.


> SERIOUS issues with PerFieldAnalyzerWrapper in 4.8
> --------------------------------------------------
>
>                 Key: LUCENENET-612
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-612
>             Project: Lucene.Net
>          Issue Type: Bug
>          Components: Lucene.Net.Analysis.Common
>    Affects Versions: Lucene.Net 4.8.0
>            Reporter: Shad Storhaug
>            Priority: Major
>   Original Estimate: 16h
>  Remaining Estimate: 16h
>
> This came in on the user mailing list on 15-July-2019 and was originally reported by
Bryan Rojo (BryanRojo@elliotelectric.com)
>  
> {quote}Not necessarily a bug, but for some people who use PerFieldAnalyzerWrapper like
I do this might be worth noting.
> PerFieldAnalyzerWrapper has been "improved" in 4.8 and now uses a PER_FIELD_REUSE_STRATEGY
which means that the tokenized fields will be stored in a dictionary, so If you have multiple
fields with the same name in your document, then you will only be able to index the very first
one that makes it into that dictionary.
> So the problem with this is that you can potentially lose thousands of terms in your
index, which could cause your searches to be of very low quality.
> BEWARE.
> {quote}
>  
> There are 2 issues that need to be resolved to address this:
> 1. The documentation for {{PerFieldAnalyzerWrapper}} should be updated to inform users
that if they need to use multiple dictionary keys with the same name, they should use {{TreeDictionary<K,
V>}}.
> 2. {{TreeDictionary<K, V>}} does not currently implement {{System.Collections.Generic.IDictionary<TKey,
TValue>}}, as it was brought over from C5 as-is.
> Another thing of note is that C5 has added support for .NET Standard 1.0 since this was
brought over.
> However, there still seems to be a few problems that make the C5 types incompatible with
Lucene.Net, most notably the lack of support for {{System.Collections.Generic.IDictionary<TKey,
TValue>}} in {{TreeDictionary}} and {{System.Collections.Generic.ISet<T>}} in {{TreeSet}}
(the latter of which has already been patched in {{Lucene.Net.Support.TreeSet}}).
> I [reported|https://github.com/sestoft/C5/issues/53] the lack of support for {{ISet<T>}}
on 6-Nov-2016, but although the maintainers agree this should be done, it still hasn't been.
Perhaps a PR to the C5 project is the way to get this done, which would allow us to finally
remove these collection copies from Lucene.Net.Support and add a package dependency on C5.
> Another option is to shop around to see if there are any other generic TreeSet/TreeDictionary
implementations that have popped up since late 2016 that we can check for compatibility.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message