lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shad Storhaug (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (LUCENENET-612) SERIOUS issues with PerFieldAnalyzerWrapper in 4.8
Date Sun, 29 Dec 2019 07:57:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENENET-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shad Storhaug resolved LUCENENET-612.
-------------------------------------
    Fix Version/s: Lucene.Net 4.8.0
       Resolution: Fixed

This has now been resolved in Lucene.NET 4.8.0-beta00007

> SERIOUS issues with PerFieldAnalyzerWrapper in 4.8
> --------------------------------------------------
>
>                 Key: LUCENENET-612
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-612
>             Project: Lucene.Net
>          Issue Type: Bug
>          Components: Lucene.Net.Analysis.Common
>    Affects Versions: Lucene.Net 4.8.0
>            Reporter: Shad Storhaug
>            Priority: Major
>             Fix For: Lucene.Net 4.8.0
>
>   Original Estimate: 16h
>  Remaining Estimate: 16h
>
> This came in on the user mailing list on 15-July-2019 and was originally reported by
Bryan Rojo (BryanRojo@elliotelectric.com)
>  
> {quote}Not necessarily a bug, but for some people who use PerFieldAnalyzerWrapper like
I do this might be worth noting.
> PerFieldAnalyzerWrapper has been "improved" in 4.8 and now uses a PER_FIELD_REUSE_STRATEGY
which means that the tokenized fields will be stored in a dictionary, so If you have multiple
fields with the same name in your document, then you will only be able to index the very first
one that makes it into that dictionary.
> So the problem with this is that you can potentially lose thousands of terms in your
index, which could cause your searches to be of very low quality.
> BEWARE.
> {quote}
>  
> There are 2 issues that need to be resolved to address this:
> 1. The documentation for {{PerFieldAnalyzerWrapper}} should be updated to inform users
that if they need to use multiple dictionary keys with the same name, they should use {{TreeDictionary<K,
V>}}.
> 2. {{TreeDictionary<K, V>}} does not currently implement {{System.Collections.Generic.IDictionary<TKey,
TValue>}}, as it was brought over from C5 as-is.
> Another thing of note is that C5 has added support for .NET Standard 1.0 since this was
brought over.
> However, there still seems to be a few problems that make the C5 types incompatible with
Lucene.Net, most notably the lack of support for {{System.Collections.Generic.IDictionary<TKey,
TValue>}} in {{TreeDictionary}} and {{System.Collections.Generic.ISet<T>}} in {{TreeSet}}
(the latter of which has already been patched in {{Lucene.Net.Support.TreeSet}}).
> I [reported|https://github.com/sestoft/C5/issues/53] the lack of support for {{ISet<T>}}
on 6-Nov-2016, but although the maintainers agree this should be done, it still hasn't been.
Perhaps a PR to the C5 project is the way to get this done, which would allow us to finally
remove these collection copies from Lucene.Net.Support and add a package dependency on C5.
> Another option is to shop around to see if there are any other generic TreeSet/TreeDictionary
implementations that have popped up since late 2016 that we can check for compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message