lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From NightOwl888 <...@git.apache.org>
Subject [GitHub] lucenenet issue #182: Analysis Missing Tests and Bug Fixes
Date Sat, 27 Aug 2016 09:21:24 GMT
Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/182
  
    Ok, this is now down to 23 failing tests.
    
    The 17 failing tests in Synonym are still really no closer to being solved. I went over
the SynonymMap and SynonymFilter classes line by line 3x. Wherever the problem is, it is hidden
well.
    
    After spending a whole day stepping through code, I finally found a clue - all of the
failing tests are failing when the expected synonym input has a space in it. For example,
TestMatching doesn't fail until [this line](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Tests.Analysis.Common/Analysis/Synonym/TestSynonymMapFilter.cs#L875)
when the first expected input is "z x c v". It is unclear how that is supposed to happen,
though since the tokenizer makes "z" a separate token which causes the logic to exit out at
that point without comparing "z x", "z x c", and "z x c v". I went online hunting for a clue,
but only found [this question on SO](http://stackoverflow.com/questions/17283100/lucene-synonym-filter-behavior)
in which the poster is just as confused about it as I am.
    
    I also tried again at the 5 failing tests in the Compound namespace. I went over everything
line by line. Then I tried stepping through the code. However, I don't have a clue what the
code is supposed to do, only what the expected output is. In [this test](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Tests.Analysis.Common/Analysis/Compound/TestCompoundWordTokenFilter.cs#L84),
the first output succeeds. The second output is expected to be "ba". The first token [comes
back as "b"](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Analysis.Common/Analysis/Compound/hyphenation/HyphenationTree.cs#L414)
(is that right?), it then looks up [TernaryTree.Find()](https://github.com/NightOwl888/lucenenet/blob/analysis-bugz/src/Lucene.Net.Analysis.Common/Analysis/Compound/hyphenation/HyphenationTree.cs#L415)
and it maps to "a" (is that right?), it then puts it as the second letter of the word array
(that seems right..?). The next letter i
 s "a", it looks it up and comes back as "z"(is that right?) it adds it as the 3rd element
in the array (now that can't be right, can it?), the next letters it looks up are "r" and
"j".  The documentation is scarce. I really don't see any hope of solving this without running
side-by-side with the Java Lucene to see where the paths diverge. Although, the most likely
cause has something to do with replacing the SAX parser with XmlReader and the HyphenationTree
isn't being populated right. But, it is difficult to know what "right" is, since there are
no tests on the HyphenationTree itself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message