lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shad Storhaug (Jira)" <j...@apache.org>
Subject [jira] [Commented] (LUCENENET-599) Fine-grained segmentation tools with vectorHighlight will cause bug
Date Thu, 30 Jan 2020 08:58:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENENET-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026520#comment-17026520
] 

Shad Storhaug commented on LUCENENET-599:
-----------------------------------------

Sorry for the late reply.

There are a couple of things that might be happening here:

# When using `VectorHighlight` with Chinese, you should probably be using the `BreakIteratorBoundaryScanner`
from `Lucene.Net.ICU` and pass it an instance of `BreakIterator` created in the Chinese culture
(for example, {{BreakIterator.GetSentenceInstance(new CultureInfo("zh-Hans"))}}).
# 4.8.0-beta00005 was using culture-senstive method calls in `Lucene.Net.Highlighter`. Many
of these calls have been changed to use the invariant culture in 4.8.0-beta00007.

I suggest trying both of the above with the latest version. Let us know whether this is still
a problem.

Or alternatively, provide us with the code you were using so we can check it on our end.

> Fine-grained segmentation tools with vectorHighlight will cause bug
> -------------------------------------------------------------------
>
>                 Key: LUCENENET-599
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-599
>             Project: Lucene.Net
>          Issue Type: Improvement
>          Components: Lucene.Net Core, Lucene.Net.Highlighter
>    Affects Versions: Lucene.Net 4.8.0
>         Environment: System:
> Linux version 4.4.0-62-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu
5.4.0-6ubuntu1~16.04.4) )
>  
> Lucene Version :Lucene4.8.0-beta00005
> Participle tool:JIEba
>            Reporter: ChenYongkang
>            Priority: Minor
>              Labels: HightLighter
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> the text to analyze :
> "主体内容来自并且自己加了点基本数据结构数组链表,双向链表"
> when I used  fine-graine service and it was token to :
> "
> 主体/ 内容/ 来自/ 并且/ 自己/ 加/ 了/ 点/ 基本/ 数据/ 结构/ 数据结构/
数组/ 链表/ ,/ 双向/ 链表
> "
> I searched with query “数据,基本数据结构” and got wrong :
>  
> System.ArgumentOutOfRangeException: Index and length must refer to a location within
the string.
> Parameter name: length
>    at System.String.Substring(Int32 startIndex, Int32 length)
>    at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.MakeFragment(StringBuilder
buffer, Int32[] index, Field[] values, WeightedFragInfo fragInfo, String[] preTags, String[]
postTags, IEncoder encoder) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
195
>    at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader
reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32 maxNumFragments,
String[] preTags, String[] postTags, IEncoder encoder) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
146
>    at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader
reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32 maxNumFragments)
in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
99
>  
> The reason is the code in vectorHighlighter:
>  
>  1. protected String makeFragment( StringBuilder buffer, int[] index, Field[] values, WeightedFragInfo fragInfo,  
>  2.     String[] preTags, String[] postTags, Encoder encoder ){  
>  3.   StringBuilder fragment = new StringBuilder();  
>  4.   final int s = fragInfo.getStartOffset();  
>  5.   int[] modifiedStartOffset = \{ s };  
>  6.   String src = getFragmentSourceMSO( buffer, index, values, s, fragInfo.getEndOffset(), modifiedStartOffset );  
>  7.   int srcIndex = 0;  
>  8.   for( SubInfo subInfo : fragInfo.getSubInfos() ){  
>  9.     for( Toffs to : subInfo.getTermsOffsets() ){  
>  10.       fragment  
>  11.         .append( encoder.encodeText( src.substring( srcIndex, to.getStartOffset() - modifiedStartOffset[0] ) ) )  
>  12.         .append( getPreTag( preTags, subInfo.getSeqnum() ) )  
>  13.         .append( encoder.encodeText( src.substring( to.getStartOffset() - modifiedStartOffset[0], to.getEndOffset() - modifiedStartOffset[0] ) ) )  
>  14.         .append( getPostTag( postTags, subInfo.getSeqnum() ) );  
>  15.       srcIndex = to.getEndOffset() - modifiedStartOffset[0];  
>  16.     }  
>  17.   }  
>  18.   fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );  
>  19.   return fragment.toString();  
>  20. }  
>  
> when I searched with "基本数据结构" and it was ok.  My English is pool .I will
explain reason with Chinese.
> 细粒度分词会把“基本数据结构”再次分词,当我们搜索“数据,基本数据结构”,
数据分词被第一个高亮,因为上面的分词,“数据”在“基本数据结构”前面,而数据在文本中的起始位置是(15,16),对“数据”高亮之后,srcIndex
会变成“数据”的末位置,也就是16,从16开始找下一个高亮分词,下一个分词“基本数据结构”的位置(13,18)。src.substring(16,13)高亮前的片段,显示是错误的。
所以快速分词基于的是分词在原文本中的顺序是前后衔接的,当你使用细粒度分词的时候就打破了这种衔接,会导致报错。但是作为搜索引擎,很多时候都是细粒度分词,搜索的时候使用快速高亮也可以提高速度,然而二者不能很好的结合。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message