lucenenet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shad Storhaug (Jira)" <j...@apache.org>
Subject [jira] [Closed] (LUCENENET-599) Fine-grained segmentation tools with vectorHighlight will cause bug
Date Sat, 14 Mar 2020 11:59:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENENET-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shad Storhaug closed LUCENENET-599.
-----------------------------------
    Resolution: Invalid

This issue is being closed for lack of activity and the likelihood that this issue was due
to ICU being unfinished at the time of this report. However, feel free to reopen it if you
are still experiencing issues.

> Fine-grained segmentation tools with vectorHighlight will cause bug
> -------------------------------------------------------------------
>
>                 Key: LUCENENET-599
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-599
>             Project: Lucene.Net
>          Issue Type: Improvement
>          Components: Lucene.Net Core, Lucene.Net.Highlighter
>    Affects Versions: Lucene.Net 4.8.0
>         Environment: System:
> Linux version 4.4.0-62-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu
5.4.0-6ubuntu1~16.04.4) )
>  
> Lucene Version :Lucene4.8.0-beta00005
> Participle tool:JIEba
>            Reporter: ChenYongkang
>            Priority: Minor
>              Labels: HightLighter
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> the text to analyze :
> "主体内容来自并且自己加了点基本数据结构数组链表,双向链表"
> when I used  fine-graine service and it was token to :
> "
> 主体/ 内容/ 来自/ 并且/ 自己/ 加/ 了/ 点/ 基本/ 数据/ 结构/ 数据结构/
数组/ 链表/ ,/ 双向/ 链表
> "
> I searched with query “数据,基本数据结构” and got wrong :
>  
> System.ArgumentOutOfRangeException: Index and length must refer to a location within
the string.
> Parameter name: length
>    at System.String.Substring(Int32 startIndex, Int32 length)
>    at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.MakeFragment(StringBuilder
buffer, Int32[] index, Field[] values, WeightedFragInfo fragInfo, String[] preTags, String[]
postTags, IEncoder encoder) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
195
>    at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader
reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32 maxNumFragments,
String[] preTags, String[] postTags, IEncoder encoder) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
146
>    at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader
reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32 maxNumFragments)
in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
99
>  
> The reason is the code in vectorHighlighter:
>  
>  1. protected String makeFragment( StringBuilder buffer, int[] index, Field[] values, WeightedFragInfo fragInfo,  
>  2.     String[] preTags, String[] postTags, Encoder encoder ){  
>  3.   StringBuilder fragment = new StringBuilder();  
>  4.   final int s = fragInfo.getStartOffset();  
>  5.   int[] modifiedStartOffset = \{ s };  
>  6.   String src = getFragmentSourceMSO( buffer, index, values, s, fragInfo.getEndOffset(), modifiedStartOffset );  
>  7.   int srcIndex = 0;  
>  8.   for( SubInfo subInfo : fragInfo.getSubInfos() ){  
>  9.     for( Toffs to : subInfo.getTermsOffsets() ){  
>  10.       fragment  
>  11.         .append( encoder.encodeText( src.substring( srcIndex, to.getStartOffset() - modifiedStartOffset[0] ) ) )  
>  12.         .append( getPreTag( preTags, subInfo.getSeqnum() ) )  
>  13.         .append( encoder.encodeText( src.substring( to.getStartOffset() - modifiedStartOffset[0], to.getEndOffset() - modifiedStartOffset[0] ) ) )  
>  14.         .append( getPostTag( postTags, subInfo.getSeqnum() ) );  
>  15.       srcIndex = to.getEndOffset() - modifiedStartOffset[0];  
>  16.     }  
>  17.   }  
>  18.   fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );  
>  19.   return fragment.toString();  
>  20. }  
>  
> when I searched with "基本数据结构" and it was ok.  My English is pool .I will
explain reason with Chinese.
> 细粒度分词会把“基本数据结构”再次分词,当我们搜索“数据,基本数据结构”,
数据分词被第一个高亮,因为上面的分词,“数据”在“基本数据结构”前面,而数据在文本中的起始位置是(15,16),对“数据”高亮之后,srcIndex
会变成“数据”的末位置,也就是16,从16开始找下一个高亮分词,下一个分词“基本数据结构”的位置(13,18)。src.substring(16,13)高亮前的片段,显示是错误的。
所以快速分词基于的是分词在原文本中的顺序是前后衔接的,当你使用细粒度分词的时候就打破了这种衔接,会导致报错。但是作为搜索引擎,很多时候都是细粒度分词,搜索的时候使用快速高亮也可以提高速度,然而二者不能很好的结合。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message