期刊文献+

文档检索中文本片段化机制的研究 被引量:4

Research on Text Snippet Mechanism in Document Retrieval
下载PDF
导出
摘要 文档检索是自然语言处理的研究热点,相对于短文本文档具有信息丰富且冗长的特征。在长文本检索中,查询语句与长文本中的句子往往不是全部相关,可能会出现某些高相似片段的强干扰,因此查询语句与文档之间的相关性评分不能简单采用基于词语或字符串之间的相似度计算。提出了一种文本片段化机制(TSM)进行文档检索,首先将每个候选文档划分成片段,再计算查询语句与文档片段之间的相关度,所使用的相关度匹配方案考虑了语义和词频等因素,筛选出关键的文本片段并得出相关片段比率,综合这些片段信息计算查询与文档之间的相关性得分,从而获取Top-K文档集。针对Glasgow信息检索专用数据集的实验结果表明,利用文本片段化机制进行文本匹配可以提高信息检索的性能。 Document retrieval is a research hotspot of natural language processing.Compared with short text document which has the characteristics of information diversity and length,in long text retrieval,a query statement is often not related to all sentences in a long text,and strong interference of some highly similar segments will occur.Therefore,the correlation score between a query statement and a document can not be simply calculated based on the similarity between words or strings.Text snippet mechanism(TSM)is proposed for document retrieval.TSM first divides each candidate document into snippets,and then calculates the correlation between query statements and document snippets.The correlation matching scheme used takes into account the semantic and word frequency factors.TSM selects key text snippets and obtains the relevant snippet ratio,and then calculates the correlation score between query and target document based on these information,so as to obtain the Top-K document set.Experimental results show that TSM can improve the performance of information retrieval on IR test collection of Glasgow.
作者 李宇 刘波 LI Yu;LIU Bo(College of Information Science and Technology,Jinan University,Guangzhou 510632,China)
出处 《计算机科学与探索》 CSCD 北大核心 2020年第4期578-589,共12页 Journal of Frontiers of Computer Science and Technology
基金 广州市科技计划基金No.201604010037。
关键词 文本片段化机制 文档检索 相关性评分 相关片段比例 片段整合计算 text snippet mechanism document retrieval correlation calculation relevant snippet ratio snippet integration score
  • 相关文献

参考文献2

二级参考文献21

  • 1张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:121
  • 2宋惟然.中文文本分类中的特征选择和权重计算方法研究[D].北京:北京工业大学,2013. 被引量:2
  • 3Salton G, McGill M J. Introduction to Modem Information Retrieval[M]. McGraw-Hill, 1983. 被引量:1
  • 4Luhn H P. Auto-encoding of Documents for Information Re- trieval Systems [ M ]// Modem Trends in Documentation. New York: Pergamon Press, 1959:68-95. 被引量:1
  • 5Salton G, Wong A, Yang C S. A vector space model for automate indexing[ J ]. Communications of ACM, 1975,18 ( 11 ) :613-620. 被引量:1
  • 6Lewis D D. Naive Bayes at forty: The independence assump- tion in information retrieval [ C ]// Proceedings of the lOth European Conference on Machine Learning. 1998:4-15. 被引量:1
  • 7Hsu C, Lin C. A comparison on methods for multi-class support vector machines[ J]. IEEE Transactions on Neural Networks, 2002,13 (2) :415-425. 被引量:1
  • 8候敏.计算语言学与汉语自动分析[M].北京:北京广播学院出版社,1999. 被引量:1
  • 9Salton G. On the construction of effective vocabularies for information retrieval[ C ]// Proceedings of the 1973 Meet- ing on Programming Languages and Information Retrieval. 1973 : 48-60. 被引量:1
  • 10Cohen W, Singer Y. Context-sensitive learning methods for text categorization [ J ]. ACM Trans. Information Systems, 1996,17 (2) : 146-173. 被引量:1

共引文献40

同被引文献16

引证文献4

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部