期刊文献+

一种两阶段的中文专利语义检索方法

Two-stage Semantic Retrieval Method for Chinese Patents
下载PDF
导出
摘要 专利检索系统主要以传统的术语匹配方式提供检索服务,语义扩展性不足,使得具有语义相似的专利在Top_N的检出率较低.为了提升相似专利的Top_N检出率,该文提出了一种两阶段的中文专利语义检索方法.第1阶段基于Sentence-BERT进行语义编码,然后基于近似最近邻算法进行语义匹配,能够从海量专利文献库中快速匹配到语义相似的专利.第2阶段以BERT为基础模型,基于交叉编码器(Cross-Encoder)捕获专利文本之间更细粒度的语义相关性,对第1阶段的候选专利集进行重新排序.此外,该文还提出了难负例(hard negative)采样和白化转换(whitening)两种简单有效的模型训练优化策略,使模型从简单的训练数据逐渐过度到复杂的训练数据,提高模型区分相似专利的能力.实验表明,该文提出的方法相比于主流的方法在检出率上均有提升,且相比市面上现有的检索系统同样具有优势. The patent retrieval system mainly provide retrieval services in the traditional term matching method,and the semantic expansion is insufficient,so that the recall rate of patents with similar semantics in Top_N is low.In order to improve the Top_N recall rate of similar patents,this paper proposes a two-stage semantic retrieval method for Chinese patents.In the first stage,semantic en-coding is performed based on Sentence-BERT,and then semantic matching is performed based on the approximate nearest neighbor algorithm,which can quickly match semantically similar patents from the massive patent literature database.The second stage use BERT as the basic model,based on Cross-Encoder to capture finer-grained semantic correlations between patent texts,and re-rank the candidate patent set in the first stage.In addition,this paper also proposes two simple and effective model training optimization strategies,they are hard negative mining and whitening,which make the model gradually transition from simple training data to complex training data,and improve the ability to distinguish similar patents.Experiments show that the method proposed in this paper has improved recall rate compared with mainstream methods,and it also has advantages over existing retrieval systems on the market.
作者 吕学强 梁虎 赵颖 游新冬 LU Xueqiang;LIANG Hu;ZHAO Ying;YOU Xindong(Beijing Key Laboratory of Internet Culture Digital Dissemination,Beijing Information Science and Technology University,Beijing 100101,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2024年第10期2378-2383,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(62171043)资助 北京市自然科学基金项目(4212020)资助 国家语委项目(ZDI145-10,YB145-3)资助 北京市教育委员会科学研究计划项目(KM202111232001)资助.
关键词 专利检索 语义检索 难负例采样 白化转换 patent retrieval semantic retrieval hard negative mining whitening
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部