摘要
【目的】提出一种基于集成策略的中文专利摘要生成模型(STNLTP),以改善现有的自动文本摘要技术在处理长文档摘要生成时存在的生成重复问题和长程依赖问题。【方法】引入专利术语词典,运用基于SAT模型的义原词向量对“中药材”专利文本进行表示。基于集成策略,运用TextRank、Lead4和NMF三种抽取方法抽取专利说明书文本的关键句,经过聚类并去重,选出最优关键句。最后最优关键句通过基于Transformer字向量的指针生成网络得到最终的生成摘要。【结果】STNLTP模型实现了抽取式和生成式方法的结合,相较于对比模型RLCPAR,在ROUGE-1、ROUGE-2和ROUGE-L评价指标上分别提升2.00、9.73和2.35个百分点。【局限】生成摘要的部分结果存在一些常识性错误。【结论】STNLTP模型优于对比模型,可以改善中文专利摘要生成的效果。
[Objective] This paper proposes an abstracting model for Chinese patents based on integration strategy(STNLTP), aiming to reduce the duplication and long document dependency issues of the existing automatic abstracting techniques. [Methods] First, we introduced a patent term dictionary, and used the sememe vector based on SAT model to represent traditional Chinese medicine patents. Then, with the help of integration strategy,we utilized the TextRank, Lead4 and NMF models to extract key sentences from the patents. Third, we identified the optimal key sentences with the clustering and redundancy removing. Finally, we processed these optimal key sentences with the pointer-generator network based on Transformer character vector to create the abstracts.[Results] Our new model successfully combined the extractive and generative methods. Compared with the existing RLCPAR model, we improved the evaluation indicators of ROUGE-1, ROUGE-2 and ROUGE-L by2.00%, 9.73% and 2.35%, respectively. [Limitations] There are still some errors in the new abstracts.[Conclusions] The new STNLTP model could effectively generate Chinese patent abstracts.
作者
张乐
杜一凡
吕学强
董志安
Zhang Le;Du Yifan;Lü Xueqiang;Dong Zhian(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2022年第7期107-117,共11页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目(项目编号:62171043)的研究成果之一。
关键词
专利摘要
义原
词向量
字向量
指针生成网络
Patent Abstract
Sememe
Word Vector
Character Vector
Pointer-Generator Network