基于BLSTM-CRF的自举式术语识别方法研究

A BLSTM-CRF-based Bootstrapping Terminology Recognition Approach Research

下载PDF

导出

摘要 [目的/意义]自动识别优质术语一直是多领域普遍关注的问题,其中一个突出困难是缺乏领域标注语料,为此本文提出一种基于BLSTM-CRF的自举式领域术语识别方法。[方法/过程]首先选取少量种子术语标注语料,训练BLSTM-CRF模型,识别候选术语;再基于术语质量特征构造筛选准则,从候选术语中挑出优质且新增的结果加入到新一轮训练的标注词汇集合,迭代标注训练,直到新增术语量小于某一阈值或迭代达到特定次数。本文还检测了模型迭代训练效率及在其他领域的推广性,将在计算机领域语料训练出的模型用于新兴的融合出版领域的技术术语识别。[局限]术语质量特征量化方法待综合多指标优化,模型改进学习机制未引入负例且迭代不易收敛等。[结果/结论]本文最终通过标注数量和标注语境丰富度实验表明了采用新增标注数据进行迭代的有效性。以50轮迭代训练后结果为例,在计算机测试语料上识别术语及其所有标注序列的F1值为0.43和0.59,新术语率为0.79,均优于基准BLSTM-CRF模型、BERT-BLSTM-CRF模型效果,证实了新方法启动成本低,领域适应性好,能够有效解决术语识别中训练语料缺乏的问题。在模型迁移效能评价中,抽样判断的术语识别平均正确率为87.7%,说明了迁移学习方法的应用潜力。 [Objective/Significance]Automatic extraction of domain terms has been a research hotspot in the field of information technology.An urgent problem to be solved is the shortage of terms for labeling training corpus,which limits the application of neural network models in domain term extraction.To solve this problem,this paper proposes a BLSTM-CRF-based bootstrapping term recognition approach.[Methods/Processes]First,inputting a small number of seed terms for corpora annotation and training BLSTM-CRF model to identify candidate terms;Then,constructing a set of criterions based on the quality of terms in order to select high-quality new terms from candidate terms,and adding these quality terms to the annotation set for next round training.Thus,the corpus is relabeled for iteratively model training until the number of new terms is less than a certain threshold or a specific number of iteration rounds is reached.In addition,the model trained on the corpus of the computer science domain can be transferred to recognizing technical terms on the new-emerging domain of fusion-publishing.[Limitations]There are still issues such as the quantification method of term quality features to be optimized by integrating multiple indicators,the learning mechanism of model improvement does not introduce negative examples and the iteration is not easy to converge,etc.[Results/Conclusions]The decision of iteration approach is supported by the experiments on the amount of annotation and the contextual richness,which show that the performance of term recognition can be improved when new annotation data increases.Taking the model obtained after 50 rounds of iterative training as an example,the F1 of the recognized terms and all the annotation sequences are 0.43 and 0.59 on the test set of the computer science domain,and the new-term rate is 0.79,which are better than the benchmark BLSTM-CRF model and the BERT-BLSTM-CRF model.It is confirmed that the new method has low starting cost and good domain adaptability,which effectively solves the p

作者陈翀高欣妍黄红 CHEN Chong;GAO Xinyan;HUANG Hong(School of Government,Beijing Normal University,Beijing 100087,China;Key Laboratory of rich-Media Knowledge Organization and Service of Digital Publishing Content,Beijing 100038,China)

机构地区北京师范大学政府管理学院富媒体数字出版内容组织与知识服务重点实验室

出处《情报工程》 2023年第5期97-111,共15页 Technology Intelligence Engineering

基金富媒体数字出版内容组织与知识服务重点实验室2022年度开放基金项目。

关键词术语识别自举 BLSTM-CRF模型识别性能评价术语质量筛选准则 Term Extraction Bootstrapping,BLSTM-CRF Model Performance Evaluation of Term Recognition Term Quality Criterions

分类号 G35 [文化科学—情报学] TP391 [自动化与计算机技术—计算机应用技术]