摘要
【目的】探究古籍方志物产资料中物产别名、人物、产地及引书等4种实体的自动识别,用于方志物产知识库的构建。【方法】以机构特藏《方志物产》云南卷为基础语料,在文本预处理与语料标注基础上,采用4种深度学习模型Bi-RNN、Bi-LSTM、Bi-LSTM-CRF、BERT进行实验,并对实验结果进行对比分析。【结果】Bi-LSTM-CRF模型与Bi-LSTM模型相比,P值提高5.54%,F值提高3.51%;BERT模型的R值达到了83.36%,优于其他模型;Bi-LSTM-CRF模型对引书实体识别效果最好,F值为89.71%;BERT模型对人物实体识别效果最好,F值为87.90%。【局限】由于古籍方志文本语料特性,以及相关实体的认定需掌握领域知识,在人工标注过程中或存在一些漏标与错标的情况,导致模型未能最优化。【结论】研究表明深度学习方法对古籍方志文本实体识别任务的可行性与优越性。
[Objective]This paper tries to automatically identify the produce aliases,related human figures,places of origin and cited books from ancient local chronicles,aiming to establish a knowledge base for traditional products.[Methods]Firstly,we chose Local Chronicle of Yunnan:Produce as the basic corpus and preprocessed its texts to carry out corpus tagging.Then,we adopted four deep learning models(Bi-RNN,Bi-LSTM,Bi-LSTMCRF and BERT)to identify the needed entities.Finally,we compared outputs of these models.[Results]The P-value and F-value of the Bi-LSTM model were 5.54%and 3.51%higher than those of the Bi-LSTM-CRF model.The R-value of the BERT model reached 83.36%,which was the best among all models.The Bi-LSTMCRF model yielded the best results with the entity recognition of cited books(F-value=89.71%),and the BERT model had the best performance on character entities with a F-value of 87.90%.[Limitations]Due to the linguistic characteristics of ancient local chronicles and the domain knowledge required for identifying related entities,there may be errors in tagging.[Conclusions]Deep learning could help us identify needed entities from ancient local chronicles effectively.
作者
徐晨飞
叶海影
包平
Xu Chenfei;Ye Haiying;Bao Ping(Institution of Chinese Agricultural Civilization,Nanjing Agricultural University,Nanjing 210095,China;Economics and Management School,Nantong University,Nantong 226019,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2020年第8期86-97,共12页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金重大项目“方志物产知识库构建及深度利用研究”(项目编号:18ZDA327)
教育部人文社会科学研究青年基金项目“基于语义的方志物产资料知识组织与知识聚合实证研究”(项目编号:19YJC870027)的研究成果之一。
关键词
深度学习
方志物产
命名实体识别
模型构建
数字人文
Deep Learning
Local Chronicle:Produce
Named Entity Recognition
Models Construction
Digital Humanities