摘要
目的:命名实体识别在自然语言处理中是最基本的任务之一,本文通过应用深度表示的方法实现临床上的现病史数据的自动标识。方法:本文随机选取了10 426条现病史句子作为主要的文本研究对象,分别用词嵌入(word2vec)和网络结构特征(node2vec)两种构建向量的方法生成不同的词向量特征,再在基于条件随机场(Conditional Random Field,CRF)和结构化支持向量机(Structured Support Vector Machines,SSVM)的方法上进行十重交叉验证,计算并比较基于深度表示的症状表型命名实体抽取的性能。结果:传统的CRF算法的三个评价指标(准确率,召回率,F值)为(0.888 9,0.786 9,0.834 8);基于WENER方法下的CRF和SSVM的评价指标为(0.975 0,0.984 9,0.979 8)和(0.992 8,0.988 9,0.990 8);在GENER方法下基于词的CRF和SSVM算法的三个评价指标为(0.972 8,0.976 8,0.975 2)和(0.983 3,0.974 5,0.978 8);GENER方法下基于字的CRF和SSVM算法的评价指标为(0.927 8,0.862 8,0.887 9)和(0.943 7,0.946 8,0.941 3)。结论:深度表示的命名实体抽取算法性能要比传统的非深度表示的命名实体标识算法性能好。另外,通过比较深度表示的两种算法的性能后发现,无论是基于word2vec生成的词向量还是基于node2vec生成的词向量,SSVM模型算法性能均优于CRF算法的性能。
Named entity recognition is one of most basic tasks in natural language processing. In this paper, deeprepresentation-based method is applied to automatic identification of clinical data. First, 10,426 sentences about presenthistory were selected randomly as text training data. Then word2vec-based and node2vec-based deep representationmethods were used to construct low-dimensional word embedding. Based on word vectors of symptoms, we conductedconditional random field(CRF) and structured support vector machine(SSVM) to extract symptom named entity. Finally,the performance of different named entity extraction algorithms for TCM's symptom phenotype were compared with 10-fold cross validation. Three evaluation metrics: precision(P), recall(R) and F1-score(F1) were considered. The results showed, compared with classic CRF algorithm(PR: 0.888 9; RE: 0.786 9; F1:0.834 8), WENRE-based CRF(P: 0.975 0;R: 0.984 9; F1: 0.979 8), WENRE-based SSVM(P: 0.992 8; R: 0.988 9; F1: 0.990 8), word-based CRF under GENER(P:0.972 8; R:0.976 8; F1:0.975 2), word-based SSVM under GENER(P: 0.983 3; R: 0.974 5; F1: 0.978 8), character-based CRF under GENER(P: 0.927 8; R: 0.862 8; F1: 0.887 9), character-based SSVM under GENER(P: 0.943 7; R:0.946 8; F1: 0.941 3). In conclusion, compared with classic CRF algorithm, deep representation-based named entityextraction method of symptom phenotype has a better performance. For both word2vec-based and node2vec-based vectorrepresentation, SSVM algorithm has a better performance than CRF algorithm.
作者
原旎
卢克治
袁玉虎
舒梓心
杨扩
张润顺
李晓东
周雪忠
Yuan Ni;Lu Kezhi;Yuan Yuhu;Shu Zixin;Yang Kuo;Zhang Runshun;Li Xiaodong;Zhou Xuezhong(College of Computer Science and Information Technology Beifing Jiaotong University, Belting 100044, China;Hubei Hospital of Traditional Chinese Medicine, Wuhan 430061, China;Guang'anmen Hospital, Chinese Academy of Chinese Medical Sciences, Beijing 100053, China)
出处
《世界科学技术-中医药现代化》
CSCD
北大核心
2018年第3期355-362,共8页
Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology
基金
国家中医药管理局2015年度国家中医临床研究基地业务建设第二批科研专项(JDZX2015171):肝病回顾性病例表型信息抽取方法与分析研究
负责人:周雪忠
国家科技部国家重点研发计划项目(2017YFC1703506):中医药大数据挖掘研究与创新应用
负责人:于剑
关键词
条件随机场
结构化支持向量机
命名实体抽取
中医病历
Conditional random field
structured support vector machines
named entity recognition deep representationtraditional Chinese medical reeordst