摘要
中医古籍命名实体识别是构建中医知识图谱的基础,对中医知识的提取与智能化呈现具有重要意义。然而,中医知识体系结构庞大,公开可用的语料库稀少且语义复杂,当前的研究大多关注字向量的表达,对特殊汉字的结构特征中丰富的语义特点考虑不充分;而且,由于汉字语义丰富,还存在潜在特征表达不足及一词多义的问题。文中结合中医古籍的语料特点与古汉字结构信息,提出了一种基于SiKuBERT与多元数据嵌入的命名实体识别方法,通过SiKuBERT创建字特征信息,在此基础上嵌入词特征与部首特征来捕捉汉字的语义信息,让具有相似部首序列的字符在空间向量中彼此接近。采用该方法对本草数据集中的人名、中草药物名,病症名、病理名、经络名进行识别,实验结果表明:文中方法能够有效抽取文本中的5类实体,F1值为86.66%,精确率达86.95%,召回率达86.37%;相较于基于字特征的SiKuBERT-CRF模型,文中方法融合了字词信息与繁体汉字的结构信息,能增强实体识别效果,总体F1值提升了2.83个百分点;此外,该方法对具有显著部首特征的中草药物名和病症名的识别效果最佳,相较于基于字特征的SiKuBERT-CRF模型,F1值分别提升了3.48和0.97个百分点。总体而言,文中方法的性能指标高于其他主流的深度学习模型,且具有良好的泛化能力。
The named entity recognition of traditional Chinese medicine(TCM)classics is the basis for constructing TCM knowledge graph,and is of great significance for the extraction and intelligent presentation of TCM knowledge.However,the knowledge system of TCM has a huge structure,and the publicly available corpus is scarce and semantically complex.Most of the current researches focus on the expression of character vectors,and do not fully consider the rich semantic features in the structural characteristics of special Chinese characters.Moreover,due to the rich semantic meaning of Chinese characters,there are still problems of insufficient expression of the potential features and polysemy of one word.In this paper,a named entity recognition method based on SiKuBERT and multivariate data embedding is proposed by combining the corpus features of ancient Chinese medicine books with the structural information of ancient Chinese characters.In this method,the word feature information is created by SiKuBERT,and on this basis,word features and radical features are embedded to capture the semantic information of Chinese characters,so that characters with similar radical sequences can be close to each other in the spatial vector.Then,the method is used to identify the names of people,herbal medicines,diseases,pathologies,and meridians in the Materia Medica dataset.The experimental results show that the proposed method is able to effectively extract five types of entities in the text,with an F1 score of 86.66%,a precision rate of 86.95%,and a recall rate of 86.37%.As compared with the SiKuBERT-CRF model based on word features,the proposed method integrates the word information with the structural information of traditional Chinese characters,which enhances the entity recognition effect,and the overall F1 score is improved by 2.83 percentage points.Moreover,the proposed method is most effective in the recognition of Chinese herbal medicine names and disease names with significant radicals,with the corresponding F1 scores respectiv
作者
张文东
吴子炜
宋国昌
霍庆澳
王博
ZHANG Wendong;WU Ziwei;SONG Guochang;HUO Qingao;WANG Bo(College of Software,Xinjiang University,Urumqi 830008,Xinjiang,China)
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2024年第6期128-137,共10页
Journal of South China University of Technology(Natural Science Edition)
基金
新疆维吾尔自治区自然科学基金资助项目(2020D01C33)
新疆维吾尔自治区重点研发任务专项(2021B01002)
新疆大学博士科研启动基金资助项目(202112120001)。
关键词
中医古籍
命名实体识别
《本草纲目》
SiKuBERT
多元数据嵌入
traditional Chinese medicine classics
named entity recognition
Compendium of Materia Medica
SiKuBERT
multivariate data embedding