摘要
知识图谱技术促进了新药研发的进展,但国内研究起点晚且领域知识多以文本形式存储,图谱重用率低。因此,本研究基于多源异构的医药文本,设计了以Bert-wwm-ext预训练模型为基础,并融合级联思想的中文命名实体识别模型,从而减少了传统单次分类的复杂度,进一步提高了文本识别的效率。实验结果显示,该模型在自建的训练语料上的F1分数达0.903,精确率达89.2%,召回率达91.5%。同时,将模型应用于公开数据集CCKS2019上,结果显示该模型能够更好地识别中文文本中的医疗实体。最后,利用此模型构建了一个中文医药知识图谱,图谱包含13530个实体,10939个属性,以及39247个相关关系。本研究所提出的中文医药实体识别与图谱构建方法,有望助力研究者加快医药知识新发现,从而缩短新药研发进程。
Knowledge graph technology has promoted the progress of new drug research and development,but domestic research starts late and domain knowledge is mostly stored in text,resulting in low rate of knowledge graph reuse.Based on multi-source and heterogeneous medical texts,this paper designed a Chinese named entity recognition model based on Bert-wwm-ext pre-training model and also integrated cascade thought,which reduced the complexity of traditional single classification and further improved the efficiency of text recognition.The experimental results showed that the model achieved the best performance with an F1-score of 0.903,a precision of 89.2%,and a recall rate of 91.5%on the self-built dataset.At the same time,the model was applied to the public dataset CCKS2019,and the results showed that the model had better performance and recognition effect.Using this model,this paper constructed a Chinese medical knowledge graph,involving 13530 entities,10939 attributes and 39247 relationships of them in total.The Chinese medical entity extraction and graph construction method proposed in this paper is expected to help researchers accelerate the new discovery of medical knowl⁃edge,and shorten the process of new drug discovery.
作者
杨晔
裴雷
侯凤贞
YANG Ye;PEI Lei;HOU Fengzhen(Institute of Medical Big Data and Artificial Intelligence,School of Science,China Pharmaceutical University,Nanjing 211198,China)
出处
《中国药科大学学报》
CAS
CSCD
北大核心
2023年第3期363-371,共9页
Journal of China Pharmaceutical University