摘要
园林植物知识图谱可为顾及区域适应性、观赏性和生态性等因子的绿化树种的选型提供知识支持。植物描述文本的实体识别及关系抽取是知识图谱构建的关键环节。针对植物领域未有公开的标注数据集,本文阐述了园林植物数据集的构建流程,定义了园林植物的概念体系结构,完成了园林植物语料库的构建。针对现有Word2vec、ELMo和BERT等语言模型存在无法解决多义词、融合上下文能力差、运行速度慢等缺点,提出了嵌入ALBERT(A Lite BERT)预训练语言模型的实体识别和关系抽取模型。ALBERT预训练的动态词向量能够有效地表示文本特征,将其分别输入到BiGRU-CRF命名实体识别模型和BiGRU-Attention关系抽取模型中进行训练,进一步提升实体识别和关系抽取的效果。在园林植物语料库上进行方法的有效性验证,结果表明ALBERT-BiGRU-CRF命名实体识别模型的F1值为0.9517,ALBERT-BiGRU-Attention关系抽取模型的F1值为0.9161,相较于经典的语言模型(如Word2vec、ELMo和BERT等)性能有较为显著的提升。因此基于ALBERT模型的实体与关系抽取任务能有效提高识别分类效果,可将其应用于植物描述文本的实体关系抽取任务中,为园林植物知识图谱自动构建提供方法。
Knowledge graph of landscape plants provides potential uses in the selection of greening tree species considering regional adaptability,ornamental and ecological factors.Entity and relationship extraction of the plant's description text is a key issue in the construction of knowledge graph.Until now,there has been no publicly available annotated data set for the plant domain.In this paper,a conceptual architecture of landscape plants was defined and briefly described,and the landscape plant corpus was constructed.Existing language models such as word2vec,ELMo,and BERT have various disadvantages,e.g.,they can't solve the problem of polysemous words and have poor ability of context fusion and computational efficiency.In this paper,we proposed a named entity recognition model,ALBERT-BiGRU-CRF,and a relationship extraction model,ALBERT-BiGRU-Attention,which were embedded with ALBERT(A Lite Bidirectional Encoder Representation from Transformers)pre-training language model.In the ALBERT-BiGRU-CRF model,the ALBERT model was used to extract text features,the Bi-GRU model was used to learn and excavate deep semantic features between sentences,and the CRF model was used to calculate the probability distribution of the annotation sequence to determine the entities contained in the description text.The ALBERT-BiGRU-Attention model was based on the results of the named entity recognition model.Similarly,the attention model was used to improve the weight of keywords to determine the relationship between entities.The proposed models have the following advantages:(1)The method can effectively identify and extract entities and relationships of landscape plants'knowledge;(2)The models can represent the semantic and sentence characteristics of characters with a good accuracy.The validity of the method was verified on the landscape plant corpus constructed in this paper and compared with other models.Our experimental results of quantitative evaluation show that:(1)The F1 index of the ALBERT-BiGRU-CRF model was 0.9517,indicating that
作者
陈晓玲
唐丽玉
胡颖
江锋
彭巍
冯先超
CHEN Xiaoling;TANG Liyu;HU Ying;JIANG Feng;PENG Wei;FENG Xianchao(Key Laboratory of Spatial Data Mining&Information Sharing of Ministry of Education,Fuzhou University,Fuzhou 350108,China;National Engineering Research Center of Geospatial Information Technology,Fuzhou University,Fuzhou 350108,China)
出处
《地球信息科学学报》
CSCD
北大核心
2021年第7期1208-1220,共13页
Journal of Geo-information Science
基金
国家自然科学基金项目(41971344)。
关键词
知识图谱
信息抽取
语料库
园林植物
ALBERT
词向量
实体识别
关系抽取
knowledge graph
information extraction
landscape plant corpus
landscape plant
ALBERT
word vectors
entity recognition
relation extraction