科技领域词汇语义表示的稳定性研究:多种词嵌入模型对比

A Study on the Stability of Semantic Representation of Entities in the Technology Domain-Comparison of Multiple Word Embedding Models

下载PDF

导出

摘要在科技文献情报分析领域,词汇语义分析至关重要。分布式词嵌入技术可以有效学习词汇的语义表示,近年来逐渐成为科技词汇语义分析的共性基础技术。然而,主流词嵌入模型的随机初始化操作使得即使在相同的语料上,每次训练产生的词汇语义向量都有不同程度的偏差,干扰了下游语义分析任务结果的可靠性与可复现能力。为了厘清模型和各因素对词汇语义表示结果稳定性的干扰程度,本文开展多种对比实验,以量化指导后续技术选型。本文综合考虑了领域数据集大小、模型种类、训练算法、关键词频次、向量维度、上下文窗口大小等影响因素,设计了基于语义场重叠的稳定性评估指标和相应的实验方案。在“人工智能”“免疫学”“货币政策”“量子纠缠”4个领域的MAG(Microsoft Academic Graph)论文语料集上,针对论文关键词开展多种模型词嵌入模型(Word2Vec、GloVe和fastText),训练并比较各种结果的稳定性。4个领域的研究结果均表明,在一定范围内,数据集越大,语义表示的稳定性越好,但GloVe例外;考虑语料规模、待分析关键词频次、词形相似等因素时,词嵌入模型的稳定性各有不同;向量维度为300,上下文窗口为5是较为合适的选择。最后,本文给出了多种因素组合下建议选择的词嵌入模型与技术,为后续科技词汇语义分析研究提供了量化证据和借鉴。 Lexical semantic analysis is crucial in the science and technology literature intelligence analysis field.Distribut‐ed word embedding techniques(e.g.,fastText,GloVe,and Word2Vec),which can effectively represent lexical semantics and conveniently characterize the semantic similarity of lexical words,have recently become the mainstream technology for technological lexical semantic analysis.The use of word embedding techniques for lexical semantic analysis is highly dependent on computing the nearest semantic neighbors of words based on word vectors.However,because of random ini‐tialization of the word embedding model,even if the nearest semantic neighbors generated by repeated training on the ex‐act same data are not identical,the randomly perturbed nearest semantic neighbors introduce untrue information.To mini‐mize the impact of random initialization,enhance reproducibility,and obtain more reliable and effective semantic analysis results,this study comprehensively examined the influence of dataset size,model type,training algorithm,keyword fre‐quency,vector dimension,and context window size and designed a quantitative stability assessment index and correspond‐ing experimental scheme.The present study investigated the Microsoft Academic Graph(MAG)paper corpus in four dis‐tinct fields:artificial intelligence,immunology,monetary policy,and quantum entanglement.Specifically,we trained word embedding models on a corpus of MAG papers,performed word vector semantic representations for the keywords of the papers,and calculated evaluation metrics to ascertain the stability of semantic representations in conjunction with quantitative results.The results on the four domains demonstrate that the larger the dataset,the more stable the semantic representation.However,this is not the case for GloVe.Different models and training algorithms must be targeted when considering structural grammatical information,such as lexical composition,character similarity,and keyword frequency.Furthermore,setting the vector dimensi

作者陈果徐赞洪思琪吴嘉桓肖璐 Chen Guo;Xu Zan;Hong Siqi;Wu Jiahuan;Xiao Lu(School of Economics&Management,Nanjing University of Science&Technology,Nanjing 210094;School of Journalism,Nanjing University of Finance&Economics,Nanjing 210023)

机构地区南京理工大学经济管理学院南京财经大学新闻学院

出处《情报学报》 CSSCI CSCD 北大核心 2024年第12期1440-1452,共13页 Journal of the China Society for Scientific and Technical Information

基金国家自然科学基金青年科学基金项目“基于语义分析的自媒体政策信息传播失真识别与协同纠偏研究”(72404121) 江苏省社会科学基金项目“不完备文献资源上的科技情报分析方法体系构建”(24TQB001)。

关键词科技情报分析领域知识分析词汇语义语义表示稳定性词嵌入模型 science intelligence analysis domain knowledge analysis lexical semantic semantic representation stability word embedding models

分类号 TP3 [自动化与计算机技术—计算机科学与技术]