摘要
【目的】解决文献资源管理系统中中文论文学者同名问题。【方法】在文献数据的基础上构建以"作者名+机构名"为标识的学者实体,利用学者实体的属性构建三个方面的6个相似度特征,分别采用主成分分析、直接赋值权重以及二者结合的方法融合特征,研究各融合方法消歧能力和各特征消歧效果。【结果】主成分分析与以单个特征为单位的赋值权重相结合的融合方法,以及以单个方面为单位的赋值权重的融合方法能有效降低时间开销,在LIS测试集上F1值分别达到70.74%和70.42%,在经济学测试集上F1值分别达到81.90%和80.93%。【局限】研究所使用的特征有限,均来源于论文的元数据描述,没有使用外部信息或挖掘文本内容。【结论】所提特征融合方法可有效解决多特征融合时权重设置问题。
[Objective] This paper aims to address the issues facing document management systems due to Chinese authors with the same names. [Methods] We built author entities with"author name + institution name"based on bibliographic data. Then, we used the attributes of author entities to construct six similarity features from three aspects. Third, we merged these features by principal component analysis or direct weight assignment.Finally, we evaluated the performance of the proposed method. [Results] Our methods significantly reduced processing time. Their F1 values on the LIS dataset were 70.74% and 70.42%, while their F1 values on the economics dataset were 81.90% and 80.93%. [Limitations] The attributes used in this research were only retrieved from metadata of the papers. [Conclusions] The proposed method could improve weight setting of multiple features.
作者
林克柔
王昊
龚丽娟
张宝隆
Lin Kerou;Wang Hao;Gong Lijuan;Zhang Baolong(School of Information Management,Nanjing University,Nanjing 210023,China;Jiangsu Key Laboratory of Data Engineering and Knowledge Service,Nanjing 210023,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2021年第4期90-102,共13页
Data Analysis and Knowledge Discovery
基金
江苏省“六大人才高峰”高层次人才项目(项目编号:JY-001)
江苏青年社科英才和南京大学仲英青年学者的研究成果之一。
关键词
特征融合
同名消歧
主成分分析
中文论文
Feature Fusion
Author Name Disambiguation
PCA
Chinese Papers