摘要
[目的/意义]为了发现适合合著关系预测的最佳数据集规模,并公平比较合著关系预测的指标,需要比较和分析不同数据规模下合著关系预测的整体准确率和最优指标的变化情况。[方法/过程]选取12个共同邻居及其改进指标作为代表性的合著关系预测指标,在不同规模的合著网络数据集上运用链路预测的理论和方法计算不同指标的预测准确率,并发现不同数据规模下的最优指标,从而揭示数据规模对合著关系预测的影响以及造成这些影响的原因。[结果/结论]在图书情报领域,通过作者出现频次大小形成不同规模的合著网络数据集,实验结果表明,数据规模越大,合著关系预测的整体准确率越高,并在合著网络全数据集上实现了准确率的巨大提升,说明没有经过任何过滤的完整合著网络是合著关系预测的最佳数据集;同时,不同数据集中合著关系预测的最优指标发生了变化,验证了指标具有数据规模偏好,说明公平科学比较合著关系预测指标需要在多个不同规模的数据集下进行。造成该结果的原因在于随着数据规模变大,合著网络数据集越接近真实情况,改进指标的优势得到了充分发挥。该方法可以扩展应用到其他领域并对结论进行验证。
[ Purpose/Significance]In order to find the optimaldatasetsize for co-authorship predictionand compareindicators of co-an- thorship prediction fairy, we need to compare and analyze the changes of overall accuracy and optimal indicators in different size datasets for co-anthorship prediction. [ Method/Process] This paper selects 12 representative indicators for co-authorship prediction including com- mon indicator (CN) and its improvements, and then useslink prediction method for calculating accuraciesof different indicators in different size co-authorship networks and finds the best appropriate indicator for co-authorship prediction. It could reveal how and why data size in- fluences co-authorship prediction. [ Result/Conclusion] In the field of Library and Information Science, the different sizedatasets of co- authorship network are formed through author occurringfrequency. The results show that the larger the size of the datasets, the higher the o- verall accuracy of the co-authorship prediction. The best appropriate dataset is the co-authorship network without any filtering because the accuracy of full dataset is the highest that achieves a huge boost compared to others. Furthermore, the indicators have biases in different datasets because optimal indicator changes along with the different size of datasets. It indicates thata fair comparison among indicators needs to be experimented amongdifferent size datasets. The reason is that the largerthe data size becomes, the closerthe co-authorship net- work is to the real situation, and thereforethe advantages of improved indicators couldbe fully activated. The method could be extended toother areas and to validate the conclusions.
出处
《情报杂志》
CSSCI
北大核心
2016年第9期80-85,共6页
Journal of Intelligence
基金
国家自然科学基金青年基金"基于被引科学知识突变的突破性创新动态识别及其形成机理研究"(编号:71503125)
教育部人文社会科学研究青年基金"异构知识网络中主题突变动态识别研究"(编号:14YJC870025)
中央高校基本科研业务专项资金"基于专利引用科学知识突变的突破性创新动态识别方法与形成机理研究"(编号:30915013101)的研究成果之一
关键词
数据规模
合著关系预测
图书情报
准确率
最优指标
data size
co-authorship prediction
Library and Information Science
precision
optimal indicator