摘要
针对电子科技大学综合信息系统中学术论文填报数据不准确的问题,提出了通过余弦相似度计算来识别标准期刊名或会议名的方案.首先对填报名进行预处理,并清洗来自互联网的爬取名,进而得到测试名.通过经典的TF-IDF方法,对所有测试名和标准期刊名进行分词、去除停止词和取词操作,在计算出每个单词的TF-IDF值后,即可将所有的测试名和标准期刊名都转化为由所有单词的TF-IDF值构成的多维向量.通过计算测试名和标准期刊名间的余弦相似度,即可最终识别出正确的标准期刊名.实际的识别结果表明,余弦相似度计算极大地提高了学术论文填报数据的质量.
Aiming at the data problem of the academic papers filled by the teachers in the comprehensive information system of University of Electronic Science and Technology of China,a solution to find the standard journal names or the conference names by calculating the cosine similarity is presented. First,the filled names are pretreated and the names crawled from the Internet are cleaned,and then the test names are generated. Through a classic TF-IDF method,all of the test names and the standard journal names are divided into words and the stop words of the names are removed.Then the words are taken from the names. After the TF-IDF value of every words is calculated,all of the test names and the standard journal names are converted into multidimensional vectors consisting of the TF-IDF value of every words. By calculating the cosine similarity between the test names and the standard journal names,the correct standard journal names are identified. The identification results show that the cosine similarity calculation can improve the quality of the filled data for the academic papers.
出处
《东南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2017年第A01期123-128,共6页
Journal of Southeast University:Natural Science Edition
基金
电子科技大学专项建设资助项目(Y03093036001089)
关键词
大数据分析
综合信息系统
余弦相似度
多维向量转换
数据治理
big data analysis
comprehensive information system
cosine similarity
multidimensional vector transformation
data governance