摘要
[目的/意义]从不同来源的文本中识别和提取学术论文寻找合适的方法,为后续开展国内替代计量研究提供参考。[方法/过程]基于广泛的数据收集,总结归纳中国替代数据来源类别,提出从中国替代计量数据来源中识别学术论文的思路,探究将命名实体识别的方法引入识别和提取学术论文上的可能性,并利用基于正则表达式的识别方法进行实证分析。[结果/结论]中国替代计量数据源种类繁多,包括知识共享平台、学术社交平台、大众社交平台、新闻平台、学科交流平台和视频网站。学术论文提及作为一类新型的命名实体,可参考传统的命名实体识别方法进行识别和提取。实证研究显示,基于正则表达式的方法可以用于学术论文的识别,在知乎“机器学习”话题的数据集上取得了80%的F1值,而正则表达式模板的匹配度对识别效果起关键性作用。
[Purpose/significance]This study seeks suitable methods for identifying and extracting academic papers from texts of different sources,in order to provide a reference for the subsequent development of domestic altmetrics studies.[Method/process]Based on extensive data collection,this study summarizes the categories of altmetrics data sources in China,constructs an idea of identifying academic papers from Chinese altmetrics data sources,explores the possibility of introducing a named entity recognition approach to identify and extract academic papers,and uses regular expression-based recognition methods for empirical analysis.[Result/conclusion]There is a wide variety of Chinese altmetrics data sources,including knowledge sharing platforms,academic social platforms,popular social platforms,news platforms,disciplinary communication platforms,and video websites.The mention of academic papers,as a new type of named entities,can be identified and extracted by referring to the traditional named entity recognition methods.The empirical study shows that the regular expression-based method can be used for the identification of academic papers,and achieves an F1 value of 80%on the dataset of“Machine Learning”topic in Zhihu.The matching degree of the regular expression template plays a key role in the recognition effect.
出处
《情报理论与实践》
CSSCI
北大核心
2022年第12期111-118,共8页
Information Studies:Theory & Application
基金
国家自然科学基金面上项目“中国替代计量的数据识别机制与关键分析方法研究”(项目编号:72274227)
教育部人文社会科学研究规划基金项目“融合替代计量分析的高校科研社会影响力评价研究”(项目编号:22YJA870016)的成果之一。