摘要
【目的】探究不同深度主动学习方法对科技文献摘要的结构功能识别效果和标注成本。【方法】提出基于主动学习和序列标注的科技文献摘要结构功能识别方法,构建考虑句间上下文序列信息的SciBERTBiLSTM-CRF模型(SBCA),然后分别提出基于摘要单句和摘要全文两个维度的基于不确定性的主动学习策略,并在PubMed 20K数据集上进行实验。【结果】SBCA模型具有最佳的识别效果,与不考虑序列信息仅使用SciBERT模型相比,F1值提升了11.93个百分点。使用基于整篇摘要的最小置信度策略达到SBCA模型的最优F1值仅需使用60%数据,使用基于单句的最小置信度策略达到SBCA模型的最优F1值仅需使用65%数据。【局限】本研究中仅构建了基于不确定性的主动学习查询策略,未考虑构建其他类别的查询策略。【结论】基于深度主动学习的方法有助于在更低注释成本的前提下进行摘要结构功能识别。
[Objective]This paper explores different DeepAL methods for identifying the structural function of scientific literature abstracts and their labeling costs.[Methods]Firstly,we constructed a SciBERT-BiLSTM-CRF model for the abstracts(SBCA),which utilized the contextual sequence information between sentences.Then,we developed an uncertainty active learning strategy for single sentences and full text of the abstracts.Finally,we conducted experiments on the PubMed 20K dataset.[Results]The SBCA model showed the best recognition performance and increased the F1 value by 11.93%,compared to the SciBERT model without sequence information.Using the Least Confidence strategy based on the abstracts,our SBCA model achieved its optimal F1 value with 60%of the experimental data.Using the Least Confidence strategy based on sentences,the SBCA model achieved optimal F1 value with 65%of the experimental data.[Limitations]In the future,we need to examine different active learning strategies in more fields or multi-language datasets.[Conclusions]The new model based on deep active learning could identify the structural function of scientific literature with a lower annotation cost.
作者
毛进
陈子洋
Mao Jin;Chen Ziyang(Center for Studies of Information Resources,Wuhan University,Wuhan 430072,China;School of Information Management,Wuhan University,Wuhan 430072,China)
出处
《数据分析与知识发现》
EI
CSCD
北大核心
2024年第6期44-55,共12页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目(项目编号:72174154)
高校人文社会科学重点研究基地重大项目(项目编号:22JJD870005)的研究成果之一。
关键词
深度学习
文献结构功能识别
语步
主动学习
知识组织
Deep Learning
Document Structural Function Identification
Move
Active Learning
Knowledge Organization