摘要
【目的】在基于新闻文本挖掘的开源技术情报监测任务场景下,提出一种结合半监督学习与主动学习的细分领域新闻分类方案。【方法】首先,在新闻文本表示学习的基础上开展K-Means聚类,筛选各类簇中少量代表性样本供人工判定类目,合并调整后作为细分领域类目;其次,利用代表性样本作为训练集,集成多种分类算法训练出初始分类器;最后,结合困惑度和混淆矩阵开展主动学习有针对性地迭代优化初始分类器。【结果】在坦克装甲车领域新闻数据集上进行测试,进行主动学习后取得较好的文本分类结果,正确率、召回率和F1值达到83.68%、83.35%和83.17%,较主动学习前分别提升2.71、2.52和2.81个百分点。【局限】为了减少人工语料标注任务,主动学习环节仅做了两次迭代。【结论】所提方案能够在缺乏语料标注、未预设细分类目的原始状态下,仅利用少量人工参与成本,即可一体化地获得效果较好的细分领域新闻分类器。该方案在实践中具有较高的性价比和良好的领域泛化能力。
[Objective]This paper proposes a news classification scheme combining semi-supervised learning and active learning,aiming to improve intelligence monitoring based on news mining.[Methods]First,we carried out K-means clustering based on the learning of news text representations,and selected a small number of representative samples from various clusters for manual judgment.These categories were merged and adjusted as sub-field categories.Then,we used the representative samples as the training set for a variety of integrated classification algorithms and train the initial classifier.Finally,we utilized active learning to optimize the initial classifier.[Results]We tested our new model with news on tanks and armored vehicles.After active learning,we received better text classification results.The precision,recall and F1 value reached 83.68%,83.35% and 83.17%,which were increased by 2.71%,2.52% and 2.81% respectively.[Limitations]To reduce manually labeling work,we only conducted 2 iterations.[Conclusions]The proposed method can effectively classify news with little corpus annotation and no pre-trained classifier.It could also be used in other fields.
作者
陈果
叶潮
Chen Guo;Ye Chao(School of Economics&Management,Nanjing University of Science&Technology,Nanjing 210094,China;Jiangsu Science and Technology Collaborative Innovation Center of Social Public Safety,Nanjing 210094,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2022年第4期28-38,共11页
Data Analysis and Knowledge Discovery
基金
教育部人文社会科学研究青年项目(项目编号:21YJC870003)
江苏省社会科学基金青年项目(项目编号:21TQC002)的研究成果之一。
关键词
半监督学习
主动学习
文本分类
集成学习
Semi-Supervised Learning
Active Learning
Text Classification
Ensemble Learning