期刊文献+

融合SLDA主题模型的不均衡文本分类方法 被引量:3

Imbalanced Text Categorization Method with SLDA Topic Model
下载PDF
导出
摘要 在标签均衡分布且标注样本足够多的数据集上,监督式分类算法通常可以取得比较好的分类效果。然而,在实际应用中样本的标签分布通常是不均衡的,分类算法的分类性能就变得比较差。为此,结合SLDA(Supervised LDA)有监督主题模型,提出一种不均衡文本分类新算法ITC-SLDA(Imbalanced Text Categorization based on Supervised LDA)。基于SLDA主题模型,建立主题与稀少类别之间的精确映射,以提高少数类的分类精度。利用SLDA模型对未标注样本进行标注,提出一种新的未标注样本的置信度计算方法,以及类别约束的采样策略,旨在有效采样未标注样本,最终降低不均衡文本的倾斜度,提升不均衡文本的分类性能。实验结果表明,所提方法能明显提高不均衡文本分类任务中的Macro-F1和G-mean值。 Supervised categorization algorithms can yield better categorization performance in datasets with enough and balanced labels.However,various real-world categorization tasks suffer from the class imbalance problem which has been known to hinder the learning performance of categorization algorithms.This paper,demonstrates that SLDA model is capable of solving the class imbalance problem by sampling unlabeled instances.In order to yield a better prediction per-formance with minority classes,the semantic relationship between topics and minority classes is derived by the SLDA topic model.An efficient way of calculating confidence and sampling valuable unlabeled instances is proposed.The proposed method reduces the skewness of the imbalanced datasets efficiently and improves the categorization performance of minority classes.Our experimental results show that the the proposed method,ITC-SLDA algorithm,can significantly improve Macro-F1 and G-mean values in imbalanced text categorization.
作者 唐焕玲 刘艳红 郑涵 窦全胜 鲁明羽 TANG Huanling;LIU Yanhong;ZHENG Han;DOU Quansheng;LU Mingyu(School of Computer Science and Technology,Shandong Technology and Business University,Yantai,Shandong 264005,China;Co-innovation Center of Shandong Colleges and Universities,Yantai,Shandong 264005,China;Key Laboratory of Intelligent Information Processing in Universities of Shandong(Shandong Technology and Business University),Yantai,Shandong 264005,China;Information Science and Technology College,Dalian Maritime University,Dalian,Liaoning 116026,China)
出处 《计算机工程与应用》 CSCD 北大核心 2021年第12期144-154,共11页 Computer Engineering and Applications
基金 国家自然科学基金(61976124,61976125,61772319,61773244,61972235)。
关键词 有监督主题模型 半监督学习 不均衡文本 分类 supervised topic model semi-supervised learning imbalanced text categorization
  • 相关文献

参考文献8

二级参考文献45

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3吴洪兴,彭宇,彭喜元.适用于不平衡样本数据处理的支持向量机方法[J].电子学报,2006,34(B12):2395-2398. 被引量:16
  • 4Menzies T,Greenwald J,Frank A.Data mining static code attributes to learn defect predictors[J].IEEE Transactions on Software Engineering,2007,33(1):2-13. 被引量:1
  • 5Turhan B,Bener A.Analysis of Naive Bayes assumptions on software fault data:An empirical study[J].Data&Knowledge Engineering,2009,68(2):278-290. 被引量:1
  • 6Boetticher G D.Improving credibility of machine learner models in software engineering[M]∥Advanced Machine Learner Applications in Software Engineering(Series on Software Engineering and Knowledge Engineering),USA:Langston University,2006:52-72. 被引量:1
  • 7Catal C,Diri B.Investigating the effect of dataset size,metrics sets and feature selection techniques on software fault prediction problem[J].Information Sciences,2009,179(8):1040-1058. 被引量:1
  • 8Riquelme J C,Ruiz R,Rodriguez D,et al.Finding defective modules from highly unbalanced datasets[J].Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos,2008,2(1):67-74. 被引量:1
  • 9Menzies T,Turhan B,Bener A,et al.Implications of ceiling effects in defect predictors[C]∥Proc of the 4th International Workshop on Predictor Models in Software Engineering,2008:47-54. 被引量:1
  • 10Seiffert C,Khoshgoftaar T M,Van Hulse J.Improving software-quality predictions with data sampling and boosting[J].IEEE Transactions on Systems,Man and Cybernetics,Part A:Systems and Humans,2009,39(6):1283-1294. 被引量:1

共引文献500

同被引文献33

引证文献3

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部