摘要
随着互联网和信息技术的发展,大量的多标签文本数据快速产生。在文本分类中如何确定合适的分类数目以及如何更加准确地辨别文档的标签是亟待解决的问题。提出的HL_LDA模型通过层次狄利克雷过程自动确定分类的数目,通过发掘多标签文档的标签之间的层次信息提高分类的质量。实验结果表明在不同类型的数据集中,和经典的LDA,SVM等方法相比,HL_LDA在精度,F1-score等评估指标上明显优于现有的方法。
With the development of Internet and information technology, a large number of multi-label texts data quickly generated. In the text classification, how to determine the appropriate number of categories and how to identify the label of the textmore accurately is an urgent problem to be solved. The HL_LDA model proposed in this paper automatically determines the number of categories through the hierarchical Dirichlet process, and improves the quality of the classification by discovering the hierarchical information between labels of multi-label documents. The experimental results show that the evaluation of HL_LDA is superior to the existing method in precision and F1-score compared with the LDA-based and SVM-based methods on different types of data sets.
出处
《计算机工程与应用》
CSCD
北大核心
2017年第23期18-23,46,共7页
Computer Engineering and Applications
基金
青年科学基金项目(No.60903035)
国家自然科学基金(No.61572373)
国家重点研发计划(No.2017YFC0803808)
关键词
多标签
文本分类
标签依赖
层次狄利克雷过程
multi-label
text clustering
tag dependence
hierarchical Dirichlet process