摘要
针对传统文本分类方法对文档间关联关系考虑不充分的问题,提出一种基于iTopicModel的关联文本分类算法。根据类信息已知的文档归属于各个主题的概率判断主题代表的类信息,利用待分类文档归属于各个主题的概率及文本信息对文档进行分类。实验结果表明,当文档间的关联关系对类信息影响较大时,TC-iTM的分类性能优于传统文本分类方法。
In order to solve the problem that traditional text classification methods do not emphasize the links among text documents enough,this paper proposes a novel text classification algorithm TC-iTM based on iTopicModel.TC-iTM uses the probability that the labeled documents are assigned to each topic to judge the category that each topic represents.TC-iTM classifies unlabelled documents by using the probability that the documents are assigned to each topic and the text information of these documents.Experimental result shows that TC-iTM outperforms the traditional text classification methods when links among documents are important to the categories of the documents in document network.
出处
《计算机工程》
CAS
CSCD
北大核心
2011年第21期124-125,130,共3页
Computer Engineering
基金
国家自然科学基金资助项目(60970083)
关键词
文本分类
文档网络
主题模型
EM算法
text classification
document network
topic model
EM algorithm