摘要
随着社交网络的兴起,文本数据不断增加,这使得自动化文本分类技术成为研究的热点。单个文本可能同时带有多个类别标签,该特点直接导致传统的二分类或多类别分类技术在多标签文本数据上性能不佳。针对这一不足,提出一种基于半监督杂质的子空间聚类分析算法SCA(subspace clustering analysis),该算法分析在多标签环境下每一对分类和标签之间存在的潜在相关性。并设计一种对分类文本数据更有效的多标签分类器。最后,实验对两个多标签文本集进行分析,结果表明该算法优于当前采用的其他文本分类方法。
With the rise of social networking,the amount of generated text data gains increasingly,this makes the automated text classification technology become the focus of the research. Single text file may have multiple category labels simultaneously,this feature directly causes conventional two or multi-category classification techniques perform poor in text data with multi-label. In response to this deficiency,we propose a semi-supervised impurity based subspace clustering analysis algorithm named SCA,it analyses the potential correlation existing between each pair of classification and label in a multi-label environment. We also design a multi-label classifier more effective on the classified text data. Finally,the experiments of analysing two multi-label text set are carried out,results show that the algorithm is superior to other text classification methods currently used.
出处
《计算机应用与软件》
CSCD
北大核心
2014年第8期288-291,303,共5页
Computer Applications and Software
关键词
文本数据
多标签
分类器
子空间聚类
杂质
Text data Multi-label Classifier Subspace clustering Impurity