摘要
结合半监督学习和集成学习方法,提出了一种基于置信度重取样的SemiBoost-CR分类模型。给出了基于标注近邻与未标注近邻的置信度计算公式,按照置信度重采样,不仅选取一定比例置信度较高的未标注样本,而且选取一定比例置信度较低的未标注样本,分别以不同的策略加入到已标注的训练样本集。引入置信度高的未标注样本,用以提高基分类器的正确性(accuracy);而引入置信度低的未标注样本,目的则是进一步增加基分类器间的差异性(diversity)。对比实验表明,SemiBoost-CR分类模型能够有效提升Naive Bayesian文本分类器的性能。
This paper proposes SemiBoost-CR, an enhanced categorization model which utilizing the confidence- based resampling technique and incorporating semi-supervised learning with ensemble learning. The confidence score is derived from the nearer labeled neighbors and unlabeled neighbors of the example. According to the confidence-based resampling, not only the unlabeled examples with higher confidence score, but also the unlabeled ones with lower confidence score are selected and added to the labeled training set. The accuracy of the base classi- fier is to be improved by introducing the unlabeled data with higher confidence; the diversity among the base classifiers is further increased by introducing the unlabeled data with lower confidence. Experimental results show that SemiBoost-CR can boost the performance of Naive Bayesian text categorization.
出处
《计算机科学与探索》
CSCD
2011年第11期1048-1056,共9页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金No.61073133
61175053
高等学校博士学科点专项科研基金No.20070151009~~