摘要
针对文本分类中的交叉类别问题,提出一种基于传统潜在语义分析方法的新算法NLSA(new latentsemantic analysis)对网页进行文本分类.该方法可以将相关但是不同类别中的标签和非标签数据统一在一个概率模型中,通过研究两个类别的共有主题,在不同类别中转换知识来帮助目标文本进行分类.该方法可以最大化利用原有标签数据对新文本进行分类.实验证明:该算法能够显著提高交叉类别的文本分类性能,比传统的文本分类器有更好的性能.
At present,label data are rarely applied to classification in textual ads.The author proposes an approach based on traditional latent semantic analysis used in a cross-category for text classification and integrates labeled and unlabeled data from different but related category into a probabilistic model.By studying the common topics of two categories,the knowledge is converted in different categories to help target text categorization.This approach has the advantage that one can maximize the use of the original labeled data in a new text categorization.The experiment proves that this algorithm can dramatically improve the performance in cross-category text classification.
出处
《扬州大学学报(自然科学版)》
CAS
CSCD
北大核心
2011年第4期43-46,共4页
Journal of Yangzhou University:Natural Science Edition
基金
国家高新技术研究发展计划(863)项目(2007AA01Z448)
关键词
上下文广告
文本分类
潜在语义分析
textual ads
text classification
latent semantic analysis