摘要
有监督的分类方法是文本分类中常用的方法,它需要采用人工标识的样本进行训练,对样本的人工标识是一个比较繁锁的过程。无监督的分类方法没有这一过程,但其分类的效果往往不太好。针对两者各自的优缺点,利用一种基于SVM和K-means相结合的文本分类方法,首先用K-means方法进行文本聚类,然后选取每类中距离聚类中心较近的一些文本作为该类的训练样本训练SVM分类器,最后用训练好的SVM对文本进行分类。此方法避免了无监督方法分类效果不好的缺点,同时也省去了SVM方法中对样本进行人工标识的繁锁过程。基于灾害文本的实验结果也表明了这种新方法的可行性。
Supervised classification is commonly used in the text classification, but it needs manual identified samples for training, which made the process relatively cumbersome. Unsupervised classification does not in the process, hut the classification result often not good enough. According to the advantages and disadvantages of each method, uses a text classification method based on the combination of SVM and K - means. Using K - means cluster text first, and then chose some samples which are close to each cluster center as study samples to training SVM classifier. Finally, classify texts with the SVM classifier. This method avoids the shortocoming of unsupervised classification, and eliminates the cumhersome process of manual identifying samples of SVM. The experimental result based on disaster text also demonstrates the feasibility of this new approach.
出处
《计算机技术与发展》
2009年第11期35-37,44,共4页
Computer Technology and Development
基金
国家科技支撑计划项目(2006BAD20B02)