摘要
文本分类是文本数据挖掘的重要技术.从文本分类实现过程的各个环节,包括建立文档模型、特征提取、维数约简、选择分类策略几个方面分别给出了目前实用的解决方案,同时对各种算法进行了分类和性能上的定性与定量的比较,最后讨论了国内文本分类研究中的一些问题和未来的发展.
Text categorization is one of the important techniques in textual data mining.This survey introduces general solutions to every step of the categorization process including document modeling,feature selection,dimensionality reduction,classification scheme selection.All classification algorithms mentioned are divided into several categories and are evaluated qualitatively and quantitatively by different measures.At the end,the paper presents some existing problems and future developments in text categorization field.
出处
《广西师范大学学报(自然科学版)》
CAS
2003年第A01期173-179,共7页
Journal of Guangxi Normal University:Natural Science Edition
基金
铁路数据中心体系结构的研究与设计(2002X039)
关键词
文本分类
特征提取
维数约简
向量空间模型
相似度
组合模型
text categorization
feature selection
dimensionality reduction
vector space model (VSM)
similarity
combination model