摘要
在没有建立起完善的中文停用词表的情况下,运用程序流程控制剔除中文分词器切分出来的单个独立字、英文字符、数字和一系列数学符号以及含有这些符号的中文词,从而使得两个字以上的纯中文词成为代表文本信息的特征项。这不仅明显降低了初始文本向量的维度,而且大大提高了文本向量中的特征信息含量。
Presents a new text pretreatment method that applying programme flows control to eliminate the single Chinese word, pure English words, number and Chinese words containing English letter or maths symbol from the original text vector. Consequently the features that represent the text turn into the pure Chinese term. As a result, not only dimension of original text vector is deduced greatly but the information contents of text vector are improved enormously.
出处
《计算机应用研究》
CSCD
北大核心
2005年第2期85-86,共2页
Application Research of Computers
关键词
文本分类
文本预处理
停用词
中文分词
Text Classification
Text Pretreatment
Stop-words
Chinese Term