摘要
研究了现有的基于向量空间模型的文本聚类算法,发现这些算法都存在数据维度过高和忽略了单词之间语义关系的缺点。针对这些问题,提出一种基于单词相似度的文本聚类算法,该算法首先利用单词相似度对单词进行分类获得单词间的语义关系,然后利用产生的单词类作为向量空间的项表示文本降低了向量空间的维度,最后采用基于划分聚类方法对文本聚类。实验结果表明,相对于传统基于向量空间模型的聚类算法,该算法具有较好的聚类效果。
Researching currently text clustering algorithm based on vector space model, found that these algorithms have high dimen- sionality and neglecting semantic relations between words shortcomings. Considering these problems, a novel text clustering algorithm based on word similarity (TCWS) is proposed. Firstly, the algorithm uses word similarity to classified words, and then use word cluster as items of vector space that reduced dimension of text vector space. Finally, through partition clustering method cluster text. The experiments results show that, compared with traditional algorithm based on vector space model, the TCWS algorithm improves the quality of the cluster.
出处
《计算机工程与设计》
CSCD
北大核心
2009年第8期1966-1968,共3页
Computer Engineering and Design
基金
国家火炬计划基金项目(2004EB33006[0])
江苏省高校自然科学指导性计划基金项目(05JKD520050)
关键词
文本聚类
单词相似度
向量空间模型
单词类向量空间
文本表示
text clustering
word similarity
vector space model
word cluster vector space
text respresentation