摘要
为解决文本聚类时文本的高维稀疏性问题,提出一种语义和统计特征相结合的短文本聚类算法。该算法通过语义词典对词汇的语义相关性分析实现一次降维,结合统计方法进行特征选择实现二次降维,并融合二次降维特征实现短文本聚类。实验结果表明,该算法具有较好的短文本聚类效果和效率。
The primary difficulty of text clustering lies in the multi-dimensional sparseness of texts. A short text clustering algorithm which takes semantic and statistic features into account is proposed. A dimensionality reduction is achieved via the semantic relativity analysis of lexical semantics by semantic dictionary. The second dimension reduction is completed after a feature selection through statistical methods. The short text clustering is obtained with the combination of the two reductions. Experimental result shows that the algorithm has better clustering effect and efficiency on short text.
出处
《计算机工程》
CAS
CSCD
2012年第22期171-175,共5页
Computer Engineering
基金
国家“863”计划基金资助项目(2011AA010704,2012AA011004)
清华大学自主科研基金资助项目“跨媒体分布式垂直搜索及舆情分析的关键技术”(20111081023)
关键词
特征选择
聚类
短文本
向量空间模型
语义
降维
feature selection
clustering
short text
Vector Space ModeI(VSM)
semantic
dimension reduction