摘要
提出一种基于密度峰值发现的文本聚类算法,将文本的距离与密度计算转化为文本向量的相似度计算,实现基于密度峰值发现的文本聚类算法。该算法采用空间向量模型表示文本,用余弦公式进行相似度计算,然后求得每个文本的密度和距离。剔除噪音点后,选取聚类中心,将剩下的非中心点划分到距离其最近的聚类中心所在的类簇中去。通过多组对比试验,验证了本方法的可靠性和鲁棒性。
A text clustering algorithm based on find of density peak was proposedin this paper. The algorithm was implemented by the calculation of text distance and density,which was in accordance with calculation of the text vector similarity. VSM( Vector Space Model) was used to express ducument to obtain the similarity calculation with cosine formula. The cucument work was to find the local density and the distance from points of higher density of each ducument,remove the noise points and select the cluster center. The remainednon-centralpoints were assigned into the cluster which was the nearest to the cluster center. According to several sets of contrast experiments,the density-based text clustering was improved to have an advantage of reliability and robustness.
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2016年第1期65-70,共6页
Journal of Shandong University(Natural Science)
基金
国家自然科学基金资助项目(61373148)
国家社会科学基金资助项目(12BXW040)
山东省自然基金资助项目(ZR2012FM038)
山东省优秀中青年科学家奖励基金资助项目(BS2013DX033)
教育部人文社科基金资助项目(14YJC860042)
山东省社科规划项目(12BXWJ01)
山东省高等学校科技计划项目(J12LN21)
关键词
密度
文本聚类
特征项
向量距离
density
ducument clustering
feature term
vector distance