期刊文献+

基于R-Grams的文本聚类方法 被引量:1

Novel text clustering approach based on R-Grams
下载PDF
导出
摘要 针对传统文本聚类中存在着聚类准确率和召回率难以平衡等问题,提出了一种基于R-Grams文本相似度计算方法的文本聚类方法。该方法首先通过将待聚类文档降序排列,其次采用R-Grams文本相似度算法计算文本之间的相似度并根据相似度实现各聚类标志文档的确定并完成初始聚类,最后通过对初始聚类结果进行聚类合并完成最终聚类。实验结果表明:聚类结果可以通过聚类阈值灵活调整以适应不同的需求,最佳聚类阈值为15左右。随着聚类阈值的增大,各聚类准确率增大,召回率呈现先增后降的趋势。此外,该聚类方法避免了大量的分词、特征提取等繁琐处理,实现简单。 Focusing on the issue that the clustering accuracy rate and recall rate are difficult to balance in traditional text clustering algorithms, a clustering approach based on the R-Grams text similarity computing algorithm was proposed. Firstly, the clustered documents were sorted in descending order; secondly, the symbolic documents were identified and then initial clustering results were achieved by using an R-Grams-based similarity computing algorithm; finally, the final clustering results were completed by combining the initial clustering. The experimental results show that the proposed approach can flexibly regulate the clustering results by adjusting the clustering threshold parameter to satisfy different demands and the optimal parameter is about 15. With the increasing of the clustering threshold, the clustering accuracies increase, and the recalls increase at first, then decrease. In addition, the approach is free from time-consuming processing procedures such as word segmentation and feature extraction and can be easily implemented.
出处 《计算机应用》 CSCD 北大核心 2015年第11期3130-3134,共5页 journal of Computer Applications
基金 浙江省自然科学基金资助项目(LY13F010005) 教育部人文社会科学研究项目(15YJAZH015) 湖北省科技支撑计划软科学项目(2015BDH109) 温州市科技计划项目(R20130021)
关键词 文本 聚类 随机 R-Grams text clustering random R-Grams
  • 相关文献

参考文献23

  • 1MACQUEEN J B. Some methods for classification and analysis of multivariate observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967:281-297. 被引量:1
  • 2ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: an efficient data clustering method for very large databases[J]. Data Mining and Knowledge Discovery, 1997,1(2):141-182. 被引量:1
  • 3ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise [C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press,1996:226-231. 被引量:1
  • 4ANKERST M, BREUNIG M, KRIEGEL H P, et al. OPTICS: ordering points to identify the clustering structure[C]// Proceedings of the ACM SIGMOD 1999 International Conference on Management of Data. New York: ACM, 1999: 49-60. 被引量:1
  • 5曾依灵,许洪波,白硕.改进的OPTICS算法及其在文本聚类中的应用[J].中文信息学报,2008,22(1):51-55. 被引量:29
  • 6HYOTYNIEMI H. Text document classification with self-organizing maps[C]// Proceedings of Finnish Artificial Intelligence Conference Genes, Nets and Symbols. Vaasa: the Finnish Artificial Intelligence Society and University of Vaasa, 1996:64-72. 被引量:1
  • 7LIU Y, WU C, LIU M. Research of fast SOM clustering for text information[J]. Knowledge-Based Systems,2011,38(8): 9325-9333. 被引量:1
  • 8何婷婷,戴文华,焦翠珍.基于混合并行遗传算法的文本聚类研究[J].中文信息学报,2007,21(4):55-60. 被引量:11
  • 9JAMAL A N, IRAKLIS V, ASIM K, et al. Semantic smoothing for text clustering[J]. Knowledge-Based Systems,2013,54(4): 216-229. 被引量:1
  • 10彭京,杨冬青,唐世渭,付艳,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363. 被引量:44

二级参考文献98

共引文献120

同被引文献23

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部