期刊文献+

基于COSA算法的中文文本聚类 被引量:9

Chinese Text Clustering Based on COSA Algorithm
下载PDF
导出
摘要 传统聚类算法在计算两个对象间的距离时,每个属性对距离的贡献相同。COSA(Clustering On Subsets of Attributes)算法[1]认为在不同的分组中,每个属性对计算距离所起的作用可能并不相等,因为不同分组中的对象可能在不同的属性子集上聚集。文献[1]在此基础上定义了新的距离,并提出了两种COSA算法:COSA1算法是一种分割的聚类算法;COSA2算法是一种层次聚类算法。为了对比COSA距离和传统的欧氏距离在文本聚类中的表现,本文对中文文本进行了分割聚类和层次聚类的实验。实验结果显示出COSA算法较基于欧氏距离的聚类算法有更好的性能,而且对于属性数的变化,COSA算法更加稳定。 Most traditional clustering algorithms treat each attribute equally. However, COSA (clustering on sub- sets of attributes) algorithm believes that each separate attribute in different groups may have different weight, and that objects in different groups may cluster in different subsets of attributes. A new distance definition is presented in literature [1], which also presented two COSA algorithms. COSA1 is a partitioning algorithm and COSA2 is a hierarchical cluster algorithm. In this paper, COSA and COSA1 were used for Chinese documents in order to compare the COSA distance and the Euclidean distance. The results show that COSA algorithms achieve better performance and are more robust when the number of attributes changes.
出处 《中文信息学报》 CSCD 北大核心 2007年第6期65-70,共6页 Journal of Chinese Information Processing
基金 国家863计划(2006AA01Z142)
关键词 计算机应用 中文信息处理 文本聚类 COSA算法 K—means算法 computer application Chinese information processing text clustering COSA algorithm K means
  • 相关文献

参考文献9

  • 1Jerome H Friedman, Jacqueline J Meulman. Cluste ring objects on subsets of attributes[J]. J R Statist Soc B, 2004, 66(4): 1-25. 被引量:1
  • 2刘远超,王晓龙,徐志明,关毅.文档聚类综述[J].中文信息学报,2006,20(3):55-62. 被引量:65
  • 3曼宁D.C.统计自然语言处理基础[M].苑春法,李庆中,王昀等译.第一版.北京:电子工业出版社,2005. 被引量:1
  • 4孙即祥等编著..现代模式识别[M].长沙:国防科技大学出版社,2002:460.
  • 5范明 等.数据挖掘概念与技术[M].北京:机械工业出版社,2001.. 被引量:120
  • 6Lance Parsons, Ehtesham Haque Ehtesham, Huan Liu. Evaluating subspace clustering algorithms [J]. In Workshop on Clustering High Dimensional Data and its Applications, SIAM Int. Conf. on Data. Mining: 48-56, 2004. 被引量:1
  • 7John Allen. Bridging Microarray Platforms To Extend the Utility of Gene Expression Profile [D]. Science School of Informatics University of Edinburgh, 2004. 被引量:1
  • 8Steinbach M, Karypis G, Kumar V, A comparison of Document Clustering Techniques [A]. Department of Computer Science and Engineering, University of Minnesota. Technical Report # 00 -034,2000. 被引量:1
  • 9Zhao Y, Karypis G. Criterion Functions for Document Clustering Experiments and Analysis [A]. Technical Report # 01-40 ,Department of Computer Science,University of Minnesota, Minneapolis, MN, 2001. 被引量:1

二级参考文献39

  • 1陈浩,何婷婷,姬东鸿.基于k-means聚类的无导词义消歧[J].中文信息学报,2005,19(4):10-16. 被引量:16
  • 2Regina Barzilay,Min-Yen Kan,and Kathleen R.McKeown.Simfinder:A Flexible Clustering Tool for Summarization[A].In proceedings of the Workshop on Summarization in NAACL 01[C].Pittsburg,Pennsylvania,USA:June 2001. 被引量:1
  • 3Zheng Chen,Wei-Ying Ma,Jinwen Ma.Learning to Cluster Web Search Results[A].In:proceedings of the 27th Annual International ACM SIGIR Conference[C].Sheffield,South Yorkshire,UK,July 2004,210 -217. 被引量:1
  • 4Y.C.Fang,S.Parthasarathy,F.Schwartz.Using Clustering to Boost Text Classification[J].In:proceedings of the IEEE ICDM Workshop on Text Mining,Maebashi City,Japan,2002. 被引量:1
  • 5A.Rauber,and M.Frühwirth.Automatically Analyzing and Organizing Music Archives[A].In:proceedings of the 5.European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2001)[C].Darmstadt,Germany,2001. 被引量:1
  • 6Cutting,D.,Karger,D.,and etc.Scatter/Gather:A Cluster-based Approach to Browsing Large Document Collections[A].SIGIR ‘ 92,1992[C].318-329. 被引量:1
  • 7JR Wen,JY Nie,HJ Zhang.Clustering User Queries of a Search Engine[A].The Tenth International World Wide Web Conference[C].Hong Kong.May 1 -5,2001. 被引量:1
  • 8Anton Leuski and James Allan.Improving Interactive Retrieval by Combining Ranked Lists and Clustering[A].In:proceedings of RIAO2000[C].Paris,France,April 12-14,2000,665 -681. 被引量:1
  • 9Anton V.Leouski and W.Bruce Croft.An Evaluation of Techniques for Clustering Search Results[A].Technical Report IR-76,Department of Computer Science,University of Massachusetts,Amherst,1996. 被引量:1
  • 10Htttp://www.cs.washington.edu/research/clustering. 被引量:1

共引文献183

同被引文献89

引证文献9

二级引证文献39

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部