期刊文献+

基于快速地标采样的大规模谱聚类算法 被引量:10

Large Scale Spectral Clustering Based on Fast Landmark Sampling
下载PDF
导出
摘要 为避免传统谱聚类算法高复杂度的应用局限,基于地标表示的谱聚类算法利用地标点与数据集各点间的相似度矩阵,有效降低了谱嵌入的计算复杂度。在大数据集情况下,现有的随机抽取地标点的方法会影响聚类结果的稳定性,k均值中心点方法面临收敛时间未知、反复读取数据的问题。该文将近似奇异值分解应用于基于地标点的谱聚类,设计了一种快速地标点采样算法。该算法利用由近似奇异向量矩阵行向量的长度计算的抽样概率来进行抽样,同随机抽样策略相比,保证了聚类结果的稳定性和精度,同k均值中心点策略相比降低了算法复杂度。同时从理论上分析了抽样结果对原始数据的信息保持性,并对算法的性能进行了实验验证。 The applicability of traditional spectral clustering is limited by its high complexity in large-scale data sets. Through construction of affinity matrix between landmark points and data points, the Landmark-based Spectral Clustering (LSC) algorithm can significantly reduce the computational complexity of spectral embedding. It is vital for clustering results to apply the suitable strategies of the generation of landmark points. While considering big data problems, the existing generation strategies of landmark points face some deficiencies: the unstable results of random sampling, along with the unknown convergence time and the repeatability of data reading in k-means centers method. In this paper, a rapid landmark-sampling spectral clustering algorithm based on the approximate singular value decomposition is designed, which makes the sampling probability of each landmark point decided by the row norm of the approximate singular vector matrix. Compared with LSC algorithm based on random sampling, the clustering result of new algorithm is more stable and accurate; compared with LSC algorithm based on k-means centers, the new algorithm reduces the computational complexity. Moreover, the preservation of information in original data is analyzed for the landmark-sampling results theoretically. At the same time, the performance of new approach is verified by the experiments in some public data sets.
作者 叶茂 刘文芬 YE Mao LIU Wenfen(PLA Information Engineering University, Zhengzhou 450002, China State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China)
出处 《电子与信息学报》 EI CSCD 北大核心 2017年第2期278-284,共7页 Journal of Electronics & Information Technology
基金 国家973计划(2012CB315905) 国家自然科学基金(61502527 61379150)~~
关键词 地标点采样 大数据 谱聚类 近似奇异值分解 Landmark sampling Big data Spectral clustering Approximate singular value decomposition
  • 相关文献

参考文献2

二级参考文献102

  • 1Labrinidis A, Jagadish H V. Challenges and Opportunities with Big Data. Proc of the VLDB Endowment, 2012, 5(12) : 2032-2033. 被引量:1
  • 2Bizer C, Boncz P, Brodie M L, et al. The Meaningful Use of Big Data : Four Perspectives-Four Challenges. ACM SIGMOD Record, 2012, 40(4) : 56-60. 被引量:1
  • 3Wang F Y. A Big-Data Perspective on AI: Newton, Merton, and An- alytics Intelligence. IEEE Intelligent Systems, 2012, 27 (5) : 2-4. 被引量:1
  • 4Simon H A. Why Should Machines Learn?//Michalski R S, Car- bonell J G, Mitchell T M, et al. , eds. Machine Learning: An Arti- ficial Intelligence Approach. Berlin, Germany: Springer, 1983: 25 -37. 被引量:1
  • 5Hart P. The Condensed Nearest Neighbor Rule. IEEE Trans on In- formation Theory, 1968, 14(3) : 515-516. 被引量:1
  • 6Gates G. The Reduced Nearest Neighbor Rule. IEEE Trans on In- formation Theory, 1972, 18(3) : 431-433. 被引量:1
  • 7Brighton H, Mellish C. Advances in Instance Selection for Instance- Based Learning Algorithms. Data Mining and Knowledge Discovery, 2002, 6(2) : 153-172. 被引量:1
  • 8Li Y H, Maguire L. Selecting Critical Patterns Based on Local Geo- metrical and Statistical Information. IEEE Trans on Pattern Analysis and Machine Intelligence, 2011, 33(6) : 1189-1201. 被引量:1
  • 9Angiulli F. Fast Nearest Neighbor Condensation for Large Data Sets Classification. IEEE Trans on Knowledge and Data Engineering, 2007, 19(11): 1450-1464. 被引量:1
  • 10Angiulli F, Folino G. Distributed Nearest Neighbor-Based Conden- sation of Very Large Data Sets. IEEE Trans on Knowledge and Da- ta Engineering, 2007, 19(12): 1593-1606. 被引量:1

共引文献354

同被引文献46

引证文献10

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部