期刊文献+

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets 被引量:1

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets
原文传递
导出
摘要 The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments. The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第6期1233-1248,共16页 计算机科学技术学报(英文版)
基金 This work was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N120816001.
关键词 outlier detection MULTI-DIMENSIONAL DISTRIBUTED large dataset outlier detection, multi-dimensional, distributed, large dataset
  • 相关文献

参考文献27

  • 1Hawkins D M. Identification of Outliers. Springer, 1980. 被引量:1
  • 2Barnett V, Lewis T. Outliers in Statistical Data. Wiley New York, 1994. 被引量:1
  • 3Rousseeuw P J, Leroy A M. Robust Regression and Outlier Detection. John Wiley & Sons, 2003. 被引量:1
  • 4Knorr E M, Ng R T. Algorithms for mining distancebased outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases, August 1998, pp.392-403. 被引量:1
  • 5Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 2000, 29(2): 427-438. 被引量:1
  • 6Angiulli F, Pizzuti C. Outlier mining in large highdimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(2): 203-215. 被引量:1
  • 7Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In Proc. the 6th European Conference on Principles of Data Mining and Knowledge Discovery, August 2002, pp.15-27. 被引量:1
  • 8Schubert E, Zimek A, Kriegel H P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 2014, 28(1): 190-237. 被引量:1
  • 9Shiokawa H, Fujiwara Y, Onizuka M. Scan+-]: Efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proceedings of the VLDB Endowment, 2015, 8(11): 1178-1189. 被引量:1
  • 10Aggarwal C C, Yu P S. Outlier detection for high dimensional data. ACM SIGMOD Record, 2001, 30(2): 37-46. 被引量:1

同被引文献4

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部