An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets 被引量：1

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

导出

摘要 The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree （ZH-tree） is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1） To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm （ZHkNN for short） to obtain a threshold LW. 2） We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments. The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree （ZH-tree） is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1） To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm （ZHkNN for short） to obtain a threshold LW. 2） We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.

作者王习特申德荣白梅聂铁铮寇月于戈

机构地区 College of Information Science and Engineering

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第6期1233-1248,共16页 计算机科学技术学报（英文版）

基金 This work was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N120816001.

关键词 outlier detection MULTI-DIMENSIONAL DISTRIBUTED large dataset outlier detection, multi-dimensional, distributed, large dataset

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论] TP393.4 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献27

1Hawkins D M. Identification of Outliers. Springer, 1980. 被引量：1
2Barnett V, Lewis T. Outliers in Statistical Data. Wiley New York, 1994. 被引量：1
3Rousseeuw P J, Leroy A M. Robust Regression and Outlier Detection. John Wiley & Sons, 2003. 被引量：1
4Knorr E M, Ng R T. Algorithms for mining distancebased outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases, August 1998, pp.392-403. 被引量：1
5Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 2000, 29(2): 427-438. 被引量：1
6Angiulli F, Pizzuti C. Outlier mining in large highdimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(2): 203-215. 被引量：1
7Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In Proc. the 6th European Conference on Principles of Data Mining and Knowledge Discovery, August 2002, pp.15-27. 被引量：1
8Schubert E, Zimek A, Kriegel H P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 2014, 28(1): 190-237. 被引量：1
9Shiokawa H, Fujiwara Y, Onizuka M. Scan+-]: Efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proceedings of the VLDB Endowment, 2015, 8(11): 1178-1189. 被引量：1
10Aggarwal C C, Yu P S. Outlier detection for high dimensional data. ACM SIGMOD Record, 2001, 30(2): 37-46. 被引量：1

同被引文献4

1薛安荣,鞠时光,何伟华,陈伟鹤.局部离群点挖掘算法研究[J].计算机学报,2007,30(8):1455-1463. 被引量：96
2范文山,王斌.启发式探查最佳分割平面的快速KD-Tree构建方法[J].计算机学报,2009,32(2):185-192. 被引量：9
3胡昱璞,牛保宁.动态确定K值聚类算法的R-树空间索引构建[J].计算机科学与探索,2016,10(2):173-181. 被引量：3
4王习特,申德荣,白梅,聂铁铮,寇月,于戈.BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51. 被引量：29

引证文献1

1李子茂,骆庆,刘晶.VDOD:一种基于KD树的分布式离群点检测算法[J].计算机与数字工程,2018,46(3):419-423. 被引量：2

二级引证文献2

1杨扬,武文佳.基于无人机影像的屋顶检测与光伏自动排布方法[J].现代制造技术与装备,2019,0(6):86-88. 被引量：3
2李江岱.基于离群模型的异常大数据检测方法研究[J].新一代信息技术,2019,2(14):56-60.

1陈建铎.测量数据的筛选与程序设计[J].现代电子技术,1996,19(1):30-32.
2陈甲英,曹飞龙.一种二维鲁棒随机权网络及其应用[J].中国计量学院学报,2016,27(2):239-246.
3YIN Hong YANG Shuqiang HAN Weihong.An Efficient Algorithm for Processing Partialmax/min Queries in OLAP[J].China Communications,2010,7(4):65-70.
4Jeff Fellinge 高斌(译).Paros代理服务器基于Java技术的免费软件帮助寻找网站的弱点[J].Windows IT Pro Magazine（国际中文版）,2007(2):54-56.
5牛永鑫.基于距离的孤立点挖掘改进算法在教务管理系统中的应用[J].硅谷,2014,7(8):52-53.
6秦艳华.数据挖掘技术中孤立点的分析研究[J].硅谷,2010,3(4):49-50. 被引量：2
7谢文阁,王海虹.一种改进的基于距离的孤立点挖掘算法的研究[J].渤海大学学报（自然科学版）,2011,32(2):157-161. 被引量：1
8陆声链,林士敏.基于距离的孤立点检测研究[J].计算机工程与应用,2004,40(33):73-75. 被引量：44
9佘玉萍.基于中位数的双MAD的离群值检测方法[J].廊坊师范学院学报（自然科学版）,2016,16(2):25-27. 被引量：6
10周悦,邢妍妍.基于ODDD水下机器人故障诊断方法[J].计算机测量与控制,2015,23(4):1106-1108.

Journal of Computer Science & Technology

2015年第6期

浏览历史

内容加载中请稍等...

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets 被引量：1

参考文献27

同被引文献4

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史