期刊文献+

基于数据流的k-近邻连接算法 被引量:3

Algorithm for k-Nearest Neighbor Join Based on Data Stream
下载PDF
导出
摘要 k-近邻连接查询是空间数据库中一种常用的操作,该查询处理过程涉及连接和最近邻查询两个复杂操作。传统的集中式k-近邻连接查询算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-近邻连接查询算法成为了目前亟需解决的问题。现有的分布式k-近邻连接查询算法都包括了多轮串行的MapReduce任务,而每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此算法效率低下。首先提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模。在该框架基础上,提出了一种高效的k-近邻连接算法,它利用空间填充曲线将多维数据映射为一维数据,从而将k-近邻连接查询转化为一维范围查询。实验结果表明,该算法的可扩展性较高,且效率比现有算法更优。 kNN join is a frequently used operation in spatial database. It involves both the join and the NN search. Data scale is exploding,and traditional centralized algorithm can not meet the requirements. It is an urgent problem to design distributed kNN join algorithm currently, Existing distributed algorithms include several rounds of serial MapReduce tasks, but each MapReduce task reads and writes data from distributed file system. It is inefficient when expressing dependencies between jobs, and these algorithms are inefficient. Firstly, we proposed a framework based on data stream on MapReduce. This framework models data handle process according to the data flow diagram, and we proposed an efficient kNN join algorithm on the framework. The algorithm maps multi-dimensional data sets into one dimension using space-filling curves (z-values), and transforms kNN joins into a sequence of one-dimensional range searches. Experimental results demonstrate that the algorithm can efficiently resolve the large scale kNN join spatial query.
出处 《计算机科学》 CSCD 北大核心 2015年第5期204-210,共7页 Computer Science
基金 国家自然科学基金项目(61373015 61300052) 国家教育部高等学校博士学科点专项科研基金资助项目(20103218110017) 江苏高校优势学科建设工程资助项目(PAPD) 中央高校基本科研业务费专项项目(NP2013307) 云计算-南航-大数据处理引擎技术研究项目资助
关键词 k-近邻连接 数据流 MAPREDUCE 计算框架 kNN join, Data stream, MapReduce, Framework
  • 相关文献

参考文献18

  • 1Bohm C,Krebs F.The k-nearest neighbour join:Turbo charging the KDD process[J].Knowledge and Information Systems,2004,6(6):728-749. 被引量:1
  • 2Kavraki L E,Plaku E.Distributed Computation of the knnGraph for Large High-Dimensional Point Sets[J].Parallel Distributed Computation,2007,7(3):346-359. 被引量:1
  • 3Sardana D,Bhatnagar R.Graph Clustering Using Mutual K-Nearest Neighbors[M]∥Active Media Technology.Springer International Publishing,2014:35-48. 被引量:1
  • 4Xia C,Lu H,Ooi B C,et al.Gorder:an efficient method for KNN join processing[C]∥Proceedings of the Thirtieth international conference on VLDB Endowment.2004:756-767. 被引量:1
  • 5Yu C,Cui B,Wang S,et al.Efficient index-based KNN join processing for high-dimensional data[J].Information and Software Technology,2007,49(4):332-344. 被引量:1
  • 6Cheung K L,Fu A W C.Enhanced nearest neighbour search on the R-tree[J].ACM SIGMOD Record,1998,27(3):16-21. 被引量:1
  • 7Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113. 被引量:1
  • 8Zhang C,Li F,Jestes J.Efficient parallel kNN joins for large data in MapReduce[C]∥Proceedings of the 15th International Conference on Extending Database Technology.ACM,2012:38-49. 被引量:1
  • 9Lu W,Shen Y,Chen S,et al.Efficient processing of k nearest neighbor joins using mapreduce[J].Proceedings of the VLDB Endowment,2012,5(10):1016-1027. 被引量:1
  • 10刘义,景宁,陈荦,熊伟.MapReduce框架下基于R-树的k-近邻连接算法[J].软件学报,2013,24(8):1836-1851. 被引量:60

二级参考文献10

  • 1Bohm C, Krebs F. The k-nearest neighbor join: Turbo charging the KDD process. Knowledge Information System, 2004,6(6): 728-749. [doi: 10.1007/s10115-003-0122-9]. 被引量:1
  • 2Xia CY, Lu HJ, Coi BC, Hu J. Gorder: An efficient method for KDD joins processing. In: Proc. of the 30th Int'l Conf. on Very Large Data Bases (VLDB). 2004. 756-767. 被引量:1
  • 3Yao B, Li FF, Kumar P. K nearest neighbor queries and KNN-joins in large relational databases (almost) for free. In: Proc. of the 26th Int'l Conf. on Data Engineering (ICDE). 2010.4-15. [doi: 10.1109/ICDE.2010.5447837]. 被引量:1
  • 4Yu C, Cui B, Wang SG, Su JW. Efficient index-based KNN join processing for high-dimensional data. Information and Software Technology, 2007,49(4):332-344. [doi: 10.1016/j.infsof.2006.05.006]. 被引量:1
  • 5Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008,51(1):107-113 [doi: 10.1145/1327452.1327492]. 被引量:1
  • 6White T. Hadoop: The Definitive Guide. Sebastopol: Yahoo! Press, 2009. 被引量:1
  • 7Zhang C, Li FF, Jestes J. Efficient parallel kNN joins for large data in MapReduce. In: Proc. of the 15th Int'l Conf. on Extending Database Technology (EDBT). 2012.38-49. [doi: 10.1145/2247596.2247602]. 被引量:1
  • 8Lu W, Shen YY, Chen S, Col BC. Efficient processing of k nearest neighbor joins using MapReduce. In: Proc. of the 38th lnt'l Conf. on Very Large Data Bases (VLDB). 2012. 1016-1027. 被引量:1
  • 9Liu Y, Jing N, Chen L, Chen HZ. Parallel bulk-loading of spatial data with MapReduce: An R4ree case. Wuhan University Journal of Natural Sciences, 2011,16(6):513-519. [doi: 10.1007/s11859-011-0790-3]. 被引量:1
  • 10Tao YF, Papadias D. Range aggregate processing in spatial databases. IEEE Trans. on Knowledge and Data Engineering, 2004, 16(12):1555-1570. [doi: 10.1109/TKDE.2004.93]. 被引量:1

共引文献59

同被引文献15

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部