摘要
k-近邻连接查询是空间数据库中一种常用的操作,该查询处理过程涉及连接和最近邻查询两个复杂操作。传统的集中式k-近邻连接查询算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-近邻连接查询算法成为了目前亟需解决的问题。现有的分布式k-近邻连接查询算法都包括了多轮串行的MapReduce任务,而每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此算法效率低下。首先提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模。在该框架基础上,提出了一种高效的k-近邻连接算法,它利用空间填充曲线将多维数据映射为一维数据,从而将k-近邻连接查询转化为一维范围查询。实验结果表明,该算法的可扩展性较高,且效率比现有算法更优。
kNN join is a frequently used operation in spatial database. It involves both the join and the NN search. Data scale is exploding,and traditional centralized algorithm can not meet the requirements. It is an urgent problem to design distributed kNN join algorithm currently, Existing distributed algorithms include several rounds of serial MapReduce tasks, but each MapReduce task reads and writes data from distributed file system. It is inefficient when expressing dependencies between jobs, and these algorithms are inefficient. Firstly, we proposed a framework based on data stream on MapReduce. This framework models data handle process according to the data flow diagram, and we proposed an efficient kNN join algorithm on the framework. The algorithm maps multi-dimensional data sets into one dimension using space-filling curves (z-values), and transforms kNN joins into a sequence of one-dimensional range searches. Experimental results demonstrate that the algorithm can efficiently resolve the large scale kNN join spatial query.
出处
《计算机科学》
CSCD
北大核心
2015年第5期204-210,共7页
Computer Science
基金
国家自然科学基金项目(61373015
61300052)
国家教育部高等学校博士学科点专项科研基金资助项目(20103218110017)
江苏高校优势学科建设工程资助项目(PAPD)
中央高校基本科研业务费专项项目(NP2013307)
云计算-南航-大数据处理引擎技术研究项目资助