摘要
提出一种在网格环境下的k近邻查询方法——GkNN.到目前为止,尚未有文献提出数据网格环境下的k近邻查询算法.当用户在查询节点提交一个查询向量和k,首先以一个较小的查询半径,在数据节点进行基于双重距离尺度的向量缩减,然后将缩减后的向量按照向量“打包”传输的方式发送到执行节点,在执行节点并行地对这些候选向量进行距离(求精)运算.最终将结果向量返回到查询节点.当返回的向量个数小于k时,扩大半径值,继续循环直到得到k个最近邻向量为止.理论分析和实验证明该方法在减少网络通信开销、增加I/O和CPU并行、降低响应时间方面具有较好的性能,非常适合海量高维数据的查询.
Proposed in this paper is a novel k-nearest neighbor query algorithm based on data grid, called the GkNN. Three steps are made in the GkNN. First, when user submits a query vector and k, the vector reduction is performed using DDM index. Then the candidate vectors are transferred to the execution nodes by using vector package technique. Furthermore, the refinement process is conducted in parallelism to get the answer set of the candidate vectors. Finally, the answer set is transferred to the query node. The proposed algorithm uses vector reduction algorithm, vector package technique and pipelined parallelism to solve the problem of heterogeneity of network bandwidth between nodes on the data grid. The analysis and experimental results show that the performance of the algorithm is good in minimizing the response time by decreasing network transmission cost and increasing parallelism of I/O and CPU.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2006年第11期1876-1885,共10页
Journal of Computer Research and Development
基金
国家杰出青年基金项目(60525108)
国家自然科学基金重点项目(60533090)
国家"九七三"重点基础研究发展规划基金项目(2002CB312101)
浙江省科技计划项目重大科技基金项目(2005C13032)
浙江省科技计划项目重大科技攻关基金项目(2005C11001-05)
高等学校中英文图书数字化国际合作计划基金(http://www.cadal.zju.edu.cn)
关键词
K近邻查询
类超球
超球交
数据网格
k-nearest neighbor query
cluster hypersphere
hypersphere intersection
data grid