摘要
取样是一种通用有效的近似技术,利用取样技术进行近似聚集查询处理是决策支持系统和数据挖掘实现技术中的常用方法.如何正确有效地给出近似查询结果并最小化近似查询误差是近似查询处理的关键和目标.在深入研究近似聚集查询取样方法的基础上,本文提出了一个有误差确界且只需单遍扫描数据集的离群分治取样Outlier-DivideConquer算法,该算法在聚集属性内部存在高方差分布时能克服随机均匀取样局限,可显著降低近似查询误差,且执行效率优于同类算法.最后通过与传统均匀取样算法的实验比较验证了Outlier-DivideConquer算法的有效性和正确性.
Sampling is an efficient and most widely-used approximation technique.The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support system and data mining tools.We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is high skewed.The distribution of high skewed data or the large variance in the aggregate column is primarily due to the presence of certain outliers or deviants in the data.To address this issue,we introduce an optimized uniform sampling technique called Outlier-DivideConquer that tries to overcome limitations of mere uniform sampling.The method presented by us belong to precomputed query processing scheme and it is worked out based on deep and extensive studies for sampling techniques applied to approximate aggregation queries.The central idea of Outlier-DivideConquer is adopting "divide and conquer" approach to deal with outliers separately.For this purpose,we identify the tuples with outlier values and store them in a separate sub-relation,and random uniform sample from the rest of the relation.In details,the scheme of Outlier-DivideConquer consists of two steps as follows: Separating outliers and Query processing.The Outlier-Divide algorithm included in separating outliers step is the very core of our scheme.The main characteristics of Outlier-Divide algorithm are briefly described as follows:(1) Outlier-Divide algorithm is a one pass and error guarantees algorithm,and(2) unlike outlier-indexes,which is a comparable algorithm with our Outlier-DivideConquer technique,in our algorithm,there is only a single scan of the data and is not necessary to sorting the aggregated column of entire tuples of the relation,so it has lower time complexity than outlier-indexes and can be naturally extended to approximate query processing of streaming data.Moreover,query processing step consists of three steps as follows: Aggregate outliers,Aggregate non-outliers,and Combine aggregates.A more
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2011年第5期524-531,共8页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(60873176
61073059)
福建省教育厅科技项目(JA08161)
关键词
数据挖掘
决策支持
近似聚集查询
均匀取样
离群分治
data mining
decision support
approximate aggregation queries
uniform sampling
outliers divide-and-conquer