期刊文献+

Outlier-DivideConquer:近似聚集查询中离群分治取样算法 被引量:1

An outliers divide-and-conquers sampling algorithm for approximate aggregation queries
下载PDF
导出
摘要 取样是一种通用有效的近似技术,利用取样技术进行近似聚集查询处理是决策支持系统和数据挖掘实现技术中的常用方法.如何正确有效地给出近似查询结果并最小化近似查询误差是近似查询处理的关键和目标.在深入研究近似聚集查询取样方法的基础上,本文提出了一个有误差确界且只需单遍扫描数据集的离群分治取样Outlier-DivideConquer算法,该算法在聚集属性内部存在高方差分布时能克服随机均匀取样局限,可显著降低近似查询误差,且执行效率优于同类算法.最后通过与传统均匀取样算法的实验比较验证了Outlier-DivideConquer算法的有效性和正确性. Sampling is an efficient and most widely-used approximation technique.The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support system and data mining tools.We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is high skewed.The distribution of high skewed data or the large variance in the aggregate column is primarily due to the presence of certain outliers or deviants in the data.To address this issue,we introduce an optimized uniform sampling technique called Outlier-DivideConquer that tries to overcome limitations of mere uniform sampling.The method presented by us belong to precomputed query processing scheme and it is worked out based on deep and extensive studies for sampling techniques applied to approximate aggregation queries.The central idea of Outlier-DivideConquer is adopting "divide and conquer" approach to deal with outliers separately.For this purpose,we identify the tuples with outlier values and store them in a separate sub-relation,and random uniform sample from the rest of the relation.In details,the scheme of Outlier-DivideConquer consists of two steps as follows: Separating outliers and Query processing.The Outlier-Divide algorithm included in separating outliers step is the very core of our scheme.The main characteristics of Outlier-Divide algorithm are briefly described as follows:(1) Outlier-Divide algorithm is a one pass and error guarantees algorithm,and(2) unlike outlier-indexes,which is a comparable algorithm with our Outlier-DivideConquer technique,in our algorithm,there is only a single scan of the data and is not necessary to sorting the aggregated column of entire tuples of the relation,so it has lower time complexity than outlier-indexes and can be naturally extended to approximate query processing of streaming data.Moreover,query processing step consists of three steps as follows: Aggregate outliers,Aggregate non-outliers,and Combine aggregates.A more
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2011年第5期524-531,共8页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(60873176 61073059) 福建省教育厅科技项目(JA08161)
关键词 数据挖掘 决策支持 近似聚集查询 均匀取样 离群分治 data mining decision support approximate aggregation queries uniform sampling outliers divide-and-conquer
  • 相关文献

参考文献24

  • 1Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11 (1) :37-57. 被引量:1
  • 2Brown P G, Haas P J. Techniques for warehousing of sample data. Proceedings of the 22^rd ICDE: IEEE Computer Society, Washington DC, USA, 2006, 6. 被引量:1
  • 3胡文瑜,孙志挥,吴英杰.数据挖掘取样方法研究[J].计算机研究与发展,2011,48(1):45-54. 被引量:54
  • 4Chaudhuri S, Das G, Narasayy A V. Optimized stratified sampling for approximate query processing. ACM Transactions on Datatmse Systems. New York: ACM, 2007, 32, 2(9): 50. 被引量:1
  • 5Hellerstein J M, Haas P J, Wang H J. Online aggregation. Proceedings of the ACM SIGMOD Conference, 1997, 26(2): 171- 182. 被引量:1
  • 6Acharya S, Gibbons P B, Poosala V, el al. Join Synopses for approximate query answering. Proceedings of the ACM SIGMOD Conference, New York: ACM,1999, 275-286. 被引量:1
  • 7Olken F. Random sampling from database. Ph. D thesis. U.C. Berkeley of California, 1993. 被引量:1
  • 8Hawkins D M. Identification of outliers. London: Chapman and Hall, 1980,188. 被引量:1
  • 9Chaudhuri S, Das G, Datar M, et al. Overcoruing limitations of sampling for aggregation queries. Proceedings of ICDE, Heidelberg, Germany, 2001, 534-542. 被引量:1
  • 10Haas P J, Hellerstein J M. Ripple joins for online aggregation. Proceedings of the ACM SIGMOD Conference, New York: ACM,1999, 287 -298. 被引量:1

二级参考文献61

共引文献61

同被引文献14

  • 1Acharya S, Gibbons P B, V Poosala. Congressional samples for approximate answering of group-by queries[C] //Proc of the ACM SIGMOD on Management of Data, 2000: 487-498. 被引量:1
  • 2Cochran W G. Sampling Techniques [M]. Third edition. New York: John Wiley 8z Sons, 1977. 被引量:1
  • 3Kun-T Chuang, Hung-L Chen, Ming-S Chen. Feature-preserved sampling over streaming data[J]. ACM TKDD, 2009, 2(4): 1-45. 被引量:1
  • 4Braverman V, Ostrovsky R, Zaniolo C. Optimal sampling from sliding windows[C]//Proc of the 28th ACM SIGMOD-SIGACT-SIGART Symp on Principles of database systems, 2009, 147-156. 被引量:1
  • 5Mark Last. Improving data mining utility with projective sampling [C]//Proc of the 15th ACM SIGKDD intl conf on KDD, Paris, France, 2009, 487=496. 被引量:1
  • 6Ko|lios G, Gunopoulos D, Koudas N, et al. Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets [J]. IEEE Trans on Knowledge and Data Engineering, 2003; 15(5): 1170-1187. 被引量:1
  • 7Chaudhuri S, Das G, NARASAYY A V. Optimized stratified sampling for approximate query pro- cessing [J]. ACM Trans on Database Systems, 2007, 2(9): 32. 被引量:1
  • 8Chaudhuri S, Das G, Narasayya V, A Robust, Optimization-based approach for approximate an- swering of aggregate queries[C]//Proc of ACM SIGMOD, 2001, 295-306. 被引量:1
  • 9Acharya S, Gibbons P B, Poosala V, et al. Join synopses for approximate query answering[C]//In Procs of the ACM SIGMOD Conference, 1999: 275-286. 被引量:1
  • 10Hawkins D M. Identification of Outliers [M]. London: Chapman and Hall, 1980-188. 被引量:1

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部