Flexisample:个性化近似聚合查询系统

Flexisample:Personalized Approximate Aggregate Query System

下载PDF

导出

摘要大数据交互式查询分析对于查询时延具有较高需求,基于采样技术的近似计算服务通过牺牲一定的准确性可以获得较少的查询时延,其在大数据近似查询分析方面具有良好的普适性和广阔的应用前景。论文所述系统Flexisample是一个基于采样技术的个性化近似聚合查询系统,实现了针对查询请求的解析重写和逻辑样本组合策略,使其可以满足个性化的多维聚合查询需求。为了在满足多样个性化聚合查询请求的同时保证一定的准确率,Flexisample维护了一组优化设计后的分层样本,并且为了扩大样本在时间维度上的覆盖范围,系统利用在线数据流对分层样本进行维护与更新。将系统应用于电能质量数据聚合查询,结果表明:针对多个个性化聚合查询请求和查询时延约束,系统可以在满足业务人员个性化查询需求的同时有效降低查询时延,在时间消耗仅为全量查询不足7%的条件下,全部分层的查询准确率均达到了88%以上,样本存储空间相比直接存储减少了87.5%。 Big data interactive query analysis has a high demand for query delay.The approximate computing service based on sampling technology can achieve less query delay by sacrificing certain accuracy.It has a good universality and broad application prospect in the aspect of big data approximate query analysis.The system described in this paper named Flexisample,is a personal⁃ized approximate aggregate query system based on sampling technology,which realizes the analytic rewrite and logical sample com⁃bination strategy for query request,so that it can meet the needs of personalized multidimensional aggregate query.Flexisample maintains an optimized set of layered samples to meet a variety of personalized aggregated query requests while maintaining a degree of accuracy.To extend sample coverage in the time dimension,the system maintains and updates layered samples using online data streams.Applying the system to power quality data aggregation query requirements,the results show that with multiple personalized aggregated query requests and query delay constraints,the system can meet the personalized query requirements of business person⁃nel and effectively reduce the query delay,under the condition that the time consumption is less than 7%of the full query,the que⁃ry accuracy of all layers reaches more than 88%.Meanwhile,the sample storage space required by the system is reduced by 87.5%compared with direct storage.

作者赵博左昌麒房俊 ZHAO Bo;ZUO Changqi;FANG Jun(School of Information,North China University of Technology,Beijing 100144;Beijing Key Laboratory on Integration and Analysis of Large-Scale Stream Data,Beijing 100144)

机构地区北方工业大学信息学院大规模流数据集成与分析技术北京市重点实验室

出处《计算机与数字工程》 2021年第12期2431-2436,共6页 Computer & Digital Engineering

基金国家自然科学基金国际(地区)合作与交流项目(编号:62061136006) 国家重点研发计划(编号:2018YFB1402500)资助。

关键词近似计算聚合查询分层采样样本维护 approximate computing aggregate query stratified sampling sample maintenance

分类号 C931.6 [经济管理—管理学]

引文网络
相关文献

参考文献6

1盛家,房俊,郭晓乾,王承栋.时序数据多维聚合查询服务的实现[J].重庆大学学报（自然科学版）,2020,43(7):121-128. 被引量：4
2钟丽娟..时间序列数据相似性与聚合top-k查询算法研究与应用[D].浙江大学,2016:
3高彦杰,陈冠诚.SparkSQL：基于内存的大数据处理引擎[J].程序员,2014(8):104-107. 被引量：4
4冯诗淳,曹斌,晁德文,林博,尹建伟.结合HBase的散列概要森林索引方案[J].小型微型计算机系统,2018,39(1):100-104. 被引量：5
5王宇翔,罗军舟,宋爱波,东方.Partition-Based Online Aggregation with Shared Sampling in the Cloud[J].Journal of Computer Science & Technology,2013,28(6):989-1011. 被引量：2
6辛超..基于分层抽样的在线聚集方法设计与实现[D].华中科技大学,2015:

二级参考文献32

1陈勇旭,陈梦杰,刘雪冰,宋杰.基于MapReduce的连接聚集查询算法研究[J].计算机研究与发展,2013,50(S1):306-311. 被引量：7
2Herodotou H, Lim H, Luo Get al. Starfish: A self-tuning system for big data analytics. In Proc. the 15th CIDR, Apr. 2011, pp.261-272. 被引量：1
3Wu S, Ooi B C, Tan K L. Continuous sampling for online aggregation over multiple queries. In Proc. the 2010 Interna- tional Conference on Management of Data ( SIGMOD), June 2010, pp.651-662. 被引量：1
4Chaudhuri S, Das G, Datar Met al. Overcoming limitations of sampling for aggregation queries. In Proc. the 17th Int. Conf. Data Engineering, Apr. 2001, pp.534-544. 被引量：1
5Laptev N, Zeng K, Zaniolo C. Early accurate results for ad- vanced analytics on MapReduce. PVLDB, 2012, 5(10): 1028- 1039. 被引量：1
6Hellerstein J M, Haas P J, Wang H J. Online aggregation. ACM SIGMOD Record., 1997, 26(2): 171-182. 被引量：1
7Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113. 被引量：1
8Borkar V, Carey M, Grover R et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.1151-1162. 被引量：1
9Pansare N, Borkar V R, Jermaine C et al. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11): 1135-1145. 被引量：1
10Bose J H, Andrzejak A, Hogqvist M. Beyond online aggrega- tion: Parallel and incremental data mining with online map- reduce. In Proc. MDAC, Apr. 2010, Article No.3. 被引量：1

共引文献9

1周文杰,杨璐,严建峰.大数据驱动的投诉预测模型[J].计算机科学,2016,43(7):217-223. 被引量：12
2廖彬,张陶,李敏,于炯,国冰磊,刘炎.基于操作历史图的分布式Key-Value数据库一致性检测算法[J].计算机科学,2019,46(12):213-219. 被引量：9
3张小娟,张永恒,杨斐.基于分布式结构的数字图书馆文献集成检索方法研究[J].电子设计工程,2020,28(12):35-38. 被引量：3
4盛家,房俊,郭晓乾,王承栋.时序数据多维聚合查询服务的实现[J].重庆大学学报（自然科学版）,2020,43(7):121-128. 被引量：4
5杨华芬.云存储环境下大数据实时动态迁移算法研究[J].机械设计与制造工程,2021,50(2):117-122. 被引量：3
6房俊,赵博,左昌麒.基于两阶段分层抽样的近似聚合查询方法[J].数据采集与处理,2022,37(5):1049-1058. 被引量：1
7赵东明,邱圆辉,康瑞,宋韶旭,黄向东,王建民.面向聚合查询的Apache IoTDB物理元数据管理[J].软件学报,2023,34(3):1027-1048. 被引量：9
8罗睿,何清,陈丰,王毅,田晨,李小波,韩秀清.基于电力生产画面的时序数据查询统计组件开发及应用[J].热力发电,2023,52(11):165-172. 被引量：1
9季健,洪帅,陈洪健,钱叶,刘传耀.京东零售基于ClickHouse的增量刷岗方法[J].计算机应用,2024,44(S01):199-203.

1聂桂菊.新媒体时代的大学语文教学改革[J].文教资料,2020(18):45-47. 被引量：2
2吕艳霞,刘波男,王翠荣,王聪,万聪.面向概念漂移数据流的自适应增量集成分类算法[J].小型微型计算机系统,2019,40(12):2624-2630. 被引量：11
3付文杰,杨伯青,黄莉,李化.考虑风光出力不确定性与相关性的混合电价机制设计[J].南方电网技术,2021,15(9):85-92. 被引量：11
4侯浩大.华为云FusionInsight可信智能计算服务让数据可信流通[J].计算机与网络,2021,47(21):73-73.
5陈喆,杨珺,朱志翔,王燕婷,周海燕,李国辉.肿瘤专科临床药师理论考核试题库的建设[J].中国药业,2021,30(24):9-11.
6郑创伟,谢志成,邢谷涛,陈少彬,陈义飞.文本分类技术在报业智能客服系统中的应用[J].中国传媒科技,2021(10):149-151.
7冯浩,郭彩丽.车联网中视频内容理解任务的计算卸载决策研究[J].计算机工程,2022,48(1):135-141.
8刘兴元,缪祥华.一种混合采样与膨胀卷积相结合的入侵检测模型设计[J].化工自动化及仪表,2022,49(1):27-35.
9柴佳楠,查小春,黄春长,周亚利,庞奖励,张玉柱,王娜,炊郁达,戎晓庆,尚瑞清.若尔盖盆地黄河辖曼段河岸沉积物成因判别[J].兰州大学学报（自然科学版）,2021,57(5):600-607. 被引量：5
10西安循数信息科技有限公司[J].科技创业月刊,2021,34(11).

计算机与数字工程

2021年第12期

浏览历史

内容加载中请稍等...

Flexisample:个性化近似聚合查询系统

参考文献6

二级参考文献32

共引文献9

相关作者

相关机构

相关主题

浏览历史