期刊文献+

跳跃滤波:一种面向大数据治理的动态数据摘要设计 被引量:2

Jump Filter: Dynamic Sketch Design for Big Data Governance
下载PDF
导出
摘要 随着信息技术的迅速发展,数据体量维持指数增长,数据价值挖掘困难,这为数据采集、清洗、存储、共享等数据生命周期中各环节的高效管控带来极大的挑战.数据摘要技术利用哈希表/矩阵/位向量对数据的频数、基数、成员关系等核心基础特性进行追踪,使得数据摘要自身成为元数据,并在共享、传输、更新等场景得到广泛应用.大数据的快速流转特性更是催生了动态数据摘要技术.现有的动态数据摘要技术通过动态维护链状或树状结构的概率数据结构列表,具有其容量随数据流大小而扩增或缩减的优势,然而也存在空间开销过大以及时间开销随数据基数增加而增长的缺陷.基于先进的跳跃一致性哈希理论,设计了一种面向大数据治理的动态数据摘要技术.该方法可以同时实现随数据基数线性增长的空间开销以及数据处理分析常数级别的时间开销,能够有效地支撑要求苛刻的多种大数据处理分析任务.在多种合成和真实数据集上,通过与传统方法实验对比,验证了所提方法的有效性和高效性. With the rapid development of information technology, the volume of data maintains an exponential growth, and the value of data is hard to mine. It brings significant challenges to the efficient management and control of each link in the data life cycle, such as data collection, cleaning, storage, and sharing. Sketch uses a hash table/matrix/bit vector to track the core characteristics of data, such as frequency, cardinality, membership, etc. This mechanism makes sketch itself metadata which has been widely used in the sharing,transmission, update and other scenarios. The rapid flow characteristic of big data has spawned the dynamic sketches. The existing dynamic sketches have the advantage of expanding or shrinking in capacity with the size of the data stream by dynamically maintaining a list of probabilistic data structures in a chain or tree structure. However, there are defects of excessive space overhead an d time overhead increasing with the increase of the dataset cardinality. This study designs a dynamic sketch for big data governance based on the advanced jump consistent hash. This method can simultaneously realize the space overhead that grows linearly with the dataset c ardinality and the constant time overhead of data processing and analysis, effectively supporting the demanding big data processing and analysis tasks for big data governance. The validity and efficiency of the proposed method are verified by comparing it with traditional methods on various datasets, including synthetic and natural datasets.
作者 符鹏涛 罗来龙 郭得科 赵翔 李尚森 王怀民 FU Peng-Tao;LUO Lai-Long;GUO De-Ke;ZHAO Xiang;LI Shang-Sen;WANG Huai-Min(College of Systems Engineering,National University of Defense Technology,Changsha 410073,China;College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
出处 《软件学报》 EI CSCD 北大核心 2023年第3期1193-1212,共20页 Journal of Software
基金 国家自然科学基金(U19B2024,62002378,61772544) 国防科技大学科研基金(ZK20-30)。
关键词 大数据 大数据治理 元数据 动态数据摘要 概率数据结构 big data big data governance metadata dynamic sketch probabilistic data structure
  • 相关文献

参考文献1

二级参考文献7

共引文献5

同被引文献9

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部