摘要
数据立方体和频繁项集挖掘分别是数据仓库和数据挖掘领域的重要技术,已开展了大量的相关研究工作,取得了较好的进展.数据立方体和频繁项集挖掘依据各自的数据单元和项集构造了类似的代数格(Lattice)结构;数据立方体的等价类上界单元与频繁项集挖掘的闭项集也是相对应的.如果能够论证二者的统一性,则可以为彼此提供更广泛的研究思路,有利于两种技术的相互促进,如:在数据库中利用冰山立方体计算实现频繁项集挖掘来避免数据迁移、利用频繁项集挖掘算法优化数据立方体计算等.之前的工作没有将二者系统地结合起来研究,也没有建立二者之间较为完整的联系.本文在深入研究数据立方体的计算和频繁项集挖掘的过程后,将二者有效地结合在一起,提出了统一的计算框架,给出了二者众多计算性质和方法之间的映射关系,进行了相关概念泛化,具体地建立了冰山立方体、浓缩立方体和商立方体等主要数据立方体计算与相应频繁项集挖掘方法的对应关系.通过算法和实验进一步论证统一计算的有效性:(1)将频繁项集挖掘事务集导入关系数据库,用冰山立方体计算方式进行频繁项集挖掘,从而在数据库中用标准的或扩展的SQL可以实现对关系表进行频繁项集挖掘;(2)验证了浓缩立方体与频繁项集挖掘的统一性并对比了计算效率;(3)将基本表转换为频繁项集挖掘事务集,引入高效的频繁项集挖掘算法LCM计算商立方体,以提升数据立方体计算效率.在公开的真实数据集和人工合成的数据集上验证二者结合、统一计算的正确性,通过改变元组数、维数和倾斜度进行对比验证有效性.实验发现,在大数据集上可令时间效率提升高达92%.
Data cube and frequent itemset mining are essential technologies in the field of data warehouse and data mining respectively.A lot of relevant research work has been conducted,and impressive progress has been achieved.Data cube and frequent itemset mining construct similar algebraic lattice structures according to their data cell and itemset.Simultaneously,the upper bounds of the equivalent class of the data cube correspond to the closed itemset of frequent itemset mining.If the unity of the two lattice structures can be argued,they can provide broader research ideas for each other and facilitate the mutual promotion of the two techniques.For example,we can use iceberg cube computing to implement frequent itemset mining to avoid data migration in databases,and use frequent itemset mining algorithms to optimize data cube computing.Previous studies have not studied the two concepts with systematic combination,nor have they established a complete connection between them.After intensely studying the computation of data cube and the process of frequent itemset mining,this paper combines data cube and frequent itemset mining effectively.The paper proposes a unified computation framework,and gives the mapping relationship between multitudes of computational properties and methods of two lattice structure.Therefore,the high-performance lattice structure computation methods and application algorithms in the two fields can be integrated,thus improving the performance of lattice structure usage and enhancing the efficiency and accuracy of data analysis.On the basis of these results,related concept generalization was performed.Specifically,the corresponding relationship between the computation method of classic data cubee such as the iceberg cube,the condensed cube,and the quotient cube and the corresponding frequent itemset mining method is established.The effectiveness of the unified computation framework is further demonstrated by algorithms and experiments.First,the transaction datasets of frequent itemset mining are imp
作者
徐静文
游进国
王全鹍
黄星瑞
贾连印
XU Jing-Wen;YOU Jin-Guo;WANG Quan-Kun;HUANG Xing-Rui;JIA Lian-Yin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;Yunnan Key Laboratory of Artificial Intelligence,Kunming 650500)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第4期780-802,共23页
Chinese Journal of Computers
基金
国家自然科学基金项目(No.62062046,No.61462050)资助.
关键词
数据立方体
频繁项集挖掘
格结构
统一计算方法
计算效率
data cube
frequent itemset mining
lattice structure
unified computation method
computation efficiency