摘要
针对现有的基于垂直格式挖掘频繁项集采用正交的方式两两进行比较耗费大量时间和产生的Tid集可能很大浪费存储空间的问题,提出了一种基于三角矩阵和差集的垂直数据格式挖掘频繁项集的挖掘算法。该算法利用差集解决了对稠密数据集进行频繁项集挖掘时的Tid集可能很大的问题,并且利用一种前提方法判断是否有必要连接产生候选频繁k+1项集,减少时间的开销,而且在存储上用三角矩阵的数据结构可以进一步节省存储空间。实验结果表明,本算法大大减少挖掘频繁项集时间和空间内存的开销。
The existing vertical format based frequent itemsets mining employs the intersection method to compare two Tid sets, which costs a large amount of time and wastes storage space. Aiming at these problems, we propose a vertical data format based frequent itemsets mining algorithm based on triangular matrix and diffset. The algorithm utilizes the diffset to solve the large number of Tid sets when conducting frequent item mining for dense data sets. A prerequisite method is used to determine whether it is necessary to connect and generate candidate frequent k + 1 itemsets, and reduce the cost of time. With the help of the data structure of the triangular matrix on storage can further save storage space. Experimental results show that the algorithm can greatly reduce time cost and space memory overhead for mining frequent itemsets.
出处
《计算机工程与科学》
CSCD
北大核心
2017年第7期1365-1370,共6页
Computer Engineering & Science
基金
国家自然科学基金(61402212)
辽宁省高等学校杰出青年学者成长计划项目(LJQ2015045)
辽宁省自然科学基金(2015020098)
关键词
频繁项集
三角矩阵
差集
垂直数据格式
frequent itemsets
triangular matrix
diffset
vertical data format