摘要
对于传统的FP-Growth算法而言,当事务数据库D很大时,构造基于内存的FP树可能是不现实的。针对此问题,提出了一种基于样本事务数据库的SFP算法。该方法对事务数据库D进行随机抽样,得到样本数据库S,此时以比指定的支持度min_sup小的支持度(min_sup')在S中挖掘频繁项集L',根据求得的频繁项集L',在剩余的数据库D-S中求得L'中各事务的支持数,这在大多数情况下就可以求得所有的频繁项集,但是有时可能会漏掉一些。这时可以对D进行二次扫描以发现漏掉的频繁项集。该算法大多数情况下只需要对数据库进行一次扫描,最坏情况下也只需要对数据库进行二次扫描。当把效率放在首位时,比如计算密集事务数据库的频繁项集时,SFP算法尤其合适。
For the traditional algorithm of FP growth, when database is very big, the structure of FP tree based on memory, sometimes is not realistic. According to this problem, put forward a SFP algorithm based on the sample of the transaction database. This method gets sample transaction database S from database D through random sampling, f'trstly in less than a specified min_sup support, digging fre- quent itemsets L "from S, then calculating L "each element of the support in the rest of the data sets D-S, finally getting L by min_sup. In most cases it can be obtained all the frequent itemsets, but sometimes no. In this time omit the frequent itemsets a second scan. The algorithm needs only one scan of database in most cases, the worst case scenario also needs only second scan to database. When the efficiency is the most important, such as intensive application must be frequently calculated. SFP algorithm is the first selecting.
出处
《计算机技术与发展》
2011年第5期79-82,共4页
Computer Technology and Development
基金
安徽省自然科学基金项目(090412054)
关键词
关联规则
频繁项集
FP树
样本事务数据库
association rule
frequent itemsets
FP tree
sample transaction database