摘要
针对大规模文本数据库中频繁项集挖掘的特殊要求,本文提出了一种新的并行挖掘算法parFIM。parFIM以一种简单的数据结构H-Struct为基础,对数据进行纵向划分从而实现并行挖掘。算法同时考虑了去除短模式和减少重复模式。实验结果表明,parFIM能够很好地适用于大规模文本数据库中的频繁项集挖掘任务。
Frequent itemset mining is a common and useful task in data mining. It is also important in text mining. But most of the current mining algorithms can not be used in very large text databases. In order to solve the special problems in frequent itemsets mining in very large text databases,we propose a new parallel mining algorithm parFIM. Based on a simple data structure H-Struct, parFIM mines in parallel by partitioning data vertically. Removing short patterns and reducing duplicated patterns are also considered. Our experiment shows parFIM can suit the frequent itemset mining task well in very large text databases.
出处
《计算机工程与科学》
CSCD
2007年第9期110-113,119,共5页
Computer Engineering & Science
基金
国家863计划资助项目(2004AA112020
2003AA115210
2003AA111020)
关键词
文本挖掘
海量文本数据库
频繁项集
并行
text mining
very large text database
frequent itemset
parallel