海量文本数据库中的高效并行频繁项集挖掘方法被引量：2

An Efficient Method for the Parallel Mining of Frequent Itemsets in Very Large Text Databases

下载PDF

导出

摘要针对大规模文本数据库中频繁项集挖掘的特殊要求,本文提出了一种新的并行挖掘算法parFIM。parFIM以一种简单的数据结构H-Struct为基础,对数据进行纵向划分从而实现并行挖掘。算法同时考虑了去除短模式和减少重复模式。实验结果表明,parFIM能够很好地适用于大规模文本数据库中的频繁项集挖掘任务。 Frequent itemset mining is a common and useful task in data mining. It is also important in text mining. But most of the current mining algorithms can not be used in very large text databases. In order to solve the special problems in frequent itemsets mining in very large text databases,we propose a new parallel mining algorithm parFIM. Based on a simple data structure H-Struct, parFIM mines in parallel by partitioning data vertically. Removing short patterns and reducing duplicated patterns are also considered. Our experiment shows parFIM can suit the frequent itemset mining task well in very large text databases.

作者王永恒杨树强贾焰

机构地区国防科技大学计算机学院

出处《计算机工程与科学》 CSCD 2007年第9期110-113,119,共5页 Computer Engineering & Science

基金国家863计划资助项目(2004AA112020 2003AA115210 2003AA111020)

关键词文本挖掘海量文本数据库频繁项集并行 text mining very large text database frequent itemset parallel

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献9

1Antonie M L,Za¨iane O R.Text Document Categorization by Term Association[A].Proc of the IEEE 2002 Int'l Conf on Data Mining[C].2002.19-26. 被引量：1
2Beil F,Ester M,Xu X.Frequent Term-Based Text Clustering[A].Proc of the Int'l Conf on Knowledge Discovery and Data Mining[C].2002.436-442. 被引量：1
3Agrawal R,Srikant R.Fast Algorithms for Mining Association Rules[A].Proc of the 20th Int '1 Conf Very Large Data Bases[C].1994.487-499. 被引量：1
4Zaki M J,Hsiao C J.CHARM:An Efficient Algorithm for Closed Itemset Mining[A].Proc of the 2nd SIAM Int'l Conf on Data Mining[C].2002.12-28. 被引量：1
5Han J,Pei J,Yin Y.Mining Frequent Patterns Without Candidate Generation[A].Proc of the Special Interest Group on Management of Data[C].2000.1-12. 被引量：1
6Pei J,Han J,Lu H,et al.H-Mine:Hyper-Structure Mining of Frequent Patterns in Large Databases[A].Proc of the 2001IEEE ICDM Conf[C].2001. 被引量：1
7Agrawal R,Shafer J.Parallel Mining of Association Rules[J].IEEE Trans on Knowledge and Data Engineering,1996,8(6):962-969. 被引量：1
8Zheng Z,Kohavi R,Mason L.Real World Performance of Association Rule Algorithms[A].Proc of KDD'01[C].2001. 被引量：1
9Oracle Text 10g Technical Overview[EB/OL].http://www.oracle.com/technology/products/text/x/ 10g _ tech _ overview.html,2005-10. 被引量：1

同被引文献15

1Agrawal R, Imieliaski T, Swami A. Mining association rules between sets of items in large databases [ C ]//Proceedings of ACM SIGMOD record. [ s. 1. ] :ACM ,1993:207-216. 被引量：1
2Agrawal R, Shafer J C. Parallel mining of association rules [ J ]. IEEE Transactions on Knowledge and Data Engineering, 1996,8(6) :962-969. 被引量：1
3Kotsiantis S, Kanellopoulos D. Association rules mining:a re- cent overview [ J ]. GESTS International Transactions on Com- puter Science and Engineering,2006,32( 1 ) :71-82. 被引量：1
4Han E H, Karypis G, Kumar V. Sealable parallel data mining for association rules [ J]. IEEE Transactions on Knowledge and Data Engineering ,2000,12 ( 3 ) : 337-352. 被引量：1
5Dean J, Ghemawat S. MapReduce: a flexible data processing tool[J]. Communications of the ACM ,2010,53( 1 ) :72-77. 被引量：1
6张诤,王惠文.一种高效的并行频繁集挖掘算法[J].计算机工程,2008,34(11):55-57. 被引量：7
7王丹阳,田卫东,胡学钢.一种有效的并行频繁项集挖掘算法[J].计算机应用研究,2008,25(11):3332-3334. 被引量：2
8金桃,何艳珊,宋伟国,岳敏.一种简单有效的并行化频繁项集挖掘算法[J].微计算机信息,2010,26(18):147-149. 被引量：2
9张大为,黄丹,嵇敏,谢福鼎.利用模式指导树的并行频繁项集挖据方法[J].计算机工程与应用,2010,46(22):147-150. 被引量：3
10李成华,张新访,金海,向文.MapReduce:新型的分布式并行计算编程模型[J].计算机工程与科学,2011,33(3):129-135. 被引量：111

引证文献2

1宋威,吉红蕾,李晋宏.一种高效用项集并行挖掘算法[J].计算机工程与科学,2015,37(3):422-428. 被引量：3
2陈静,郑彦.基于二叉树的并行频繁项集挖掘算法[J].计算机技术与发展,2015,25(10):80-83.

二级引证文献3

1孙亮.对大规模数据集高效数据挖掘算法的研究[J].自动化与仪器仪表,2016(3):192-193. 被引量：10
2秦东霞,齐迎春,王伟.基于等价类划分的并行频繁闭项集挖掘算法[J].信阳师范学院学报（自然科学版）,2017,30(3):454-459. 被引量：1
3浦蓉,邵剑飞,胡常礼,曲坤.基于优化上界的高平均效用项集垂直挖掘算法[J].计算机工程与科学,2020,42(5):931-937. 被引量：1

1王永恒,贾焰,杨树强.大规模文本数据库中的短文分类方法[J].计算机工程与应用,2006,42(22):5-7. 被引量：4
2张勇.大集中和最后一公里问题[J].华南金融电脑,2006,14(2):13-16.
3艾萍,倪伟新.基于软件复用的水利领域业务应用特征分析[J].计算机工程与应用,2003,39(27):24-26.
4谢欣,王韬,李晓明.一种支持动态网站生成的模型与系统[J].计算机应用研究,2004,21(4):146-148. 被引量：13
5严胜祥,吴绍春,吴耿锋,金沈杰.一种基于纵向划分数据集的并行决策树分类算法[J].计算机工程与科学,2004,26(7):67-70. 被引量：2
6常梅,董英茹,王法胜..NET程序设计案例教学中“金字塔”式课程设计模式探讨[J].软件工程师,2012(8):40-43. 被引量：1
7覃章荣,岑龙科,任新文,张超英.基于中心实体逻辑分组的XML关键字查询算法[J].计算机工程与设计,2014,35(6):2218-2223. 被引量：1
8魏星.基于SVM的山体滑坡灾害图像识别方法[J].电子测量技术,2013,36(8):65-70. 被引量：24
9杨鹤标,丁勇,杜江.基于三层分治结构的领域类库模型设计[J].江苏大学学报（自然科学版）,2005,26(6):521-524. 被引量：2
10王红霞,刘倩倩.基于二次线性插值的测试性分配方法研究[J].计算机测量与控制,2014,22(7):2037-2039. 被引量：4

计算机工程与科学

2007年第9期

浏览历史

内容加载中请稍等...

海量文本数据库中的高效并行频繁项集挖掘方法被引量：2

参考文献9

同被引文献15

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

海量文本数据库中的高效并行频繁项集挖掘方法 被引量：2

参考文献9

同被引文献15

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

海量文本数据库中的高效并行频繁项集挖掘方法被引量：2