摘要
由于分布式环境下挖掘全局序列模式常常产生过多候选序列,加大了网络通信代价。为此提出一种基于分布式环境下的全局序列模式快速挖掘算法。该算法将各站点得到的局部序列模式压缩到一种语法序列树上,避免了重复的序列前缀传输;基于合并树中节点序列规则和简单的特点,提出一种项扩展和序列扩展剪枝策略,有效地约减了候选序列,减少了网络传输量,从而快速生成全局序列模式。理论和实验表明,在大数据集环境下该算法性能优越,能够有效地挖掘全局序列模式。
There were too many candidate sequences generated from sequential pattern mining algorithms in distributed environment which led to communication overhead.To deal with this problem,a new algorithm,Fast Mining of Global Sequential Pattern(FMGSP) in distributed system was proposed.The core idea of this algorithm was to compress local frequent sequential patterns into the corresponding lexicographic sequence tree so as to avoid transmission of repeated prefixes.Based on the regular and simple sequences of merged trees,a new pruning method named Item Extension and Sequence Extension(I/S-E) pruning was presented to prune candidate sequences effectively.Therefore,communication overhead was significantly reduced and global sequential patterns were generated quickly.Theories and experiments showed that the performance of FMGSP was superior,and it was effective specially in mining global sequential patterns for huge amount of data.
出处
《计算机集成制造系统》
EI
CSCD
北大核心
2007年第11期2229-2235,共7页
Computer Integrated Manufacturing Systems
基金
国家自然科学基金资助项目(60773103
70472033
60673060)
国家科技基础条件平台资助项目(2004DKA20310)
江苏省自然科学基金资助项目(BK2005047)
江苏省"青蓝工程"基金资助项目。~~
关键词
数据挖掘
全局序列模式
语法序列树
项扩展和序列扩展剪枝
data mining
global sequential pattern
lexicographic sequence tree
item extension and sequence extension pruning