BP(back propagation)算法是一种常用的神经网络学习算法,而基于Hadoop集群MapReduce编程模型的BP(MapReduce back propagation,MRBP)算法在处理大数据问题时,表现出良好的性能,因而得到了广泛应用.但是,由于该算法缺乏神经节点之间细...BP(back propagation)算法是一种常用的神经网络学习算法,而基于Hadoop集群MapReduce编程模型的BP(MapReduce back propagation,MRBP)算法在处理大数据问题时,表现出良好的性能,因而得到了广泛应用.但是,由于该算法缺乏神经节点之间细粒度结构并行的能力,当遇到数据维度较高、网络节点较多时,性能还显不足.另一方面,Hadoop集群计算节点通信不能由用户直接控制,现有基于集群系统的结构并行策略不能直接用于MRBP算法.为此,提出一种适合于Hadoop集群的结构并行MRBP(structure parallelism based MapReduce back propagation,SP-MRBP)算法,该算法将神经网络各层划分为多个结构,通过逐层并行-逐层集成(layer-wise parallelism,layer-wise ensemble,LPLE)的方式,实现了MRBP算法的结构并行.同时,推导出了SP-MRBP算法和MRBP算法计算时间解析表达式,以此分析了2种算法时间差和SP-MRBP算法最优并行规模.据了解,这是首次将结构并行策略引入MRBP算法中.实验表明,当神经网络规模较大时,SP-MRBP较之原算法,具有较好的性能.展开更多
Advancements in next-generation sequencer(NGS)platforms have improved NGS sequence data production and reduced the cost involved,which has resulted in the production of a large amount of genome data.The downstream ana...Advancements in next-generation sequencer(NGS)platforms have improved NGS sequence data production and reduced the cost involved,which has resulted in the production of a large amount of genome data.The downstream analysis of multiple associated sequences has become a bottleneck for the growing genomic data due to storage and space utilization issues in the domain of bioinformatics.The traditional string-matching algorithms are efficient for small sized data sequences and cannot process large amounts of data for downstream analysis.This study proposes a novel bit-parallelism algorithm called BitmapAligner to overcome the issues faced due to a large number of sequences and to improve the speed and quality of multiple sequence alignment(MSA).The input files(sequences)tested over BitmapAligner can be easily managed and organized using the Hadoop distributed file system.The proposed aligner converts the test file(the whole genome sequence)into binaries of an equal length of the sequence,line by line,before the sequence alignment processing.The Hadoop distributed file system splits the larger files into blocks,based on a defined block size,which is 128 MB by default.BitmapAligner can accurately process the sequence alignment using the bitmask approach on large-scale sequences after sorting the data.The experimental results indicate that BitmapAligner operates in real time,with a large number of sequences.Moreover,BitmapAligner achieves the exact start and end positions of the pattern sequence to test the MSA application in the whole genome query sequence.The MSA’s accuracy is verified by the bitmask indexing property of the bit-parallelism extended shifts(BXS)algorithm.The dynamic and exact approach of the BXS algorithm is implemented through the MapReduce function of Apache Hadoop.Conversely,the traditional seeds-and-extend approach faces the risk of errors while identifying the pattern sequences’positions.Moreover,the proposed model resolves the largescale data challenges that are covered through MapReduce in th展开更多
文摘BP(back propagation)算法是一种常用的神经网络学习算法,而基于Hadoop集群MapReduce编程模型的BP(MapReduce back propagation,MRBP)算法在处理大数据问题时,表现出良好的性能,因而得到了广泛应用.但是,由于该算法缺乏神经节点之间细粒度结构并行的能力,当遇到数据维度较高、网络节点较多时,性能还显不足.另一方面,Hadoop集群计算节点通信不能由用户直接控制,现有基于集群系统的结构并行策略不能直接用于MRBP算法.为此,提出一种适合于Hadoop集群的结构并行MRBP(structure parallelism based MapReduce back propagation,SP-MRBP)算法,该算法将神经网络各层划分为多个结构,通过逐层并行-逐层集成(layer-wise parallelism,layer-wise ensemble,LPLE)的方式,实现了MRBP算法的结构并行.同时,推导出了SP-MRBP算法和MRBP算法计算时间解析表达式,以此分析了2种算法时间差和SP-MRBP算法最优并行规模.据了解,这是首次将结构并行策略引入MRBP算法中.实验表明,当神经网络规模较大时,SP-MRBP较之原算法,具有较好的性能.
基金This work was supported in part by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2018R1C1B5084424)in part by the Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(No.2019R1A6A1A03032119).
文摘Advancements in next-generation sequencer(NGS)platforms have improved NGS sequence data production and reduced the cost involved,which has resulted in the production of a large amount of genome data.The downstream analysis of multiple associated sequences has become a bottleneck for the growing genomic data due to storage and space utilization issues in the domain of bioinformatics.The traditional string-matching algorithms are efficient for small sized data sequences and cannot process large amounts of data for downstream analysis.This study proposes a novel bit-parallelism algorithm called BitmapAligner to overcome the issues faced due to a large number of sequences and to improve the speed and quality of multiple sequence alignment(MSA).The input files(sequences)tested over BitmapAligner can be easily managed and organized using the Hadoop distributed file system.The proposed aligner converts the test file(the whole genome sequence)into binaries of an equal length of the sequence,line by line,before the sequence alignment processing.The Hadoop distributed file system splits the larger files into blocks,based on a defined block size,which is 128 MB by default.BitmapAligner can accurately process the sequence alignment using the bitmask approach on large-scale sequences after sorting the data.The experimental results indicate that BitmapAligner operates in real time,with a large number of sequences.Moreover,BitmapAligner achieves the exact start and end positions of the pattern sequence to test the MSA application in the whole genome query sequence.The MSA’s accuracy is verified by the bitmask indexing property of the bit-parallelism extended shifts(BXS)algorithm.The dynamic and exact approach of the BXS algorithm is implemented through the MapReduce function of Apache Hadoop.Conversely,the traditional seeds-and-extend approach faces the risk of errors while identifying the pattern sequences’positions.Moreover,the proposed model resolves the largescale data challenges that are covered through MapReduce in th