并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之...并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel Universal Matrix Multiplication Algorithm)算法,该算法是通过将初始的进程分成c组,利用计算节点的额外内存,在每个进程组上同时存储矩阵A、B和执行1/c的PUMMA算法,最后通过规约操作来得到矩阵乘的最终结果.本文基于BLACS(Basic Linear Algebra Communication Subprograms)通信库实现了一种从2D到2.5D的新型数据重分配算法,与PUMMA算法相结合,最终得到2.5D PUMMA算法,可直接替换PDGEMM(Parallel Double-precision General Matrix-matrix Multiplication),具有良好的可移植性.与国际标准算法库ScaLAPACK(Scalable Linear Algebra PACKage)中的PDGEMM等经典2D算法相比,本文算法缩减了通信次数,提高了数据局部性,具有更好的可扩展性.在进程数较多时,例如4096进程时,系统测试表明相对PDGEMM的加速比可达到2.20~2.93.进一步地,本文将2.5D PUMMA算法应用于加速计算对称三对角矩阵的特征值分解,其加速比可达到1.2以上.本文通过大量数值算例分析了2.5D PUMMA算法的性能,并给出了实用性建议和总结了未来的工作.展开更多
Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on...Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on modern supercomputers 10% to 30% of the eigenvalues and eigenvectors of huge dense matrices have to be calculated. Therefore, performance and parallel scaling of the used eigensolvers is of upmost interest. In this article different routines of the linear algebra packages ScaLAPACK and Elemental for parallel solution of the symmetric eigenvalue problem are compared concerning their performance on the BlueGene/P supercomputer. Parameters for performance optimization are adjusted for the different data distribution methods used in the two libraries. It is found that for all test cases the new library Elemental which uses a two-dimensional element by element distribution of the matrices to the processors shows better performance than the old ScaLAPACK library which uses a block-cyclic distribution.展开更多
As the parallel version of LAPACK, ScaLAPACK will provide convenient and powerful parallel programming platform for parallel computing society. In this paper, wepropose a PVM based ScaLAPACK application programming st...As the parallel version of LAPACK, ScaLAPACK will provide convenient and powerful parallel programming platform for parallel computing society. In this paper, wepropose a PVM based ScaLAPACK application programming structure. Based on thisgeneral programming structure, we give three initialization methods of BLACS thatcan be used under different situations. The data distribution fashion used in ScaLAPACK is also introduced to help the user to understand them better. For those whoare familiar with PVM parallel programming, they can call ScaLAPACK subroutinesdirectly using our methods, without concerning ScaLAPACK details too much. Finally,several shortcomings of current version ScaLAPACK package are discussed and an APIis proposed to improve its usability.展开更多
文摘并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel Universal Matrix Multiplication Algorithm)算法,该算法是通过将初始的进程分成c组,利用计算节点的额外内存,在每个进程组上同时存储矩阵A、B和执行1/c的PUMMA算法,最后通过规约操作来得到矩阵乘的最终结果.本文基于BLACS(Basic Linear Algebra Communication Subprograms)通信库实现了一种从2D到2.5D的新型数据重分配算法,与PUMMA算法相结合,最终得到2.5D PUMMA算法,可直接替换PDGEMM(Parallel Double-precision General Matrix-matrix Multiplication),具有良好的可移植性.与国际标准算法库ScaLAPACK(Scalable Linear Algebra PACKage)中的PDGEMM等经典2D算法相比,本文算法缩减了通信次数,提高了数据局部性,具有更好的可扩展性.在进程数较多时,例如4096进程时,系统测试表明相对PDGEMM的加速比可达到2.20~2.93.进一步地,本文将2.5D PUMMA算法应用于加速计算对称三对角矩阵的特征值分解,其加速比可达到1.2以上.本文通过大量数值算例分析了2.5D PUMMA算法的性能,并给出了实用性建议和总结了未来的工作.
文摘Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on modern supercomputers 10% to 30% of the eigenvalues and eigenvectors of huge dense matrices have to be calculated. Therefore, performance and parallel scaling of the used eigensolvers is of upmost interest. In this article different routines of the linear algebra packages ScaLAPACK and Elemental for parallel solution of the symmetric eigenvalue problem are compared concerning their performance on the BlueGene/P supercomputer. Parameters for performance optimization are adjusted for the different data distribution methods used in the two libraries. It is found that for all test cases the new library Elemental which uses a two-dimensional element by element distribution of the matrices to the processors shows better performance than the old ScaLAPACK library which uses a block-cyclic distribution.
文摘As the parallel version of LAPACK, ScaLAPACK will provide convenient and powerful parallel programming platform for parallel computing society. In this paper, wepropose a PVM based ScaLAPACK application programming structure. Based on thisgeneral programming structure, we give three initialization methods of BLACS thatcan be used under different situations. The data distribution fashion used in ScaLAPACK is also introduced to help the user to understand them better. For those whoare familiar with PVM parallel programming, they can call ScaLAPACK subroutinesdirectly using our methods, without concerning ScaLAPACK details too much. Finally,several shortcomings of current version ScaLAPACK package are discussed and an APIis proposed to improve its usability.