摘要
矩阵乘法在科学计算领域中起着重要的作用,不同结构模型能够改善并行矩阵乘的性能。现有的MPI+CUDA同步模型中,主机端需要进入等待状态,直到设备端完成任务后才能继续工作,这显然浪费时间。针对上述问题,提出一种基于MPI+CUDA异步模型的并行矩阵乘法。该模型避免了主机端进入等待状态,并采用CUDA流技术解决数据量超过GPU内存问题。通过分析异步模型的加速比和效率,实验结果表明,此方法显著提高了并行效率和大型矩阵乘法的运算速度,充分发挥了节点间分布式存储和节点内共享内存的优势,是一种有效可行的并行策略。
Matrix multiplication plays an important role in scientific computing.Different structural models can improve the performance of parallel matrix multiplication.In the existing MPI+CUDA synchronization model,the host-side need enter the waiting state and cannot continue to work until the device completes the task,which obviously wastes time.Concerning this question,a parallel matrix multiplication based on MPI+CUDA asynchronous model was proposed.This model prevented host-side's entering into the waiting state,and used CUDA-stream technology to solve the problem of data bulk over GPU memory.By analyzing the speedup ratio and efficiency of the asynchronous model,the experimental results show that MPI+CUDA parallel programming obviously promotes parallel efficiency and large-scale matrix multiplication's speed,which exerts the advantages of the distributional memory between the nodes and the share memory in the node.It is an effective and feasible parallel strategy.
出处
《计算机应用》
CSCD
北大核心
2011年第12期3327-3330,共4页
journal of Computer Applications
基金
国家自然科学基金资助项目(21133005
20703022
21011120087)
关键词
矩阵乘法
并行计算
混合编程
消息传递接口
统一计算设备架构
matrix multiplication
parallel computing
hybrid programming
Message Passing Interface(MPI)
Computer Unified Device Architecture(CUDA)