期刊文献+

开放式计算语言加速的分段前缀和并行算法

Open Computing Language Accelerated Parallel Algorithm for Segmented Prefix Sum
下载PDF
导出
摘要 针对数值计算中前缀和运算数据量大、耗时巨大这一难题,提出了一种基于开放式计算语言(open computing language,OpenCL)的分段式前缀和并行算法。首先进行了分段式前缀和算法的并行性分析,对任务进行了层次化分解与组合,设计了两级并行的分段式前缀和算法;然后通过OpenCL编程将前缀和并行算法映射到CPU+GPU系统平台上,实现了层次化并行前缀和处理;最后,根据计算单元(compute unit,CU)的资源条件,增加CU中本地存储器的分配,通过改进工作节点的访问模式来降低bank冲突,提高访存速度。实验结果表明,与基于AMD Opteron 2439 SE CPU的串行算法、基于OpenMP(open multi-processing)并行算法和基于统一计算设备架构并行算法性能相比,前缀和并行算法在OpenCL架构下NVIDIA Tesla C2075计算平台上分别获得了33.51倍、6.26倍和2.41倍的加速比。验证了提出的并行优化方法的有效性和性能可移植性。 Aiming at the problem of large amount of prefix sum computation data in numerical computation and huge time-consuming,a segmented prefix sum parallel algorithm based on the open computing language(OpenCL)is proposesd.First,the parallel analysis of segmented prefix sum algorithms was performed,and a two-level parallel segmented prefix sum algorithm was designed through the hierarchical decomposition and combination of processing tasks.Then the prefix sum parallel algorithm was mapped to the hardware platform of CPU+GPU and the hierarchical parallel processing of prefix sum was implemented by the OpenCL programming.Finally,according to the resource conditions of the compute unit(CU),the allocation of local memory was increased in CU.In addition,the bank conflict was reduced by improving the work-items access mode to increase the memory access speed.The experimental results showed that compared with the performance of the serial algorithm based on AMD Opteron 2439 SE CPU,parallel algorithm based on OpenMP(open multi-processing)and parallel algorithm based on compute unified device architecture(CUDA),the prefix sum parallel algorithm obtained 33.51 times,6.26 times and 2.41 times speedup in the NVIDIA Tesla C2075 computing platform under the OpenCL architecture respectively.The validity and performance portability of the proposed parallel optimization method are verified.
作者 肖汉 李彩林 郭宝云 周清雷 XIAO Han;LI Cai-lin;GUO Bao-yun;ZHOU Qing-lei(School of Information Science and Technology,Zhengzhou Normal University,Zhengzhou 450044,China;School of Civil and Architectural Engineering,Shandong University of Technology,Zibo 255000,China;School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China)
出处 《科学技术与工程》 北大核心 2019年第31期215-221,共7页 Science Technology and Engineering
基金 国家自然科学基金(61572444、41601496、41701525) 山东省自然科学基金(ZR2017LD002) 山东省重点研发计划项目(2018GGX106002)资助
关键词 分段式前缀和 图形处理器 开放式计算语言 并行算法 性能优化 segmented prefix sum graphic processing unit open computing language parallel algorithm performance optimization
  • 相关文献

参考文献6

二级参考文献45

  • 1王骞,丁铁夫.一种稀疏树加法器及结构设计[J].电子器件,2005,28(2):312-314. 被引量:2
  • 2靳战鹏,沈绪榜,罗旻.并行前缀加法器的研究与实现[J].微电子学与计算机,2005,22(12):92-95. 被引量:6
  • 3崔晓平,王成华.二级进位跳跃加法器的优化方块分配[J].北京航空航天大学学报,2007,33(4):495-499. 被引量:3
  • 4Sklansky J. Conditional sum addition logic [J]. IRE Trans Electron Computers, 1960, EC-9(6) :226-231. 被引量:1
  • 5Brent R P, Kung H T. A regular layout for parallel adders [J]. IEEE Fransactions Computers, 1982,31(3):260-264. 被引量:1
  • 6Kogge P M, Stone H S. A parallel algorithm for efficient solution of a general class of recurrence equations[J]. IEEE Trans Computers, 1973, 22(8) : 786-793. 被引量:1
  • 7Matthew M Ziegler, Mircea R StanA. Unified design space for regular parallel prefix adders[J]. Design Au- tomation and Test in Europe Conference and Exhibi- tion, 2004(2) : 1386-1387. 被引量:1
  • 8Zhu Haikun, Cheng Chungkuan, Ronald Graham. Con- structing zero-deficiency parallel prefix adder of mini- mum Depth[J]. ASP-DAC, 2005(2) : 883- 888. 被引量:1
  • 9Reto Zimmermann. Binary Adder Architecture for Cell- Based VLSI and their Synthesis [D]. Zurich: Swiss Federal Institute of Technology, 1997. 被引量:1
  • 10勒战鹏.高速浮点加法运算单元的研究与实现[D].西安:西北工业大学,2006. 被引量:1

共引文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部