Improving performance portability for GPU-specific Open CL kernels on multi-core/many-core CPUs by analysis-based transformations

使用“基于分析的代码转换方法”来提升GPU特定的OpenCL kernel在多核/众核CPU上的性能移植性(英文)

导出

摘要 OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL＇s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by （1） removing all the unwanted local-memory arrays together with the obsolete barrier statements and （2） optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel＇s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance. 目的:针对面向GPU设计的Open CL kernel程序在CPU上性能移植性欠佳这一问题,设计一种基于访存特征分析的代码转换方法,提升性能移植性。创新点:通过分析Open CLkernel中的访存模式,去除不必要的局部存储数组及其带来的同步语句,并使用向量化和局域性重开发进一步优化代码,最终取得显著的性能提升。方法:首先,针对Open CL kernel代码中的数组访问,设计一种精确的线性化访问描述子(图2)。然后,利用该描述子,分两步对GPU特定的Open CL kernel代码进行转换,以提高其在CPU上的性能(图7)。第一步为基于分析的work-item折叠,即通过分析访问描述子,找出并去除不必要的局部存储数组及其带来的同步语句,然后完成work-item折叠。第二步为适应架构的代码优化,即针对CPU架构的特点,使用向量化和局域性重开发进一步优化折叠后的代码。最后,上述代码转换过程被整合为一个工具链,连同一个调度程序,嵌入到一个开源的Open CL运行时系统中(图11)。实验结果表明,这种转换方法可以显著提升GPU特定的Open CL kernel在Intel Sandy Bridge架构CPU和Intel Knights Corner架构协处理器上的性能。结论:准确分析Open CL kernel代码中的访存模式,不仅利于判断局部存储数组是否适合于CPU架构,还能用于指导之后的代码优化过程,因此是提高性能移植性的重要步骤。

作者 Mei WEN Da-fei HUANG Chang-qing XUN Dong CHEN

机构地区 School of Computer National Key Laboratory of Parallel and Distributed Processing

出处《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第11期899-916,共18页 信息与电子工程前沿（英文版）

基金 Project supported by the National Natural Science Foundation of China(No.61272145) the National High-Tech R&D Program(863)of China(No.2012AA012706)

关键词 OpenCL Performance portability Multi-core/many-core CPU Analysis-based transformation OpenCL 性能移植性多核/众核CPU 基于分析的转换

分类号 TP391.41 [自动化与计算机技术—计算机应用技术] TP311.54 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1于蕾,张景,李朋.OLAP技术及其在SQL SERVER2000中的实现[J].微机发展,2003,13(10):15-18. 被引量：3
2Yuehua DAI Yi SHI Yong QI Jianbao REN Peijian WANG.Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture[J].Frontiers of Computer Science,2013,7(1):34-43. 被引量：4
3高剑刚.众核处理器研究现状及关键技术[J].高性能计算技术,2013,0(3):1-8.
4XU Zheng-Quan.Classification Analysis Process of Reusable Software Components[J].Journal of Shanghai University(English Edition),2001,5(z1):203-206. 被引量：1
5申彦,朱玉全.CMP上基于数据集划分的K-means多核优化算法[J].智能系统学报,2015,10(4):607-614. 被引量：4
6阎菲,陈刚.基于数据仓库汽车零部件失效诊断研究[J].微计算机信息,2007,23(05S):225-226.
7刘晴.QLogic为IBM提供第五代FC适配器和Virtual Fabric适配器[J].计算机与网络,2013,39(16):52-52.
8Xin Li.SOME PROPERTIES FOR ANALYSIS-SUITABLE T-SPLINES[J].Journal of Computational Mathematics,2015,33(4):428-442. 被引量：3
9汪泓帆,赵奎,梁刚,袁龙.一种基于X86架构的多核绑定技术[J].计算机安全,2012(6):14-18.
10Check Point VPN-1 Power Multi-Core提供吉比特级安全保护[J].电信技术,2008(3):68-68.

Frontiers of Information Technology & Electronic Engineering

2015年第11期

浏览历史

内容加载中请稍等...

Improving performance portability for GPU-specific Open CL kernels on multi-core/many-core CPUs by analysis-based transformations

相关作者

相关机构

相关主题

浏览历史