CUDA-NP： Realizing Nested Thread-Level Parallelism in GPGPU Applications 被引量：3

CUDA-NP： Realizing Nested Thread-Level Parallelism in GPGPU Applications

导出

摘要 Parallel programs consist of series of code sections with different thread-level parallelism （TLP）. As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average. Parallel programs consist of series of code sections with different thread-level parallelism （TLP）. As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

作者杨毅李超周辉阳

机构地区 Department of Computing Systems Architecture Department of Electrical and Computer Engineering

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期3-19,共17页 计算机科学技术学报（英文版）

基金 This work was supported by the National Science Foundation of USA under Grant No. CCF-1216569 and a CAREER award of National Science Foundation of USA under Grant No. CCF-0968667.

关键词 GPGPU nested parallelism COMPILER local memory GPGPU, nested parallelism, compiler, local memory

分类号 TP311.52 [自动化与计算机技术—计算机软件与理论] TP311 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献44

1Chen L, Agrawal G. Optimizing MapReduce for GPUs with effective shared memory usage. In Proc. the 21st International Symposium on High-Performance Parallel and Distributed Computing, June 2012, pp.199-210. 被引量：1
2He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008, pp.260-269. 被引量：1
3Stuart J A, Owens J D. Multi-GPU MapReduce on GPU clusters. In Proc. IEEE Int. Parallel & Distributed Processing Symposium, May 2011, pp.1068-1079. 被引量：1
4Wang J, Yalamanchili S. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Proc. the 2014 IEEE International Symposium on Workload Characterization, Oct. 2014. 被引量：1
5Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. the 2009 IEEE International Symposium on Workload Characterization, Oct. 2009, pp.44-54. 被引量：1
6Bakhoda A, Yuan G, Fung W W L et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. Int. Symp. Performance Analysis of Systems and Software, April 2009, pp.163-174. 被引量：1
7Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proc, the 2009 Euro-Par Parallel Processing Workshops, Aug. 2009, pp.46-55. 被引量：1
8Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.86-97. 被引量：1
9Boyer M, Tarjan D, Acton S T, Skadron K. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009. 被引量：1
10Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292. 被引量：1

同被引文献22

1熊志辉,李思昆,陈吉华.具有初始信息素的蚂蚁寻优软硬件划分算法[J].计算机研究与发展,2005,42(12):2176-2183. 被引量：9
2韩俊刚,蒋林,杜慧敏,曹小鹏,董梁,孟李林,赵全良,殷诚信,张军.一种图形加速器和着色器的体系结构[J].计算机辅助设计与图形学学报,2010,22(3):363-372. 被引量：31
3李建江,李兴钢,路川,樊少明.一种单GPU程序向多GPU移植的模板化技术[J].计算机研究与发展,2010,47(12):2185-2191. 被引量：4
4马城城,田泽,黎小玉.基于GPU FPGA芯片原型的VxWorks下驱动软件开发[J].计算机技术与发展,2013,23(7):84-86. 被引量：6
5陈志,武继刚,宋国治,陈金亮.NodeRank:一种高效软硬件划分算法[J].计算机学报,2013,36(10):2033-2040. 被引量：8
6薛敬,靳雁霞.位置扰动的粒子群算法[J].计算机工程与设计,2014,35(3):1037-1040. 被引量：3
7邓军勇,李涛,蒋林,韩俊刚,杜慧敏,沈绪榜,黄光新,常立博,山蕊,黄虎才,马栋.MIGPU-9多核交互式图形处理器的设计[J].计算机辅助设计与图形学学报,2014,26(9):1468-1478. 被引量：11
8刘晖,田泽,黎小玉,陈佳.3D图形处理器API符合性验证方法关键技术研究[J].计算机技术与发展,2014,24(10):193-196. 被引量：3
9田泽,张淑,张骏,许宏杰,黎小玉,郭蒙.图形处理器片段处理单元的设计与实现[J].计算机应用,2014,34(A02):357-360. 被引量：5
10韩俊刚,姚静,李涛,黄虎才,乔虹,延酉玫,王鹏博.多态并行机上的3D图形渲染[J].西安邮电大学学报,2015,20(2):1-6. 被引量：7

引证文献3

1郑祯,翟季冬,李焱,陈文光.基于CUPTI接口的典型GPU程序负载特征分析[J].计算机研究与发展,2016,53(6):1249-1262. 被引量：3
2Xiao-Hu Yan,Fa-Zhi He,Yi-Lin Chen.A Novel Hardware/Software Partitioning Method Based on Position Disturbed Particle Swarm Optimization with Invasive Weed Optimization[J].Journal of Computer Science & Technology,2017,32(2):340-355. 被引量：9
3任向隆,田泽,张骏,郑新建,韩立敏,王治,张亮,李哲,许宏杰,刘航,张宏伟.面向OpenGL 2.0的图形处理器图像处理单元体系结构[J].计算机辅助设计与图形学学报,2019,31(10):1858-1870. 被引量：2

二级引证文献14

1段治健,谢公南,张迎春.隐式超松弛LU-SGS间断Galerkin算法[J].北京邮电大学学报,2019,42(5):8-14. 被引量：3
2Kang Li,Fa-Zhi He,Hai-Ping Yu.Robust Visual Tracking Based on Convolutional Features with Illumination and Occlusion Handing[J].Journal of Computer Science & Technology,2018,33(1):223-236. 被引量：7
3LI Kang,HE Fa-zhi,YU Hai-ping,CHEN Xiao.A correlative classifiers approach based on particle filter and sample set for tracking occluded target[J].Applied Mathematics(A Journal of Chinese Universities),2017,32(3):294-312. 被引量：6
4LI Hao-ran,HE Fa-zhi,YAN Xiao-hu.IBEA-SVM: An Indicator-based Evolutionary Algorithm Based on Pre-selection with Classification Guided by SVM[J].Applied Mathematics(A Journal of Chinese Universities),2019,34(1):1-26. 被引量：7
5LIN Geng.Solving Hardware/Software Partitioning via a Discrete Dynamic Convexized Method[J].Wuhan University Journal of Natural Sciences,2019,24(4):341-348. 被引量：1
6Kang LI,Fazhi HE,Haiping YU,Xiao CHEN.A parallel and robust object tracking approach synthesizing adaptive Bayesian learning and improved incremental subspace learning[J].Frontiers of Computer Science,2019,13(5):1116-1135. 被引量：3
7谢根栓,张伟哲.面向CUDA程序的线程放置优化策略研究[J].智能计算机与应用,2020,10(2):341-345.
8曲海成,于思淼,刘万军,王鑫源.面向CUDA程序的性能预测框架[J].电子学报,2020,48(4):654-661.
9刘晖,田泽,张骏,马城城.基于OpenGL的GPU命令处理器设计方法研究[J].航空计算技术,2020,50(3):105-108. 被引量：1
10Yiteng PAN,Fazhi HE,Haiping YU.A correlative denoising autoencoder to model social influence for top-N recommender system[J].Frontiers of Computer Science,2020,14(3):31-43. 被引量：5

1王振宇,王义和,郭福顺.并行循环的识别[J].哈尔滨工业大学学报,1992,24(1):40-46.
2NP架构防火墙性价比显锋芒——中华卫士防火墙5003、5106特点及应用[J].网络安全技术与应用,2006(9):15-15.
3孙红娜.当安全遇上多核[J].网管员世界,2008(22):35-45.
4王振宇,郭福顺.循环并行的优化技术[J].深圳大学学报（理工版）,1994,11(3):25-30. 被引量：1
5胡吉明,周建强.并行循环的乐观自调度模式[J].计算机学报,1995,18(1):46-55.
6侯毅.基于NP架构的信息控制系统的设计与实现[J].硅谷,2013,6(2):73-73.
7港湾网络发布系列NP架构高性能防火墙产品[J].现代电信科技,2004(12):69-69.
8双NP架构千兆防火墙[J].信息安全与通信保密,2003(12):72-72.
9彤欣.双NP架构提升防火墙性能[J].中国计算机用户,2003(47):30-30.
10胡玲园.恩威MIS系统的设计与实现[J].上海电力学院学报,2001,17(3):66-70.

Journal of Computer Science & Technology

2015年第1期

浏览历史

内容加载中请稍等...

CUDA-NP： Realizing Nested Thread-Level Parallelism in GPGPU Applications 被引量：3

参考文献44

同被引文献22

引证文献3

二级引证文献14

相关作者

相关机构

相关主题

浏览历史