期刊文献+

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications 被引量:3

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications
原文传递
导出
摘要 Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average. Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期3-19,共17页 计算机科学技术学报(英文版)
基金 This work was supported by the National Science Foundation of USA under Grant No. CCF-1216569 and a CAREER award of National Science Foundation of USA under Grant No. CCF-0968667.
关键词 GPGPU nested parallelism COMPILER local memory GPGPU, nested parallelism, compiler, local memory
  • 相关文献

参考文献44

  • 1Chen L, Agrawal G. Optimizing MapReduce for GPUs with effective shared memory usage. In Proc. the 21st International Symposium on High-Performance Parallel and Distributed Computing, June 2012, pp.199-210. 被引量:1
  • 2He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008, pp.260-269. 被引量:1
  • 3Stuart J A, Owens J D. Multi-GPU MapReduce on GPU clusters. In Proc. IEEE Int. Parallel & Distributed Processing Symposium, May 2011, pp.1068-1079. 被引量:1
  • 4Wang J, Yalamanchili S. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Proc. the 2014 IEEE International Symposium on Workload Characterization, Oct. 2014. 被引量:1
  • 5Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. the 2009 IEEE International Symposium on Workload Characterization, Oct. 2009, pp.44-54. 被引量:1
  • 6Bakhoda A, Yuan G, Fung W W L et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. Int. Symp. Performance Analysis of Systems and Software, April 2009, pp.163-174. 被引量:1
  • 7Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proc, the 2009 Euro-Par Parallel Processing Workshops, Aug. 2009, pp.46-55. 被引量:1
  • 8Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.86-97. 被引量:1
  • 9Boyer M, Tarjan D, Acton S T, Skadron K. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009. 被引量:1
  • 10Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292. 被引量:1

同被引文献22

引证文献3

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部