基于GPU的多层次并行QR分解算法研究被引量：4

Research on Multi-Level Parallel Algorithm of GPU Based QR Decomposition

下载PDF

导出

摘要 QR分解作为一个基本计算模块,广泛应用在图像处理、信号处理、通信工程等众多领域。传统的并行QR分解算法只能挖掘计算过程中的数据级并行。在分析快速Givens Rotation分解特征的基础上,提出了一种多层次并行算法,能够同时挖掘计算过程中的任务级并行和数据级并行,非常适合于以图形处理器(GPU)为代表的大规模并行处理器。同时,采用GPU的并行QR分解算法可以作为基本运算模块被GPU平台上的众多应用程序直接调用。实验结果显示,与CPU平台上使用OpenMP实现的算法相比,基于GPU的多层次并行算法能够获得5倍以上的性能提升,而调用QR分解模块的奇异值分解(SVD)应用可以获得3倍以上的性能提升。 QR decomposition has been widely used as a fundamental computation module in many applications,such as image processing,signal processing and communication,and so on.Traditional parallel implementation of QR decomposition can only exploit data parallelism.Based on the inherent characteristics of Fast Givens Rotation algorithm,this paper proposed a multi-level parallel algorithm,which can exploit task parallelism and data parallelism concurrently and be suitable for massively parallel processors exemplified by Graphics Processing Units （GPU）.Meanwhile,the parallel QR implementation on GPU can be reused by a variety of applications.The experimental results reveal that compared to OpenMP based implementation on CPU,this multi-level parallel algorithm implemented on GPU can improve the performance of 5X and SVD application,and invoking GPU based QR module can achieve a speedup of 3X.

作者穆帅王晨曦邓仰东

机构地区清华大学微电子所

出处《计算机仿真》 CSCD 北大核心 2013年第9期234-238,共5页 Computer Simulation

基金国家自然科学基金(61272085)

关键词正交三角矩阵分解图形处理器多层次并行快速吉文斯旋转 QR Decomposition Graphics Processing Unit （GPU） Multi-Level Parallel Fast Givens Rotation

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献9

1王斌,王彦平,洪文.多基线SAR三维成像的QR分解算法[J].中国科学院研究生院学报,2011,28(1):80-85. 被引量：1
2朱勇旭,吴斌,周玉梅,蔡菁菁,夏凯锋.用于MIMO-OFDM系统QR分解的分布式脉动阵列处理算法[J].电子与信息学报,2012,34(8):1968-1973. 被引量：4
3A Farina,L Timmoneri.Parallel Algorithms and Processing Architectures for Space Time Adaptive Processing[C].Proc.of CIE International Conf,1996:770-774. 被引量：1
4I NDunn,G L Meyer.Parallel QR Factorization for Hybrid Message Passing/shared Memory Operation[J].Journal of the Franklin Institute,2001,338 (5):601-613. 被引量：1
5V Hernandez,J E Roman,A Tomas.A Parallel Variant of the Gram Schmidt Process with Rorthogonalization[C],Proc.of Int'l Conf.on Parallel Computing:Current & Future Issues of HighEnd Computing,2005:221-228. 被引量：1
6周杰,陈啸洋,赵建勋,窦勇.大矩阵QR分解的FPGA设计与实现[J].计算机工程与科学,2010,32(10):34-37. 被引量：7
7B K David and W H Wen.Programming Massively Parallel Processors:A Hands-on Approach[M],San Mateo:Morgan Kaufmann,2010. 被引量：1
8Kerr,D Campbell and M Richards.QR decomposition on GPUs[C].Workshop on General Purpose Processing on Graphics Processing Units,Washington,D.C.,2009,383:71-78. 被引量：1
9H Ryan,M Theresa,K Jeremy and L James.The HPEC Challenge Benchmark Suite[C].In Proceedings of the Ninth Annual HighPerformance Embedded Computing Workshop,Lexington,MA,September 2005. 被引量：1

二级参考文献24

1Farina A, Timmoneri L. Parallel Algorithms and Processing Architectures for Space-Time Adaptive Proeessing[C]//Proc of CIE Int'l Conf, 1996:770-774. 被引量：1
2Fischer B, Modersitzki J. Fast Inversion of Matrices Arising in Image Processin[J]. Computer Science, 1999,22(1) :1-11. 被引量：1
3Batchelor G H. Introduction to Fluid Dynamics[M]. 2nd ed. Cambridge University Press, 2000. 被引量：1
4Ojalvo I U. Proper Use of Lanczos Vectors for Large Eigenvalue Problems[J]. Computers & Structures, 1985,20 (1-3) : 115-120. 被引量：1
5Fernandez L, Garcia J M. The Performance of Fast Givens Rotation Problem Implemented with MPI Extensions in Multicomputer[C]//Proc of Int'l Conf on Applications of High- Performance Computers in Engineering, 1997 : 83-92. 被引量：1
6Dunn I N, Meyer G G L. Parallel QR Factorization for Hybrid Message Passing/shared Memory Operation[J]. Journal of the Franklin Institute, 2001,338(5) : 601-613. 被引量：1
7Hernandez V,Roman J E,Tomas A. A Parallel Variant of the Gram-Schmidt Process with Reorthogonalization[C]//Proc of Int'l Conf on Parallel Computing; Current & Future Issues of High-End Computing, 2005:221-228. 被引量：1
8Hamill R,McCanny J V,Walke R L. Online CORDIC Algorithm and VLSI Architecture for Implementing QR-Array Processors[J]. IEEE Trans on Signal Processing, 2000, 48 (2) :592-598. 被引量：1
9Sergyienko A, Maslennikov O. Implementation of Givens QR- Decomposition in FPGA[C] ff Proe of Int'l Conf on Parallel Processing and Applied Mathematics,2000: 458-465. 被引量：1
10Lorenzelli F, Yao K. A Systematic Folding Design Procedure for a 1-D RLS Systolic Array[C]//Proc of IEEE VLSI Signal Processing V, 1992 : 469-482. 被引量：1

共引文献9

1肖宇,王建业,张伟.可重构计算最优编译器并行计算程序编译方法[J].探测与控制学报,2011,33(2):51-54. 被引量：1
2王少军,刘琦,仲雪洁,彭喜元.一种解线性最小二乘问题的FPGA计算方法[J].仪器仪表学报,2012,33(3):701-707. 被引量：24
3袁晖坪.广义行(列)对称矩阵的QR分解及其算法[J].计算机应用,2012,32(4):990-993.
4熊承义,董攀峰.基于FPGA的正交匹配追踪算法的硬件实现[J].中南民族大学学报（自然科学版）,2013,32(2):73-76. 被引量：1
5朱勇旭,易芝玲,吴斌,周玉梅.WLAN MIMO-OFDM系统DSAP设计与实现[J].电子科技大学学报,2014,43(3):353-358. 被引量：1
6鲁庆男,刘仲.一种基于Matrix的QR分解向量化方法[J].计算机工程与科学,2016,38(2):210-216.
7廖斌,李思敏,唐智灵.基于迭代优化算法的MVDR空间谱估计的SoC实现[J].桂林电子科技大学学报,2016,36(2):87-93.
8张多利,蒋雯,叶紫燕,宋宇鲲,汪健.一种用于矩阵求逆的原位替换算法及硬件实现[J].合肥工业大学学报（自然科学版）,2020,43(1):75-80. 被引量：4
9陈文杰,宋宇鲲,张多利.基于改进QR算法的矩阵分解器设计[J].电子科技,2022,35(11):21-28. 被引量：1

同被引文献43

1王前,吴淑泉,李韬,冼志妙.三角矩阵求逆的ASIC实现研究[J].微电子学与计算机,2004,21(8):135-136. 被引量：4
2M P Bends0e, 0 Sigmund. Topology optimization : theory, meth-ods, and applications[ M] . Berlin : Springer - Verlag, 2003. 被引量：1
3T Borrvall, J Petersson. I^rge - scale topology optimization in 3Dusing parallel computing [ J ]. Computer Methods in Applied Me-chanics an(] Engineering, 2001,190:6201 -6229. 被引量：1
4T Kim,J Kim, Y Kim. Parallelized structural topology optimiza-tion for eigenvalue problems [ J ]. International Journal of Solidsand Structures, 2004,41:2623 -2641. 被引量：1
5K Vemaganti, Vi' I^iwrence. Parallel methods for optimality criteria—based topology optimization [ J ]. Computer Methods in AppliedMechanics and Kngineering, 2005,194:3637 -3667. 被引量：1
6X Huang, M Xie. Evolutionary topology optimization of continu-um structures: methods and applioations[ M ]. New York : Wiley,2010. 被引量：1
7T Fujisawa, M Inaha, G Yagawa. Parallel computing of high -speed compressible flows using a node - based finite - elementmethod[ J]. International journal for numerical methods in engi-neering, 2003,58:48I -511. 被引量：1
8G Yagawa. Node — by - node parallel finite elements: a virtuallymeshless method[ J ]. International journal for numerical methodsin engineering, 2004,60:69 - 102. 被引量：1
9李浪,李仁发.基于数据流异常挖掘的入侵检测系统设计[J].科学技术与工程,2008,8(13):3500-3503. 被引量：5
10李国徽,陈辉.挖掘数据流任意滑动时间窗口内频繁模式[J].软件学报,2008,19(10):2585-2596. 被引量：45

引证文献4

1徐凯华,赵雪琴.多层传感器故障数据的挖掘模型仿真[J].计算机仿真,2014,31(12):393-396. 被引量：3
2韩琪,蔡勇.基于GPU的大规模拓扑优化问题并行计算方法[J].计算机仿真,2015,32(4):221-226. 被引量：5
3李敏,年安君,易飞,张友.基于FPGA技术的多载波通信信号检测算法的研究[J].能源与环保,2017,39(11):153-157.
4于敬巨,张多利,宋宇鲲.高性能矩阵求逆硬件加速器的设计与实现[J].合肥工业大学学报（自然科学版）,2018,41(12):1652-1658. 被引量：4

二级引证文献12

1蔡勇,李胜.Matlab的图形处理器并行计算及其在拓扑优化中的应用[J].计算机应用,2016,36(3):628-632. 被引量：3
2李建荣,付学良.网络中高速目标信息优化检测仿真研究[J].计算机仿真,2017,34(1):409-413. 被引量：1
3刘芳,彭智勇,高兴.WEB数据库中低维子空间偏移定位仿真[J].计算机仿真,2017,34(2):418-421. 被引量：2
4董峰.差异性传感器网络的故障节点定位模型仿真[J].现代电子技术,2016,39(14):20-23. 被引量：2
5苏辉,邱夏青,马文鹏.基于Matlab平台有限元方法的GPU加速[J].信阳师范学院学报（自然科学版）,2018,31(4):677-680. 被引量：3
6梁岚博,金阿芳,闻腾腾,楚花明.基于GPU并行计算的风沙流SPH数值分析[J].机床与液压,2021,49(7):122-127.
7高彦钊,王建明,雷志勇,张宇,陶常勇.分布式机会阵雷达拟态信号处理方法[J].现代雷达,2021,43(11):1-8. 被引量：2
8邱俊豪,宋宇鲲,陈文杰,侯宁.64位双精度矩阵分解的优化和硬件实现[J].合肥工业大学学报（自然科学版）,2021,44(12):1640-1645.
9陈文杰,宋宇鲲,张多利.基于改进QR算法的矩阵分解器设计[J].电子科技,2022,35(11):21-28. 被引量：1
10高彦钊,陶常勇.信号处理与深度学习硬件加速的一致性计算结构[J].国防科技大学学报,2023,45(2):112-120. 被引量：2

1洪振刚,罗省贤.多层次并行体绘制算法的研究与应用[J].计算机工程与科学,2009,31(A01):221-224. 被引量：1
2许佑辉,王仲康,朱育清,陈仁甫.多层次并行体系结构容错技术的研究[J].小型微型计算机系统,1991,12(12):9-17.
3郭静,田有先.基于各向异性扩散方程的多层次并行图像去噪[J].计算机工程与科学,2010,32(4):49-51. 被引量：1
4文颖.CPU的多层次并行调度优化模型仿真[J].计算机仿真,2014,31(12):359-363.
5瑞萨科技开发基于矩阵采购的大规模并行处理器[J].单片机与嵌入式系统应用,2006(4):86-87.
6刘钰,赵荣彩,张铮,芦阳.IXP1200网络处理器多层次并行机制研究[J].微机发展,2004,14(6):111-114. 被引量：1
7陈辉,孙雷鸣,李录明,罗省贤.基于多层次并行算法的共成像点道集抽取[J].计算机仿真,2010,27(8):266-269.
8陈鹏,袁雅婧,桑红石,张天序.一种可扩展的并行处理器模型设计及性能评估[J].航空兵器,2011,18(5):56-61. 被引量：6
9卢桂馥,王勇,邹健.一种快速的零空间算法[J].西安交通大学学报,2012,46(2):59-63. 被引量：3
10陈左宁,尉红梅.基于CC—NUMA结构的系统软件可扩展问题的研究与实践[J].高性能计算技术,2002,0(6):1-5.

计算机仿真

2013年第9期

浏览历史

内容加载中请稍等...

基于GPU的多层次并行QR分解算法研究被引量：4

参考文献9

二级参考文献24

共引文献9

同被引文献43

引证文献4

二级引证文献12

相关作者

相关机构

相关主题

浏览历史

基于GPU的多层次并行QR分解算法研究 被引量：4

参考文献9

二级参考文献24

共引文献9

同被引文献43

引证文献4

二级引证文献12

相关作者

相关机构

相关主题

浏览历史

基于GPU的多层次并行QR分解算法研究被引量：4