期刊文献+

面向深度学习推理的矩阵乘法加速器设计 被引量:2

Design of Matrix Multiplication Accelerator for Deep Learning Inference
下载PDF
导出
摘要 为满足深度学习推理中对不同规模矩阵乘法的计算需求,提出一种基于Zynq SoC平台的整数矩阵乘法加速器。采用基于总线广播的并行结构,充分利用片上数据的重用性并最小化中间累加结果的移动范围,以降低外部DRAM的访问需求。通过动态调整矩阵分块的大小,使加速器在计算形状不规则的矩阵乘时保持较高效率。实验结果表明,在DeepBench测试基准下,该加速器可对双核ARM Cortex-A9 CPU的矩阵乘运算实现8.4倍的加速效果。 An integer matrix multiplication accelerator based on Zynq SoC platform is proposed to satisfy the computing requirements of matrix multiplication of different sizes in deep learning inference.The parallel architecture based on bus broadcasting makes full use of the reusability of on chip data and minimizes the moving range of intermediate cumulative result to reduce the access requirement of external DRAM.By dynamically adjusting the size of matrix blocks,the accelerator can maintain high efficiency in calculating matrix multiplication with irregular shape.Experimental results show that under DeepBench test benchmark,the accelerator can achieve 8.4 times acceleration effect for matrix multiplication of dual-core ARM Cortex-A9 CPU.
作者 冉德成 吴东 钱磊 RAN Decheng;WU Dong;QIAN Lei(State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214125,China)
出处 《计算机工程》 CAS CSCD 北大核心 2019年第10期40-45,共6页 Computer Engineering
基金 国家自然科学基金(61732010)
关键词 整数矩阵乘法 加速器 可编程片上系统 深度学习推理 分块方案 DeepBench测试 integer matrix multiplication accelerator programmable System on Chip(SoC) deep learning inference blocking scheme DeepBench test
  • 相关文献

参考文献3

二级参考文献22

  • 1雷晶,金心宇,王锐.矩阵相乘的并行计算及其DSP实现[J].传感技术学报,2006,19(3):737-740. 被引量:2
  • 2UNDERWOOD K. FPGAs vs. CPUs: trends in peak floating-point performance [C] // Proceedings of the International Symposium on Field Programmable Gate Arrays. Monterey: ACM , 2004: 171- 180. 被引量:1
  • 3UNDERWOOD K, HEMMERT K. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance [C]//Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '04). Washington: IEEE, 2004: 219 - 228. 被引量:1
  • 4AMIRA A, BENSAALI F. An FPGA based parametrisable system for matrix product implementation [C] // Proceedings of the IEEE Workshop on Signal Processing Systems Design and Implementation (SIPS2002). San Diego: IEEE, 2002: 75-79. 被引量:1
  • 5JANG J, CHOI S, PRASANNA V K. Area and time efficient implementation of matrix multiplication on FPGAs [C]//Proeeedings of IEEE International Conference on Field Programmable Technology. [S. I. ]: IEEE, 2002:93 - 100. 被引量:1
  • 6ZHUO L, PRASANNA V K. Scalable and modular algorithms for floating-point matrix multiplication on FPGAs [C]// Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS ' 04). [S. l. ]: IEEE, 2004: 92. 被引量:1
  • 7DOU Y, VASSILIADIS S, KUZMANOV G K, et al. 64-bit floating-point FPGA matrix multiplication [C]// Proceedings of the International Symposium on Field Programmable Gate Arrays. Monterey: ACM, 2005: 86 - 95. 被引量:1
  • 8CAMPBELL S J, KHATRI S P. Resource and delay efficient matrix multiplication using newer FPGA devices [C] // Proceedings of the 16th ACM Great Lakes Symposium on VLSI. Philadelphia: ACM, 2006:308 - 311. 被引量:1
  • 9ZHUO L, PRASANNA V K. Sparse matrix-vector multiplication on FPGAs [C]//Proceedings of the International Symposium on Field Programmable Gate Arrays. Monterey: ACM, 2005:63 - 74. 被引量:1
  • 10DE LORIMIER M, DE HON A. Floating-point sparse matrix-vector multiply for FPGAs [C] // Proceedings of the International Symposium on Field Programmable Gate Arrays. Monterey: ACM, 2005:75-85. 被引量:1

共引文献20

同被引文献8

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部