摘要
为满足深度学习推理中对不同规模矩阵乘法的计算需求,提出一种基于Zynq SoC平台的整数矩阵乘法加速器。采用基于总线广播的并行结构,充分利用片上数据的重用性并最小化中间累加结果的移动范围,以降低外部DRAM的访问需求。通过动态调整矩阵分块的大小,使加速器在计算形状不规则的矩阵乘时保持较高效率。实验结果表明,在DeepBench测试基准下,该加速器可对双核ARM Cortex-A9 CPU的矩阵乘运算实现8.4倍的加速效果。
An integer matrix multiplication accelerator based on Zynq SoC platform is proposed to satisfy the computing requirements of matrix multiplication of different sizes in deep learning inference.The parallel architecture based on bus broadcasting makes full use of the reusability of on chip data and minimizes the moving range of intermediate cumulative result to reduce the access requirement of external DRAM.By dynamically adjusting the size of matrix blocks,the accelerator can maintain high efficiency in calculating matrix multiplication with irregular shape.Experimental results show that under DeepBench test benchmark,the accelerator can achieve 8.4 times acceleration effect for matrix multiplication of dual-core ARM Cortex-A9 CPU.
作者
冉德成
吴东
钱磊
RAN Decheng;WU Dong;QIAN Lei(State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214125,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2019年第10期40-45,共6页
Computer Engineering
基金
国家自然科学基金(61732010)