摘要
为消除卷积神经网络前向计算过程中因模型参数的稀疏性而出现的无效运算,基于现场可编程门阵列(FPGA)设计针对稀疏化神经网络模型的数据流及并行加速器。通过专用逻辑模块在输入通道方向上筛选出特征图矩阵和卷积滤波器矩阵中的非零点,将有效数据传递给由数字信号处理器组成的阵列做乘累加操作。在此基础上,对所有相关的中间结果经加法树获得最终输出特征图点,同时在特征图宽度、高度和输出通道方向上做粗颗粒度并行并寻找最佳的设计参数。在Xilinx器件上进行实验验证,结果表明,该设计实现VGG16卷积层综合性能达到678.2 GOPS,性能功耗比为69.45 GOPS/W,其性能与功耗指标较基于FPGA的稠密网络加速器和稀疏网络加速器有较大提升。
In order to eliminate the invalid operations caused by the sparsity of the model parameters in the forward process of the Convolution Neural Network(CNN),a dataflow and parallel accelerator system for the sparse neural network are designed based on the Field Programmable Gate array(FPGA).By using a dedicated logic module,the nonzero elements in the feature map matrices and the convolution filter matrices are picked up.Then the valid data is transferred to the array consisting of Digital Signal Processor(DSP)for multiply-accumulate operations.On this basis,all relevant intermediate results are transferred to the adder tree to generate the final output feature map.Meanwhile,the coarse-grained parallelism is implemented along the width,height and output channel of the feature maps,and the optimal design parameters are searched for.Experiments are carried out based on Xilinx FPGAs for verification,and the results show that the design enables the sparse convolution layer in VGG to deliver performance of 678.2 GOPS and energy efficiency of 69.45 GOPS/W,displaying a considerable improvement of performance and energy efficiency compared with FPGA-based accelerators for the dense and sparse networks.
作者
狄新凯
杨海钢
DI Xinkai;YANG Haigang(Aerospace Information Research Institute,Chinese Academy of Sciences,Beijing 100094,China;University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2021年第7期189-195,204,共8页
Computer Engineering
基金
国家自然科学基金(61876172)
北京市科委重大科研计划项目(Z171100000117019)。
关键词
卷积神经网络
稀疏性
现场可编程门阵列
并行加速器
数字信号处理器
加法树
Convolutional Neural Network(CNN)
sparsity
Field Programmable Gate Array(FPGA)
parallel accelerator
Digital Signal Processor(DSP)
adder tree