摘要
为提升在资源、功耗受限的嵌入式平台上运行的深度卷积网络算法的速度和能效,提出一种基于现场可编程门阵列(FPGA)的卷积并行加速方案。利用卷积层与批归一化(batch normalization,BN)层融合减少计算复杂度;利用数据分片减少片上存储消耗;利用数据复用、并行计算提升运算速度,减少系统硬件开销;利用设计空间探索找到最符合硬件资源约束的计算并行度。实验结果表明,在100 MHz的工作频率下,加速器的峰值计算性能可以达到52.56 GFLOPS,性能是CPU的4.1倍,能耗仅为GPU的9.9%,与其它FPGA方案相比综合性能有一定的提升。
To improve the speed and energy efficiency of deep convolutional network algorithms running on embedded platforms with limited resources and power consumption,a convolutional parallel acceleration scheme based on field programmable gate array(FPGA)was proposed.Convolutional layer and batch normalization(BN)layer fusion was used to reduce computational complexity.Data fragmentation was used to reduce on-chip storage consumption.Data multiplexing and parallel calculation were utilized to increase the operation speed and to reduce the system hardware overhead.Design space exploration was used to find the computational parallelism that best met the hardware resource constraints.Experimental results show that at the working frequency of 100 MHz,the peak computing performance of the accelerator can reach 52.56 GFLOPS,which is 4.1 times better than the performance of the CPU and consumes only 9.9%of the GPU.Compared with other FPGA solutions,the overall performance has certain improvement.
作者
龚豪杰
周海
冯水春
GONG Hao-jie;ZHOU Hai;FENG Shui-chun(Key Laboratory of Electronic Information Technology for Complex Aerospace Systems,National Space Science Center,Chinese Academy of Sciences,Beijing 101499,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 101408,China)
出处
《计算机工程与设计》
北大核心
2022年第7期1872-1878,共7页
Computer Engineering and Design
基金
中国科学院青年创新促进会基金项目(E0293401)。