摘要
华为昇腾是一款新型神经网络加速器.与GPU相比,昇腾加速器专门面向神经网络计算,设计了专用计算单元,核心算力集中在低精度,基于昇腾的软件栈与GPU有所差异.现有研究大多专注于GPU上的深度学习负载性能分析和优化,由于昇腾平台推出不久且具有新的体系结构特征,其实际表现仍有待探索.为深入挖掘昇腾的性能和优化方法,本文对其进行了系统性的评测和分析,包括:(1)基于标准数据集在四个端到端神经网络(ResNet、Transformer、DeepFM和LSTM)上对昇腾和GPU的性能和功耗进行了对比;(2)研究了昇腾上深度学习框架、算子和混合精度训练优化策略;(3)测试三个计算密集型算子(全连接、卷积和RNN)的浮点计算能力、硬件利用率和访存性能.评测结果表明:华为昇腾加速器适合进行稠密型神经网络工作负载,且功耗低于GPU;使用昇腾进行模型训练,需要将神经网络模型从32位精度量化到16位精度.针对昇腾的体系结构和编译软件栈特点,本文提出如下优化策略:深度学习框架开发时应进行整图编译构建,进行算子融合;算子开发时应合理设置分块大小,尽量使用低精度实现算子;模型训练时要合理设置混合精度参数.
The great success achieved by deep neural networks(DNNs)mainly relies on the computation ability provided by modern chips.Nvidia’s high performance and general-purpose Graphics Processing Units(GPUs)are widely used to build deep learning tools and software.There is an industry-wide trend towards domain specific neural network accelerators to extend deep learning performance.For example,Google has released Tensor Processing Unit(TPU)and has deployed TPUs in the data center;MIT proposed an energy-efficient reconfigurable accelerator for deep convolution neural networks.In addition to these accelerators,Huawei has developed the Ascend accelerator,including Ascend 910 for training and Ascend 310 for inference.Ascend accelerators feature super computing power,high integration and fast network bandwidth.Take Ascend 910 as an example,it delivers 256T half precision FLOPS,32 GB memory with 1200 GB/s bandwidth and 100G RoCE v2 network adapter.Compared with GPU,Ascend is mainly for neural networks.Differences between Ascend and GPU are:(1)Ascend uses task-specific processing units which is mainly for neural networks;(2)the computing power is based on lower precision;(3)the compiler software stack on Ascend is different from GPU.The main goal of deep learning is to train a statistical model based on train dataset and the fitted model should make high quality predictions on unseen data,which is referred to as generalization.From the perspective of hardware design,task-specific processing units can greatly speed up some particular workloads and lower precision enables faster training for a single iteration.However,task-specific processing units may not meet the need of a wide variety of deep learning models and lower precision hardware requires special software-level optimization methods.Previous benchmarks and analyses focused on deep learning with GPU platform.Ascend has its special and novel features and its potential remains unknown.To thoroughly understand its performance and optimization method,we conduct a systematic
作者
鲁蔚征
张峰
贺寅烜
陈跃国
翟季冬
杜小勇
LU Wei-Zheng;ZHANG Feng;HE Yin-Xuan;CHEN Yue-Guo;ZHAI Ji-Dong;DU Xiao-Yong(Office of Research Infrastructure,Renmin University of China,Beijing 100872;Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China,Beijing 100872;School of Information,Renmin University of China,Beijing 100872;Department of Computer Science and Technology,Tsinghua University,Beijing 100084)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第8期1618-1637,共20页
Chinese Journal of Computers
基金
国家重点研发计划项目(2018YFB1004401)
国家自然科学基金(U1711261,62172419)
教育部产学融合协同育人(华为昇腾)项目资助.