摘要
该文在ARMv8 64位多核处理器上基于OpenBLAS首次设计、实现并优化了四精度矩阵乘法(Quadruple precision General Matrix-Matrix Multiplication,QGEMM).由于浮点计算中不可避免地引入舍入误差,双精度矩阵乘法(DGEMM)在某些情况下不能给出令人满意的数值结果,因此需要高精度或多精度算法来实现更精确的计算.Double-double算术是一种较为有效和广泛使用的手段.文中采用double-double数据格式构建结构体存储四精度浮点数据;基于OpenBLAS中的稠密矩阵计算的分块算法,增加四精度数据格式的相关的头文件和源文件,并用汇编代码撰写文中所提出的QGEMM的核心内核;利用无误差变换技术,调整并优化内核中的算法流程,避免规格化操作步骤造成的数据强制依赖关系;通过分析算法的数据依赖关系,设计寄存器的分配和轮转策略,优化指令调度顺序,开发指令级并行性,提高QGEMM的实际性能.根据具体算法使用混合乘加指令(FMA)的程度不同,文中采用了算法理论峰值性能这一概念,其有别于机器理论峰值的概念,能更好地评估文中所提出的QGEMM的实际效率.数值实验表明:文中通过汇编代码实现并优化的QGEMM性能最高达到19.7Gflops,效率为在ARMv864位多核处理器平台上QGEMM算法理论峰值性能的82.1%,在满足数值结果精度要求的同时,其计算速度约是由C语言撰写的未优化的QGEMM和MBLAS中QGEMM的5.8倍,是编译器GCC实现的long double数据格式的QGEMM的24倍.同时数值实验还显示文中提出的QGEMM针对不同规模的矩阵具有较好的线程可扩展性.
In this paper, we present the first design, implementation and optimization of quadruple precision matrix-matrix multiplication(QGEMM) based on OpenBLAS for ARMv8 64-bit multi-core processor. Sometimes, double precision matrix matrix multiplication (DGEMM) can't give accurate results as expected owing to cancellation from round-off errors, therefore higher or multiple precision is required. The most efficient and widely used way is by using double-double arithmetic to achieve quadruple precision. The element of the designed OGEMM in this paper is stored as the structure, which consists of two floating-point numbers in double format corresponding to a double-double number. With GEMM blocking algorithm of OpenBLAS, we implement the QGEMM by adding some header files, source files and especially the inner kernel written in assembly. With error-free transformation, we optimize the algorithm flow in the inner kernel to avoid the renormalization step that sometimes is not necessary. By analyzing the data dependency, we design the register rotation and instruction scheduling to exploit instruction level parallelism. Considering that algorithms utilize fused multiply and add (FMA) instructions differently, we use the concept of algorithm~ s theoretical peak performance, which is different from that of machine's theoretical peak performance, to evaluate the efficiency of QGEMM better. Experimental results show that our QGEMM can perform up to 19.7 Gflops with the efficiency 82.1 G of the algorithm's theoretical peak performance for ARMv8 64-bit multi-core processor. With the similar accuracy, our QGEMM runs 5.8 times faster than the un-optimized QGEMM based on OpenBLAS and the QGEMM in MBLAS, both of which utilize the double-double arithmetic to implement QGEMM and are written in C code. Our QGEMM also runs 24 times faster than the QGEMM implementation using GCC complier with long double format. varying thread counts The numerical tests show that our QGEMM has good scalability under across a range of matr
出处
《计算机学报》
EI
CSCD
北大核心
2017年第9期2018-2029,共12页
Chinese Journal of Computers
基金
国家"八六三"高技术研究发展计划项目基金(2012AA01A301)
国家自然(61402495
61303189
61602166
61170049
61402496)资助~~