摘要
国产DCU采用单指令多线程(SIMT)的并行执行模型,在程序执行时核函数内会产生非一致控制流,导致线程束中的线程部分只能串行执行,即线程束分化。针对核函数的性能因线程束分化受到严重制约的问题,提出一种减少线程束分化时间的编译优化方法——部分控制流合并(PCFM)。首先,通过散度分析找到同构且含有大量相同指令和相似指令的可融合发散区域;其次,统计合并后节省的指令周期百分比,从而评估可融合发散区域的融合盈利;最后,查找对齐序列,并合并有收益的可融合发散区域。在DCU上使用PCFM测试从图形处理器(GPU)基准测试套件Rodinia和经典的排序算法中选择的测试用例,实验结果表明,PCFM对测试用例能够取得1.146的平均加速比,与分支融合+尾合并方法相比,使用PCFM的加速比平均提高了5.72%。可见,所提方法减少线程束分化的效果更好。
The domestic DCU(Deep Computer Unit)adopts the parallel execution model of Single Instruction Multiple Thread(SIMT).When the programs are executed,inconsistent control flow is generated in the kernel function,which causes the threads in the warp be executed serially.And that is warp divergence.Aiming at the problem that the performance of the kernel function is severely restricted by warp divergence,a compilation optimization method to reduce the warp divergence time—Partial-Control-Flow-Merging(PCFM)was proposed.Firstly,divergence analysis was performed to find the fusible divergent regions that are isomorphic and contained a large number of same instructions and similar instructions.Then,the fusion profit of the fusible divergent regions was evaluated by counting the percentage of instruction cycles saved after merging.Finally,the alignment sequence was searched,the profitable fusible divergent regions were merged.Some test cases from Graphics Processing Unit(GPU)benchmark suite Rodinia and the classic sorting algorithm were selected to test PCFM on DCU.Experimental results show that PCFM can achieve an average speedup ratio of 1.146 for the test cases.And the speedup of PCFM is increased by 5.72%compared to that of the branch fusion+tail merging method.It can be seen that the proposed method has a better effect on reducing warp divergence.
作者
杨小艺
赵荣彩
王洪生
韩林
徐坤坤
YANG Xiaoyi;ZHAO Rongcai;WANG Hongsheng;HAN Lin;XU Kunkun(School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou Henan 450001,China;National Supercomputing Center in Zhengzhou,Zhengzhou Henan 450001,China)
出处
《计算机应用》
CSCD
北大核心
2023年第10期3170-3177,共8页
journal of Computer Applications
基金
河南省重大科技专项(221100210600)。
关键词
DCU
单指令多线程
线程束分化
复杂控制流
编译优化
Deep Computer Unit(DCU)
Single Instruction Multiple Thread(SIMT)
warp divergence
complex control flow
compilation optimization