摘要
近年来,Transformer模型中多层注意力网络的使用有效提升了翻译模型的译文质量,但同时大量注意力操作的使用也导致模型整体的推断效率相对较低.基于此,提出了从粗粒度到细粒度(coarse-to-fine,CTF)的方法,根据注意力权重中的信息量差异对信息表示进行细粒度压缩,最终达到加速推断的目的.实验发现,在NIST中英和WMT英德翻译任务上,该方法在保证模型性能的同时,推断速度分别提升了13.9%和12.8%.此外,还进一步分析了注意力操作在不同表示粒度下的信息量差异,对该方法的合理性提供支持.
In recent years,Transformer system has effectively improved the translation quality of the translation model through the introduction of multi-layer attention network.At the same time,the use of a large number of attention operations has also led to low overall inference efficiencies of the model.In order to solve this problem,we propose a method based on coarse-to-fine algorithm,which compresses the information representation according to the difference of the amount of information in the attention weight,and finally achieves the purpose of accelerating decoding.Experimental results show that,on the Chinese-English translation task of NIST and the English-German translation task of WMT,the inference speed of this method can be improved by 13.9%and 12.8%respectively on the premise of ensuring the performance of the model.At the same time,we further analyze the information difference of attention operation under different representation granularity,which provides support for the rationality of coarse-to-fine method.
作者
张裕浩
许诺
李垠桥
肖桐
朱靖波
ZHANG Yuhao;XU Nuo;LI Yinqiao;XIAO Tong;ZHU Jingbo(Natural Language Processing Laboratory,Northeastern University,Shenyang 110819,China;Shenyang Yatrans Network Technology Co.,Ltd.,Shenyang 110004,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2020年第2期175-184,共10页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金重点项目(61732005,61432013)
国家重点研发计划(2019QY1801)
国家自然科学基金(61876035)
网络文化与数字传播北京市重点实验室开放课题。