摘要
近年来,使用单一模型实现多语言神经机器翻译的方法受到了广泛关注。然而,现有方法多将所有语种语料直接混合作为训练语料,未能利用多种语言之间关联和相似的信息。此外,模型训练涉及语言种类多、数据量大、整体训练难度大、耗时长等问题。针对以上两个问题,文中提出了一种基于语种关联度的课程学习方法来提高多语言神经机器翻译的整体性能和收敛速度。具体来说,提出了两种度量语种关联度的指标:使用奇异向量典型相关分析对不同语言进行排序以及使用余弦相似度对特定语言中的不同句子进行排序。进一步,文中提出以验证集损失为课程替换标准的课程学习策略,使模型训练由整体训练转化为一系列课程上的训练,降低了训练难度。该方法填补了课程学习策略在多语言神经机器翻译领域的空白。文中在平衡和非平衡的IWSLT多语言数据集和Europarl语料库数据集上进行了实验,结果表明,所提方法优于多语言基线翻译系统,最多可使训练时间缩短64%。
Multilingual neural machine translation(MNMT)with a single model has drawn more attention due to its capability to deal with multiple languages.However,the current multilingual translation paradigm does not make use of the similar features embodied in different languages,which has already been proven useful for improving the multilingual translation.Besides,the training of multilingual model is usually very time-consuming due to the huge amount of training data.To address these problems,we propose a similarity-based curriculum learning method to improve the overall performance and convergence speed.We propose two hierarchical criteria for measuring the similarity,one is for ranking different languages(inter-language)with singular vector canonical correlation analysis,and the other is for ranking different sentences in a particular language(intra-language)with cosine similarity.At the same time,the paper proposes a curriculum learning strategy that takes the loss of validation set as the curriculum replacement standard.We conduct experiments on balanced and unbalanced IWSLT multilingual data sets and Europarl corpus datasets.The results demonstrate that the proposed method outperforms strong multilingual translation systems and can achieve up to a 64%decrease in training time.
作者
于东
谢婉莹
谷舒豪
冯洋
YU Dong;XIE Wan-ying;GU Shu-hao;FENG Yang(College of Information Sciences,Beijing Language and Culture University,Beijing 100083,China;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《计算机科学》
CSCD
北大核心
2022年第1期24-30,共7页
Computer Science
基金
教育部人文社会科学研究青年基金项目(19YJCZH230)
北京语言大学研究生创新基金资助项目(20YCX138)。
关键词
机器翻译
多语言
课程学习
关联度评估
语种排序
句子排序
Machine translation
Multilingual
Curriculum learning
Similarity evaluation
Language ranking
Sentence ranking