摘要
神经机器翻译在高资源情况下已经获得了巨大的成功,但是对低资源情况翻译效果还有待提高.目前,维吾尔语-汉语(维汉)翻译和蒙古语-汉语(蒙汉)翻译都属于低资源情况下的翻译任务.本文提出将汉语单语数据按照领域相似性划分成多份单语数据,并通过回译方法分段利用不同的单语数据训练翻译模型,然后借助模型平均和模型集成等方法进一步提升维汉和蒙汉翻译质量.使用第16届全国机器翻译大会(CCMT 2020)的评测数据进行实验,结果表明该方法可以有效地提升维汉和蒙汉翻译的翻译质量.
Neural machine translation has achieved great success in high-resource situations,but the translation effect in low-resource situations needs to be improved.At present,both Uyghur-Chinese and Mongolian-Chinese translation are low resource translation tasks.This paper proposes to divide Chinese monolingual data into multiple monolingual data according to domain similarity,and to train a translation model on different monolingual data by pre-training and fine-tuning.Then,the translation quality of Uyghur-Chinese and Mongolian-Chinese is further improved by model averaging and model ensemble.Using the evaluation data of the 16th China Conference on Machine Translation(CCMT 2020)for experimental comparison,the results show that this method can effectively improve the translation quality of Uyghur-Chinese and Mongolian-Chinese translation.
作者
张文博
张新路
杨雅婷
董瑞
李晓
ZHANG Wenbo;ZHANG Xinlu;YANG Yating;DONG Rui;LI Xiao(The Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Sciences,Urumqi 830011,China;School of Computer and Technology,University of Chinese Academy of Sciences,Beijing 100049,China;Xinjiang Laboratory of Minority Speech and Language Information Processing,Urumqi 830011,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2021年第4期675-679,共5页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金(U1703133)
新疆自治区高层次人才引进工程项目(Y839031201)
新疆维吾尔自治区重点实验室开放课题(2018D04018)。
关键词
神经机器翻译
低资源语言
回译
领域相似性
预训练
neural machine translation
low-resource language
back translation
domain similarity
pre-trainning