摘要
统计机器翻译是目前主流的机器翻译技术,其在维汉翻译中良好的性能已经得到了广泛的认可。维汉统计机器翻译的最终结果同样是受这几方面的影响:翻译模型、语言模型、语料质量和规模等。旨在通过对维汉双语训练语料的筛选来提高最终的机器翻译性能。在相关学者的研究基础上,提出了改进的IBM1模型评价句对齐质量、双语语言模型困惑度进行语料筛选和多种筛选指标综合求交集的方法。这些方法没有语言特性的依赖,支持维汉双语语料的筛选。通过实验可证明,使用笔者提出的方法可以得到更优的维汉机器翻译结果。
Statistical machine translation is the main technique of machine translation at present, its good performance in Uyghur-Chinese machine translation area has been widely accepted. The factors affecting Uyghur-Chinese MT eventually performance still are these : translation model,language model, the quality and scale of corpus and so on. This paper aimed to improve the performance of Uyghur-Chinese SMT by filtering the Uyghur-Chinese training corpus. On the basis of relevant scholars' research, this paper proposed modified IBM1 model to evaluate the quality of sentence alignment,bilingual language model perplexity to filter corpus and getting intersection with multi filtering indexes. These methods were independent on language features, so it supported Uyghur-Chinese corpus filtering well. According to the experimental results,it can achieve better performance in Uyghur-Chinese SMT by the proposed methods.
作者
孔金英
温政阳
杨雅婷
王磊
李晓
Kong Jinying Wen Zhengyang Yang Yating Wang Lei Li Xiao(Xinjiang Technical Institute of Physics & Chemistry, Chinese Academic of Science, Urumqi 830011, China Xinjiang Laboratory of Minority Speech & Language Information Processing, Ururnqi 830011, China University of Chinese Academy of Sciences, Beijing 100049, China Experimental Center for Electronic Data ldentifwation of Urumqi Municipal Public Security Bureau, Urumqi 830000, China Institute of Acoustics of Chinese Academy of Sciences, Bering 100190, China)
出处
《计算机应用研究》
CSCD
北大核心
2016年第12期3654-3657,共4页
Application Research of Computers
基金
中国科学院西部之光项目(XBBS201216
LHXZ201301)
中国科学院先导科技专项项目(XDA06030400)
新疆维吾尔自治区青年自然科学基金资助项目(2015211B034)
新疆维吾尔自治区重点实验室开放课题项目(2015KL031)
关键词
维汉机器翻译
语料筛选
语言模型
Uyghur-Chinese machine translation
corpus filtering
language model