摘要
由于受到翻译腔的影响,中英平行语料库存在固有的扭斜的语言模型。显然,用这样的语料库训练的机器翻译、跨语言检索等自然语言处理系统也承袭了扭斜的语言模型,严重影响到应用系统的性能。为了克服平行语料库固有的缺陷,提出构建和剖析中英三元组可比语料库的技术研究。这项研究采用可比语料库和语言自动剖析技术,使用统计和规则相结合的方法,对由本族英语、中式英语和标准中文三元素所组成的三元组可比语料库中的本族英语和中式英语进行统计分析。在此基础上,利用n-元词串、关键词簇等自动抽取技术挖掘基于本族语言模型的双语资源,实现改进和发展机器翻译等自然语言的处理应用。
There exists inherent skewed language model in Chinese-English parallel corpus due to the influence of transla-tionese. Obviously, natural language processing systems trained with these corpora, including machine translation and cross-language information retrieval, will inherit the skewed language model, thus seriously degrading the performance of applications. To fix the inherent defaults in parallel corpus, this paper proposes a technical research on building and profiling Chinese-English 3-tuple comparable corpora. The study adopts comparable corpora and automatic language profiling technologies and applies a combined method of statistics and rules for statistical analysis on native English and Chinglish in 3-tuple comparable corpora that consists of native English, Chinglish and standard Chinese. Based on this, automatic extraction technologies, such as n-grams and key clusters, are used in the mining of native-language-based bilingual resources to improve and develop natural language processing applications such as machine translation.
出处
《计算机工程与应用》
CSCD
2014年第13期153-157,186,共6页
Computer Engineering and Applications
基金
国家自然科学基金(No.61172101
No.61172102)