摘要
双语语料库句子对齐已成为新一代机器翻译研究中的一个至关重要的问题.对齐方法主要有基于长度的方法和基于词汇的方法,两者各具特点:前者实现简单、效率高,但精度低;后者精度高但实现复杂.本文提出一种新的对齐方法,首先利用基于长度的方法对文本进行粗对齐,然后在双语平行文本中确定锚点并自动抽取双语对应的关键词汇,降低了对齐问题的复杂度并减少了错误的蔓延.最后再利用所得到的词汇对应信息进行句子的对齐.这种方法融合了基于长度和基于词汇方法的优点,实验表明,它很大程度地提高了对齐的精度.
Parallel corpora alignment is a key issue in the research of new generation of MT. Thereare two main methods in sentence alignment, i. e., length-based and lexicon-based methods. Thesetwo methods have different characteristics. The former is efficient and easy to implement, but theprecision is not satisfactory, versus the latter. This paper proposes a novel method to alignsentences in Chinese-English parallel corpora. First, the rough result is obtained using thelengthbased method. Then anchors are identified in the texts to reduce the complexity. Some lexicalcorrespondence is also extracted. Finally, the extracted lexical correspondence information is applied infine alignment using lexicon--method. The experimental result shows that this new method cangreatly reduce errors of alignment.
出处
《计算机学报》
EI
CSCD
北大核心
1998年第S1期151-158,共8页
Chinese Journal of Computers
基金
国家自然科学基金
航天预研基金
关键词
双语语料库
句子对齐
机器翻译
Parallel corpora, sentence alignment, machine translation