摘要
双语平行语料库是多语种自然语言处理的重要资源,已被广泛地应用于机器翻译、机助人译、翻译知识抽取与跨语言信息检索等领域中。本文针对汉语-印尼语平行语料的自动对齐与可比语料的自动提取问题,提出了基于锚点和词典相结合的段落对齐方法,并在此基础上采用基于置信区间的长度模型实现句子对齐,同时,为了快速提高汉语-印尼语平行语料库的构建效率,还提出了基于跨语言文档相似度的可比语料提取方法。实验结果表明,本文提出的平行语料对齐方法和可比语料提取方法的准确率较传统方法有显著的提高,说明本文提出方法是有效的、可行的。
Bilingual parallel corpus is an important resource for multilingual natural language processing.It has been widely used in the fields of machine translation,machine-assisted translation,translation knowledge extraction and cross-language information retrieval.In this paper,the automatic alignment of Chinese-Indonesian parallel corpus and the automatic extraction of comparable corpus are proposed.Firstly,a paragraph alignment method based on the combination of anchor point and dictionary is proposed.On this basis,the length alignment model based on confidence interval is used to achieve sentence alignment.At the same time,in order to quickly improve the construction efficiency of the Chinese-Indonesian parallel corpus,a comparable corpus extraction method based on the similarity of cross-language documents is proposed.The experimental results show that the accuracy of parallel corpus alignment method and comparable corpus extraction method is significantly higher than that of traditional methods,which indicates that the proposed method is effective and feasible.
作者
郑铿涛
林楠铠
付颖雯
王连喜
蒋盛益
ZHENG Kengtao;LIN Nankai;FU Yingwen;WANG Lianxi;JIANG Shengyi(School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou Guangdong 510420,China;Eastern Language Processing Center,Guangdong University of Foreign Studies,Guangzhou Guangdong 510420,China)
出处
《广西师范大学学报(自然科学版)》
CAS
北大核心
2019年第1期89-97,共9页
Journal of Guangxi Normal University:Natural Science Edition
基金
国家自然科学基金(61572145)
国家社会科学基金青年项目(17CTQ045)
广东省教育厅基础研究重大项目及应用研究重大项目(2017KZDXM031)
2018年广东大学生科技创新培育专项资金(pdjhb0177)
关键词
平行语料
语料库构建
可比语料
段落对齐
句对齐
parallel corpus
corpus construction
comparable corpus
paragraph alignment
sentence alignment