期刊文献+

汉语-印尼语平行语料自动对齐方法研究 被引量:6

Study on the Automatic Alignment of Mandarin-Indonesian Bilingual Texts
下载PDF
导出
摘要 双语平行语料库是多语种自然语言处理的重要资源,已被广泛地应用于机器翻译、机助人译、翻译知识抽取与跨语言信息检索等领域中。本文针对汉语-印尼语平行语料的自动对齐与可比语料的自动提取问题,提出了基于锚点和词典相结合的段落对齐方法,并在此基础上采用基于置信区间的长度模型实现句子对齐,同时,为了快速提高汉语-印尼语平行语料库的构建效率,还提出了基于跨语言文档相似度的可比语料提取方法。实验结果表明,本文提出的平行语料对齐方法和可比语料提取方法的准确率较传统方法有显著的提高,说明本文提出方法是有效的、可行的。 Bilingual parallel corpus is an important resource for multilingual natural language processing.It has been widely used in the fields of machine translation,machine-assisted translation,translation knowledge extraction and cross-language information retrieval.In this paper,the automatic alignment of Chinese-Indonesian parallel corpus and the automatic extraction of comparable corpus are proposed.Firstly,a paragraph alignment method based on the combination of anchor point and dictionary is proposed.On this basis,the length alignment model based on confidence interval is used to achieve sentence alignment.At the same time,in order to quickly improve the construction efficiency of the Chinese-Indonesian parallel corpus,a comparable corpus extraction method based on the similarity of cross-language documents is proposed.The experimental results show that the accuracy of parallel corpus alignment method and comparable corpus extraction method is significantly higher than that of traditional methods,which indicates that the proposed method is effective and feasible.
作者 郑铿涛 林楠铠 付颖雯 王连喜 蒋盛益 ZHENG Kengtao;LIN Nankai;FU Yingwen;WANG Lianxi;JIANG Shengyi(School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou Guangdong 510420,China;Eastern Language Processing Center,Guangdong University of Foreign Studies,Guangzhou Guangdong 510420,China)
出处 《广西师范大学学报(自然科学版)》 CAS 北大核心 2019年第1期89-97,共9页 Journal of Guangxi Normal University:Natural Science Edition
基金 国家自然科学基金(61572145) 国家社会科学基金青年项目(17CTQ045) 广东省教育厅基础研究重大项目及应用研究重大项目(2017KZDXM031) 2018年广东大学生科技创新培育专项资金(pdjhb0177)
关键词 平行语料 语料库构建 可比语料 段落对齐 句对齐 parallel corpus corpus construction comparable corpus paragraph alignment sentence alignment
  • 相关文献

参考文献3

二级参考文献22

  • 1张艳,柏冈秀纪.基于长度的扩展方法的汉英句子对齐[J].中文信息学报,2005,19(5):31-36. 被引量:24
  • 2任成梅.跨语言信息检索的发展与展望[J].图书馆学研究,2006(4):79-82. 被引量:11
  • 3任成梅,李春英.汉英跨语言信息检索探讨[J].图书馆理论与实践,2006(6):51-53. 被引量:5
  • 4冯志伟.中国语料库研究的历史与现状.Journal of Chinese Language and Computing,2002,11(2):127-136. 被引量:4
  • 5Peter F.Brown,John Cocke,Stephen A,et al..A Statistical Approach to Machine Translation:Parameter Estimation[J].Computational Linguistics,1990,volume 16:79-85. 被引量:1
  • 6Resnik,p.and N.A.Smith..The web as a Parallel Corpus[J].Comoutational Linguistics,2003,volume 29:349-380. 被引量:1
  • 7Lei Shi,Cheng Niu,Ming Zhou,,et al.A DOM Tree Alignment Model for Mining Parallel Data from the Web[C]//Joint Pro-ceedings of the Association for Computational Linguistics and the International Conference on Computational Linguistics,Sydney,Australia,2006:489-496. 被引量:1
  • 8Lei Shi,Ming Zhou:Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model[C]//EMNLP,2008:505-513. 被引量:1
  • 9Long Jiang,Shiquan Yang,Ming Zhou,et al.Mining Bilingual Data from the Web with Adaptively Learnt Patterns[C]//Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing,2009:870-878. 被引量:1
  • 10林政,吕雅娟,刘群,等.基于双语混和网页的平行语料挖掘[C]//全国第十届计算语言学会,烟台,2009:352-357. 被引量:2

共引文献4

同被引文献135

引证文献6

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部