期刊文献+

构建和剖析中英三元组可比语料库 被引量:5

Building and profiling Chinese-English 3-tuple comparable corpora
下载PDF
导出
摘要 由于受到翻译腔的影响,中英平行语料库存在固有的扭斜的语言模型。显然,用这样的语料库训练的机器翻译、跨语言检索等自然语言处理系统也承袭了扭斜的语言模型,严重影响到应用系统的性能。为了克服平行语料库固有的缺陷,提出构建和剖析中英三元组可比语料库的技术研究。这项研究采用可比语料库和语言自动剖析技术,使用统计和规则相结合的方法,对由本族英语、中式英语和标准中文三元素所组成的三元组可比语料库中的本族英语和中式英语进行统计分析。在此基础上,利用n-元词串、关键词簇等自动抽取技术挖掘基于本族语言模型的双语资源,实现改进和发展机器翻译等自然语言的处理应用。 There exists inherent skewed language model in Chinese-English parallel corpus due to the influence of transla-tionese. Obviously, natural language processing systems trained with these corpora, including machine translation and cross-language information retrieval, will inherit the skewed language model, thus seriously degrading the performance of applications. To fix the inherent defaults in parallel corpus, this paper proposes a technical research on building and profiling Chinese-English 3-tuple comparable corpora. The study adopts comparable corpora and automatic language profiling technologies and applies a combined method of statistics and rules for statistical analysis on native English and Chinglish in 3-tuple comparable corpora that consists of native English, Chinglish and standard Chinese. Based on this, automatic extraction technologies, such as n-grams and key clusters, are used in the mining of native-language-based bilingual resources to improve and develop natural language processing applications such as machine translation.
出处 《计算机工程与应用》 CSCD 2014年第13期153-157,186,共6页 Computer Engineering and Applications
基金 国家自然科学基金(No.61172101 No.61172102)
关键词 三元组可比语料库 语言迁移 自动语言剖析 n-元词串 3-tuple comparable corpora language transfer automatic language profiling n-grams
  • 相关文献

参考文献16

  • 1Daille B.Building bilingual terminologies from comparable corpora: the TTC TermSuite[C]//Proceedings of the 5th Work- shop on Building and Using Comparable Corpora, 2012 : 29-32. 被引量:1
  • 2TTC Annual Public Report 2012[R].2012. 被引量:1
  • 3Wu Zhibiao,Palmer M.Verbs semantics and lexical selec- tion[C]//Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics(ACL'94), Association for Computational Linguistics, 1994: 133-138. 被引量:1
  • 4Bouamor D, Semmar N, Zweigenbaum EUsing WordNet and semantic similarity for bilingual terminology mining from comparable eorpora[C]//Proeeedings of the 6th Work- shop on Building and Comparable Corpora,2013:16-23. 被引量:1
  • 5Tillmann C,Xu Jianming.A simple sentence-level extrac- tion algorithm for comparable data[C]//Proceedings of NAACL HLT2009,2009 : 93-96. 被引量:1
  • 6Munteanu D S, Marcu D.Extracting parallel sub-sentential fragments from non-parallel corpora[C]//Proceedings of the 21st International Conference on Computational Lin- guistics and 44th Annual Meeting of the ACL, Sydney, July 2006 : 81-88. 被引量:1
  • 7Genc Y, Lennon E A, Mason W, et al.Building ontologies from collaborative knowledge bases to search and interpret multilingual corpora[C]//Proceedings of the 9th Work- shop on Building and Comparable Corpora,2013:87-94. 被引量:1
  • 8Lapshinova-Koltunski E.VARTRA: a comparable corpus for analysis of translation variation[C]//Proceedings of the 6th Workshop on Building and Comparable Corpora,2013:77-86. 被引量:1
  • 9张永臣,孙乐,李飞,李文波,西野文人,于浩,方高林.基于Web数据的特定领域双语词典抽取[J].中文信息学报,2006,20(2):16-23. 被引量:11
  • 10孙广范,宋金平,袁琦,肖健,单玉秋.中英可比语料库中翻译等价对抽取方法研究[J].计算机工程与应用,2007,43(32):44-46. 被引量:9

二级参考文献42

  • 1许勇,荀恩东,贾爱平,宋柔.基于互连网的术语定义获取系统[J].中文信息学报,2004,18(4):37-43. 被引量:13
  • 2徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184. 被引量:56
  • 3林政,吕雅娟,刘群,等.基于双语混和网页的平行语料挖掘[C]//全国第十届计算语言学会,烟台,2009:352-357. 被引量:2
  • 4D. Lin, S. Zhao, B. Durme, et al. Mining Parenthetical Translations from the Web by Word Alignment [C]//ACL 08, 2008: 994-1002. 被引量:1
  • 5G. H. Cao,J. F. Gao and J. Y. Nie. A System to Mine Large-Scale Bilingual Dictionaries from Monolingual web Pages[C]//Proceedings of MT Summit XI, 2007. 被引量:1
  • 6Lichun Sun, Mengchang Chen, et al. Web Doeument Classification based on Tagged-Region Progressive Analysis[C]//Proceedings of the International Computer Symposium (ICS), 2004. 被引量:1
  • 7Choochart Haruechaiyasak, Meiling Shyu. Web Document Classification Based on Fuzzy Association[C]// Proceedings of the 26th International Computer Soft- ware and Applications Conference, 2002:487-492. 被引量:1
  • 8Shyhming Tai, Chengzen Yang and Ingxian Chen. Improved Automatic Web page Classification by Neighbor Text Percolation [C]//Proceedings of the 8th CSIM Conference on Information Management Research and Praetice, 2002: 289-296. 被引量:1
  • 9L. Jiang, S. Yang, M. Zhou, et. al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns[C]//Proceedings of 47th Annual Meeting of the Association for Computational Linguistics. ACL, 2009: 870-878. 被引量:1
  • 10Jisong Chen, Rowena Chau, and Chung-Hsing Yeh. Discovering parallel text from the World Wide Web [C]//Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Interllationalization. Australia, 2004: 157-161. 被引量:1

共引文献21

同被引文献78

引证文献5

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部