摘要
本文分析和讨论了双语平行语料库建设中的纯文本化、分词处理和文本对齐三个步骤,并认为它们是双语平行语料的预处理过程。探讨了各个步骤之间的关系以及各个步骤目前发展现状和存在的问题,对我国现已建成的汉英双语语料库也作了剖析说明。
The paper focuses on the three procedures of text-formatting,word segmentation and sentence alignment in the building of bilingual parallel corpus,and regards these three procedures as the pro-procession of bilingual parallel corpora. It explores the relationship between the procedures and points out their present status and difficulties respectively in the NLP project. It also introduces the exited bilingual parallel corpus in our country.
出处
《外语教育》
2007年第1期145-149,共5页
Foreign Language Education
关键词
平行语料
预处理
纯文本化
分词处理
文本对齐
parallel corpora
pre-processing
text-formatting
word segmentation
sentence alignment