摘要
从Web中抽取平行语料对于机器翻译和其他多语语言处理任务来说非常重要,由此提出了一种从Web中灵活高效地增量抽取平行语料的方法,通过持续地对Common Crawl的Web抓取存档进行下载、扫描和分析统计,增量更新域名下的语言文本长度统计数据。对于任意给定的感兴趣目标语言对,抽取方法基于域名下的语言文本长度统计数据确定抓取网站入口,并根据目标语言进行定向抓取,忽略多语域名和目标语言外的链接。此外还提出了一种在多语域名内基于语义相似性进行全局对齐的新的句子对齐方法。实验表明,增量抽取能够持续不断地获得新的平行语料,根据指定的语言对进行抽取,可以灵活地获得感兴趣的目标语言对平行语料;新的对齐方法在对齐效率上明显优于全局方法,且能完成局部方法无法完成的对齐;在6个语言方向中,抽取到的平行语料在4个中低资源语言方向的质量优于现有Web开源平行语料,在2个高资源语言方向的质量接近现有最好的Web开源平行语料。
Extracting parallel corpus from the web is important for machine translation and other multilingual processing tasks.This paper proposes an incremental web parallel corpus extraction method,which incrementally updates language text length statistics for domains by continuously downloading,scanning and analyzing Common Crawl's web crawling archive.For any given interested language pairs,web sites to be crawled are determined based on language text length statistics for domains and crawled according to the target language pairs,and non-target domains and links are discarded.It also proposes a new intermediatesentence alignment method,which globally aligns sentences based on semantic similarity within multilingual domains.Experiments show that:1)our extraction method can continuously obtain new parallel corpus and flexibly obtain the target language pair of interest via extracting the specified language pairs;2)the proposed intermediate method is significantly better than the global method in terms of alignment efficiency,and can complete the alignment that cannot be completed by local methods;3)out of 6 language directions,the extracted parallel corpora are superior to existing web open source parallel corpus in 4 medium-low resource languages and close to the best available web open source parallel corpus in 2 high-resource languages.
作者
刘小峰
郑禹铖
李东阳
LIU Xiaofeng;ZHENG Yucheng;LI Dongyang(School of Software Engineering,Huazhong University of Science and Technology,Wuhan 430074,China)
出处
《计算机科学》
CSCD
北大核心
2024年第11期248-254,共7页
Computer Science
关键词
平行语料抽取
句子对齐
语料库构建
机器翻译
WEB挖掘
Parallel corpus extraction
Sentence alignment
Corpus construction
Machine translation
Web mining