摘要
随着网络的广泛应用和科技的高速发展,人们所接收信息急剧增加,机器翻译面临强大的市场需求。从现存文本资料中提取语言模型,是整个机器翻译系统的重点,决定了翻译系统的性能表现。用于特定领域的文本翻译系统,往往受到相关文本缺少的困扰,无法通过大规模语料库的建设来训练语言模型,由此而产生了严重的数据稀疏问题。通过实验研究了受限语料库下语言模型平滑算法的选择。实验结论表明,在语料库极度受限的情况下,Good-Turing能够发挥其低频词汇重估优势,良好解决训练语料库的数据稀疏问题。通过该方法,可以提高在语料受限条件下语言模型的性能。
In recent years,with the rapid development of science and technology and the widespread application of Internet,information increases dramatically.Training language model from corpus plays an important role in improving system performance For specific areas translation task,it is often plagued by the lack of relevant texts,fail to construction of large-scale corpus to train the language model,resulting in serious data sparse problem.This paper focuses on choosing smoothing algorithms under limited corpus language model.Through several comparative experiments,it can be concluded that Good-Turing method can leverage its low-frequency lexical revaluation advantage,and solve the problem caused by data sparse efficiently,and also improve the efficiency of language model under limited corpus.
出处
《微型电脑应用》
2010年第12期18-20,1,共3页
Microcomputer Applications
基金
国家自然科学基金(60574063)项目基金资助项目
关键词
自然语言处理
受限语料库
语言模型
数据稀疏
Natural Language Processing
Limited Corpus
Language Model
Data Sparse