摘要
在藏文信息处理中,涉及句法、语义都需要以词为基本单位,句法分析、语句理解、自动文摘、自动分类和机器翻译等,都是在切词之后基于词的层面来完成各项处理。因此,藏文分词是藏文信息处理的基础。该文通过研究藏文自动分词中的紧缩词,首次提出了它的一种识别方案,即还原法,并给出了还原算法。其基本思想是:利用藏文紧缩词的添接规则还原藏文原文,以达到进行分词的目的。该还原算法已应用到笔者承担的国家语委项目中。经测试,在85万字节的藏文语料中紧缩词的识别准确率达99.83%。
In Tibetan information processing, the word is to be treated as the fundamental unit for parsing, the sentence comprehension, the automatic abstract, the automatic classification, the machine translation and so on, Therefore, Tibetan word segmentation is essential for Tibetan information processing. Through the analysis of abbreviated word in Tibetan,, this article proposes a new method of restoration to identify the abbreviated word for Tibetan word segmentation. The basic idea of the restoration method is to re-establish the abbreviated Tibetan word to its original form by the reinstallation rules. The method has been applied in a research project of National Language Committee, with a testing result from a 850 000 byte Tibetan corpus reaching the accuracy of 99.83%.
出处
《中文信息学报》
CSCD
北大核心
2009年第1期35-37,43,共4页
Journal of Chinese Information Processing
基金
国家语委资助项目(MZ05-118)
关键词
计算机应用
中文信息处理
紧缩词
藏文分词
还原法
格助词
computer application
Chinese information processing
abbreviated word, Tibetan word segmentation, restoration method, case-auxiliary word