摘要
藏文句子断句是藏文信息处理领域的难点之一,也是藏汉机器翻译、藏文文本分类等工作的一项重要基础性研究.提出了一种统计与规则相结合的藏文句子自动断句方法以解决藏文标点符号功能的歧义问题,实验结果表明该方法具有比较好的效果,F1值达到98%以上.在规则中首先使用经验的方法,识别出不确定的藏文句子作为候选句子,然后采用基于关联词的复句分析方法进行分句合并形成二次候选句子;最后使用最大熵的方法对二次候选句子进行断句.经验方法和复句分析有效解决了最大熵算法无法触及的语料稀疏和分句问题.
Segmentation of Tibetan sentences is one of the difficult task in the area of Tibetan information processing, and is also one of the key foundational researches of Tibetan - Chinese Machine Translation, Text Cat- egorization, etc. To deal with the ambiguous functions of the Tibetan punctuations, this paper proposes a method of automatic segmentation of Tibetan sentences, which combines statistics and rules. The experiment shows that thisapproach works really well: the F1 - measure reaches 98 % and more. First, the experience method is used in rules to identify the ambiguous Tibetan sentences which are the candidate sentences. Then the analysis of com- pound sentences which is based on conjunctive words is used to combine clauses to form the further candidate sentences. Finally, the method of Maximum Entropy is used to cut up the further candidate sentences according to the meanings. Thus the experience method and the analysis of compound sentences have solved the problems of sparse corpus and clauses that Maximum Entropy cannot work out.
出处
《云南大学学报(自然科学版)》
CAS
CSCD
北大核心
2012年第6期653-657,663,共6页
Journal of Yunnan University(Natural Sciences Edition)
基金
国家自然科学基金资助项目(61032008
60970071)
甘肃省自然科学基金资助项目(1107RJZA157)
关键词
藏文句子自动断句
复句分析
二次候选句子
最大熵
automatic segmentation of Tibetan sentences
analysis of compound sentences
further candidate sentences
maximum entropy