摘要
传统的分词方法将一个维吾尔文语义词(多词关联模式)拆分成与词意义不符的若干个片段,因此在维吾尔语文本分析及文本处理过程中导致许多问题,严重影响文本处理效率。提出了一种维吾尔文组词的全新概念,用互信息作为相邻单词间关联程度的度量,实现了基于分段式策略和增量式策略的两种自适应组词算法,并与传统的分词方法得到的词汇表进行对比分析。实验结果表明,组词算法能够非常有效地提取文本中的语义词,两种算法在大规模文本集上的组词准确率分别达到了84.31%和88.24%。
The traditional segmentation method Will be split a Uyghur semantic word (multi-word association) into several frag- ments that inconsistent with its original meanings, so this will leads many problems and seriously affect the efficiency of Uyghur text analyzing and processing. This paper put forward a new idea and two kind of adaptive word grouping algorithm for segmen- tation of Uyghur muhiword structured semantic words based on the segmentation strategy and incremental strategy. In these al- gorithms, the mutual information taken as a measurement to estimate the association degree between two adjacent Uyghur words. The result of comparative experiments with traditional method shows that, the proposed algorithms can be extracted the semantic words very effectively, and the grouping word accuracy of them on the large-scale tests are achieves 84. 31% and 88.24% respectively.
出处
《计算机应用研究》
CSCD
北大核心
2013年第2期429-431,435,共4页
Application Research of Computers
基金
国家自然科学基金资助项目(61063022
61262062
61163033
61163032)
国家教育部新世纪优秀人才支持计划资助项目(NCET-10-0969)
新疆维吾尔自治区高技术研究发展计划资助项目(201212124)
新疆多语种信息技术重点实验室开放课题资助项目(XJDX0905)
关键词
维吾尔文
传统分词
语义词
互信息
组词
Uyghur text
traditional segmentation
semantic word
mutual information
word grouping