摘要
作为语言最小独立运行且有意义的单位,将连续型的老挝语划分成词是非常有必要的。提出一种基于双向长短期记忆BLSTM神经网络模型的老挝语分词方法,使用包含913 487个词的人工分词语料来训练模型,将老挝语分词任务转化为基于音节的序列标注任务,即将老挝语音节标注为词首(B)、词中(M)、词尾(E)和单独成词(S)4个标签。首先将老挝语句子划分成音节并训练成向量,然后把这些向量作为BLSTM神经网络模型的输入来预估该音节所属标签,再使用序列推断算法确定其标签,最后使用人工标注的分词语料进行实验。实验表明,基于双向长短期记忆神经网络的老挝语分词方法在准确率上达到了87.48%,效果明显好于以往的分词方法。
It is necessary to divide the continuous Lao language into words,which are the smallest independent and meaningful unit of language.We propose a Lao word segmentation method based on bidirectional long-short term memory(BLSTM)neural network model.The model is trained from a Lao corpus that contains 913487 manually tagged words.In this model,the Lao word segmentation task can be transformed into a syllable-based sequential tagging task,in which a Lao syllable is labeled as four tags:begin-word(B),middle-word(M),end-word(E)and single-word(S).Firstly,Lao sentences are divided into syllables and the syllables are trained into vectors.Secondly,as the input of the BLSTM neural network model,these vectors are used to predict the label of the syllable.Thirdly,the sequence inference algorithm is used to determine the label of the syllable.We carry out experiments on the manually labeled word-segmentation corpus.Experimental results show that the proposal has an accuracy of 87.48%,which is obviously better than that of existing word segmentation methods.
作者
何力
周兰江
周枫
郭剑毅
HE Li;ZHOU Lan-jiang;ZHOU Feng;GUO Jian-yi(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
出处
《计算机工程与科学》
CSCD
北大核心
2019年第7期1312-1317,共6页
Computer Engineering & Science
基金
国家自然科学基金(61662040,61562049)
关键词
神经网络
音节
双向长短期记忆
老挝语分词
neural network
syllable
bidirectional long-short term memory
Lao word segmentation