摘要
基于最大似然估计(Maximum likelihood estimation,MLE)的语言模型(Language model,LM)数据增强方法由于存在暴露偏差问题而无法生成具有长时语义信息的采样数据.本文提出了一种基于对抗训练策略的语言模型数据增强的方法,通过一个辅助的卷积神经网络判别模型判断生成数据的真伪,从而引导递归神经网络生成模型学习真实数据的分布.语言模型的数据增强问题实质上是离散序列的生成问题.当生成模型的输出为离散值时,来自判别模型的误差无法通过反向传播算法回传到生成模型.为了解决此问题,本文将离散序列生成问题表示为强化学习问题,利用判别模型的输出作为奖励对生成模型进行优化,此外,由于判别模型只能对完整的生成序列进行评价,本文采用蒙特卡洛搜索算法对生成序列的中间状态进行评价.语音识别多候选重估实验表明,在有限文本数据条件下,随着训练数据量的增加,本文提出的方法可以进一步降低识别字错误率(Character error rate,CER),且始终优于基于MLE的数据增强方法.当训练数据达到6 M词规模时,本文提出的方法使THCHS 30数据集的CER相对基线系统下降5.0%,AISHELL数据集的CER相对下降7.1%.
The conventional approach to data augmentation for language models based on maximum likelihood estimation(MLE) causes the exposure bias problem, which leads to generated text lacking of long-term semantics. We propose a novel data augmentation approach via adversarial training, which uses a convolutional neural network as a discriminator to guide the training of a recurrent neural network based generative model. The matter of augmentation for language models can be regarded as discrete sequential data generation. When outputs of the generative model are discrete, backforward propagation algorithm fails to update the generative model via the gradient of discriminator errors. To deal with this problem, we treat the generative model as a stochastic policy in reinforcement learning and optimize it by rewards from the discriminator. Since the discriminator can only judge completed sequences, we evaluate intermediate states by Monte Carlo search. Experiments on rescoring the n-best lists of speech recognition outputs show that with the increase of training corpus, the proposed approach achieves a lower character error rate(CER) and always outperforms the MLE-based approach. When training corpus reaches 6 million tokens, the proposed approach provides a relative 5.0 % CER reduction on THCHS 30 dataset and a relative 7.1 % CER reduction on AISHELL dataset compared with the baseline.
作者
张一珂
张鹏远
颜永红
ZHANG Yi-Ke;ZHANG Peng-Yuan;YAN Yong-Hong(Key Laboratory of Speech Acoustics and Content Under standing, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190;University of Chinese Academy of Sciences, Beijing 100049;Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Insti- tute of Physics and Chemistry, Chinese Academy of Sciences, Urumchi 830011)
出处
《自动化学报》
EI
CSCD
北大核心
2018年第5期891-900,共10页
Acta Automatica Sinica
基金
国家自然科学基金(11590770-4
U1536117
11504406
11461141004)
国家重点研发计划(2016YFB0801203
2016YFB0801200)
新疆维吾尔自治区科技重大专项(2016A03007-1)资助~~
关键词
数据增强
语言模型
生成对抗网络
强化学习
语音识别
Data augmentation
language modeling
generative adversarial nets (GAN)
reinforcement learning
speechrecognition