摘要
为了解决语音情感识别中数据集样本分布不平衡的问题,提出一种结合数据平衡和注意力机制的卷积神经网络(CNN)和长短时记忆单元(LSTM)的语音情感识别方法.该方法首先对语音情感数据集中的语音样本提取对数梅尔频谱图,并根据样本分布特点对进行分段处理,以便实现数据平衡处理,通过在分段的梅尔频谱数据集中微调预训练好的CNN模型,用于学习高层次的片段语音特征.随后,考虑到语音中不同片段区域在情感识别作用的差异性,将学习到的分段CNN特征输入到带有注意力机制的LSTM中,用于学习判别性特征,并结合LSTM和Softmax层从而实现语音情感的分类.在BAUM-1s和CHEAVD2.0数据集中的实验结果表明,本文提出的语音情感识别方法能有效地提高语音情感识别性能.
In order to solve the problem of unbalanced sample distribution in a dataset in Speech Emotion Recognition(SER),this study proposes a SER method combining a Convolutional Neural Network(CNN)and Long Short-Term Memory(LSTM)units with data balance and an attention mechanism.This method first extracts the log-Mel spectrogram from the samples in a speech emotion dataset and devides the sample distribution into segments according to sample distribution for balance.Then,this method fine-tunes the pre-trained CNN model in the segmented Mel-spectrum dataset to learn high-level speech segments.Next,given the differences in the emotion recognition of different segments in speech,the learned segmented CNN features are input into the LSTM with an attention mechanism for learning discriminative features,and speech emotions are classified with LSTM and Softmax layers.The experimental results in the BAUM-1s and CHEAVD2.0 datasets show that the method proposed in this study has much better performance than conventional methods.
作者
陈港
张石清
赵小明
CHEN Gang;ZHANG Shi-Qing;ZHAO Xiao-Ming(Faculty of Mechanical Engineering&Automation,Zhejiang Sci-Tech University,Hangzhou 310018,China;Institute of Intelligent Information Processing,Taizhou University,Taizhou 318000,China)
出处
《计算机系统应用》
2021年第5期269-275,共7页
Computer Systems & Applications
基金
国家自然科学基金(61976149)
浙江省自然科学基金(LZ20F020002)。
关键词
卷积神经网络
长短时记忆单元
注意力机制
语音情感识别
Convolutional Neural Network(CNN)
Long Short-Term Memory(LSTM)unit
attention mechanism
speech emotion recognition