摘要
考虑传统语音情感识别任务中,手动提取声学特征的繁琐性,本文针对原始语音信号提出一种Sinc-Transformer(SincNet Transformer)模型来进行语音情感识别任务。该模型同时具备SincNet层及Transformer模型编码器的优点,利用SincNet滤波器从原始语音波形中捕捉一些重要的窄带情感特征,使其整个网络结构在特征提取过程中具有指导性,从而完成原始语音信号的浅层特征提取工作;利用两层Transformer模型编码器进行二次处理,以提取包含全局上下文信息的深层特征向量。在交互式情感二元动作捕捉数据库(IEMOCAP)的四类情感分类中,实验结果表明本文提出的Sinc-Transformer模型准确率与非加权平均召回率分别为64.14%和65.28%。同时与基线模型进行对比,所提模型能有效地提高语音情感识别性能。
Considering the complexity of manual extraction of acoustic features in traditional speech emotion recognition tasks,this paper proposed the Sinc-Transformer(SincNet Transformer)model for speech emotion recognition using raw speech.This model combined the advantages of SincNet and Transformer model encoder,and used SincNet filter to capture important narrow-band emotional features from the raw speech waveform,so that the whole network structure could be instructive in the process of feature extraction,so as to completed the shallow feature extraction work of raw speech signals;and used two layers of Transformer model encoders for secondary processing to extract deeper feature vectors that contain global context information.Among the four categories of speech emotion recognition in IEMOCAP database,experimental results show that the accuracy and unweighted average recall of Sinc-Transformer model proposed in this paper are 64.14%and 65.28%respectively.Meanwhile,compared with the baseline model,the proposed model can effectively improve speech emotion recognition performance.
作者
俞佳佳
金赟
马勇
姜芳艽
戴妍妍
YU Jiajia;JIN Yun;MA Yong;JIANG Fangjiao;DAI Yanyan(School of Physics and Electronic Engineering,Jiangsu Normal University,Xuzhou,Jiangsu 221116,China;Kewen College,Jiangsu Normal University,Xuzhou,Jiangsu 221116,China;School of Linguistic Sciences and Arts,Jiangsu Normal University,Xuzhou,Jiangsu 221116,China)
出处
《信号处理》
CSCD
北大核心
2021年第10期1880-1888,共9页
Journal of Signal Processing
基金
国家自然科学基金青年项目(52005267)
江苏省高校自然科学基金(18KJB510013,17KJB510018)。