Background One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech.In recent years,the challenging problem of cross-corpus speech emotion recognition(SE...Background One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech.In recent years,the challenging problem of cross-corpus speech emotion recognition(SER)has generated extensive research.Nevertheless,the domain discrepancy between training data and testing data remains a major challenge to achieving improved system performance.Methods This paper introduces a novel multi-scale discrepancy adversarial(MSDA)network for conducting multiple timescales domain adaptation for cross-corpus SER,i.e.,integrating domain discriminators of hierarchical levels into the emotion recognition framework to mitigate the gap between the source and target domains.Specifically,we extract two kinds of speech features,i.e.,handcraft features and deep features,from three timescales of global,local,and hybrid levels.In each timescale,the domain discriminator and the feature extrator compete against each other to learn features that minimize the discrepancy between the two domains by fooling the discriminator.Results Extensive experiments on cross-corpus and cross-language SER were conducted on a combination dataset that combines one Chinese dataset and two English datasets commonly used in SER.The MSDA is affected by the strong discriminate power provided by the adversarial process,where three discriminators are working in tandem with an emotion classifier.Accordingly,the MSDA achieves the best performance over all other baseline methods.Conclusions The proposed architecture was tested on a combination of one Chinese and two English datasets.The experimental results demonstrate the superiority of our powerful discriminative model for solving cross-corpus SER.展开更多
To solve the problem of mismatching features in an experimental database, which is a key technique in the field of cross-corpus speech emotion recognition, an auditory attention model based on Chirplet is proposed for...To solve the problem of mismatching features in an experimental database, which is a key technique in the field of cross-corpus speech emotion recognition, an auditory attention model based on Chirplet is proposed for feature extraction.First, in order to extract the spectra features, the auditory attention model is employed for variational emotion features detection. Then, the selective attention mechanism model is proposed to extract the salient gist features which showtheir relation to the expected performance in cross-corpus testing.Furthermore, the Chirplet time-frequency atoms are introduced to the model. By forming a complete atom database, the Chirplet can improve the spectrum feature extraction including the amount of information. Samples from multiple databases have the characteristics of multiple components. Hereby, the Chirplet expands the scale of the feature vector in the timefrequency domain. Experimental results show that, compared to the traditional feature model, the proposed feature extraction approach with the prototypical classifier has significant improvement in cross-corpus speech recognition. In addition, the proposed method has better robustness to the inconsistent sources of the training set and the testing set.展开更多
针对不同语料库之间数据分布差异问题,提出一种基于深度自编码器子域自适应的跨库语音情感识别算法。首先,该算法采用两个深度自编码器分别获取源域和目标域表征性强的低维情感特征;然后,利用基于LMMD(local maximum mean discrepancy)...针对不同语料库之间数据分布差异问题,提出一种基于深度自编码器子域自适应的跨库语音情感识别算法。首先,该算法采用两个深度自编码器分别获取源域和目标域表征性强的低维情感特征;然后,利用基于LMMD(local maximum mean discrepancy)的子域自适应模块,实现源域和目标域在不同低维情感类别空间中的特征分布对齐;最后,使用带标签的源域数据进行有监督地训练该模型。在eNTERFACE库为源域、Berlin库为目标域的跨库识别方案中,所提算法的跨库识别准确率相比于其他算法提升了5.26%~19.73%;在Berlin库为源域、eNTERFACE库为目标域的跨库识别方案中,所提算法的跨库识别准确率相比于其他算法提升了7.34%~8.18%。因此,所提方法可以有效地提取不同语料库的共有情感特征并提升了跨库语音情感识别的性能。展开更多
基金the National Nature Science Foundation of China(U2003207,61902064)the Jiangsu Frontier Technology Basic Research Project(BK20192004).
文摘Background One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech.In recent years,the challenging problem of cross-corpus speech emotion recognition(SER)has generated extensive research.Nevertheless,the domain discrepancy between training data and testing data remains a major challenge to achieving improved system performance.Methods This paper introduces a novel multi-scale discrepancy adversarial(MSDA)network for conducting multiple timescales domain adaptation for cross-corpus SER,i.e.,integrating domain discriminators of hierarchical levels into the emotion recognition framework to mitigate the gap between the source and target domains.Specifically,we extract two kinds of speech features,i.e.,handcraft features and deep features,from three timescales of global,local,and hybrid levels.In each timescale,the domain discriminator and the feature extrator compete against each other to learn features that minimize the discrepancy between the two domains by fooling the discriminator.Results Extensive experiments on cross-corpus and cross-language SER were conducted on a combination dataset that combines one Chinese dataset and two English datasets commonly used in SER.The MSDA is affected by the strong discriminate power provided by the adversarial process,where three discriminators are working in tandem with an emotion classifier.Accordingly,the MSDA achieves the best performance over all other baseline methods.Conclusions The proposed architecture was tested on a combination of one Chinese and two English datasets.The experimental results demonstrate the superiority of our powerful discriminative model for solving cross-corpus SER.
基金The National Natural Science Foundation of China(No.61273266,61231002,61301219,61375028)the Specialized Research Fund for the Doctoral Program of Higher Education(No.20110092130004)the Natural Science Foundation of Shandong Province(No.ZR2014FQ016)
文摘To solve the problem of mismatching features in an experimental database, which is a key technique in the field of cross-corpus speech emotion recognition, an auditory attention model based on Chirplet is proposed for feature extraction.First, in order to extract the spectra features, the auditory attention model is employed for variational emotion features detection. Then, the selective attention mechanism model is proposed to extract the salient gist features which showtheir relation to the expected performance in cross-corpus testing.Furthermore, the Chirplet time-frequency atoms are introduced to the model. By forming a complete atom database, the Chirplet can improve the spectrum feature extraction including the amount of information. Samples from multiple databases have the characteristics of multiple components. Hereby, the Chirplet expands the scale of the feature vector in the timefrequency domain. Experimental results show that, compared to the traditional feature model, the proposed feature extraction approach with the prototypical classifier has significant improvement in cross-corpus speech recognition. In addition, the proposed method has better robustness to the inconsistent sources of the training set and the testing set.
文摘针对不同语料库之间数据分布差异问题,提出一种基于深度自编码器子域自适应的跨库语音情感识别算法。首先,该算法采用两个深度自编码器分别获取源域和目标域表征性强的低维情感特征;然后,利用基于LMMD(local maximum mean discrepancy)的子域自适应模块,实现源域和目标域在不同低维情感类别空间中的特征分布对齐;最后,使用带标签的源域数据进行有监督地训练该模型。在eNTERFACE库为源域、Berlin库为目标域的跨库识别方案中,所提算法的跨库识别准确率相比于其他算法提升了5.26%~19.73%;在Berlin库为源域、eNTERFACE库为目标域的跨库识别方案中,所提算法的跨库识别准确率相比于其他算法提升了7.34%~8.18%。因此,所提方法可以有效地提取不同语料库的共有情感特征并提升了跨库语音情感识别的性能。