摘要
针对传统身份认证矢量(i-vector)与概率线性判别分析(PLDA)结合的声纹识别模型步骤繁琐、泛化能力较弱等问题,构建了一个基于角度间隔嵌入特征的端到端模型。该模型特别设计了一个深度卷积神经网络,从语音数据的声学特征中提取深度说话人嵌入;选择基于角度改进的A-Softmax作为损失函数,在角度空间中使模型学习到的不同类别特征始终存在角度间隔并且同类特征间聚集更紧密。在公开数据集VoxCeleb2上进行的测试表明,与i-vector结合PLDA的方法相比,该模型在说话人辨认中的Top-1和Top-5上准确率分别提高了58.9%和30%;而在说话人确认中的最小检测代价和等错误率上分别减小了47.9%和45.3%。实验结果验证了所设计的端到端模型更适合在多信道、大规模的语音数据集上学习到有类别区分性的特征。
An end-to-end model with angular interval embedding was constructed to solve the problems of complicated multiple steps and weak generalization ability in the traditional voiceprint recognition model based on the combination of identity vector (i-vector) and Probabilistic Linear Discriminant Analysis (PLDA). A deep convolutional neural network was specially designed to extract deep speaker embedding from the acoustic features of voice data. The Angular Softmax (A-Softmax), which is based on angular improvement, was employed as the loss function to keep the angular interval between the different classes of features learned by the model and make the clustering of the similar features closer in the angle space. Compared with the method combining i-vector and PLDA, it shows that the proposed model has the identification accuracy of Top-1 and Top-5 increased by 58.9% and30% respectively and has the minimum detection cost and equal error rate reduced by 47.9% and 45.3% respectively for speaker verification on the public dataset VoxCeleb2. The results verify that the proposed end-to-end model is more suitable for learning class- discriminating features from multi-channel and large-scale datasets.
作者
王康
董元菲
WANG Kang;DONG Yuanfei(Nanjing Fiber Home World Communication Technology Company Limited, Nanjing Jiangsu 210019, China;Wuhan Research Institute of Posts and Telecommunications, Wuhan Hubei 430074, China)
出处
《计算机应用》
CSCD
北大核心
2019年第10期2937-2941,共5页
journal of Computer Applications
基金
国家重点研发计划项目(2017YFB1400704)~~
关键词
声纹识别
端到端模型
损失函数
卷积神经网络
深度说话人嵌入
voiceprint recognition
end-to-endmodel
loss function
convolutional neuralnetwork
deep speaker embedding