On the Complementary Role of DNN Multi-Level Enhancement for Noisy Robust Speaker Recognition in an I-Vector Framework

Xingyu ZHANG  Xia ZOU  Meng SUN  Penglong WU  Yimin WANG  Jun HE  

Publication
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E103-A   No.1   pp.356-360
Publication Date: 2020/01/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.2019EAL2104
Type of Manuscript: LETTER
Category: Speech and Hearing
Keyword: 
multi-level enhancement,  i-vector,  robust speaker recognition,  deep learning,  weighted Gaussian mixture models,  

Full Text: PDF(494.1KB)>>
Buy this Article




Summary: 
In order to improve the noise robustness of automatic speaker recognition, many techniques on speech/feature enhancement have been explored by using deep neural networks (DNN). In this work, a DNN multi-level enhancement (DNN-ME), which consists of the stages of signal enhancement, cepstrum enhancement and i-vector enhancement, is proposed for text-independent speaker recognition. Given the fact that these enhancement methods are applied in different stages of the speaker recognition pipeline, it is worth exploring the complementary role of these methods, which benefits the understanding of the pros and cons of the enhancements of different stages. In order to use the capabilities of DNN-ME as much as possible, two kinds of methods called Cascaded DNN-ME and joint input of DNNs are studied. Weighted Gaussian mixture models (WGMMs) proposed in our previous work is also applied to further improve the model's performance. Experiments conducted on the Speakers in the Wild (SITW) database have shown that DNN-ME demonstrated significant superiority over the systems with only a single enhancement for noise robust speaker recognition. Compared with the i-vector baseline, the equal error rate (EER) was reduced from 5.75 to 4.01.