Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions

Longbiao WANG  Kazue MINAMI  Kazumasa YAMAMOTO  Seiichi NAKAGAWA  

IEICE TRANSACTIONS on Information and Systems   Vol.E93-D   No.9   pp.2397-2406
Publication Date: 2010/09/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E93.D.2397
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Processing Natural Speech Variability for Improved Verbal Human-Computer Interaction)
Category: Speaker Recognition
speaker identification,  phase information,  MFCC,  noisy environment,  GMM,  

Full Text: PDF(856.9KB)>>
Buy this Article

In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.