Speech Analysis Method Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition


IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E99-A   No.10   pp.1762-1773
Publication Date: 2016/10/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.E99.A.1762
Type of Manuscript: PAPER
Category: Speech and Hearing
multivariate empirical mode decomposition,  speech analysis,  fundamental frequency,  formant frequency,  

Full Text: PDF(1.8MB)>>
Buy this Article

We propose a speech analysis method based on the source-filter model using multivariate empirical mode decomposition (MEMD). The proposed method takes multiple adjacent frames of a speech signal into account by combining their log spectra into multivariate signals. The multivariate signals are then decomposed into intrinsic mode functions (IMFs). The IMFs are divided into two groups using the peak of the autocorrelation function (ACF) of an IMF. The first group characterized by a spectral fine structure is used to estimate the fundamental frequency F0 by using the ACF, whereas the second group characterized by the frequency response of the vocal-tract filter is used to estimate formant frequencies by using a peak picking technique. There are two advantages of using MEMD: (i) the variation in the number of IMFs is eliminated in contrast with single-frame based empirical mode decomposition and (ii) the common information of the adjacent frames aligns in the same order of IMFs because of the common mode alignment property of MEMD. These advantages make the analysis more accurate than with other methods. As opposed to the conventional linear prediction (LP) and cepstrum methods, which rely on the LP order and cut-off frequency, respectively, the proposed method automatically separates the glottal-source and vocal-tract filter. The results showed that the proposed method exhibits the highest accuracy of F0 estimation and correctly estimates the formant frequencies of the vocal-tract filter.