A Speech Intelligibility Estimation Method Using a Non-reference Feature Set

Toshihiro SAKANO  Yosuke KOBAYASHI  Kazuhiro KONDO  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E98-D   No.1   pp.21-28
Publication Date: 2015/01/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2014MUP0004
Type of Manuscript: Special Section PAPER (Special Section on Enriched Multimedia)
Category: 
Keyword: 
speech intelligibility,  non-reference estimation,  support vector regression,  P.563,  diagnostic rhyme test,  

Full Text: PDF(943.6KB)>>
Buy this Article




Summary: 
We proposed and evaluated a speech intelligibility estimation method that does not require a clean speech reference signal. The propose method uses the features defined in the ITU-T standard P.563, which estimates the overall quality of speech without the reference signal. We selected two sets of features from the P.563 features; the basic 9-feature set, which includes basic features that characterize both speech and background noise, e.g., cepstrum skewness and LPC kurtosis, and the extended 31-feature set with 22 additional features for a more accurate description of the degraded speech and noise, e.g., SNR, average pitch, and spectral clarity among others. Four hundred noise samples were added to speech, and about 70% of these samples were used to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility of speech degraded by added noise. The proposed method showed a root mean square error (RMSE) value of about 10% and correlation with subjective intelligibility of about 0.93 for speech distorted with known noise type, and RMSE of about 16% and a correlation of about 0.84 for speech distorted with unknown noise type, both with either the 9 or the 31-dimension feature set. These results were higher than the estimation using frequency-weighed SNR calculated in critical frequency bands, which requires the clean reference signal for its calculation. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.