Reducing Computation Time of the Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics

Randy GOMEZ  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E90-D   No.2   pp.554-561
Publication Date: 2007/02/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e90-d.2.554
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Speech and Hearing
Keyword: 
HMM-sufficient statistics,  unsupervised,  rapid adaptation,  speech recognition,  

Full Text: PDF(1.5MB)>>
Buy this Article




Summary: 
In real-time speech recognition applications, there is a need to implement a fast and reliable adaptation algorithm. We propose a method to reduce adaptation time of the rapid unsupervised speaker adaptation based on HMM-Sufficient Statistics. We use only a single arbitrary utterance without transcriptions in selecting the N-best speakers' Sufficient Statistics created offline to provide data for adaptation to a target speaker. Further reduction of N-best implies a reduction in adaptation time. However, it degrades recognition performance due to insufficiency of data needed to robustly adapt the model. Linear interpolation of the global HMM-Sufficient Statistics offsets this negative effect and achieves a 50% reduction in adaptation time without compromising the recognition performance. Furthermore, we compared our method with Vocal Tract Length Normalization (VTLN), Maximum A Posteriori (MAP) and Maximum Likelihood Linear Regression (MLLR). Moreover, we tested in office, car, crowd and booth noise environments in 10 dB, 15 dB, 20 dB and 25 dB SNRs.