Improved Reference Speaker Weighting Using Aspect Model

Seong-Jun HAHM  Yuichi OHKAWA  Masashi ITO  Motoyuki SUZUKI  Akinori ITO  Shozo MAKINO  

IEICE TRANSACTIONS on Information and Systems   Vol.E93-D   No.7   pp.1927-1935
Publication Date: 2010/07/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E93.D.1927
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Speech and Hearing
speaker adaptation,  aspect model,  reference speaker weighting,  latent reference model,  

Full Text: PDF>>
Buy this Article

We propose an improved reference speaker weighting (RSW) and speaker cluster weighting (SCW) approach that uses an aspect model. The concept of the approach is that the adapted model is a linear combination of a few latent reference models obtained from a set of reference speakers. The aspect model has specific latent-space characteristics that differ from orthogonal basis vectors of eigenvoice. The aspect model is a "mixture-of-mixture" model. We first calculate a small number of latent reference models as mixtures of distributions of the reference speaker's models, and then the latent reference models are mixed to obtain the adapted distribution. The mixture weights are calculated based on the expectation maximization (EM) algorithm. We use the obtained mixture weights for interpolating mean parameters of the distributions. Both training and adaptation are performed based on likelihood maximization with respect to the training and adaptation data, respectively. We conduct a continuous speech recognition experiment using a Korean database (KAIST-TRADE). The results are compared to those of a conventional MAP, MLLR, RSW, eigenvoice and SCW. Absolute word accuracy improvement of 2.06 point was achieved using the proposed method, even though we use only 0.3 s of adaptation data.