Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

Junichi YAMAGISHI  Takao KOBAYASHI  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E90-D   No.2   pp.533-543
Publication Date: 2007/02/01
Online ISSN: 1745-1361
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Speech and Hearing
Keyword: 
HMM-based speech synthesis,  speaker adaptation,  speaker adaptive training (SAT),  hidden semi-Markov model (HSMM),  maximum likelihood linear regression (MLLR),  voice conversion,  

Full Text: PDF(731.1KB)
>>Buy this Article


Summary: 
In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.