Fundamental Frequency Modeling for Speech Synthesis Based on a Statistical Learning Technique

Shinsuke SAKAI  

IEICE TRANSACTIONS on Information and Systems   Vol.E88-D   No.3   pp.489-495
Publication Date: 2005/03/01
Online ISSN: 
DOI: 10.1093/ietisy/e88-d.3.489
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Corpus-Based Speech Technologies)
Category: Speech Synthesis and Prosody
speech synthesis,  fundamental frequency,  additive models,  statistical learning,  

Full Text: PDF(308.7KB)>>
Buy this Article

This paper proposes a novel multi-layer approach to fundamental frequency modeling for concatenative speech synthesis based on a statistical learning technique called additive models. We define an additive F0 contour model consisting of long-term, intonational phrase-level, component and short-term, accentual phrase-level, component, along with a least-squares error criterion that includes a regularization term. A backfitting algorithm, that is derived from this error criterion, estimates both components simultaneously by iteratively applying cubic spline smoothers. When this method is applied to a 7,000 utterance Japanese speech corpus, it achieves F0 RMS errors of 28.9 and 29.8 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.806 and 0.777. The automatically determined intonational and accentual phrase components turn out to behave smoothly, systematically, and intuitively under a variety of prosodic conditions.