Developments in Corpus-Based Speech Synthesis: Approaching Natural Conversational Speech


IEICE TRANSACTIONS on Information and Systems   Vol.E88-D   No.3   pp.376-383
Publication Date: 2005/03/01
Online ISSN: 
DOI: 10.1093/ietisy/e88-d.3.376
Print ISSN: 0916-8532
Type of Manuscript: INVITED PAPER (Special Section on Corpus-Based Speech Technologies)
speech synthesis,  corpora,  concatenation,  paralinguistic information,  communication,  affect,  

Full Text: PDF(136.2KB)
>>Buy this Article

This paper describes the special demands of conversational speech in the context of corpus-based speech synthesis. The author proposed the CHATR system of prosody-based unit-selection for concatenative waveform synthesis seven years ago, and now extends this work to incorporate the results of an analysis of five-years of recordings of spontaneous conversational speeech in a wide range of actual daily-life situations. The paper proposes that the expresion of affect (often translated as 'kansei' in Japanese) is the main factor differentiating laboratory speech from real-world conversational speech, and presents a framework for the specification of affect through differences in speaking style and voice quality. Having an enormous corpus of speech samples available for concatenation allows the selection of complete phrase-sized utterance segments, and changes the focus of unit selection from segmental or phonetic continuity to one of prosodic and discoursal appropriateness instead. Samples of the resulting large-corpus-based synthesis can be heard at