A VoiceFont Creation Framework for Generating Personalized Voices

Takashi SAITO  Masaharu SAKAMOTO  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E88-D   No.3   pp.525-534
Publication Date: 2005/03/01
Online ISSN: 
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Corpus-Based Speech Technologies)
Category: Speech Synthesis and Prosody
Keyword: 
personalized voice,  voice font,  voice inventory generation,  automatic segmentation,  corpus-based speech synthesis,  speaker recognizability,  

Full Text: PDF(958.5KB)
>>Buy this Article


Summary: 
This paper presents a new framework for effectively creating VoiceFonts for speech synthesis. A VoiceFont in this paper represents a voice inventory aimed at generating personalized voices. Creating well-formed voice inventories is a time-consuming and laborious task. This has become a critical issue for speech synthesis systems that make an attempt to synthesize many high quality voice personalities. The framework we propose here aims to drastically reduce the burden with a twofold approach. First, in order to substantially enhance the accuracy and robustness of automatic speech segmentation, we introduce a multi-layered speech segmentation algorithm with a new measure of segmental reliability. Secondly, to minimize the amount of human intervention in the process of VoiceFont creation, we provide easy-to-use functions in a data viewer and compiler to facilitate checking and validation of the automatically extracted data. We conducted experiments to investigate the accuracy of the automatic speech segmentation, and its robustness to speaker and style variations. The results of the experiments on six speech corpora with a fairly large variation of speaking styles show that the speech segmentation algorithm is quite accurate and robust in extracting segments of both phonemes and accentual phrases. In addition, to subjectively evaluate VoiceFonts created by using the framework, we conducted a listening test for speaker recognizability. The results show that the voice personalities of synthesized speech generated by the VoiceFont-based speech synthesizer are fairly close to those of the donor speakers.