Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech

Kentaro SONE  Toru NAKASHIKA  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E102-D   No.8   pp.1546-1553
Publication Date: 2019/08/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2018EDP7344
Type of Manuscript: PAPER
Category: Speech and Hearing
Keyword: 
speech synthesis,  generative models,  Boltzmann distributions,  pre-training methods,  deep neural networks,  

Full Text: PDF(618.5KB)>>
Buy this Article




Summary: 
Conventional approaches to statistical parametric speech synthesis use context-dependent hidden Markov models (HMMs) clustered using decision trees to generate speech parameters from linguistic features. However, decision trees are not always appropriate to model complex context dependencies of linguistic features efficiently. An alternative scheme that replaces decision trees with deep neural networks (DNNs) was presented as a possible way to overcome the difficulty. By training the network to represent high-dimensional feedforward dependencies from linguistic features to acoustic features, DNN-based speech synthesis systems convert a text into a speech. To improved the naturalness of the synthesized speech, this paper presents a novel pre-training method for DNN-based statistical parametric speech synthesis systems. In our method, a deep relational model (DRM), which represents a joint probability of two visible variables, is applied to describe the joint distribution of acoustic and linguistic features. As with DNNs, a DRM consists several hidden layers and two visible layers. Although DNNs represent feedforward dependencies from one visible variables (inputs) to other visible variables (outputs), a DRM has an ability to represent the bidirectional dependencies between two visible variables. During the maximum-likelihood (ML) -based training, the model optimizes its parameters (connection weights between two adjacent layers, and biases) of a deep architecture considering the bidirectional conversion between 1) acoustic features given linguistic features, and 2) linguistic features given acoustic features generated from itself. Owing to considering whether the generated acoustic features are recognizable, our method can obtain reasonable parameters for speech synthesis. Experimental results in a speech synthesis task show that pre-trained DNN-based systems using our proposed method outperformed randomly-initialized DNN-based systems, especially when the amount of training data is limited. Additionally, speaker-dependent speech recognition experimental results also show that our method outperformed DNN-based systems, by setting the initial parameters of our method are the same as that in the synthesis experiments.