Loss Function Considering Multiple Attributes of a Temporal Sequence for Feed-Forward Neural Networks

Noriyuki MATSUNAGA  Yamato OHTANI  Tatsuya HIRAHARA  

IEICE TRANSACTIONS on Information and Systems   Vol.E103-D   No.12   pp.2659-2672
Publication Date: 2020/12/01
Publicized: 2020/08/31
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2020EDP7078
Type of Manuscript: PAPER
Category: Speech and Hearing
loss function,  multiple attributes of temporal sequence,  feed-forward neural networks,  fundamental frequency,  mel-cepstrum,  

Full Text: PDF>>
Buy this Article

Deep neural network (DNN)-based speech synthesis became popular in recent years and is expected to soon be widely used in embedded devices and environments with limited computing resources. The key intention of these systems in poor computing environments is to reduce the computational cost of generating speech parameter sequences while maintaining voice quality. However, reducing computational costs is challenging for two primary conventional DNN-based methods used for modeling speech parameter sequences. In feed-forward neural networks (FFNNs) with maximum likelihood parameter generation (MLPG), the MLPG reconstructs the temporal structure of the speech parameter sequences ignored by FFNNs but requires additional computational cost according to the sequence length. In recurrent neural networks, the recursive structure allows for the generation of speech parameter sequences while considering temporal structures without the MLPG, but increases the computational cost compared to FFNNs. We propose a new approach for DNNs to acquire parameters captured from the temporal structure by backpropagating the errors of multiple attributes of the temporal sequence via the loss function. This method enables FFNNs to generate speech parameter sequences by considering their temporal structure without the MLPG. We generated the fundamental frequency sequence and the mel-cepstrum sequence with our proposed method and conventional methods, and then synthesized and subjectively evaluated the speeches from these sequences. The proposed method enables even FFNNs that work on a frame-by-frame basis to generate speech parameter sequences by considering the temporal structure and to generate sequences perceptually superior to those from the conventional methods.