Joint Adversarial Training of Speech Recognition and Synthesis Models for Many-to-One Voice Conversion Using Phonetic Posteriorgrams

Yuki SAITO
Kei AKUZAWA
Kentaro TACHIBANA

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E103-D    No.9    pp.1978-1987
Publication Date: 2020/09/01
Publicized: 2020/06/12
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2019EDP7297
Type of Manuscript: PAPER
Category: Speech and Hearing
Keyword: 
many-to-one voice conversion,  phonetic posteriorgrams,  deep neural networks,  over-smoothing,  domain-adversarial training,  generative adversarial networks,  

Full Text: PDF(574.7KB)>>
Buy this Article



Summary: 
This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.


open access publishing via