Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines


IEICE TRANSACTIONS on Information and Systems   Vol.E97-D   No.6   pp.1403-1410
Publication Date: 2014/06/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E97.D.1403
Type of Manuscript: Special Section PAPER (Special Section on Advances in Modeling for Real-world Speech Information Processing and its Application)
Category: Voice Conversion and Speech Enhancement
voice conversion,  restricted Boltzmann machine,  deep learning,  speaker individuality,  

Full Text: PDF(642.2KB)>>
Buy this Article

This paper presents a voice conversion technique using speaker-dependent Restricted Boltzmann Machines (RBM) to build high-order eigen spaces of source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. We build a deep conversion architecture that concatenates the two speaker-dependent RBMs with neural networks, expecting that they automatically discover abstractions to express the original input features. Under this concept, if we train the RBMs using only the speech of an individual speaker that includes various phonemes while keeping the speaker individuality unchanged, it can be considered that there are fewer phonemes and relatively more speaker individuality in the output features of the hidden layer than original acoustic features. Training the RBMs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NN). The converted abstraction of the source speaker is then back-propagated into the acoustic space (e.g., MFCC) using the RBM of the target speaker. We conducted speaker-voice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model-based method and an ordinary NN.