Similar Speaker Selection Technique Based on Distance Metric Learning Using Highly Correlated Acoustic Features with Perceptual Voice Quality Similarity

Yusuke IJIMA  Hideyuki MIZUNO  

IEICE TRANSACTIONS on Information and Systems   Vol.E98-D   No.1   pp.157-165
Publication Date: 2015/01/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2014EDP7183
Type of Manuscript: PAPER
Category: Speech and Hearing
voice quality,  perceptual similarity,  acoustic feature,  speaker selection,  distance metric learning,  

Full Text: PDF(742.8KB)
>>Buy this Article

This paper analyzes the correlation between various acoustic features and perceptual voice quality similarity, and proposes a perceptually similar speaker selection technique based on distance metric learning. To analyze the relationship between acoustic features and voice quality similarity, we first conduct a large-scale subjective experiment using the voices of 62 female speakers and perceptual voice quality similarity scores between all pairs of speakers are acquired. Next, multiple linear regression analysis is carried out; it shows that four acoustic features are highly correlated to voice quality similarity. The proposed speaker selection technique first trains a transform matrix based on distance metric learning using the perceptual voice quality similarity acquired in the subjective experiment. Given an input speech, acoustic features of the input speech are transformed using the trained transform matrix, after which speaker selection is performed based on the Euclidean distance on the transformed acoustic feature space. We perform speaker selection experiments and evaluate the performance of the proposed technique by comparing it to speaker selection without feature space transformation. The results indicate that transformation based on distance metric learning reduces the error rate by 53.9%.