Unsupervised Learning of Continuous Density HMM for Variable-Length Spoken Unit Discovery

Meng SUN  Hugo VAN HAMME  Yimin WANG  Xiongwei ZHANG  

IEICE TRANSACTIONS on Information and Systems   Vol.E99-D   No.1   pp.296-299
Publication Date: 2016/01/01
Publicized: 2015/10/21
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2015EDL8178
Type of Manuscript: LETTER
Category: Speech and Hearing
spoken unit discovery,  unsupervised HMM learning,  nonnegative matrix factorization,  language acquisition,  

Full Text: PDF(199.5KB)>>
Buy this Article

Unsupervised spoken unit discovery or zero-source speech recognition is an emerging research topic which is important for spoken document analysis of languages or dialects with little human annotation. In this paper, we extend our earlier joint training framework for unsupervised learning of discrete density HMM to continuous density HMM (CDHMM) and apply it to spoken unit discovery. In the proposed recipe, we first cluster a group of Gaussians which then act as initializations to the joint training framework of nonnegative matrix factorization and semi-continuous density HMM (SCDHMM). In SCDHMM, all the hidden states share the same group of Gaussians but with different mixture weights. A CDHMM is subsequently constructed by tying the top-N activated Gaussians to each hidden state. Baum-Welch training is finally conducted to update the parameters of the Gaussians, mixture weights and HMM transition probabilities. Experiments were conducted on word discovery from TIDIGITS and phone discovery from TIMIT. For TIDIGITS, units were modeled by 10 states which turn out to be strongly related to words; while for TIMIT, units were modeled by 3 states which are likely to be phonemes.