Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training

Tobias CINCAREK  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

IEICE TRANSACTIONS on Information and Systems   Vol.E91-D   No.3   pp.499-507
Publication Date: 2008/03/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e91-d.3.499
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Robust Speech Processing in Realistic Environments)
Category: Acoustic Modeling
ASR application,  real-environment,  task-dependency,  unsupervised training,  selective training,  

Full Text: PDF>>
Buy this Article

Development of an ASR application such as a speech-oriented guidance system for a real environment is expensive. Most of the costs are due to human labeling of newly collected speech data to construct the acoustic model for speech recognition. Employment of existing models or sharing models across multiple applications is often difficult, because the characteristics of speech depend on various factors such as possible users, their speaking style and the acoustic environment. Therefore, this paper proposes a combination of unsupervised learning and selective training to reduce the development costs. The employment of unsupervised learning alone is problematic due to the task-dependency of speech recognition and because automatic transcription of speech is error-prone. A theoretically well-defined approach to automatic selection of high quality and task-specific speech data from an unlabeled data pool is presented. Only those unlabeled data which increase the model likelihood given the labeled data are employed for unsupervised training. The effectivity of the proposed method is investigated with a simulation experiment to construct adult and child acoustic models for a speech-oriented guidance system. A completely human-labeled database which contains real-environment data collected over two years is available for the development simulation. It is shown experimentally that the employment of selective training alleviates the problems of unsupervised learning, i.e. it is possible to select speech utterances of a certain speaker group but discard noise inputs and utterances with lower recognition accuracy. The simulation experiment is carried out for several selected combinations of data collection and human transcription period. It is found empirically that the proposed method is especially effective if only relatively few of the collected data can be labeled and transcribed by humans.