A Hybrid HMM/BN Acoustic Model Utilizing Pentaphone-Context Dependency

Sakriani SAKTI  Konstantin MARKOV  Satoshi NAKAMURA  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E89-D   No.3   pp.954-961
Publication Date: 2006/03/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e89-d.3.954
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Statistical Modeling for Speech Processing)
Category: Speech Recognition
Keyword: 
wide phonetic context model,  pentaphone,  HMM/BN acoustic model,  

Full Text: PDF(448.2KB)>>
Buy this Article




Summary: 
The most widely used acoustic unit in current automatic speech recognition systems is the triphone, which includes the immediate preceding and following phonetic contexts. Although triphones have proved to be an efficient choice, it is believed that they are insufficient in capturing all of the coarticulation effects. A wider phonetic context seems to be more appropriate, but often suffers from the data sparsity problem and memory constraints. Therefore, an efficient modeling of wider contexts needs to be addressed to achieve a realistic application for an automatic speech recognition system. This paper presents a new method of modeling pentaphone-context units using the hybrid HMM/BN acoustic modeling framework. Rather than modeling pentaphones explicitly, in this approach the probabilistic dependencies between the triphone context unit and the second preceding/following contexts are incorporated into the triphone state output distributions by means of the BN. The advantages of this approach are that we are able to extend the modeled phonetic context within the triphone framework, and we can use a standard decoding system by assuming the next preceding/following context variables hidden during the recognition. To handle the increased parameter number, tying using knowledge-based phoneme classes and a data-driven clustering method is applied. The evaluation experiments indicate that the proposed model outperforms the standard HMM based triphone model, achieving a 9-10% relative word error rate (WER) reduction.