A Study on Acoustic Modeling for Speech Recognition of Predominantly Monosyllabic Languages

Ekkarit MANEENOI  Visarut AHKUPUTRA  Sudaporn LUKSANEEYANAWIN  Somchai JITAPUNKUL  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E87-D   No.5   pp.1146-1163
Publication Date: 2004/05/01
Online ISSN: 
DOI: 
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Speech Dynamics by Ear, Eye, Mouth and Machine)
Category: 
Keyword: 
acoustic modeling,  continuous speech recognition,  onset-rhyme model,  predominantly monosyllabic languages,  Thai speech recognition,  

Full Text: PDF(3.7MB)>>
Buy this Article




Summary: 
This paper presents a study on acoustic modeling for speech recognition of predominantly monosyllabic languages. Various speech units used in speech recognition systems have been investigated. To evaluate the effectiveness of these acoustic models, the Thai language is selected, since it is a predominantly monosyllabic language and has a complex vowel system. Several experiments have been carried out to find the proper speech unit that can accurately create acoustic model and give a higher recognition rate. Results of recognition rates under different acoustic models are given and compared. In addition, this paper proposes a new speech unit for speech recognition, namely onset-rhyme unit. Two models are proposed-the Phonotactic Onset-Rhyme Model (PORM) and the Contextual Onset-Rhyme Model (CORM). The models comprise a pair of onset and rhyme units, which makes up a syllable. An onset comprises an initial consonant and its transition towards the following vowel. Together with the onset, the rhyme consists of a steady vowel segment and a final consonant. Experimental results show that the onset-rhyme model improves on the efficiency of other speech units. The onset-rhyme model improves on the accuracy of the inter-syllable triphone model by nearly 9.3% and of the context-dependent Initial-Final model by nearly 4.7% for the speaker-dependent systems using only an acoustic model, and 5.6% and 4.5% for the speaker-dependent systems using both acoustic and language model respectively. The results show that the onset-rhyme models attain a high recognition rate. Moreover, they also give more efficiency in terms of system complexity.