Face-to-Talk: Audio-Visual Speech Detection for Robust Speech Recognition in Noisy Environment

Kazumasa MURAI  Satoshi NAKAMURA  

IEICE TRANSACTIONS on Information and Systems   Vol.E86-D   No.3   pp.505-513
Publication Date: 2003/03/01
Online ISSN: 
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Issue on Speech Information Processing)
Category: Robust Speech Recognition and Enhancement
speech recognition,  speech section detection,  multi-modality,  face detection,  "face-to-talk",  

Full Text: PDF(682.8KB)>>
Buy this Article

This paper discusses "face-to-talk" audio-visual speech detection for robust speech recognition in noisy environment, which consists of facial orientation based switch and audio-visual speech section detection. Most of today's speech recognition systems must actually turned on and off by a switch e.g. "push-to-talk" to indicate which utterance should be recognized, and a specific speech section must be detected prior to any further analysis. To improve usability and performance, we have researched how to extract the useful information from visual modality. We implemented a facial orientation based switch, which activates the speech recognition during a speaker is facing to the camera. Then, the speech section is detected by analyzing the image of the face. Visual speech detection is robust to audio noise, but because the articulation starts prior to the speech and lasts longer than the speech, the detected section tends to be longer and ends up with insertion errors. Therefore, we have fused the audio-visual modality detected sections. Our experiment confirms that the proposed audio-visual speech detection method improves recognition performance in noisy environment.