In-Vehicle Voice Interface with Improved Utterance Classification Accuracy Using Off-the-Shelf Cloud Speech Recognizer

Takeshi HOMMA  Yasunari OBUCHI  Kazuaki SHIMA  Rintaro IKESHITA  Hiroaki KOKUBO  Takuya MATSUMOTO  
[Paper on system development]

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E101-D   No.12   pp.3123-3137
Publication Date: 2018/12/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2018EDK0001
Type of Manuscript: PAPER
Category: Speech and Hearing
Keyword: 
speech recognition errors,  natural language understanding,  car navigation,  noisy environment,  cloud speech recognition,  

Full Text: PDF(1.1MB)
>>Buy this Article


Summary: 
For voice-enabled car navigation systems that use a multi-purpose cloud speech recognition service (cloud ASR), utterance classification that is robust against speech recognition errors is needed to realize a user-friendly voice interface. The purpose of this study is to improve the accuracy of utterance classification for voice-enabled car navigation systems when inputs to a classifier are error-prone speech recognition results obtained from a cloud ASR. The role of utterance classification is to predict which car navigation function a user wants to execute from a spontaneous utterance. A cloud ASR causes speech recognition errors due to the noises that occur when traveling in a car, and the errors degrade the accuracy of utterance classification. There are many methods for reducing the number of speech recognition errors by modifying the inside of a speech recognizer. However, application developers cannot apply these methods to cloud ASRs because they cannot customize the ASRs. In this paper, we propose a system for improving the accuracy of utterance classification by modifying both speech-signal inputs to a cloud ASR and recognized-sentence outputs from an ASR. First, our system performs speech enhancement on a user's utterance and then sends both enhanced and non-enhanced speech signals to a cloud ASR. Speech recognition results from both speech signals are merged to reduce the number of recognition errors. Second, to reduce that of utterance classification errors, we propose a data augmentation method, which we call “optimal doping,” where not only accurate transcriptions but also error-prone recognized sentences are added to training data. An evaluation with real user utterances spoken to car navigation products showed that our system reduces the number of utterance classification errors by 54% from a baseline condition. Finally, we propose a semi-automatic upgrading approach for classifiers to benefit from the improved performance of cloud ASRs.