Improve Multichannel Speech Recognition with Temporal and Spatial Information

Yu ZHANG  Pengyuan ZHANG  Qingwei ZHAO  

IEICE TRANSACTIONS on Information and Systems   Vol.E101-D   No.7   pp.1963-1967
Publication Date: 2018/07/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2017EDL8268
Type of Manuscript: LETTER
Category: Speech and Hearing
multichannel speech recognition,  long short-term memory,  attention mechanism,  generalized cross correlation,  

Full Text: PDF(692.6KB)
>>Buy this Article

In this letter, we explored the usage of spatio-temporal information in one unified framework to improve the performance of multichannel speech recognition. Generalized cross correlation (GCC) is served as spatial feature compensation, and an attention mechanism across time is embedded within long short-term memory (LSTM) neural networks. Experiments on the AMI meeting corpus show that the proposed method provides a 8.2% relative improvement in word error rate (WER) over the model trained directly on the concatenation of multiple microphone outputs.