Integration of Spatial Cue-Based Noise Reduction and Speech Model-Based Source Restoration for Real Time Speech Enhancement

Tomoko KAWASE  Kenta NIWA  Masakiyo FUJIMOTO  Kazunori KOBAYASHI  Shoko ARAKI  Tomohiro NAKATANI  

Publication
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E100-A   No.5   pp.1127-1136
Publication Date: 2017/05/01
Online ISSN: 1745-1337
Type of Manuscript: PAPER
Category: Digital Signal Processing
Keyword: 
microphone array,  beamforming (BF),  power spectral density (PSD) estimation,  Gaussian mixture model,  Wiener filtering,  

Full Text: PDF(2.1MB)
>>Buy this Article


Summary: 
We propose a microphone array speech enhancement method that integrates spatial-cue-based source power spectral density (PSD) estimation and statistical speech model-based PSD estimation. The goal of this research was to clearly pick up target speech even in noisy environments such as crowded places, factories, and cars running at high speed. Beamforming with post-Wiener filtering is commonly used in many conventional studies on microphone-array noise reduction. For calculating a Wiener filter, speech/noise PSDs are essential, and they are estimated using spatial cues obtained from microphone observations. Assuming that the sound sources are sparse in the temporal-spatial domain, speech/noise PSDs may be estimated accurately. However, PSD estimation errors increase under circumstances beyond this assumption. In this study, we integrated speech models and PSD-estimation-in-beamspace method to correct speech/noise PSD estimation errors. The roughly estimated noise PSD was obtained frame-by-frame by analyzing spatial cues from array observations. By combining noise PSD with the statistical model of clean-speech, the relationships between the PSD of the observed signal and that of the target speech, hereafter called the observation model, could be described without pre-training. By exploiting Bayes' theorem, a Wiener filter is statistically generated from observation models. Experiments conducted to evaluate the proposed method showed that the signal-to-noise ratio and naturalness of the output speech signal were significantly better than that with conventional methods.