Spectra Restoration of Bone-Conducted Speech via Attention-Based Contextual Information and Spectro-Temporal Structure Constraint

Changyan ZHENG  Tieyong CAO  Jibin YANG  Xiongwei ZHANG  Meng SUN  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E102-A   No.12   pp.2001-2007
Publication Date: 2019/12/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.E102.A.2001
Type of Manuscript: LETTER
Category: Digital Signal Processing
bone-conducted microphone,  speech enhancement,  bidirectional long short-term memory,  attention,  Structural SIMilarity,  

Full Text: FreePDF(3.3MB)

Compared with acoustic microphone (AM) speech, bone-conducted microphone (BCM) speech is much immune to background noise, but suffers from severe loss of information due to the characteristics of the human-body transmission channel. In this letter, a new method for the speaker-dependent BCM speech enhancement is proposed, in which we focus our attention on the spectra restoration of the distorted speech. In order to better infer the missing components, an attention-based bidirectional Long Short-Term Memory (AB-BLSTM) is designed to optimize the use of contextual information to model the relationship between the spectra of BCM speech and its corresponding clean AM speech. Meanwhile, a structural error metric, Structural SIMilarity (SSIM) metric, originated from image processing is proposed to be the loss function, which provides the constraint of the spectro-temporal structures in recovering of the spectra. Experiments demonstrate that compared with approaches based on conventional DNN and mean square error (MSE), the proposed method can better recover the missing phonemes and obtain spectra with spectro-temporal structure more similar to the target one, which leads to great improvement on objective metrics.