Siamese Attention-Based LSTM for Speech Emotion Recognition

Tashpolat NIZAMIDIN  Li ZHAO  Ruiyu LIANG  Yue XIE  Askar HAMDULLA  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E103-A    No.7    pp.937-941
Publication Date: 2020/07/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.2019EAL2156
Type of Manuscript: LETTER
Category: Engineering Acoustics
Siamese networks,  pairwise training,  attention-based long short-term memory,  speech emotion recognition,  

Full Text: PDF(281.4KB)>>
Buy this Article

As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.