Emotion Localization

In this demo, we can observe a speech emotion classification over four classes: angry, happy, neutral and sad. You can choose between several audio files, see where the model localizes the emotion within the utterance and observe the output probabilities for each class. A better localization will lead to a better prediction.

Explanation of the demo

LSTM based recurrent neural networks have been extensively used in SER problems, mainly because of the traditional attention mechanism built on top of them. RNNs have an internal state that tries to remember what happened in the past, but after ~100 steps they start to forget. And this is the case when we are facing audio signals, since for 5 seconds we can have 500 steps (frames) or even more. Using self-attention we are not only able to solve the memory problem, but the model is also able to decide which frames are the most important to focus on in order to predict an emotion. We call this process “emotion localization”.

Demo information

January 2019
Lorenzo Tarantino
Master Student