Explanation of the demo
LSTM based recurrent neural networks have been extensively used in SER problems, mainly because of the traditional attention mechanism built on top of them. RNNs have an internal state that tries to remember what happened in the past, but after ~100 steps they start to forget. And this is the case when we are facing audio signals, since for 5 seconds we can have 500 steps (frames) or even more. Using self-attention we are not only able to solve the memory problem, but the model is also able to decide which frames are the most important to focus on in order to predict an emotion. We call this process “emotion localization”.