Emotion recognition is the task of identifying the feeling expressed by a person’s writing, speech or movement. It is a particularly hard task because it is subjective: two different people can recognize two different emotions given the same observation. Hence, there is a lack of good quality training data. Previous work on emotion recognition mainly focused on sentiment analysis from text and from facial expressions. These techniques have been often combined with audio data, however there has been little work on recognizing emotion solely from audio samples. Automatically recognizing emotions from speech can bring benefits in many different fields: customer support, interviews, market research among others. However, the current technology does not produce satisfying and deployable results. Hence, implementing a performant and scalable model for this task would enable the creation of valuable use cases.
When someone is speaking, one does not express emotions uniformly throughout the duration of the speech. Usually, the emotion is localized in only a few key moments. As a consequence, recent research focused on attention models to solve this task. Attention models (figure on the right) give more importance to certain steps in the studied sequence (like words in a sentence) and focus less when there is no need to (for example when there is silence in the speech). Two years ago a new attention technique, called self-attention, was discovered. Unlike attention models, self-attention ones learn the correlation between the words (in the case of sentences) in order to predict the emotion associated to the sentence.
In this project, we focused on four emotions - angry, happy, neutral and sad - in order to compare our results with the literature. When handling audio files there are some common sets of features to extract, such as MFCCs, prosodic features, spectral features, cepstral features… In order to prove the robustness of our model, we decided to test it with different inputs: predefined sets of features and features extracted using convolutional layers directly from raw audio (combining features extraction and emotion classification).
We use two metrics to evaluate our results: weighted accuracy (correctly detected samples divided by the total number of samples) and unweighted accuracy (averaged accuracy of the four classes). The first metric gives a general idea of how well the model is behaving, while the second one is useful to understand if it is behaving well on every class. Unweighted accuracy is particularly useful when classes are unbalanced (which is our case: the number of samples of the happy class is half of the number of samples of the other classes). This approach allowed us to improve the state of the art with every features set we used. Our best model achieves 68.1% (62.5% previously) weighted accuracy and 63.8% (59.6% previously) unweighted accuracy.
This novel use of self-attention techniques, generally used for language modeling and now used in Speech Emotion Recognition (SER), shows promising results and can lead to finally enable the creation of valuable use cases.