In the scope of my master thesis at EPFL, I am working on a well-known problem in the Speech Processing community: Speech Emotion Recognition (SER). Identifying the emotional state of a person from its speech is an essential task in the human-machine interaction (HCI) field and is proven to be very useful in applications such as call-center bots or intelligent cars. If you are not familiar with SER, this article will hopefully give you an insight of what emotion recognition is, what it can offer and more specifically, what I have been working on for the last few months.

With the explosion of the interest in Machine Learning (ML), Artificial Intelligence (AI) and Big Data (BD) over the last few years, Neural Networks (NNs) have emerged as the top-performing techniques for a large variety of tasks, from image classification to natural language processing, outperforming the performances of other approaches. However, the performances of SER systems have only slightly improved with the arrival of neural networks. Today’s state-of-the-art models yield accuracies of approximately 55-65%, while for example in the image classification task, similarly complex models are able to achieve up to 95% accuracy.

A reason behind the observed low performance lies in the nature of the data that is used for training SER systems, i.e. speech. The load of emotions in speech is not uniformly distributed throughout the sentence: not all phones (i.e. units of sounds composing each word) composing a sentence are carrying emotions. Additionally, silent parts between phones need to be disregarded by the models since they do not carry any emotional content. One recent innovative way to overcome these two issues and improve the performance of an SER model is by using an Attention Mechanism (AM).

Attention mechanisms have been initially introduced in Natural Language Processing (NLP) tasks, and afterwards were applied also to various other fields of research. The concept is simple: most of the data we are working with carry some degree of noise or contain some irrelevant information. Attention mechanism techniques are providing an SER model with the ability to learn to locate the relevant information in the data (e.g. in an image) and alleviate the impact of the inherent noise. For example, it is easy for humans to see that these pictures are all images of cats. This happens because as humans, we are able to locate the cats in the pictures, although the pictures are noisy, i.e. the backgrounds are different, the cats are in different positions, etc. However, if human brain did not have the ability to differentiate between an object and the background and to focus on the object, it would be more difficult to detect the cats.

The same principle applies to the case of emotion recognition from speech. As not all pixels of a picture represent the cat of the picture, in a respective way, not all phones of an emotional sentence are conveying emotional load in speech. By building a model that is able to localize to emotionally relevant parts of speech and give them higher importance than to the noisy, silent or irrelevant parts for predicting their emotional load, we are able to improve significantly the performance of state-of-the-art SER systems.

This is what my master thesis is all about: localizing the emotionally salient parts of speech to enhance the performance of emotion recognition systems. To this end, we have built a novel attention mechanism using Recurrent Neural Networks (RNNs). This system is able to weight each frame by taking in account not only the importance of this single frame, but also by positioning it in the context of the entire spoken utterance.

The improvements brought by attention mechanisms not only to SER but also to various other challenging tasks, are far from negligible. Observing the degree of focus and interest in several research fields that is put currently on this approach, it makes no doubt, that attention is one of the hottest topics in the AI world.

This project is supported by Swisscom, the telco leader in Switzerland, and the Swisscom Digital Lab, an innovation lab located in the EPFL Innovation Park dedicated to conduct applied research in machine learning, data analytics, security and communication systems.