Translating speech to text is a very useful capability today. It is everywhere, and every major company is using this technology one way or another: Amazon's Alexa, Google's Assistant, Apple's Siri are examples of complex text-to-speech tools, while bots recognizing keywords are examples of simpler ones. Transcribing speech can also be useful for storing purposes: Archiving a conversation in text format takes a lot less space than keeping it as an audio. It can also be used to process the transcribed text further or along with the speech, for churn detection for example.
Although these models seem to work well, in practice, the domain they operate on is very well defined. You can't talk to them in another language, or when there is too much noise, and good luck getting yourself understood if you have a heavy accent.
We have an example of this behaviour below: a model obtaining a reasonable 25.6% Word Error Rate* on the TEDLIUM dataset (conference talks) gets a catastrophic 96.8% WER on the Switchboard dataset (phone conversations)!
Why is it the case? There are several key differences between the two datasets, which also are the reasons the model fails:
- The recording quality of the phone conversations is bad.
- We have two persons speaking instead of one, sometimes at the same time.
- The speakers stutter and hesitate a lot more (improvised vs rehearsed speech).
- The type of discourse and the vocabulary are quite different.
All that said, is the model as good as garbage for anything else than conference recordings then?
A common way to "solve" this problem is to... train an other model on the new dataset and keep the specialized model in its own domain. However, this feels like a waste. We have a trained model, capable of recognizing speech, so why retrain from scratch?
The process of adapting a model on a new domain is simply called Domain Adaptation, and that's where it comes to our rescue. Although adaptation is clearly defined, there is no clear definition of what a domain is. It can be a language, an accent, a recording condition, the absence or presence of noise, etc. Even though there are therefore a lot of different domains, adaptation techniques can be used for any of those, which makes them very general and useful.
A very simple and fast adaptation method is to fine-tune the model on the new dataset. Indeed, the previous model already has some internal representation of speech, and it's always better to have a close-but-not-quite initialization than a random one for training. We can combine this with layers freezing to speed up the process even further while still getting very reasonable or even better results (because it reduces the freedom of the network, it will have less capability to overfit the data).
A more advanced technique is to use a Domain Adversarial network, whose job will be to discriminate between the two domains. The network will then have as an additional goal to fool this adversarial classifier, making it unable to discriminate the two domains. This will force the network to learn generic features instead of domain-specific ones, and, as we all know, generalization is great for neural networks, as they have a big tendency to overfit the data we feed them.
Another important advantage of domain adaptation is that it allows adapting a model on a low-resourced domain (languages with a low amount of speakers, or with few transcriptions for example) and obtaining better results than a model trained from scratch on it.
Domain Adaptation therefore still has a lot of potential to uncover, and new techniques will surely arise in the future to further improve on these results.
*WER is the number of words insertions/deletions/substitutions necessary to transform a phrase into a reference one, divided by the length of the reference phrase.