At some level, computers learn pretty much the same as humans do. We are able to learn by getting some input, practice, adapt our understanding and then repeat the process. If we take the example of a child learning to speak, a child will eventually learn how to speak by listening to many sentences, try to speak its own, get a feedback from them, adapt its knowledge and then try again.

Machine Learning mimics this process of learning and requires models to learn based on a massive amount of data: the training samples. A model eventually learns because it is able to try something for each sample, get a feedback, adapt its knowledge and then try again.

Simple learning.png

Towards a better learning

The major area of interest to improve a model’s performance is to look at the model itself. Indeed, this makes perfectly sense: by improving the brain (ie. the model) capacity to learn, we improve the result. However, for some models, the way training samples are provided can also have a significant impact on the learning process and is worth taking into account.

This observation is easily understable with humans: we won’t teach a child how to speak by starting with complex sentences, buth rather very simple ones and then increasing the difficulty. We teach humans in a meaningful order (for example by exploiting previously easier learned concepts to ease the learning of new harder ones) and this observation can be translated and generalized to Machine Learning.

With a right processing of data provided during the learning process one can expect two major effects: faster learning and/or better final performance.

Towards a better learning

The major area of interest to improve a model’s performance is to look at the model itself. Indeed, this makes perfectly sense: by improving the brain (ie. the model) capacity to learn, we improve the result. However, for some models, the way training samples are provided can also have a significant impact on the learning process and is worth taking into account.

This observation is easily understable with humans: we won’t teach a child how to speak by starting with complex sentences, buth rather very simple ones and then increasing the difficulty. We teach humans in a meaningful order (for example by exploiting previously easier learned concepts to ease the learning of new harder ones) and this observation can be translated and generalized to Machine Learning.

With a right processing of data provided during the learning process one can expect two major effects: faster learning and/or better final performance.

Language Modelling

Standard Language Modelling is a task which predicts the word following a sequence of words. It models a probability distribution over sequences of words and is a sub-components of other tasks such as Machine Translation, Text Summarization and Speech Recognition. During our research, we studied the effect of the training samples order on this particular task and found out a novel approach that improves the performance of state-of-the-arts models.

The order matters

In the case of Language Modelling, some models are very sensitive to the order of the data samples provided. Those models, like humans do, use what has been previously learnt to learn new concepts. Based on this observation, we measured the impact of four different ordering strategies on the data samples: no order, local order, standard order and total order.

To fully understand the differences between those ordering strategies, let’s imagine that our model is a student who took 5 courses during his semester. The 5 courses represent a batch size of 5 and our student has a particular mindset: when he starts learning a course in a particular day, he must keep this day and cannot take another course the same day.

A standard order is what any student can expect: each course is always given the same day and the content of each course follows its respecting chapters. This is what our student’s schedule would look like, where each row is a week:

Standard Language Modelling is a task which predicts the word following a sequence of words. It models a probability distribution over sequences of words and is a sub-components of other tasks such as Machine Translation, Text Summarization and Speech Recognition. During our research, we studied the effect of the training samples order on this particular task and found out a novel approach that improves the performance of state-of-the-arts models.

The order matters

In the case of Language Modelling, some models are very sensitive to the order of the data samples provided. Those models, like humans do, use what has been previously learnt to learn new concepts. Based on this observation, we measured the impact of four different ordering strategies on the data samples: no order, local order, standard order and total order.

To fully understand the differences between those ordering strategies, let’s imagine that our model is a student who took 5 courses during his semester. The 5 courses represent a batch size of 5 and our student has a particular mindset: when he starts learning a course in a particular day, he must keep this day and cannot take another course the same day.

A standard order is what any student can expect: each course is always given the same day and the content of each course follows its respecting chapters. This is what our student’s schedule would look like, where each row is a week:

Course standard.png

With no order, courses and their chapters do not follow any coherent order, which is a nightmare for our student:

Course no order.png

With a local order, the school director decided to shuffle the weeks. It keeps courses the right days but changes the order of the chapters likewise for each course:

Course partial order.png

The no order and local order variants both yield bad results since it does not keep the order needed for the student to properly learn. With standard order however, the student is able to learn in an efficient way.

With our method we further improve the standard order schedule by introducing a total order variant. When our student sequentially learns the chapters of a course, he actually misses some parts of the teaching, which are the concepts brought between the chapters. In other words, the student needs to have extra lessons that explain what happens between each chapter. On a total order variant we tackle this problem by introducing those new lessons. Here is how our student’s schedule would look like:

Course total order.png

Those new lessons close the gap of learning in-between concepts and further improve our student’s ability to learn.

We can observe a new property here: chapters must be strictly contiguous. That means we can not introduce the in-between chapter 1½-2½ after chapter 1. This is another property of our student, which is able to learn a new chapter only if the teacher starts exactly from where he finished the last time.

Total order with Language Modelling

In the case of Language Modelling, each lesson represents the modelling of a sequence of tokens. If an entire course is a contiguous sequence of 12 tokens, then the first chapter (or lesson) of the course corresponds to the first 4 tokens (1 to 4). Chapter 2 and 3 correspond to the sequences of tokens 5 to 8 and 9 to 12 respectively. In the total order method, we add two in-between chapters with tokens 3 to 6 and 7 to 10. We call it total order because it models every possible ordinal relation between tokens, as opposed to the standard order where the model can not learn relations between tokens 4 to 5 and 8 to 9.

Tokens to seq2.png

Results

Standard order is a common practice in Language Modelling while total order is a novel method that allows models to further improve their learning capabilities.

Using state-of-the-art Language Modelling models, we are able to push forward the state-of-the-art and reduce the perplexity on 3 different datasets. We demonstrate the effectiveness of a total order strategy and how the ordering of data points can impacts the effectiveness of a model.

We reduced the perplexity on the following datasets: -2.18 ppl on the PTB, -1.36 ppl on Wikitext2 and -0.97 ppl on Wikitext103.