Why do we need a Recommender System ?
Have you ever struggled to find a movie that fits your preferences? If not, try to replace “movie” with another nouns. Then, when it's the case, I'm pretty sure that it can be a very time-consuming process, and sometimes, might drive you crazy. Being aware of that, Netflix, Spotify or any other online entertainment platform does not want this kind of situation to happen to their users, whereas unhappy users can leave their business in ruins. Move to another example, Amazon runs an online marketplace where everyday a great amount of new products come in. Beyond a shadow of a doubt, without a good system that is able to show new products to the right user at the right time, Amazon will positively get themselves into trouble. The mentioned tool that Netflix, Spotify or Amazon all need here is called Recommender System. The underlying goal of the Recommender System is to infer user preferences from billions of interactions between a set of users and a set of items.
Ingredients for Recommender System
Let’s now talk about the ingredients for Recommender System, what do we need in order to build such one? You know, in Recommender System, we have users, items and the interactions among them. Interactions can be either implicit or explicit. Clicking, watching, or listening can be considered as implicit interactions which does not give a clear preference about the user, whereas explicit ones, like feedbacks or ratings reveal a very strong preference on a specific item by receiving a score from them. Beside the user-item interactions, we also have the information about items (description of a item, actor of a movie, ect) or about users (genre, age, ect).
Already having data, the next thing to wonder is the techniques to build a Recommender System. Two of the most popular ways to do so are Collaborative Filtering (CF) and Content-based (CB) recommendations. In order to make a personalised recommendation for a specific user, CF relies on the interactions of all others users across all items, while CB relies only on the historical interactions of this user and the information about items.
Collaborative Filtering recommendation
The main idea behind CF is measuring similarities between users (user-based) or items (item-based) in order to exploit the fact that humans enjoy sharing opinions. The similarities between things being mentioned here are not identical or physical. A pair of users is considered similar not because of the same nationality or physical gesture they have, but that they treat several items similarly in terms of like and dislike. Then as well, two items are defined similar because people treat them the same in terms of like or dislike, not that they share the same function or design.
User-Based Collaborative Filtering
As mentioned, user-based (UB) measures the similarity between users. Once the system knows who are the most similar to the target user (the one you want to recommend), it continues to compute the prediction score for each items have not seen by target user (that depends only on the scores from the most similar users found before). Let's look at Figure 1 above, we can realise that Guillaume and Christian are `similar` because they have liked movie M1, M3. Thus, the system detects this exist similarity then probably recommends movie M4 for Christian.
The example that we discuss above has only 4 users and 4 items, so we can easily make a recommendation by human eye. But in a typical Recommender System, detecting similarity becomes very hard because of a huge number of items (I) and users (N). Matrix Factorization (MF) was one of the best approaches to solve this problem but still have some limitations due to the sparsity of user-item interactions matrix (most users only interact with a tiny proportion of the items). Since this is 2018, Variational Autoencoders (VAEs) can solve this problem perfectly. VAEs are generative models, which have two connected networks: an encoder and a decoder. Imagine each user is represented as a row of user-item interactions matrix (i.e the binary vector dimensional I where 1 at position i indicates that user liked item i). The encoder takes this huge vector as input, compress it into a small, dense representation. Then pass to the decoder, decoder tries to convert it back to the original input that minimize reconstruction loss. By using VAEs with an appropriate regularization and a reasonable likelihood function, Liang D et al  obtained state-of-the-art (SoTA) results three on real world datasets (MovieLens 20M, Million Song Dataset, Netflix Prize) for recommendations.
Item-Based Collaborative Filtering
Item-based (IB), on the other hand, measures the similarity between items. This is where we were starting in order to make an improvement on Recommender System. We believe that analyzing the items grant us the ability to determine the main factors that drive a user to like them. From then on, we can design the personalised recommendations more accurately.
Let’s now compute similarity between each pair of items. Unlike UB, each item is now represented as a column of user-item interactions matrix (i.e the binary vector dimensional N where 1 at position u indicates that item was liked by user u. We train again a VAEs in order to be able to transform each item into a small, dense representation (by taking the output of the encoder) in which I call item embeddings.
So, we now have the very first item embeddings, but the best part hasn’t come yet. Let’s use it to make recommendations. The method is very intuitive and simple: we recommend the most similar items to the items liked by the target user in the past. More concretely, we first compute the cosine similarity matrix between any pair of items from item embeddings. Then, in order to make a recommendation, we compute the similarity score for each item and recommend the most similar ones. Let's suppose target user has liked a set of items S in the past. Then for an unseen item i, we compute the similarity score by taking the average similarities between movie i and every movies on S. Figure 2 show recommendations in chatbot using exactly algorithm I’ve just described. We give the bot three Action, Sci-Fi movies (Kung Fu Panda, Avatar, The Amazing Spider-Man) in which its embeddings was showed in red. Then the bot compute similarity score for each item, recommend back to us Iron Man, Iron Man 2 and X-Men First Class (its embeddings was showed in blue). The recommendation makes a lot of sense because: not only all recommended movies are Action, Sci-Fi movies, but also they are close together with the selected movies.
The result was shown in the table. Different models (also the SoTA one) was tested on MovieLens-20M dataset. We only consider users that have at least 40 ratings. The metric to qualify such a recommender system is recall at k (recall@k), it is a proportion of relevant items found in the top-k recommendations. For example, recall@20 is 40% means, in average, about 8 out of 20 recommended movies are actually the ones you liked. Observe that, the results from item-based are about 8% worse than the user-based (SoTA). One improvement that we’ve made is to add an attention layer to learn attention weights for each items in order to pay attention to the popular items before making prediction. This attention layer increases the performance by about 3.5 - 4%.
The world’s most valuable resource is no longer oil, but data
When doing this project, we also asked some friends if they read the item/ movie descriptions before buying/watching. Most of them said yes. “The descriptions are usually useful” - they added. That made us start to think, what happens for those users who might, for example, only like movies about war? It could be better if we can add war into our item embeddings. Thus, we decide to investigate in textual item descriptions.
Remember when we use VAEs to create item embeddings. The encoder compress a sparse binary item vector into a dense vector then pass it to decoder in order to reconstruct its original input. The idea is, now we do not give exactly what the encoder wants to pass to the decoder, we concatenate some extra informations to the output of the encoder then give this new concatenated vector to the decoder instead. In the worst case, decoder just ignores the extra informations if it finds these informations not useful. Extra informations here are from item textual descriptions, we used the encoder of Transformer (was proposed in the paper Attention is All You Need ) to encode text. From a variable lengths item descriptions to a fix 100 dimensional vector.
As we see in the table , clearly text helps us to improve the performance of the system. Without attention, adding text increases about 1.7% and 1.6% for recall at 20 and 50 respectively. While adding text and using attention, we obtain the best recall among all item-based approaches with 30.8% and 40.8% for recall at 20 and 50 respectively.
The results obtained from item-based model look quite promising although it can not achieve at least the same results that user-based has done. Can not judge which is better now since we haven’t tested which a dataset where the number of items is bigger than number of users! Anyway, keep in mind that the predictions from user-based and item-based are came from two different aspects: one based on the user similarity and another based on item similarity. Therefore, in the next step, we would try to merge both models together in order to exploit the strength point from each other. For example: learn attention weights for every items from each model can be a good idea.
 Liang, D., Krishnan, R., Hoffman, M. and Jebara, T. (2018). Variational Autoencoders for Collaborative Filtering. [online] arXiv.org. Available at: https://arxiv.org/abs/1802.05814 [Accessed 18 Jan 2019]
 Vaswani, Ashish, et al. (2017). Attention is All You Need. [online] arXiv.org. Available at: https://arxiv.org/abs/1706.03762 [Accessed 18 Jan 2019]