# Discover our publications

Filters
Martin MilenkoskiDiego AntogniniClaudiu MusatArthur GassnerAlexandru RusuAndreas BurgMilena FilipovicBlagoj MitrevskiEmma Lejal GlaudeBoi FaltingsKirtan PadhMohammad Saeed RadThomas YuHazim Kemal EkenelBehzad BozorgtabarJean-Philippe ThiranGiuseppe RussoNora HollensteinCe ZhangLarissa SchmidtLucy LinderSandra DjambazovskaAlexandros LazaridisTanja SamardžićGuillaume RailleNikola MilojkovicGiancarlo BergaminMichael JungoJean HennebertAndreas FischerNoémien KocherChristian SciutoLorenzo TarantinoUrs-Viktor MartiMax BaslerAthanasios GiannakopoulosMaxime CoriouAndreea HossmannMichael BaeriswylMeryem M'hamdiRobert WestKhalil MriniMartin JaggiYassine BenyahiaKaicheng YuKamil Bennani-SmiresAnthony DavisonMathieu SalzmannBojan PetrovskiIgnacio AguadoChristian AbbetDavid DaoDan AlistarhAndré CibilsMladen DimovskiVladimir IlievskiPierre-Edouard HonnetAndrei Popescu-Belis
Multi-Objective Recommender SystemMachine LearningSignal ProcessingArtificial IntelligenceInformation RetrievalImage and Video ProcessingComputer Vision and Pattern RecognitionComputation and LanguageML Model InterpretabilityEmerging TechnologiesDatabasesDistributed, Parallel, and Cluster Computing

### Recommending Burgers based on Pizza Preferences: Addressing Data Sparsity with a Product of Experts

In this paper, we describe a method to tackle data sparsity and create recommendations in domains with limited knowledge about user preferences. We expand the variational autoencoder collaborative filtering from a single-domain to a multi-domain setting. The intuition is that user-item interactions in a source domain can augment the recommendation quality in a target domain. The intuition can be taken to its extreme, where, in a cross-domain setup, the user history in a source domain is enough to generate high-quality recommendations in a target one. We thus create a Product-of-Experts (POE) architecture for recommendations that jointly models user-item interactions across multiple domains. The method is resilient to missing data for one or more of the domains, which is a situation often found in real life. We present results on two widely-used datasets - Amazon and Yelp, which support the claim that holistic user preference knowledge leads to better recommendations. Surprisingly, we find that in some cases, a POE recommender that does not access the target domain user representation can surpass a strong VAE recommender baseline trained on the target domain.

Authors: Martin Milenkoski, Diego Antognini, Claudiu Musat
Publication Date: 26.04.2021

### OpenCSI: An Open-Source Dataset for Indoor Localization Using CSI-Based Fingerprinting

Many applications require accurate indoor localization. Fingerprint-based localization methods propose a solution to this problem, but rely on a radio map that is effort-intensive to acquire. We automate the radio map acquisition phase using a software-defined radio (SDR) and a wheeled robot. Furthermore, we open-source a radio map acquired with our automated tool for a 3GPP Long-Term Evolution (LTE) wireless link. To the best of our knowledge, this is the first publicly available radio map containing channel state information (CSI). Finally, we describe first localization experiments on this radio map using a convolutional neural network to regress for location coordinates.

Authors: Arthur Gassner, Claudiu Musat, Alexandru Rusu, Andreas Burg
Publication Date: 16.04.2021

### Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context

Recommender systems research tends to evaluate model performance offline and on randomly sampled targets, yet the same systems are later used to predict user behavior sequentially from a fixed point in time. Simulating online recommender system performance is notoriously difficult and the discrepancy between online and offline behaviors is typically not accounted for in offline evaluations. This disparity permits weaknesses to go unnoticed until the model is deployed in a production setting. In this paper, we first demonstrate how omitting temporal context when evaluating recommender system performance leads to false confidence. To overcome this, we postulate that offline evaluation protocols can only model real-life use-cases if they account for temporal context. Next, we propose a training procedure to further embed the temporal context in existing models. We use a multi-objective approach to introduce temporal context into traditionally time-unaware recommender systems and confirm its advantage via the proposed evaluation protocol. Finally, we validate that the Pareto Fronts obtained with the added objective dominate those produced by state-of-the-art models that are only optimized for accuracy on three real-world publicly available datasets. The results show that including our temporal objective can improve recall@20 by up to 20%.

Authors: Milena Filipovic, Blagoj Mitrevski, Diego Antognini, Emma Lejal Glaude, Boi Faltings, Claudiu Musat
Publication Date: 19.09.2020

### Momentum-based Gradient Methods in Multi-Objective Recommendation

Authors: Blagoj Mitrevski, Milena Filipovic, Diego Antognini, Emma Lejal Glaude, Boi Faltings, Claudiu Musat
Publication Date: 10.09.2020

### Addressing Fairness in Classification with a Model-Agnostic Multi-Objective Algorithm

The goal of fairness in classification is to learn a classifier that does not discriminate against groups of individuals based on sensitive attributes, such as race and gender. One approach to designing fair algorithms is to use relaxations of fairness notions as regularization terms or in a constrained optimization problem. We observe that the hyperbolic tangent function can approximate the indicator function. We leverage this property to define a differentiable relaxation that approximates fairness notions provably better than existing relaxations. In addition, we propose a model-agnostic multi-objective architecture that can simultaneously optimize for multiple fairness notions and multiple sensitive attributes and supports all statistical parity-based notions of fairness. We use our relaxation with the multi-objective architecture to learn fair classifiers. Experiments on public datasets show that our method suffers a significantly lower loss of accuracy than current debiasing algorithms relative to the unconstrained model.

Authors: Kirtan Padh, Diego Antognini, Emma Lejal Glaude, Boi Faltings, Claudiu Musat
Publication Date: 09.09.2020

### Benefiting from Bicubically Down-Sampled Images for Learning Real-World Image Super-Resolution

Super-resolution (SR) has traditionally been based on pairs of high-resolution images (HR) and their low-resolution (LR) counterparts obtained artificially with bicubic downsampling. However, in real-world SR, there is a large variety of realistic image degradations and analytically modeling these realistic degradations can prove quite difficult. In this work, we propose to handle real-world SR by splitting this ill-posed problem into two comparatively more well-posed steps. First, we train a network to transform real LR images to the space of bicubically downsampled images in a supervised manner, by using both real LR/HR pairs and synthetic pairs. Second, we take a generic SR network trained on bicubically downsampled images to super-resolve the transformed LR image. The first step of the pipeline addresses the problem by registering the large variety of degraded images to a common, well understood space of images. The second step then leverages the already impressive performance of SR on bicubically downsampled images, sidestepping the issues of end-to-end training on datasets with many different image degradations. We demonstrate the effectiveness of our proposed method by comparing it to recent methods in real-world SR and show that our proposed approach outperforms the state-of-the-art works in terms of both qualitative and quantitative results, as well as results of an extensive user study conducted on several real image datasets.

Publication Date: 06.07.2020

### Interacting with Explanations through Critiquing

Using personalized explanations to support recommendations has been shown to increase trust and perceived quality. However, to actually obtain better recommendations, there needs to be a means for users to modify the recommendation criteria by interacting with the explanation. We present a novel technique using aspect markers that learns to generate personalized explanations of recommendations from review texts, and we show that human users significantly prefer these explanations over those produced by state-of-the-art techniques. Our work's most important innovation is that it allows users to react to a recommendation by critiquing the textual explanation: removing (symmetrically adding) certain aspects they dislike or that are no longer relevant (symmetrically that are of interest). The system updates its user model and the resulting recommendations according to the critique. This is based on a novel unsupervised critiquing method for single- and multi-step critiquing with textual explanations. Experiments on two real-world datasets show that our system is the first to achieve good performance in adapting to the preferences expressed in multi-step critiquing.

Authors: Diego Antognini, Claudiu Musat, Boi Faltings
Publication Date: 01.01.2020

### Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation

We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.

Authors: Giuseppe Russo, Nora Hollenstein, Claudiu Musat, Ce Zhang
Publication Date: 30.04.2020

### A Swiss German Dictionary: Variation in Speech and Writing

We introduce a dictionary containing forms of common words in various Swiss German dialects normalized into High German. As Swiss German is, for now, a predominantly spoken language, there is a significant variation in the written forms, even between speakers of the same dialect. To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA). This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions. Moreover, we control for the regional distribution and insure the equal representation of the major Swiss dialects. The coupling of the phonetic and written Swiss German forms is powerful. We show that they are sufficient to train a Transformer-based phoneme to grapheme model that generates credible novel Swiss German writings. In addition, we show that the inverse mapping - from graphemes to phonemes - can be modeled with a transformer trained with the novel dictionary. This generation of pronunciations for previously unknown words is key in training extensible automated speech recognition (ASR) systems, which are key beneficiaries of this dictionary.

Authors: Larissa Schmidt, Lucy Linder, Sandra Djambazovska, Alexandros Lazaridis, Tanja Samardžić, Claudiu Musat
Publication Date: 31.03.2020

### Fast Cross-domain Data Augmentation through Neural Sentence Editing

Data augmentation promises to alleviate data scarcity. This is most important in cases where the initial data is in short supply. This is, for existing methods, also where augmenting is the most difficult, as learning the full data distribution is impossible. For natural language, sentence editing offers a solution - relying on small but meaningful changes to the original ones. Learning which changes are meaningful also requires large amounts of training data. We thus aim to learn this in a source domain where data is abundant and apply it in a different, target domain, where data is scarce - cross-domain augmentation. We create the Edit-transformer, a Transformer-based sentence editor that is significantly faster than the state of the art and also works cross-domain. We argue that, due to its structure, the Edit-transformer is better suited for cross-domain environments than its edit-based predecessors. We show this performance gap on the Yelp-Wikipedia domain pairs. Finally, we show that due to this cross-domain performance advantage, the Edit-transformer leads to meaningful performance gains in several downstream tasks.

Authors: Guillaume Raille, Sandra Djambazovska, Claudiu Musat
Publication Date: 23.03.2020

### Multi-Gradient Descent for Multi-Objective Recommender Systems

Recommender systems need to mirror the complexity of the environment they are applied in. The more we know about what might benefit the user, the more objectives the recommender system has. In addition there may be multiple stakeholders - sellers, buyers, shareholders - in addition to legal and ethical constraints. Simultaneously optimizing for a multitude of objectives, correlated and not correlated, having the same scale or not, has proven difficult so far. We introduce a stochastic multi-gradient descent approach to recommender systems (MGDRec) to solve this problem. We show that this exceeds state-of-the-art methods in traditional objective mixtures, like revenue and recall. Not only that, but through gradient normalization we can combine fundamentally different objectives, having diverse scales, into a single coherent framework. We show that uncorrelated objectives, like the proportion of quality products, can be improved alongside accuracy. Through the use of stochasticity, we avoid the pitfalls of calculating full gradients and provide a clear setting for its applicability.

Authors: Nikola Milojkovic, Diego Antognini, Giancarlo Bergamin, Boi Faltings, Claudiu Musat
Publication Date: 09.12.2019

### Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling. To capture new content, our approach will run continuously to keep increasing the corpus over time.

Authors: Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Musat, Andreas Fischer
Publication Date: 30.11.2019

### Multi-Dimensional Explanation of Target Variables from Documents

Automated predictions require explanations to be interpretable by humans. Past work used attention and rationale mechanisms to find words that predict the target variable of a document. Often though, they result in a tradeoff between noisy explanations or a drop in accuracy. Furthermore, rationale methods cannot capture the multi-faceted nature of justifications for multiple targets, because of the non-probabilistic nature of the mask. In this paper, we propose the Multi-Target Masker (MTM) to address these shortcomings. The novelty lies in the soft multi-dimensional mask that models a relevance probability distribution over the set of target variables to handle ambiguities. Additionally, two regularizers guide MTM to induce long, meaningful explanations. We evaluate MTM on two datasets and show, using standard metrics and human annotations, that the resulting masks are more accurate and coherent than those generated by the state-of-the-art methods. Moreover, MTM is the first to also achieve the highest F1 scores for all the target variables simultaneously.

Authors: Diego Antognini, Claudiu Musat, Boi Faltings
Publication Date: 25.09.2019

### Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes

In sequence modeling tasks the token order matters, but this information can be partially lost due to the discretization of the sequence into data points. In this paper, we study the imbalance between the way certain token pairs are included in data points and others are not. We denote this a token order imbalance (TOI) and we link the partial sequence information loss to a diminished performance of the system as a whole, both in text and speech processing tasks. We then provide a mechanism to leverage the full token order information -Alleviated TOI- by iteratively overlapping the token composition of data points. For recurrent networks, we use prime numbers for the batch size to avoid redundancies when building batches from overlapped data points. The proposed method achieved state of the art performance in both text and speech related tasks.

Authors: Noémien Kocher, Christian Sciuto, Lorenzo Tarantino, Alexandros Lazaridis, Andreas Fischer, Claudiu Musat
Publication Date: 18.09.2019

### Benefiting from Multitask Learning to Improve Single Image Super-Resolution

Despite significant progress toward super resolving more realistic images by deeper convolutional neural networks (CNNs), reconstructing fine and natural textures still remains a challenging problem. Recent works on single image super resolution (SISR) are mostly based on optimizing pixel and content wise similarity between recovered and high-resolution (HR) images and do not benefit from recognizability of semantic classes. In this paper, we introduce a novel approach using categorical information to tackle the SISR problem; we present a decoder architecture able to extract and use semantic information to super-resolve a given image by using multitask learning, simultaneously for image super-resolution and semantic segmentation. To explore categorical information during training, the proposed decoder only employs one shared deep network for two task-specific output layers. At run-time only layers resulting HR image are used and no segmentation label is required. Extensive perceptual experiments and a user study on images randomly selected from COCO-Stuff dataset demonstrate the effectiveness of our proposed method and it outperforms the state-of-the-art methods.

Authors: Mohammad Saeed Rad, Behzad Bozorgtabar, Claudiu Musat, Urs-Viktor Marti, Max Basler, Hazim Kemal Ekenel, Jean-Philippe Thiran
Publication Date: 29.07.2019

### Resilient Combination of Complementary CNN and RNN Features for Text Classification through Attention and Ensembling

State-of-the-art methods for text classification include several distinct steps of pre-processing, feature extraction and post-processing. In this work, we focus on end-to-end neural architectures and show that the best performance in text classification is obtained by combining information from different neural modules. Concretely, we combine convolution, recurrent and attention modules with ensemble methods and show that they are complementary. We introduce ECGA, an end-to-end go-to architecture for novel text classification tasks. We prove that it is efficient and robust, as it attains or surpasses the state-of-the-art on varied datasets, including both low and high data regimes.

Authors: Athanasios Giannakopoulos, Maxime Coriou, Andreea Hossmann, Michael Baeriswyl, Claudiu Musat
Publication Date: 28.03.2019

### Expanding the Text Classification Toolbox with Cross-Lingual Embeddings

Most work in text classification and Natural Language Processing (NLP) focuses on English or a handful of other languages that have text corpora of hundreds of millions of words. This is creating a new version of the digital divide: the artificial intelligence (AI) divide. Transfer-based approaches, such as Cross-Lingual Text Classification (CLTC) - the task of categorizing texts written in different languages into a common taxonomy, are a promising solution to the emerging AI divide. Recent work on CLTC has focused on demonstrating the benefits of using bilingual word embeddings as features, relegating the CLTC problem to a mere benchmark based on a simple averaged perceptron. In this paper, we explore more extensively and systematically two flavors of the CLTC problem: news topic classification and textual churn intent detection (TCID) in social media. In particular, we test the hypothesis that embeddings with context are more effective, by multi-tasking the learning of multilingual word embeddings and text classification; we explore neural architectures for CLTC; and we move from bi- to multi-lingual word embeddings. For all architectures, types of word embeddings and datasets, we notice a consistent gain trend in favor of multilingual joint training, especially for low-resourced languages.

Authors: Meryem M'hamdi, Robert West, Andreea Hossmann, Michael Baeriswyl, Claudiu Musat
Publication Date: 23.03.2019

### Interpretable Structure-aware Document Encoders with Hierarchical Attention

We propose a method to create document representations that reflect their internal structure. We modify Tree-LSTMs to hierarchically merge basic elements such as words and sentences into blocks of increasing complexity. Our Structure Tree-LSTM implements a hierarchical attention mechanism over individual components and combinations thereof. We thus emphasize the usefulness of Tree-LSTMs for texts larger than a sentence. We show that structure-aware encoders can be used to improve the performance of document classification. We demonstrate that our method is resilient to changes to the basic building blocks, as it performs well with both sentence and word embeddings. The Structure Tree-LSTM outperforms all the baselines on two datasets by leveraging structural clues. We show our model's interpretability by visualizing how our model distributes attention inside a document. On a third dataset from the medical domain, our model achieves competitive performance with the state of the art. This result shows the Structure Tree-LSTM can leverage dependency relations other than text structure, such as a set of reports on the same patient.

Authors: Khalil Mrini, Claudiu Musat, Michael Baeriswyl, Martin Jaggi
Publication Date: 26.02.2019

### Overcoming Multi-Model Forgetting

We identify a phenomenon, which we refer to as multi-model forgetting, that occurs when sequentially training multiple deep networks with partially-shared parameters; the performance of previously-trained models degrades as one optimizes a subsequent one, due to the overwriting of shared parameters. To overcome this, we introduce a statistically-justified weight plasticity loss that regularizes the learning of a model's shared parameters according to their importance for the previous models, and demonstrate its effectiveness when training two models sequentially and for neural architecture search. Adding weight plasticity in neural architecture search preserves the best models to the end of the search and yields improved results in both natural language processing and computer vision tasks.

Authors: Yassine Benyahia, Kaicheng Yu, Kamil Bennani-Smires, Martin Jaggi, Anthony Davison, Mathieu Salzmann, Claudiu Musat
Publication Date: 21.02.2019

### Evaluating the Search Phase of Neural Architecture Search

Neural Architecture Search (NAS) aims to facilitate the design of deep networks for new tasks. Existing techniques rely on two stages: searching over the architecture space and validating the best architecture. NAS algorithms are currently compared solely based on their results on the downstream task. While intuitive, this fails to explicitly evaluate the effectiveness of their search strategies. In this paper, we propose to evaluate the NAS search phase. To this end, we compare the quality of the solutions obtained by NAS search policies with that of random architecture selection. We find that: (i) On average, the state-of-the-art NAS algorithms perform similarly to the random policy; (ii) the widely-used weight sharing strategy degrades the ranking of the NAS candidates to the point of not reflecting their true performance, thus reducing the effectiveness of the search process. We believe that our evaluation framework will be key to designing NAS strategies that consistently discover architectures superior to random ones.

Authors: Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, Mathieu Salzmann
Publication Date: 21.02.2019

### Embedding Individual Table Columns for Resilient SQL Chatbots

Most of the world's data is stored in relational databases. Accessing these requires specialized knowledge of the Structured Query Language (SQL), putting them out of the reach of many people. A recent research thread in Natural Language Processing (NLP) aims to alleviate this problem by automatically translating natural language questions into SQL queries. While the proposed solutions are a great start, they lack robustness and do not easily generalize: the methods require high quality descriptions of the database table columns, and the most widely used training dataset, WikiSQL, is heavily biased towards using those descriptions as part of the questions. In this work, we propose solutions to both problems: we entirely eliminate the need for column descriptions, by relying solely on their contents, and we augment the WikiSQL dataset by paraphrasing column names to reduce bias. We show that the accuracy of existing methods drops when trained on our augmented, column-agnostic dataset, and that our own method reaches state of the art accuracy, while relying on column contents only.

Authors: Bojan Petrovski, Ignacio Aguado, Andreea Hossmann, Michael Baeriswyl, Claudiu Musat
Publication Date: 01.11.2018

### Churn Intent Detection in Multilingual Chatbot Conversations and Social Media

We propose a new method to detect when users express the intent to leave a service, also known as churn. While previous work focuses solely on social media, we show that this intent can be detected in chatbot conversations. As companies increasingly rely on chatbots they need an overview of potentially churny users. To this end, we crowdsource and publish a dataset of churn intent expressions in chatbot interactions in German and English. We show that classifiers trained on social media data can detect the same intent in the context of chatbots. We introduce a classification architecture that outperforms existing work on churn intent detection in social media. Moreover, we show that, using bilingual word embeddings, a system trained on combined English and German data outperforms monolingual approaches. As the only existing dataset is in English, we crowdsource and publish a novel dataset of German tweets. We thus underline the universal aspect of the problem, as examples of churn intent in English help us identify churn in German tweets and chatbot conversations.

Authors: Christian Abbet, Meryem M'hamdi, Athanasios Giannakopoulos, Robert West, Andreea Hossmann, Michael Baeriswyl, Claudiu Musat
Publication Date: 25.08.2018

### DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation

It is safe to assume that, for the foreseeable future, machine learning, especially deep learning will remain both data- and computation-hungry. In this paper, we ask: Can we build a global exchange where everyone can contribute computation and data to train the next generation of machine learning applications? We present an early, but running prototype of DataBright, a system that turns the creation of training examples and the sharing of computation into an investment mechanism. Unlike most crowdsourcing platforms, where the contributor gets paid when they submit their data, DataBright pays dividends whenever a contributor's data or hardware is used by someone to train a machine learning model. The contributor becomes a shareholder in the dataset they created. To enable the measurement of usage, a computation platform that contributors can trust is also necessary. DataBright thus merges both a data market and a trusted computation market. We illustrate that trusted computation can enable the creation of an AI market, where each data point has an exact value that should be paid to its creator. DataBright allows data creators to retain ownership of their contribution and attaches to it a measurable value. The value of the data is given by its utility in subsequent distributed computation done on the DataBright computation market. The computation market allocates tasks and subsequent payments to pooled hardware. This leads to the creation of a decentralized AI cloud. Our experiments show that trusted hardware such as Intel SGX can be added to the usual ML pipeline with no additional costs. We use this setting to orchestrate distributed computation that enables the creation of a computation market. DataBright is available for download at .

Authors: David Dao, Andreea Hossmann, Dan Alistarh, Ce Zhang
Publication Date: 13.02.2018

### Diverse Beam Search for Increased Novelty in Abstractive Summarization

Text summarization condenses a text to a shorter version while retaining the important informations. Abstractive summarization is a recent development that generates new phrases, rather than simply copying or rephrasing sentences within the original text. Recently neural sequence-to-sequence models have achieved good results in the field of abstractive summarization, which opens new possibilities and applications for industrial purposes. However, most practitioners observe that these models still use large parts of the original text in the output summaries, making them often similar to extractive frameworks. To address this drawback, we first introduce a new metric to measure how much of a summary is extracted from the input text. Secondly, we present a novel method, that relies on a diversity factor in computing the neural network loss, to improve the diversity of the summaries generated by any neural abstractive model implementing beam search. Finally, we show that this method not only makes the system less extractive, but also improves the overall rouge score of state-of-the-art methods by at least 2 points.

Authors: André Cibils, Claudiu Musat, Andreea Hossmann, Michael Baeriswyl
Publication Date: 05.02.2018

### Submodularity-Inspired Data Selection for Goal-Oriented Chatbot Training Based on Sentence Embeddings

Spoken language understanding (SLU) systems, such as goal-oriented chatbots or personal assistants, rely on an initial natural language understanding (NLU) module to determine the intent and to extract the relevant information from the user queries they take as input. SLU systems usually help users to solve problems in relatively narrow domains and require a large amount of in-domain training data. This leads to significant data availability issues that inhibit the development of successful systems. To alleviate this problem, we propose a technique of data selection in the low-data regime that enables us to train with fewer labeled sentences, thus smaller labelling costs. We propose a submodularity-inspired data ranking function, the ratio-penalty marginal gain, for selecting data points to label based only on the information extracted from the textual embedding space. We show that the distances in the embedding space are a viable source of information that can be used for data selection. Our method outperforms two known active learning techniques and enables cost-efficient training of the NLU unit. Moreover, our proposed selection technique does not need the model to be retrained in between the selection steps, making it time efficient as well.

Publication Date: 02.02.2018

### Goal-Oriented Chatbot Dialog Management Bootstrapping with Transfer Learning

Goal-Oriented (GO) Dialogue Systems, colloquially known as goal oriented chatbots, help users achieve a predefined goal (e.g. book a movie ticket) within a closed domain. A first step is to understand the user's goal by using natural language understanding techniques. Once the goal is known, the bot must manage a dialogue to achieve that goal, which is conducted with respect to a learnt policy. The success of the dialogue system depends on the quality of the policy, which is in turn reliant on the availability of high-quality training data for the policy learning method, for instance Deep Reinforcement Learning. Due to the domain specificity, the amount of available data is typically too low to allow the training of good dialogue policies. In this paper we introduce a transfer learning method to mitigate the effects of the low in-domain data availability. Our transfer learning based approach improves the bot's success rate by 20% in relative terms for distant domains and we more than double it for close domains, compared to the model without transfer learning. Moreover, the transfer learning chatbots learn the policy up to 5 to 10 times faster. Finally, as the transfer learning approach is complementary to additional processing such as warm-starting, we show that their joint application gives the best outcomes.

Authors: Vladimir Ilievski, Claudiu Musat, Andreea Hossmann, Michael Baeriswyl
Publication Date: 01.02.2018

### GitGraph - Architecture Search Space Creation through Frequent Computational Subgraph Mining

The dramatic success of deep neural networks across multiple application areas often relies on experts painstakingly designing a network architecture specific to each task. To simplify this process and make it more accessible, an emerging research effort seeks to automate the design of neural network architectures, using e.g. evolutionary algorithms or reinforcement learning or simple search in a constrained space of neural modules. Considering the typical size of the search space (e.g. $10^{10}$ candidates for a $10$-layer network) and the cost of evaluating a single candidate, current architecture search methods are very restricted. They either rely on static pre-built modules to be recombined for the task at hand, or they define a static hand-crafted framework within which they can generate new architectures from the simplest possible operations. In this paper, we relax these restrictions, by capitalizing on the collective wisdom contained in the plethora of neural networks published in online code repositories. Concretely, we (a) extract and publish GitGraph, a corpus of neural architectures and their descriptions; (b) we create problem-specific neural architecture search spaces, implemented as a textual search mechanism over GitGraph; (c) we propose a method of identifying unique common subgraphs within the architectures solving each problem (e.g., image processing, reinforcement learning), that can then serve as modules in the newly created problem specific neural search space.

Authors: Kamil Bennani-Smires, Claudiu Musat, Andreea Hossmann, Michael Baeriswyl
Publication Date: 16.01.2018

### Simple Unsupervised Keyphrase Extraction using Sentence Embeddings

Keyphrase extraction is the task of automatically selecting a small set of phrases that best describe a given free text document. Supervised keyphrase extraction requires large amounts of labeled training data and generalizes very poorly outside the domain of the training data. At the same time, unsupervised systems have poor accuracy, and often do not generalize well, as they require the input document to belong to a larger corpus also given as input. Addressing these drawbacks, in this paper, we tackle keyphrase extraction from single documents with EmbedRank: a novel unsupervised method, that leverages sentence embeddings. EmbedRank achieves higher F-scores than graph-based state of the art systems on standard datasets and is suitable for real-time processing of large amounts of Web data. With EmbedRank, we also explicitly increase coverage and diversity among the selected keyphrases by introducing an embedding-based maximal marginal relevance (MMR) for new phrases. A user study including over 200 votes showed that, although reducing the phrases' semantic overlap leads to no gains in F-score, our high diversity selection is preferred by humans.

Authors: Kamil Bennani-Smires, Claudiu Musat, Andreea Hossmann, Michael Baeriswyl, Martin Jaggi
Publication Date: 13.01.2018

### Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German

Aspect Term Extraction (ATE) detects opinionated aspect terms in sentences or text spans, with the end goal of performing aspect-based sentiment analysis. The small amount of available datasets for supervised ATE and the fact that they cover only a few domains raise the need for exploiting other data sources in new and creative ways. Publicly available review corpora contain a plethora of opinionated aspect terms and cover a larger domain spectrum. In this paper, we first propose a method for using such review corpora for creating a new dataset for ATE. Our method relies on an attention mechanism to select sentences that have a high likelihood of containing actual opinionated aspects. We thus improve the quality of the extracted aspects. We then use the constructed dataset to train a model and perform ATE with distant supervision. By evaluating on human annotated datasets, we prove that our method achieves a significantly improved performance over various unsupervised and supervised baselines. Finally, we prove that sentence selection matters when it comes to creating new datasets for ATE. Specifically, we show that, using a set of selected sentences leads to higher ATE performance compared to using the whole sentence set.

Authors: Pierre-Edouard Honnet, Andrei Popescu-Belis, Claudiu Musat, Michael Baeriswyl
Publication Date: 30.10.2017

### Dataset Construction via Attention for Aspect Term Extraction with Distant Supervision

Aspect Term Extraction (ATE) detects opinionated aspect terms in sentences or text spans, with the end goal of performing aspect-based sentiment analysis. The small amount of available datasets for supervised ATE and the fact that they cover only a few domains raise the need for exploiting other data sources in new and creative ways. Publicly available review corpora contain a plethora of opinionated aspect terms and cover a larger domain spectrum. In this paper, we first propose a method for using such review corpora for creating a new dataset for ATE. Our method relies on an attention mechanism to select sentences that have a high likelihood of containing actual opinionated aspects. We thus improve the quality of the extracted aspects. We then use the constructed dataset to train a model and perform ATE with distant supervision. By evaluating on human annotated datasets, we prove that our method achieves a significantly improved performance over various unsupervised and supervised baselines. Finally, we prove that sentence selection matters when it comes to creating new datasets for ATE. Specifically, we show that, using a set of selected sentences leads to higher ATE performance compared to using the whole sentence set.

Authors: Athanasios Giannakopoulos, Diego Antognini, Andreea Hossmann, Michael Baeriswyl
Publication Date: 16.09.2017

### Unsupervised Aspect Term Extraction with B-LSTM & CRF using Automatically Labelled Datasets

Aspect Term Extraction (ATE) identifies opinionated aspect terms in texts and is one of the tasks in the SemEval Aspect Based Sentiment Analysis (ABSA) contest. The small amount of available datasets for supervised ATE and the costly human annotation for aspect term labelling give rise to the need for unsupervised ATE. In this paper, we introduce an architecture that achieves top-ranking performance for supervised ATE. Moreover, it can be used efficiently as feature extractor and classifier for unsupervised ATE. Our second contribution is a method to automatically construct datasets for ATE. We train a classifier on our automatically labelled datasets and evaluate it on the human annotated SemEval ABSA test sets. Compared to a strong rule-based baseline, we obtain a dramatically higher F-score and attain precision values above 80%. Our unsupervised method beats the supervised ABSA baseline from SemEval, while preserving high precision scores.

Authors: Athanasios Giannakopoulos, Andreea Hossmann, Michael Baeriswyl
Publication Date: 15.09.2017

### Contact

Swisscom Digital Lab
EPFL Innovation Park, Building F, 3rd floor
CH-1015 Lausanne