The term "data-driven" has been on everyone's lips for quite some time now, many companies have taken up the cause of becoming a data-driven company. The idea behind it is that processes, decisions and activities of a company are based exclusively on data and are not driven by mere intuition or personal experience.

Many machine learning approaches are already in use, be it for customer support, fraud detection, proactive fault recognition and diagnosis, customer segmentation or churn prediction, to name just a few of them. The focus is usually on the efficiency gain that results from the optimization or automation of processes, the improvement of the customer experience or the strategic and tactical and operational steering of the business.

The basic prerequisite for all these approaches is, of course, data whose volume and availability has grown exponentially in recent years. By employing the data, a machine learning algorithm "learns" to "understand" and classify customer requests, to identify a potentially fraudulent transaction or disruption, to segment customers based on their purchasing behavior or service usage, or to anticipate a change to the competitor.

Predicting whether a transaction is fraudulent leaves relatively little room for interpretation and can typically be determined by a simple "yes" or "no" answer, the situation in other areas is not that trivial. Categorizing customer inquiries and assigning them to the most suitable processor is usually
performed on the basis of a a human-created classification scheme, that has evolved over the years and has been supplemented, containing redundancies and overlaps and categories that have become obsolete.Since the classification and triage process is also human-driven, this potentially creates further room for interpretation, which is the "right" category for the individual inquiry.

Every month, Swisscom customer care receives more than one million calls and more than 160,000 such written inquiries by email, generating a large number of example data where the categorization does not differ due to the input data, but due to the prior knowledge, intuition, different understanding or varying degrees of understanding or interpretation on the part of the individual categorizing the customer request.

The algorithm learns from these partially contradictory examples and thus produces a result that is bound to lag behind the possibilities, since it represents the cross-section from these different categorization examples. At best, it is on a par with its human counterpart, but it is clearly more efficient. As the number of these queries is continuously increasing, it is hence necessary to address the root cause, which lies in the poor class structure, in the sense of efficient customer query handling and response. In other words, the human-driven process is to be transformed into a fully data-driven approach.

This master thesis presents a novel approach, starting with the email-data itself, leveraging on the existing structuring and proposing a purely data-driven categorization scheme for the customer care emails allowing to be interactively explored, refined and optimized. To obtain such a structure, cluster analysis or clustering, i.e. the process of grouping a set of elements, based on their similarity, is used. Since the clustering groups the individual documents according to their similarity or dissimilarity of their content, it is necessary to remove parts that do not contribute to the understanding of the content, such as metadata (headers, timestamps of the emails), greetings and farewells and signatures. In order to transform the emails into a format that can be processed by a clustering algorithm, two different procedures were evaluated. An established method (TF-IDF), that is based on the occurrence of a word (the term-frequency, TF) in an email, relating it to the term's occurrence in all of them (inverse document frequency, IDF) was used on one hand. Moreover a state-of-the-art embedding approach was used with fasttext. In contrast to TF-IDF, the contextual relationship between individual words is not lost here - each word is mapped to an individual vector representation that is based on its context.

From all possible vectorization and weighting parameters, the vector dimension, the number of clusters and further parameters an optimal combination is chosen experimentally by continuous re-execution of the pre-processing steps and the clustering itself, comparing the metrics in the end. In the end, each email is assigned to one of the clusters - but this result is not yet really tangible or usable for the user. This is enabled by a modern web application, allowing the user to explore, refine and optimize this clustering interactively.


For the business user to be able to get an overview of the clustering result, it is visualized, i.e. each email is plotted in two dimensions and colored according to its cluster affiliation. Emails with similar content can be easily, visually identified. In addition to being able to browse, search and view the individual emails, each cluster is enriched with key phrases that summarize the content of these clusters.

The quality metrics, determined during the pre-processing, are presented in a comprehensible way to point the user to areas of the clustering that have the greatest optimization potential, for example where the overlaps of the previous categories are large.

Where the clustering does not yet meet the business requirements, the user can "guide" the clustering algorithm by defining constraints. Multiple emails can be defined as belonging together (must-link constraint), and will be placed in a common cluster after a re-clustering run. Alternatively, it can be defined that two emails are assigned to the same cluster (cannot-link constraint)

The outcome of the approach proposed in this thesis is is a purely data-driven structure, carefully adjusted and fine-tuned to the needs of the business.

For the business user, this provides a clear structure that is less open to interpretation, more comprehensible and that can "grow" with the data and adapt to changes in it.

New topics, products or problems that arise in customer inquiries will lead to new categories, while categories that are no longer needed disappear from the emails and at the same time also from the classification scheme.

Machine learning applications will benefit from the clearer, less overlapping class structure and the improved performance associated with it.