Using Clustering for Categorization of Support Tickets Daniel Beneker and Carsten Gips FH Bielefeld - Campus Minden, Artilleriestraße 9, 32427 Minden, Germany http://www.fh-bielefeld.de/ Abstract. Support tickets from customers contain much hidden infor- mation. Unsupervised machine learning methods are able to discover this hidden information. In this paper we propose the categorization of sup- port tickets using clustering methods in combination with topic models. Furthermore label generation techniques are used to generate meaning- ful names for these categories. The results are compared with related research. Keywords: clustering, support tickets, categorization, topic model, nmf, k-means, information retrieval 1 Introduction The volume of digital data generated annually has grown massively in recent years, but only a small amount of this data is used for analysis. Manual analysis of these data sets is not only more difficult, but also more cost-intensive. Therefore, it is important to organize and categorize them appropriately. These tasks can be successfully automated with clustering algorithms. At Mittwald CM Service GmbH & Co. KG a large part of the communi- cation with the customer takes place over the ticket system of the company. The customer makes a new request via the ticket system, the so-called support ticket. This is handled by employees, so that usually a conversation sequence consisting of several internal and external tickets is created. External tickets are tickets from the customer as well as responses of the employees to the customer. Internal tickets contain additional information and are only visible to employees. Since these tickets are currently not analyzed, it is difficult to evaluate for which topic the company receives the most requests. This paper presents an approach to cluster support tickets into categories. The algorithms k-means and NMF are used for this purpose, which belong to the field of unsupervised learning and do not require labelled training data. In a second step label generation techniques are used to generate meaningful names for the calculated categories. These meaningful names are important because, without them, it is not possible to evaluate which topics cost how much time or how much money. With the help of the named categories, it is also possible that an employee, who is an expert in a particular topic, processes the corresponding tickets prioritized. Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB. Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org In Section 2 we briefly discuss different approaches for clustering documents and present in section 3 our approach to cluster the support tickets and to extract meaningful names for the found clusters. The results are evaluated in section 4. 2 Related Work There are many different approaches to categorize texts and documents. In [1] Blei uses the LDA (Latent Dirichlet Allocation) algorithm to categorize 17.000 articles from the journal Science into 100 categories. The found categories provided a capable and general overview of the individual subjects. LDA belongs to the topic models as well as the algorithm NMF used in this work. An extension to LDA is described in [9]. The hierarchical variant presented there was used to categorize news texts around the disappearance of the Malaysia airline flight mh-370. As resulting categories the individual theories about the status of the plane like plane crash or terrorism were obtained. In [4], Kuang et al. show how NMF (Nonnegative Matrix Factorization) can used to classify documents. They also test NMF on five different data sets like the well-known sets 20Newsgroups and Reuters and compare the results to a baseline k-means clustering. In four out of five cases, NMF yielded better results. In [8] Shahnaz et al. note, that the accuracy of the clustering strongly depends on the number of clusters and the dataset. With only 2 categories of the Reuters dataset, an accuracy of 99% was achieved. With 20 categories it is only 54%. But on another dataset, they received over 80% accuracy with 20 categories. Other approaches attempt to generate training data in a first step using cluster algorithms and keyword-lists, and to train a supervised classifier such as SVM or Naive Bayes in a second step. The classifier is then used to catego- rize documents. So they combine unsupervised learning methods with methods of supervised learning. In [2] Ko and Seo split the documents into sentences and categorize them by a list of keywords. Those sentences are used to train a naive bayes classifier. The results did not differ significantly from those of pure supervised learning. But the approach is helpful when training data must be generated. An approach that uses automatic created keywords to create training data is described in [3]. They evaluate their approach on the 20Newsgroups and Reuters dataset and achieved the best results using only a small number of key- word. Furthermore, they received a small improvement, if humans filter those automatic created keywords. The algorithm DCF (Description Comes First) is described in [10]. That approach is different from the others. First, label candidates are generated for each text. Then the texts are categorized (for example with k-means). The final step the centroids are used to select one of the label candidates. 3 Approach In order to categorize support tickets, the tickets are preprocessed and converted into feature vectors, and cluster algorithms are applied for categorization. In the last step meaningful names are assigned to the categories (see Figure 1). After initial clustering, the trained algorithm can be saved to get fixed categories and to predict the category of new tickets later. Fig. 1: Approach Overview 3.1 Preprocessing This section describes how the support tickets are preprocessed so that they can be used by algorithms such as k-means and NMF. Normalization A support ticket consists of two parts, the subject and the mes- sage. For further processing subject and message are concatenated and converted to lower case. Support tickets can contain many elements that can adversely af- fect the quality of clustering. After loading the tickets, elements like URLs, e-mails and account or database names are removed using regular expressions. Single numbers are also removed, as we want strings as categories. Stopword removal Stopwords contain little or no information and can there- fore be removed. We used the stopword list from the NLTK framework1 , ex- tended by a list containing words which could have a negative impact on the automatic naming of the clusters, such as salutation and greeting formulas. Es- pecially short tickets could otherwise be assigned to a common cluster called freundliche Grüße. Text tokenization To split the normalized texts into token lists we used a simple tokenizer with the following regular expression: /\w\w\w+/u, to capture all words of length 3 and longer. Other characters such as punctuation are dis- carded, since the cluster labels should not contain punctuation marks. 1 http://www.nltk.org/ Generation of feature vectors To cluster texts using algorithms, they must first be converted into numbers. One ways is to use the term frequency - in- verse document frequency (tf-idf) vectors. These feature vectors are based on two assumptions. First, words that are more common in a document are more important, or they describe the document better, and second, words that are common in all documents are unimportant. The tf-idf value of a word t in a document d is calculated as follows: tf-idftd = tft,d × idft (1) tft,d specifies the occurrences of a word t in a document d. This value is then weighted with the term idft . This is calculated as follows: N idft = log( + 1) (2) dft N is the number of all documents. dft stands for document frequency and specifies in how many documents the word t occurs. For rarely used words, the idft value is large; for common words the value is small. After this step, each document is represented by a feature vector with the length of the vocabulary. At this point, if desired, all words above and below a certain occurrence frequency can be specifically excluded in order to reduce the length of the feature vectors. This has the advantage that, for example, words with spelling mistakes, which occur very rarely, are filtered out. 3.2 Algorithms We use k-means because it is a simple baseline algorithm and NMF because it belongs to the topic models and provides descriptive results as we want to compare a standard cluster algorithm and a topic model. k-means K-means represents a simple algorithm to classify data into clusters. It can be divided into 3 steps. In the first step, the cluster centroids {c1 , · · · , cK } are initialized for randomly. In the second step, each data point x is assigned to the cluster with the smallest distance. Following, the cluster centroids are recalculated in a third step. The centroid is always calculated from the average of all the data points assigned to it. Steps 2 and 3 are repeated until cluster centroids no longer change. Non-negative Matrix Factorization (NMF) Non-negative matrix factor- ization describes a method in which a matrix is represented by a product of two smaller matrices. This method can also be used to cluster documents. [8] Given a matrix V , it can be decomposed with NMF into the matrices W and H, so that W × H ≈ V . If V is the size m×n, then W has the size m×k and H the size k×n. The goal is that the product of W and H should match the original matrix V . From this the following minimization problem can be derived: [8] 2 min kV − W HkF (3) W,H This minimization problem can be solved by a special gradient descent method as described by [6]. k·kF stands for the Frobenius norm. It specifies the size of a matrix and is defined as: sX X kKkF = |kij |2 (4) i j This method can also be used for clusters of documents. Figure 2 shows what information the matrices V , W , and H contain. The matrix V contains all the documents to be used for clustering. A document is always represented by a number vector of length n. n corresponds to the length of the vocabulary over all documents in V . As a rule, vectors with weighted word frequencies, such as tf-idf, are used. m stands for the number of documents. An entry Vmn from V thus contains a value that reflects the meaning of the word n in the document m (if tf-idf is used). K in the smaller matrices W and H stands for the number of clusters. As with k-means, this value must be set by the user beforehand. After decomposing V into the matrices W and H, W contains the assignments of the individual documents to the clusters. The largest number in each line decides about the belonging to a cluster. H contains a probability distribution of words over clusters. The larger the entry Hig in H, the more likely it is that the word g occurs in the documents from cluster i. The matrix H can be used to generate names or descriptions for the respective clusters. [8] NMF can be assigned to the topic models rather than to the classical cluster algorithms, such as k-means. Topic Model, similar to cluster algorithms, can find patterns in documents. They do not split the documents into disjoint clusters, but into so-called topics, that is, Clusters that overlap. For example, a docu- ment can belong to 90% of Topic 1, to 8% of Topic 3, and 2% of Topic 2. The subdivision into disjoint clusters with a topic model can be done by assigning a document only to the topic with the highest value. [1] 3.3 Label generation Without descriptive cluster names, it is difficult for humans to gather informa- tion from the grouped documents. One of the most difficult tasks in document clustering is therefore to generate a description which is understandable to hu- mans. For algorithms from the field of supervised learning this is easily possible via the training data. However, these are not present during clustering. The naming of clusters is often referred to as cluster labeling in the literature and can be done in many different ways. They can be divided into manual, au- tomatic and semi-automatic approaches. The manual naming of clusters is done by the human being who creates a name or description after visual inspection of Fig. 2: Graphical representation of NMF some documents of a cluster. Automatic methods attempt to generate descrip- tions from the most powerful features within a cluster. [5] [10] Semi-automatic approaches combine these two possibilities, for example, by customizing the au- tomatic descriptions by the human being. Another possibility is that a human describe categories with lists of keywords. Subsequently, the documents are au- tomatically sorted into these categories. [2, 3] Other approaches use additional external information, such as the WordNet database, which provides lexical- semantic relationships between words. [11] The implemented solution within the scope of this work consists of an au- tomatic generation of labels and offers the option to adjust the labels manually afterwards. Figure 3 introduces the individual steps, which are described in more detail below. Fig. 3: Cluster-labeling-steps Extraction of label candidates The first step of cluster labeling is to extract label candidates. For each cluster, all words in the vocabulary are sorted accord- ing their relevance to the cluster. When using NMF to cluster the documents, the H matrix already contains all the information you need. For each word and cluster, a Hig in H entry contains a value that reflects the relevance of the word g in cluster i (see section 3.2). It is therefore sufficient to sort the matrix H line by line. The larger a value, the more important is the corresponding word. For k-means as a clustering algorithm, this information can be obtained via the cluster centroids. The cluster centroids are located in the feature space and therefore have the same dimension as the tf-idf feature vectors. In order to obtain a list of words sorted by relevance for k-means, the values of the cluster centroids must be sorted. Figure 4 shows the matrix H with three different feature vectors. They each contain the two features dog and cat. For example, to obtain all label candidates for cluster 1, the values of the row are sorted descending. On the right, the image shows the characteristic space with the found cluster centroids. A row of the matrix H can be compared to a cluster center. Sorting the values of a cluster center thus corresponds to sorting a row in matrix H. (a) NMF (b) k-means Fig. 4: Graphical representation of the feature vectors / cluster centroids POS-Tagging Through POS tagging (Part-Of-Speech-Tagging), it is possible to filter automatically according to certain types of words. When naming clus- ters, this is very handy, as it can effectively filter out unwanted words. The Stanford POS tagger2 was used for this work, which in addition to English also supports the German language by using the Stuttgart-Tübingen Tag Set (STTS) [7]. The Stuttgart-Tübingen tag set contains over 50 different tags and can, for example, also distinguish between different types of conjunctions. The following tags were used: NN Normal noun, such as table, apple, travel NE Proper noun, such as Smith, Hamburg FM Foreign language material Support tickets contain many technical terms, which often come from English, such as domain record or virtual host. In order to exclude these terms as a description for a cluster, in addition to nouns and their own names, they were also filtered according to foreign-language material. All words that do not belong to any of the tags are excluded. For example, POS tagging would remove the words erreichbar and seit from the list server, erreichbar, fehler, managed, seite, problem, ftp, seit, ssh, datenbank. 2 https://nlp.stanford.edu/software/tagger.shtml Stemming We used the Snowball Stemmer3 for stemform reduction. For words whose word stem occurs more than once, only the original of the first occurrence was retained. The list domain, domains, relocation becomes after stemming do- main, relocation. Manual adjustment Manual adjustment is the last step in cluster naming. After the word lists have been filtered with the aid of POS tagging and stemming, the most meaningful word, as well as a list of the n most meaningful words in a file, is stored for each cluster. The entries can be customized by the user. 4 Evaluation This chapter contains the results of categorization using k-means and NMF. The data set contains only customer requests. This means that only the first message, which the customer writes, is contained. Tickets from employees, as well as automatically generated tickets, are not included. This data set was selected because the categorization of the customer requests was the most useful. For example, if new customer requests are categorized directly, they can be edited by employees who have the most expertise in the category. To include all the tickets form conversation would be counterproductive at this point, since only the first ticket of the customer is available at the time of the categorization of new tickets. The data set contains the latest 50,000 tickets with an average length of 680 characters. The tickets thus cover a period of approximately one year. Figure 5 shows the results of categorization with k-means and NMF in 30 cat- egories. The categorization was also executed with 50 and 70 categories. Because of the size, the results are not included. Three different aspects are analyzed for the evaluation, the intersection of categories at k-means and NMF, the distribution of tickets per category, and the development of a category when the number of clusters changes. 4.1 Intersection of categories at k-means and NMF Both algorithms, k-means and NMF, find similar categories and there is a large intersection between the categories. In categorization with 30 categories, 21 cat- egories are found in the results of both algorithms. In categorization with 50 categories, 33 are equivalent. With 70 categories, there are still 46 of categories equivalent. By increasing the number of categories from 30 to 50, the number of identical categories is decreasing in percentage. When increasing from 50 to 70 categories, there is no longer a large percentage change. With 30 categories, there are 70%, with 50 categories 66%, and with 70 categories 65.7%. For the categorization of the tickets in 30 categories, the intersection of tick- ets was examined. For all categories found by both k- means and NMF, it was 3 http://snowball.tartarus.org/ Fig. 5: Number of tickets per category - k-means vs. NMF (30 categories) checked how many tickets are included in both categories (see fig 5). Categories that are duplicated under the same or a similar name have been summarized for the calculation of the intersection. For some categories, such as ssl, the inter- section is relatively high. Only about 100 tickets from over 2000, which fall into category ssl at k-means, are not in category ssl at NMF. In some categories, however, the number of tickets included varies greatly. In the category password, NMF has more than twice as many tickets as with k-means. However, the cat- egory at NMF contains almost all the tickets, which also fall into this category at k-means. With k-means, some of the small categories are even smaller than with NMF. 4.2 Distribution of tickets per category To better evaluate the distribution of tickets per category, the number of tickets per category is shown in Figure 5. The diagram shows the distribution from the categorization in 30 categories. It is striking that there is a category with a lot of tickets at k-means. When categorized into 30 categories, the largest category contains almost 16.000 tickets. With NMF, the largest category, however, con- tains only about 5000 tickets, that is only a third of the tickets. A similar pattern can be observed at the categorization with 50 and 70. In 70 categories, the largest category, with nearly 10,000 tickets, contains a fifth of the complete data set. For NMF, however, the largest category is only about 1500 tickets. The difference between k-means and NMF in the largest categories also increases with increased cluster number. Also it is interesting that the name of the largest category in k-means varies depending on the number of clusters. The names of the largest categories are fehler, seite, and account. NMF tends to find categories that are similar in size. There are no very large categories and no very small categories. In order to better evaluate whether k-means further reflects the actual structure of the tickets, the large category fehler was further analyzed. All tickets assigned by k-means to the category fehler have been categorized with NMF. Only 7% of the tickets were assigned with NMF in the same category fehler. The largest block with approximately 27% of tickets falls with NMF into the category seite. The other tickets split relatively evenly in 26 other categories. One possible the- ory for the large categories is that with k-means outliners, i.e. tickets, which are located at the edge of a cluster, are grouped. 4.3 Development of a category when the number of clusters changes So far differences between the two algorithms k-means and NMF have been ana- lyzed. In this section we will examine the development of a category at different clusters, but using the same algorithm. The domain and domains categories were selected as examples. The term domain is very general in itself and is ex- perienced in many other areas such as e-mail, SSL certificates or DNS. For this reason the category domain is particularly interesting. All 50,000 tickets of the data set were divided into 30, 50 and 70 categories for evaluation with NMF. In 30 categories, a total of 4020 tickets fall into the category domain(s). Out of 4020 tickets, 78% of tickets are again in the domain(s) category. The remaining tickets are divided into 22 other categories. The right column shows the categorization with 70 categories. Only 38% of the tickets fall into the category domain(s). Overall, the tickets are now divided into 39 categories. Except for the category ftp, all categories from nmf50 were also found by nmf70. That means a larger amount of tickets does not suddenly change into another category, which would be an argument against the quality of the categorization. 5 Advantage of this solution Compared to a keyword driven approach, categorization with clustering algo- rithms has a great advantage: Tickets, which do not contain the name of a category, can be assigned to a category. Tickets of this kind are difficult to cat- egorize through simple procedures such as a keyword search, where many/some keywords from predefined lists of keywords for a category need to be present in a text to label it. Ko and Seo used in [3] a keyword driven approach, but would also able to assign a ticket to a category, although the name of the category is not present in the ticket, because they only created training data using the keywords. The difference is that they first had to create categories or keywords, which is not necessary in this approach. Listing 1.1 shows a ticket which has been assigned by NMF to the category typo (as a result of categorization in 30 categories). typo stands for the content management system TYPO3, which is used by many Mittwald customers. Listing 1.1: Example Ticket - Category typo 1 Pfad Sendmail 2 Hallo ! Wie lautet der Pfad zu Sendmail ? 3 Beste Gr ü ß e xxxxxx xxx The ticket itself does not contain the word typo and also no obvious keyword, which suggests a relation to TYPO3. However, Sendmail is a software to send e-mails, which is often used in TYPO3 for this purpose. The path to Sendmail must be entered in TYPO3. Our approach allows this ticket still to be assigned automatically to a meaningful category with the help of algorithms like k-means or NMF. Other examples are shown in Listings 1.2 and 1.3, where a mail-related and a ssl-related ticket were correctly recognized and assigned to the category mail and ssl, resp. Listing 1.2: Example Ticket - Category: mail 1 Required TLS im Required Mode 2 Sehr geehrtes Mittwald Support Team 3 ein Kunde wollte wissen , ob die Mittwald Postf ä cher als Verschl ü sselungsart 4 " TLS im Required Mode " unterst ü tzen . Ich weiss nicht ob es die Starttls 5 Methode ist . 6 Vielen Dank und Gruss xxxxxx xxxxxxx Listing 1.3: Example Ticket - Category: ssl 1 Let ’ s encrypt 2 Wie kann ich bei Mittwald eine Let ’ s encrypt Verschl ü sselung nutzen ? The ticket in Listing 1.4 has been assigned to the category php. In this case a keyword driven approach would have failed. Listing 1.4: Example Ticket - Category: php 1 Execution Time 2 Hallo ! Bitte ä ndern: 3 max_execution_time: 30 Recommended min value 120 4 m ax _i np u t_ va rs : 1000 Recommended min value 3000 5 Danke lg xxxxxxx Assuming manual categorization takes about 4 seconds and there are about 120 customer requests per day, then 8 minutes of time can be saved per day. The automatic categorizing of inventory tickets saves approximately 110 hours. It should also be noted that the categories found in this domain are quite general and thus give a good overview of the topics that the customers deals with. 6 Conclusion and Future Work The goal of this work was the unsupervised categorization of support tickets. According to latest research, there is no distinct solution for this. Especially the automatic naming of the calculated categories is not easy to solve. The result of a categorization is strongly dependent on the data set and the cluster algorithm. In addition, the quality of the categorization is difficult to evaluate. This work has shown that a fully automatic categorization, including the naming of the categories, is feasible and the found categories reflect meaningful subjects. To improve the quality of the found categories we propose hierarchical clus- tering of the support tickets. Such a hierarchy could contain more information than a flat clustering. Each category can be divided into other (sub-) categories, providing a more detailed view of the data. Acknowledgments We would like to thank Mittwald CM Service GmbH & Co. KG for the cooper- ation and the provision of the data set. References 1. Blei, D.: Probabilistic Topic Models. Communications of the ACM 55(4), 77–84 (2012) 2. Ko, Y., Seo, J.: Automatic Text Categorization by Unsupervised Learning. In: Proceedings of the 18th Conference on Computational Linguistics. vol. 1, pp. 453– 459. Association for Computational Linguistics (2000) 3. Ko, Y., Seo, J.: Learning with Unlabeled Data for Text Categorization Using a Bootstrapping and a Feature Projection Technique. In: Proc. 42nd Meeting of the Association for Computational Linguistics (ACL’04). pp. 255–262. Association for Computational Linguistics (2004) 4. Kuang, D., Choo, J., Park, H.: Nonnegative matrix factorization for interactive topic modeling and document clustering. In: Partitional Clustering Algorithms, pp. 215–243. Springer International Publishing (2015) 5. Li, X., Chen, J., Zaiane, O.: Text document topical recursive clustering and au- tomatic labeling of a hierarchy of document clusters. In: Advances in Knowledge Discovery and Data Mining, 17th Pacific-Asia Conference, PAKDD. pp. 197–208 (2013) 6. Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19(10), 2756–2779 (Oct 2007) 7. Schiller, A., Teufel, S., Stöckert, C.: Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset) (1999), http://www.sfs. uni-tuebingen.de/resources/stts-1999.pdf 8. Shahnaz, F., Berry, M.W., Pauca, V.P., Plemmons, R.J.: Document clustering using nonnegative matrix factorization. Information Processing & Management 42(2), 373–386 (Mar 2006) 9. Smith, A., Hawes, T., Myers, M.: Hiérarchie: Interactive Visualization for Hier- archical Topic Models. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. pp. 71–78. Association for Computational Linguistics (2014) 10. Stefanowski, J., Weiss, D.: Comprehensible and accurate cluster labels in text clustering. In: Large Scale Semantic Access to Content (Text, Image, Video, and Sound). pp. 198–209 (2007) 11. Tseng, Y.H.: Generic Title Labeling for Clustered Documents. Expert Systems with Applications: An International Journal 37(3), 2247–2254 (Mar 2010)