Clustering of lithuanian news articles Vilius Pranckaitis, Mantas Lukoševičius Faculty of Informatics Kaunas University of Technology Kaunas, Lithuania e-mail: vilius.pranckaitis@ktu.edu, mantas.lukosevicius@ktu.lt Abstract—There is arguably more research done on clustering II. A BRIEF INTRODUCTION TO DOCUMENT CLUSTERING of English texts than of any other language. In this article, the process of clustering Lithuanian news articles is studied. For text Document clustering is an application of cluster analysis to preprocessing, the effect of stemming, term frequency metrics textual documents. It is unsupervised learning and can be used and feature filtering is investigated. In addition, following for finding similar documents, organizing large document clustering algorithms are compared: k–means, bisecting k– collections, detecting duplicate content and search optimization means, and three linkage method variations of hierarchical [4]. clustering. The results show that k–means algorithm gives best overall results and that only one of the three hierarchical Document clustering process can be separated into three algorithms produces comparably good results. Term frequency– stages: feature selection, feature extraction and an application inverse document frequency (TF–IDF) with stemming of clustering algorithm. A more detailed description of these significantly increased clustering quality compared to not doing stages is given in the following paragraphs. stemming and/or using TF. Feature filtering by IDF helped to A. Feature selection optimize the k–means algorithm, but reduced the quality when using hierarchical clustering. Feature selection is a process of creating feature vectors from the text of the documents. It includes various text Keywords—document clustering; feature selection; k-means; processing steps such as splitting text into tokens, stemming hierarchical clustering; Lithuanian news articles and lemmatization, removing stop words, and calculating term frequency values. I. INTRODUCTION Text data has a few properties, which requires different The way people transfer information has changed approach than clustering other kinds of data [5]. While the drastically throughout the history. The initial methods were dimensionality of text data is large, documents usually contain slow and had low capacity, e.g. a messenger carrying as many a relatively small number of distinct words (i.e. the data is scrolls as he is physically able to carry. Throughout the sparse). In addition, the number of words in two documents centuries new ways to store and transfer information were may differ by orders of magnitude (e.g. a tweet versus a invented. Nowadays information travels at the speed of light in chapter of a book). Also, some of the words might be common amounts so large that humans can hardly comprehend them. in all kinds of texts while others would only appear in specific While the amounts of information transferred increased, type of documents. Measures for these problems can and human capabilities to consume this information did not should be taken in feature selection step. improve as much. This only increases the need for automated Stemming and lemmatization can be used to reduce the means to process information: group, categorize, find problem of high dimensionality. Multiple forms of the same duplicates, etc. One of such means is document clustering. word induce multiple dimensions in a feature vector. By This work studies clustering process of Lithuanian news stemming words multiple dimensions would be joined to a articles. The purpose of this study is to examine how different single dimension corresponding to a base form of the word. In text preprocessing steps and clustering algorithms affect the addition to decreasing dimensionality this would also reduce quality of the clustering produced. This article describes the noise caused by the grammar of a language (which is especially work done and the results of it. Similar previous work includes true for Lithuanian language, in which a word can have many [1, 2, 3]. different forms). This document is divided into several sections. Section II There are multiple ways to control the input of a term to a introduces reader to feature selection and clustering of textual clustering process. A simple stop words list can be used to data. Section III describes the data set used in this study. remove stop words from the text. Other statistical techniques Section IV lists the metrics used to evaluate results. In sections can be used for less significant term removal, such as term V and VI it is described how features were selected and what strength, entropy-based ranking and term contribution [5]. A experiments were performed. Sections VII and VIII provide common approach to control how much a term affects the analysis of results and conclusions. clustering process is term frequency–inverse document frequency (TF–IDF). A numeric value is assigned to every term as a product of TF and IDF values. TF is calculated per Copyright © 2017 held by the authors 27 document and is directly proportional to the number of times a Hierarchical agglomerative clustering is another well- term in a document. IDF is calculated per corpus and is known technique for clustering. Algorithm starts by assigning inversely proportional to number of documents in which a term every data point to a separate cluster. Then iteratively two appears. closest together clusters are joined into a single cluster. This step is repeated until single cluster is left. The output of the There are multiple variants of TF–IDF formula. In this algorithm is a tree which describes how the clusters were work fallowing formula is used: joined. (1) There are multiple ways to describe the distance between two clusters in terms of distances between data points in those where ft,d is the number of times term t appears in document d, clusters: N is the number of documents in D, and nt is the number of  Single linkage. The distance between two clusters is documents in D containing term t. equal to smallest distance between two data point B. Feature extraction crossing these clusters (i.e. closest pair of points across clusters). This resembles Kruskal’s minimum spanning While feature selection is a process of filtering relevant tree algorithm. features of a text, feature extraction uses the original feature set to build new features. This includes methods for  Group average linkage (also known as Unweighted dimensionality reduction, such as Principal Component Pair Group Method with Arithmetic Mean, UPGMA). Analysis, Non-negative Matrix Factorization and Latent In this method, the distance between clusters is an Semantic Indexing [5]. The clustering itself can also be used to average of distances between every pair of data points extract features. For example, in [6] it was shown how to across clusters. reduce dimensionality and noise of the features by clustering the words first.  Complete linkage. This is similar to single linkage method, but instead of smallest distance the largest The importance of feature extraction comes from a fact that distance between a pair of data points is taken (i.e. the words tend to correlate with one another. This means that the most distant pair of points across clusters). feature space is bigger than the number of concepts [5]. Dimensionality reduction can be used to transform this big The distance–based clustering algorithms requires a metric feature space to a smaller space of concepts. to evaluate the distance between data points. In this work a cosine distance was chosen: C. Clustering There are many different approaches how to cluster textual (2) data [5]: distance–based clustering, probabilistic methods (e.g. topic modeling), co-clustering (clustering words and documents simultaneously), clustering with frequent phrases. where , are feature vectors, n is dimensionality of the In this work variations two distance–based algorithms are vectors. Cosine distance is a commonly used distance metric studied: k–means clustering and hierarchical agglomerative and gives good results for textual data [8, 9]. In [1] it was clustering. shown that cosine distance works significantly better than Euclidean distance when clustering Lithuanian texts using k– K–means is a simple algorithm, producing flat clustering. means algorithm. During initialization, algorithm selects k means, which corresponds to k clusters. Then two steps are repeated: (1) for III. DATA SET PREPARATION every data point choose the nearest mean and assign the point The articles for the experiments were taken from the three to the corresponding cluster; (2) recalculate means by major Lithuanian news websites: delfi.lt, alfa.lt and 15min.lt. A averaging data points assigned to the corresponding cluster. week’s worth of articles were retrieved, starting from January The algorithm terminates, when assignment of the data points 1st to January 7th, 2016, total of 3572 articles. doesn’t change after successive iterations. The news sites contain different sub-sections, e.g. “news The clustering produced by k–means algorithm highly from Lithuania”, “crime”, “business”, etc. Every article is depends on the initial means chosen. A common approach is to published in one of these sub-sections. To label the data the run algorithm multiple times with randomly chosen initial names of the sub-sections were used. One problem of such means. In [7] there is proposed a simple randomized seeding labeling is that different websites have different number of technique, which improves speed and accuracy of k–means. categories, some of which are more abstract than others (e.g. Bisecting k-means algorithm is a variation of k–means one website has a single sub-section “sports”, while other has algorithm. This algorithm takes top–down approach. It starts multiple sub-sections for “basketball”, “football” and so on). with single cluster containing all the data points. Iteratively the To avoid labels being mismatched, they were normalized biggest cluster is chosen and split into two parts by running across the websites by grouping them to several categories. By inner k–means on the data points of the cluster. The algorithm investigating the websites, a common set of categories was terminates when the chosen number of clusters is reached. noticed:  Lithuania news (413 articles); 28  World news (426 articles); TABLE I. CONFUSION MATRIX FOR A PAIR OF DOCUMENTS  Crime (321 articles); Same cluster Different clusters  Business (337 articles); Matching labels True positive False negative  Cars (170 articles); Different labels False positive True negative  Sports (602 articles); described in Table I.  Technologies (129 articles); Precision and recall range from 0 to 1 and are defined as:  Opinions (99 articles); (3)  Entertainment (526 articles);  Life (306 articles); (4)  Culture (56 articles); B. F1 score  Other (187 articles, which doesn’t fall into previous Precision and recall tend to introduce bias when the number categories). of clusters reaches extremes (e.g. recall equals 1 when every This division to categories shows that some categories are item falls into a single cluster). F1 score reduces bias by more popular than others. For example, 1/6th of the articles fall combining both measures: into “sports” category. The standard deviation of such composition is 173.6, and median absolute deviation is 135 (5) (these metrics are also used for comparison of clustering algorithms further in this article). Because F1 score doesn’t tend to favor high or low number Fig. 1 shows the article categories mapped to 2 dimensions of clusters as much as other metrics, a preference is given to using PCA. While the “sports” category lies in top right this metric throughout the article. quadrant clearly separated from other categories, other C. Purity categories are much more intermixed. This is probably caused Purity describes the homogeneity of clusters. Purity of a by the fact that sports articles have a distinctive vocabulary and clustering ranges from 0 to 1 and is defined as that “sports” category takes up a big part of the data set. IV. MEANS OF EVALUATION (6) To measure the quality of the clustering several evaluation  metrics were used: precision, recall, F1 score, purity and where N is the number of items, k is the number of clusters, ci entropy. is a chosen cluster, and gj is a category which has the A. Precission and recall maximum number of items in ci. Precision and recall are well known measures in Purity has a bias for high number of clusters. For example, information retrieval. Both measures deal with notions of true when every cluster consists of a single item, purity equals to 1. positive, false positive, true negative and false negative. In case of clustering, every pair of documents are taken in account. If a D. Entropy pair of documents have same label and appear in the same Entropy, same as purity, describes the homogeneity of the cluster, then it is a true positive. The rest of the notions are clusters. However, while purity only considers the number of items from dominating category, entropy considers the whole composition of the cluster. For a single cluster, entropy is defined as (7) where ci is a chosen cluster, m is the number of categories, and gj is a category which has items in cluster ci. To calculate then entropy for a whole clustering, weighted average is used: (8) Fig. 1. Article categories, mapped to 2 dimensions using PCA. In the top right where C is a set of clusters ci, k is the number of clusters, and N quadrant clearly separated from other categories lies “sports” category. is the number of items. 29 Entropy, same as purity, is biased towards high number of HDBSCAN extraction method [10] was used, which doesn’t clusters (e.g. entropy of a single item cluster is 0). allow direct control of the number of clusters. In HDBSCAN, a single parameter is used—a minimum size of a cluster. This V. FEATURE SELECTION PROCEDURE allows to imprecisely control the number of clusters produced. Text preprocessing and conversion to feature vectors Because of this, multiple tests were run with different consisted of the following steps: minimum cluster size parameter. The tests, which produces from 9 to 13 clusters, were picked and their results were 1) Splitting text into tokens; averaged. 2) Switching characters to lowercase; C. Term filtering experiment 3) Stemming (skipped in some experiments); In term filtering experiment, it was studied how clustering quality changes in response to increasing the number of terms 4) Filtering a specific percentage of terms which are removed from feature vectors. Filtering was done by sorting rarest according to IDF (skipped in some experiments); terms by IDF and removing the ones which appear in the least 5) Applying TF–IDF (TF used in some experiments); number of documents. The percentage of terms remaining after filtering ranged from 100% to 10%. K-means and group 6) Normalizing feature vectors. average linkage hierarchical algorithms were used, configured More details about what feature selection steps were used as in clustering algorithms experiment. in which experiments are given in the experiment descriptions. VII. RESULTS AND ANALYSIS VI. EXPERIMENTS A. Stemming and term frequency experiment Several experiments were conducted during this work, The average F1 scores of different stemming and term testing text preprocessing steps and clustering algorithms. frequency combinations are displayed in Fig. 2. The worst Below is a detailed description of these experiments. results are produced by TF ( without stemming and A. Stemming and term frequency experiment with stemming). While TF–IDF without stemming In this experiment, it was studied how stemming and provides only a small increase ( ), the F1 score different term frequencies affected the clustering. 4 different improves by more than 50% when applying stemming configurations were used: (1) TF without stemming, (2) TF ( ). with stemming, (3) TF–IDF without stemming, and (4) TF– The results confirm the intuition that TF–IDF and IDF with stemming. No term filtering was applied. stemming improves clustering by disregarding stop words and For clustering, k-means algorithm was used. Each reducing noise caused by language grammar. It is worth noting stemming/term frequency configuration was tested by making that stemming was much more effective when combined with multiple runs, trying out different numbers of clusters k[9;13] TF–IDF. This shows that when selecting features, one and with 5 randomly generated initial mean sets (in total, 25 misconfigured step can greatly diminish effect of other steps. runs per stemming/term frequency configuration). B. Clustering algorithms experiment B. Clustering algorithms experiment In Fig. 3 there are displayed average and maximum values In this experiment, multiple clustering algorithms were of F1 score. The best results (both average and maximum) tested. In every test the same feature selection procedure was were shown by k–means and bisecting k–means algorithms used. The text preprocessing matched the steps mentioned in (latter achieving the absolute maximum across section “Feature Selection Procedure”, except without any term all algorithms with no term filtering applied). Close results filtering applied. were shown by group average linkage hierarchical algorithm. Complete linkage and single linkage versions performed Below is the list of clustering algorithms tested: significantly worse, by value being closer to random clustering than to previously mentioned algorithms.  K-means; More quality measures are given in Table II. Different  Bisecting k-means;  Hierarchical, single linkage;  Hierarchical, group average linkage;  Hierarchical, complete linkage;  Random (for comparison). In case of k-means algorithms, previously mentioned configuration was used: 5 randomly generated initial mean set Fig. 2. F1 scores of a clusterings using different combinations of TF/TF–IDF for each k[9;13]. The results of multiple tests were averaged. and stemming. TF–IDF produce better results than TF with and without stemming. While stemming has small impact when combined with TF, there Contrary to k-means, hierarchical clustering doesn’t depend is a more than 50% increase in F1 when combining stemming with TF–IDF. on a random factor. However, to extract flat clustering 30 TABLE II. EVALUATION METRICS OF DIFFERENT CLUSTERING ALGORITHMS Quality measures (averaged) Cluster size statistics (averaged) Clustering algorithm Median absolute F1 score Precision Recall Purity Entropy Standard deviation deviation (MAD) K–means 0.386 0.327 0.482 0.522 0.580 286.5 194.9 Bisecting k–means 0.375 0.293 0.532 0.479 0.617 356.7 150.9 Hierarchical, group average linkage 0.354 0.330 0.389 0.495 0.578 221.4 79.8 Hierarchical, complete linkage 0.194 0.152 0.277 0.291 0.806 355.1 71.0 Hierarchical, single linkage 0.187 0.111 0.604 0.287 0.811 853.0 14.0 Random 0.100 0.109 0.093 0.173 0.927 16.8 13.2 algorithms lead according to different metrics, but two of them both k–means algorithms. While the k–means has lower stands out: k–means algorithm has the highest mean value of standard deviation, bisecting k–means has lower MAD. In F1 score and purity, while group average linkage hierarchical other words, the majority of clusters produced by bisecting k– clustering leads by precision and entropy. means are more similar in size than those produced by k– means, but the outliers are more extreme too. Among all the algorithms, single linkage hierarchical clustering has the best recall value. This is probably caused by Fig. 4 displays the relation between number of clusters and the fact that recall is biased to favor big clusters. In many cases the F1 score of hierarchical clustering algorithms. Group this algorithm produced single cluster containing more than a average clustering tends to show higher values when the half of articles along with multiple small clusters. The cluster number of clusters is close to the number of categories. This size statistics confirms that—single linkage hierarchical signals that the underlying concepts found by the algorithm are clustering has the highest standard deviation, but lowest similar to the categories of the documents. In case of single and median absolute deviation (MAD). High standard deviation is complete linkage methods, there is no such correlation. caused by cluster sizes being distant from the mean. In case of MAD, a small number of outliers are irrelevant, so low MAD C. Term filtering experiment value means that the sizes of most of the clusters are close to For filtering tests, k-means and group average linkage each other. hierarchical algorithms were chosen. There were multiple runs starting with full feature space of 41 thousand dimensions to When comparing cluster size statistics, group average only 10% of terms which are most common across the hierarchical algorithm produces clusters which are close to documents. The F1 scores are displayed in Fig. 5. each other in size. This observation is backed up by relatively low values of standard deviation and median absolute Of the two algorithms, k–means shows less sensitivity to deviation. The cluster size distribution is not as equal in case of changes of feature space. While there is a small decrease in F1 score moving from 100% to 50% of the dimensionality (minimum at ), the F1 score of 10% rises slightly above the full feature space with versus .386. Similar results were shown in [11]. This study compared different metrics for feature selection, including document frequency. The clustering was done on English texts using k– means algorithm. The variation of quality metrics moving from Fig. 3. Average and maximum F1 score of different clustering algorithms. 100% to 10% was relatively small. A more rapid change The best results were shown by k-means and bisecting k-means algorithms occurred going from 10% down to 2%. Reference [2] tested with group average hierarchical clustering coming in a close third. multiple feature selection metrics on Lithuanian and Russian texts, and had best results at 7% of features remaining. The results of group average linkage hierarchical clustering differ from the k-means ones. The variance of F1 score is significantly greater. At the beginning the score increases from to . Later values drop reaching and after a temporal increase eventually drops to . The pruning of 10% of features might have reduced the noise in the feature space. This would explain the sudden rise Fig. 4. Relation between F1 score and number of clusters produced by of the F1 score. The further pruning might have started hierarchical clustering algorithms. When the number of clusters gets close to removing relevant features. However, this does not explain the number of categories, the F1 score for group average hierarchical clustering increase in F1 score at 30–20% of feature space. tend to increase. 31 TABLE III. RELEVANT TERMS (IN LITHUANIAN) OF THE CLUSTERING WITH THE HIGHEST F1 SCORE Cluster 143 113 907 484 420 308 449 121 232 395 size variklio šalčio aktorė rungtynių karinių policijos bendrovės dakaro jums šulinį dyzelinių temperatūra dainininkė ekipa partijos ugniagesiai barelį ruožas organizmą saviečių gamintojų laipsnių muzikinę žaidėjų sąjungininkų vpk indeksas ralio vitaminų tragedijos automobilių kritulių prodiuseris turnyro nimro neblaivus brent ekipažas astrologė smurto Relevant elektromobilių rajoniniai eurovizijos taškų šiitų ambulatoriškai įmonės vanagas cukraus mažamečius words tesla sniego koncertinį pergalę obama prom valiutos juknevičius horoskopas smurtaudamas motors provėžoti meninės kamuolį narystės patrulių wti lenktynininkai riebalų sugyventinę fiat hidrometeorologijos žanro treneris referendumas komisariato holding trasos jus sumetė volkswagen plikledis scenoje rezultatyvaus sirijos girtumas akcininkų lenktynių mitybos nužudė lg prispausto režisierius įvartį nato vairuojamas aplinkosaugos benediktas ožiaragis ekspertizė When comparing both clustering algorithms, k-means clusters reveals that it is not uncommon for other themes to seems to intrinsically diminish the value of rare terms. In case appear in the clusters, sometimes not very related to the of group average hierarchical clustering, the impact of such relevant words. terms to the clustering results is significantly higher. VIII. CONCLUSIONS D. A drill–down look at the clustering results In this work, several different experiments were performed During this study, several hundred runs were made of studying feature selection process and clustering algorithms. various clustering configurations. From the top 50 tests sorted Term frequency and stemming experiment proved that TF–IDF by the F1 score, every single one was done using one of the with stemming is superior to other configurations. In addition, two k–means algorithms, only several of them being bisecting this experiment showed that one misconfigured step in the k–means. Most of these tests came from the filtering process can greatly diminish the effects of other steps. In the experiment, over a half having 40% or less features remaining. clustering algorithms experiment only 3 of 5 algorithms tested In order to have a better idea if the methods actually produced acceptable results, k–means performing arguably the produce meaningful clusters, a metric was proposed to measure best overall. In case of filtering experiment, k–means algorithm which terms are more relevant to the cluster. It is a seemed mostly unaffected by removal of major part of the combination of two inverse document frequencies and is features, while the results of group average linkage hierarchical defined as clustering varied more and was less predictable. REFERENCES (9) [1] G. Ciganaitė, A. Mackutė-Varoneckienė, T. Krilavičius, "Text documents clustering," in 19-oji tarpuniversitetinė tarptautinė where t is a term, c is a set of documents in a cluster, and D is magistrantų ir doktorantų konferencija IVUS, 2014. the documents in the data set. In other words, term relevance is [2] A. Mackutė-Varoneckienė, T. Krilavičius, "Empirical Study on high if the term is (1) relatively rare across the documents in Unsupervised Feature Selection for Document Clustering," in Human Language Technologies – The Baltic Perspective, pp. 107-109, 2014. the data set and (2) relatively common in the documents of the cluster. [3] E. Kebelytė, M. Lukoševičius, "Keywords analysis of Lithuanian news stream," in 20-toji tarptautinė magistrantų ir doktorantų konferencija The test with the highest F1 score came from the filtering "Informacinės technologijos 2015", Kaunas, 2015. experiment: k–means algorithm and 20% of remaining [4] N. Shah, S. Mahajan, "Document clustering: a detailed review," features. The most relevant terms of this test are displayed in International Journal of Applied Information Systems, vol. 4(5), pp. 30- 38, 2012. Table III. As expected, the clusters don’t seem to identically map to the categories in the data set. However, the themes of [5] C. C. Aggarwal, C. Zhai, "A survey of text clustering algorithms," in Mining Text Data, New York, Springer-Verlag, pp. 77-128, 2012. the clusters are easily noticeable (in Table III, some of the themes are “cars”, “sports”, “Dakar rally”, “business”, “foreign [6] N. Slonim, N. Tishby, "Document clustering using word clusters via the information bottleneck method," in Proceedings of the 23rd annual politics” and “crime”). Despite that, a deeper look at the international ACM SIGIR conference, 2000. [7] D. Arthur, S. Vassilvitskii, "k-means++: the advantages of careful seeding," in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007. [8] H. Anna, "Similarity measures for text document clustering," in Proceedings of the Sixth New Zealand, 2008. [9] A. Strehl, J. Ghosh, R. Mooney, "Impact of similarity measures on web- page clustering," in In Workshop on Artificial Intelligence for Web Search (AAAI 2000), 2000. [10] R. J. G. B. Campello, D. Moulavi, J. Sander, "Density-based clustering based on hierarchical density estimates," Advances in Knowledge Discovery and Data Mining, pp. 160-172, 2013. Fig. 5. F1 score dependency on the percentage of features used. K–means [11] T. Liu, S. Liu, Z. Chen, W.-Y. Ma, "An evaluation on feature selection algorithm was less sensitive to the change of feature space dimensionality. F1 for text clustering," Icml, pp. 488-495, 2003. score of hierarchical clustering varies significantly more. 32