Clustering of lithuanian news articles

                                             Vilius Pranckaitis, Mantas Lukoševičius
                                                        Faculty of Informatics
                                                   Kaunas University of Technology
                                                          Kaunas, Lithuania
                                   e-mail: vilius.pranckaitis@ktu.edu, mantas.lukosevicius@ktu.lt


    Abstract—There is arguably more research done on clustering              II. A BRIEF INTRODUCTION TO DOCUMENT CLUSTERING
of English texts than of any other language. In this article, the
process of clustering Lithuanian news articles is studied. For text             Document clustering is an application of cluster analysis to
preprocessing, the effect of stemming, term frequency metrics              textual documents. It is unsupervised learning and can be used
and feature filtering is investigated. In addition, following              for finding similar documents, organizing large document
clustering algorithms are compared: k–means, bisecting k–                  collections, detecting duplicate content and search optimization
means, and three linkage method variations of hierarchical                 [4].
clustering. The results show that k–means algorithm gives best
overall results and that only one of the three hierarchical                    Document clustering process can be separated into three
algorithms produces comparably good results. Term frequency–               stages: feature selection, feature extraction and an application
inverse document frequency (TF–IDF) with stemming                          of clustering algorithm. A more detailed description of these
significantly increased clustering quality compared to not doing           stages is given in the following paragraphs.
stemming and/or using TF. Feature filtering by IDF helped to               A. Feature selection
optimize the k–means algorithm, but reduced the quality when
using hierarchical clustering.                                                 Feature selection is a process of creating feature vectors
                                                                           from the text of the documents. It includes various text
    Keywords—document clustering; feature selection; k-means;              processing steps such as splitting text into tokens, stemming
hierarchical clustering; Lithuanian news articles                          and lemmatization, removing stop words, and calculating term
                                                                           frequency values.
                      I. INTRODUCTION
                                                                               Text data has a few properties, which requires different
    The way people transfer information has changed                        approach than clustering other kinds of data [5]. While the
drastically throughout the history. The initial methods were               dimensionality of text data is large, documents usually contain
slow and had low capacity, e.g. a messenger carrying as many               a relatively small number of distinct words (i.e. the data is
scrolls as he is physically able to carry. Throughout the                  sparse). In addition, the number of words in two documents
centuries new ways to store and transfer information were                  may differ by orders of magnitude (e.g. a tweet versus a
invented. Nowadays information travels at the speed of light in            chapter of a book). Also, some of the words might be common
amounts so large that humans can hardly comprehend them.                   in all kinds of texts while others would only appear in specific
   While the amounts of information transferred increased,                 type of documents. Measures for these problems can and
human capabilities to consume this information did not                     should be taken in feature selection step.
improve as much. This only increases the need for automated                    Stemming and lemmatization can be used to reduce the
means to process information: group, categorize, find                      problem of high dimensionality. Multiple forms of the same
duplicates, etc. One of such means is document clustering.                 word induce multiple dimensions in a feature vector. By
    This work studies clustering process of Lithuanian news                stemming words multiple dimensions would be joined to a
articles. The purpose of this study is to examine how different            single dimension corresponding to a base form of the word. In
text preprocessing steps and clustering algorithms affect the              addition to decreasing dimensionality this would also reduce
quality of the clustering produced. This article describes the             noise caused by the grammar of a language (which is especially
work done and the results of it. Similar previous work includes            true for Lithuanian language, in which a word can have many
[1, 2, 3].                                                                 different forms).
    This document is divided into several sections. Section II                 There are multiple ways to control the input of a term to a
introduces reader to feature selection and clustering of textual           clustering process. A simple stop words list can be used to
data. Section III describes the data set used in this study.               remove stop words from the text. Other statistical techniques
Section IV lists the metrics used to evaluate results. In sections         can be used for less significant term removal, such as term
V and VI it is described how features were selected and what               strength, entropy-based ranking and term contribution [5]. A
experiments were performed. Sections VII and VIII provide                  common approach to control how much a term affects the
analysis of results and conclusions.                                       clustering process is term frequency–inverse document
                                                                           frequency (TF–IDF). A numeric value is assigned to every term
                                                                           as a product of TF and IDF values. TF is calculated per


   Copyright © 2017 held by the authors


                                                                      27
document and is directly proportional to the number of times a                  Hierarchical agglomerative clustering is another well-
term in a document. IDF is calculated per corpus and is                     known technique for clustering. Algorithm starts by assigning
inversely proportional to number of documents in which a term               every data point to a separate cluster. Then iteratively two
appears.                                                                    closest together clusters are joined into a single cluster. This
                                                                            step is repeated until single cluster is left. The output of the
   There are multiple variants of TF–IDF formula. In this                   algorithm is a tree which describes how the clusters were
work fallowing formula is used:                                             joined.
                                                                (1)             There are multiple ways to describe the distance between
                                                                            two clusters in terms of distances between data points in those
where ft,d is the number of times term t appears in document d,             clusters:
N is the number of documents in D, and nt is the number of                         Single linkage. The distance between two clusters is
documents in D containing term t.                                                   equal to smallest distance between two data point
B. Feature extraction                                                               crossing these clusters (i.e. closest pair of points across
                                                                                    clusters). This resembles Kruskal’s minimum spanning
    While feature selection is a process of filtering relevant
                                                                                    tree algorithm.
features of a text, feature extraction uses the original feature set
to build new features. This includes methods for                                   Group average linkage (also known as Unweighted
dimensionality reduction, such as Principal Component                               Pair Group Method with Arithmetic Mean, UPGMA).
Analysis, Non-negative Matrix Factorization and Latent                              In this method, the distance between clusters is an
Semantic Indexing [5]. The clustering itself can also be used to                    average of distances between every pair of data points
extract features. For example, in [6] it was shown how to                           across clusters.
reduce dimensionality and noise of the features by clustering
the words first.                                                                   Complete linkage. This is similar to single linkage
                                                                                    method, but instead of smallest distance the largest
    The importance of feature extraction comes from a fact that                     distance between a pair of data points is taken (i.e. the
words tend to correlate with one another. This means that the                       most distant pair of points across clusters).
feature space is bigger than the number of concepts [5].
Dimensionality reduction can be used to transform this big                      The distance–based clustering algorithms requires a metric
feature space to a smaller space of concepts.                               to evaluate the distance between data points. In this work a
                                                                            cosine distance was chosen:
C. Clustering
    There are many different approaches how to cluster textual                                                                             (2)
data [5]: distance–based clustering, probabilistic methods (e.g.
topic modeling), co-clustering (clustering words and
documents simultaneously), clustering with frequent phrases.                where ,      are feature vectors, n is dimensionality of the
In this work variations two distance–based algorithms are                   vectors. Cosine distance is a commonly used distance metric
studied: k–means clustering and hierarchical agglomerative                  and gives good results for textual data [8, 9]. In [1] it was
clustering.                                                                 shown that cosine distance works significantly better than
                                                                            Euclidean distance when clustering Lithuanian texts using k–
    K–means is a simple algorithm, producing flat clustering.               means algorithm.
During initialization, algorithm selects k means, which
corresponds to k clusters. Then two steps are repeated: (1) for                             III. DATA SET PREPARATION
every data point choose the nearest mean and assign the point
                                                                                 The articles for the experiments were taken from the three
to the corresponding cluster; (2) recalculate means by
                                                                            major Lithuanian news websites: delfi.lt, alfa.lt and 15min.lt. A
averaging data points assigned to the corresponding cluster.
                                                                            week’s worth of articles were retrieved, starting from January
The algorithm terminates, when assignment of the data points
                                                                            1st to January 7th, 2016, total of 3572 articles.
doesn’t change after successive iterations.
                                                                                The news sites contain different sub-sections, e.g. “news
    The clustering produced by k–means algorithm highly
                                                                            from Lithuania”, “crime”, “business”, etc. Every article is
depends on the initial means chosen. A common approach is to
                                                                            published in one of these sub-sections. To label the data the
run algorithm multiple times with randomly chosen initial
                                                                            names of the sub-sections were used. One problem of such
means. In [7] there is proposed a simple randomized seeding
                                                                            labeling is that different websites have different number of
technique, which improves speed and accuracy of k–means.
                                                                            categories, some of which are more abstract than others (e.g.
    Bisecting k-means algorithm is a variation of k–means                   one website has a single sub-section “sports”, while other has
algorithm. This algorithm takes top–down approach. It starts                multiple sub-sections for “basketball”, “football” and so on).
with single cluster containing all the data points. Iteratively the         To avoid labels being mismatched, they were normalized
biggest cluster is chosen and split into two parts by running               across the websites by grouping them to several categories. By
inner k–means on the data points of the cluster. The algorithm              investigating the websites, a common set of categories was
terminates when the chosen number of clusters is reached.                   noticed:
                                                                                   Lithuania news (413 articles);


                                                                       28
        World news (426 articles);
                                                                                              TABLE I. CONFUSION MATRIX FOR A PAIR OF DOCUMENTS
        Crime (321 articles);
                                                                                                                  Same cluster      Different clusters
        Business (337 articles);                                                       Matching labels       True positive       False negative
        Cars (170 articles);                                                            Different labels     False positive      True negative

        Sports (602 articles);                                                       described in Table I.
        Technologies (129 articles);                                                    Precision and recall range from 0 to 1 and are defined as:
        Opinions (99 articles);                                                                                                                         (3)
        Entertainment (526 articles);
        Life (306 articles);                                                                                                                            (4)

        Culture (56 articles);
                                                                                      B. F1 score
        Other (187 articles, which doesn’t fall into previous
                                                                                          Precision and recall tend to introduce bias when the number
         categories).
                                                                                      of clusters reaches extremes (e.g. recall equals 1 when every
    This division to categories shows that some categories are                        item falls into a single cluster). F1 score reduces bias by
more popular than others. For example, 1/6th of the articles fall                     combining both measures:
into “sports” category. The standard deviation of such
composition is 173.6, and median absolute deviation is 135                                                                                               (5)
(these metrics are also used for comparison of clustering
algorithms further in this article).                                                      Because F1 score doesn’t tend to favor high or low number
    Fig. 1 shows the article categories mapped to 2 dimensions                        of clusters as much as other metrics, a preference is given to
using PCA. While the “sports” category lies in top right                              this metric throughout the article.
quadrant clearly separated from other categories, other                               C. Purity
categories are much more intermixed. This is probably caused
                                                                                          Purity describes the homogeneity of clusters. Purity of a
by the fact that sports articles have a distinctive vocabulary and
                                                                                      clustering ranges from 0 to 1 and is defined as
that “sports” category takes up a big part of the data set.
                    IV. MEANS OF EVALUATION                                                                                                              (6)
    To measure the quality of the clustering several evaluation                                                                   
metrics were used: precision, recall, F1 score, purity and                            where N is the number of items, k is the number of clusters, ci
entropy.                                                                              is a chosen cluster, and gj is a category which has the
A. Precission and recall                                                              maximum number of items in ci.
    Precision and recall are well known measures in                                      Purity has a bias for high number of clusters. For example,
information retrieval. Both measures deal with notions of true                        when every cluster consists of a single item, purity equals to 1.
positive, false positive, true negative and false negative. In case
of clustering, every pair of documents are taken in account. If a                     D. Entropy
pair of documents have same label and appear in the same                                  Entropy, same as purity, describes the homogeneity of the
cluster, then it is a true positive. The rest of the notions are                      clusters. However, while purity only considers the number of
                                                                                      items from dominating category, entropy considers the whole
                                                                                      composition of the cluster. For a single cluster, entropy is
                                                                                      defined as

                                                                                                                                                         (7)

                                                                                      where ci is a chosen cluster, m is the number of categories, and
                                                                                      gj is a category which has items in cluster ci. To calculate then
                                                                                      entropy for a whole clustering, weighted average is used:

                                                                                                                                                         (8)

Fig. 1. Article categories, mapped to 2 dimensions using PCA. In the top right        where C is a set of clusters ci, k is the number of clusters, and N
quadrant clearly separated from other categories lies “sports” category.
                                                                                      is the number of items.


                                                                                 29
    Entropy, same as purity, is biased towards high number of           HDBSCAN extraction method [10] was used, which doesn’t
clusters (e.g. entropy of a single item cluster is 0).                  allow direct control of the number of clusters. In HDBSCAN, a
                                                                        single parameter is used—a minimum size of a cluster. This
            V. FEATURE SELECTION PROCEDURE                              allows to imprecisely control the number of clusters produced.
   Text preprocessing and conversion to feature vectors                 Because of this, multiple tests were run with different
consisted of the following steps:                                       minimum cluster size parameter. The tests, which produces
                                                                        from 9 to 13 clusters, were picked and their results were
   1) Splitting text into tokens;                                       averaged.
   2) Switching characters to lowercase;                                C. Term filtering experiment
   3) Stemming (skipped in some experiments);                                In term filtering experiment, it was studied how clustering
                                                                        quality changes in response to increasing the number of terms
   4) Filtering a specific percentage of terms which are
                                                                        removed from feature vectors. Filtering was done by sorting
      rarest according to IDF (skipped in some experiments);
                                                                        terms by IDF and removing the ones which appear in the least
   5) Applying TF–IDF (TF used in some experiments);                    number of documents. The percentage of terms remaining after
                                                                        filtering ranged from 100% to 10%. K-means and group
   6) Normalizing feature vectors.                                      average linkage hierarchical algorithms were used, configured
    More details about what feature selection steps were used           as in clustering algorithms experiment.
in which experiments are given in the experiment descriptions.
                                                                                           VII. RESULTS AND ANALYSIS
                     VI. EXPERIMENTS
                                                                        A. Stemming and term frequency experiment
    Several experiments were conducted during this work,                    The average F1 scores of different stemming and term
testing text preprocessing steps and clustering algorithms.             frequency combinations are displayed in Fig. 2. The worst
Below is a detailed description of these experiments.                   results are produced by TF (          without stemming and
A. Stemming and term frequency experiment                                           with stemming). While TF–IDF without stemming
    In this experiment, it was studied how stemming and                 provides only a small increase (           ), the F1 score
different term frequencies affected the clustering. 4 different         improves by more than 50% when applying stemming
configurations were used: (1) TF without stemming, (2) TF               (           ).
with stemming, (3) TF–IDF without stemming, and (4) TF–                     The results confirm the intuition that TF–IDF and
IDF with stemming. No term filtering was applied.                       stemming improves clustering by disregarding stop words and
   For clustering, k-means algorithm was used. Each                     reducing noise caused by language grammar. It is worth noting
stemming/term frequency configuration was tested by making              that stemming was much more effective when combined with
multiple runs, trying out different numbers of clusters k[9;13]        TF–IDF. This shows that when selecting features, one
and with 5 randomly generated initial mean sets (in total, 25           misconfigured step can greatly diminish effect of other steps.
runs per stemming/term frequency configuration).                        B. Clustering algorithms experiment
B. Clustering algorithms experiment                                          In Fig. 3 there are displayed average and maximum values
     In this experiment, multiple clustering algorithms were            of F1 score. The best results (both average and maximum)
tested. In every test the same feature selection procedure was          were shown by k–means and bisecting k–means algorithms
used. The text preprocessing matched the steps mentioned in             (latter achieving the absolute maximum                   across
section “Feature Selection Procedure”, except without any term          all algorithms with no term filtering applied). Close results
filtering applied.                                                      were shown by group average linkage hierarchical algorithm.
                                                                        Complete linkage and single linkage versions performed
   Below is the list of clustering algorithms tested:                   significantly worse, by value being closer to random clustering
                                                                        than to previously mentioned algorithms.
    K-means;
                                                                            More quality measures are given in Table II. Different
    Bisecting k-means;
    Hierarchical, single linkage;
    Hierarchical, group average linkage;
    Hierarchical, complete linkage;
    Random (for comparison).
    In case of k-means algorithms, previously mentioned
configuration was used: 5 randomly generated initial mean set           Fig. 2. F1 scores of a clusterings using different combinations of TF/TF–IDF
for each k[9;13]. The results of multiple tests were averaged.         and stemming. TF–IDF produce better results than TF with and without
                                                                        stemming. While stemming has small impact when combined with TF, there
   Contrary to k-means, hierarchical clustering doesn’t depend          is a more than 50% increase in F1 when combining stemming with TF–IDF.
on a random factor. However, to extract flat clustering


                                                                   30
                                         TABLE II. EVALUATION METRICS OF DIFFERENT CLUSTERING ALGORITHMS

                                                              Quality measures (averaged)                             Cluster size statistics (averaged)
       Clustering algorithm                                                                                                              Median absolute
                                         F1 score        Precision        Recall           Purity      Entropy      Standard deviation
                                                                                                                                         deviation (MAD)
K–means                                        0.386           0.327            0.482          0.522        0.580                286.5                194.9
Bisecting k–means                              0.375           0.293            0.532          0.479        0.617                356.7                150.9
Hierarchical, group average linkage            0.354           0.330            0.389          0.495        0.578                221.4                 79.8
Hierarchical, complete linkage                 0.194           0.152            0.277          0.291        0.806                355.1                 71.0
Hierarchical, single linkage                   0.187           0.111            0.604          0.287        0.811                853.0                 14.0
Random                                         0.100           0.109            0.093          0.173        0.927                 16.8                 13.2

algorithms lead according to different metrics, but two of them                      both k–means algorithms. While the k–means has lower
stands out: k–means algorithm has the highest mean value of                          standard deviation, bisecting k–means has lower MAD. In
F1 score and purity, while group average linkage hierarchical                        other words, the majority of clusters produced by bisecting k–
clustering leads by precision and entropy.                                           means are more similar in size than those produced by k–
                                                                                     means, but the outliers are more extreme too.
    Among all the algorithms, single linkage hierarchical
clustering has the best recall value. This is probably caused by                         Fig. 4 displays the relation between number of clusters and
the fact that recall is biased to favor big clusters. In many cases                  the F1 score of hierarchical clustering algorithms. Group
this algorithm produced single cluster containing more than a                        average clustering tends to show higher values when the
half of articles along with multiple small clusters. The cluster                     number of clusters is close to the number of categories. This
size statistics confirms that—single linkage hierarchical                            signals that the underlying concepts found by the algorithm are
clustering has the highest standard deviation, but lowest                            similar to the categories of the documents. In case of single and
median absolute deviation (MAD). High standard deviation is                          complete linkage methods, there is no such correlation.
caused by cluster sizes being distant from the mean. In case of
MAD, a small number of outliers are irrelevant, so low MAD                           C. Term filtering experiment
value means that the sizes of most of the clusters are close to                          For filtering tests, k-means and group average linkage
each other.                                                                          hierarchical algorithms were chosen. There were multiple runs
                                                                                     starting with full feature space of 41 thousand dimensions to
    When comparing cluster size statistics, group average                            only 10% of terms which are most common across the
hierarchical algorithm produces clusters which are close to                          documents. The F1 scores are displayed in Fig. 5.
each other in size. This observation is backed up by relatively
low values of standard deviation and median absolute                                    Of the two algorithms, k–means shows less sensitivity to
deviation. The cluster size distribution is not as equal in case of                  changes of feature space. While there is a small decrease in F1
                                                                                     score moving from 100% to 50% of the dimensionality
                                                                                     (minimum at                ), the F1 score of 10% rises slightly
                                                                                     above the full feature space with                        versus
                                                                                              .386.
                                                                                         Similar results were shown in [11]. This study compared
                                                                                     different metrics for feature selection, including document
                                                                                     frequency. The clustering was done on English texts using k–
                                                                                     means algorithm. The variation of quality metrics moving from
Fig. 3. Average and maximum F1 score of different clustering algorithms.             100% to 10% was relatively small. A more rapid change
The best results were shown by k-means and bisecting k-means algorithms              occurred going from 10% down to 2%. Reference [2] tested
with group average hierarchical clustering coming in a close third.                  multiple feature selection metrics on Lithuanian and Russian
                                                                                     texts, and had best results at 7% of features remaining.
                                                                                         The results of group average linkage hierarchical clustering
                                                                                     differ from the k-means ones. The variance of F1 score is
                                                                                     significantly greater. At the beginning the score increases from
                                                                                                     to               . Later values drop reaching
                                                                                                      and after a temporal increase eventually drops
                                                                                     to                 .
                                                                                         The pruning of 10% of features might have reduced the
                                                                                     noise in the feature space. This would explain the sudden rise
Fig. 4. Relation between F1 score and number of clusters produced by                 of the F1 score. The further pruning might have started
hierarchical clustering algorithms. When the number of clusters gets close to        removing relevant features. However, this does not explain the
number of categories, the F1 score for group average hierarchical clustering         increase in F1 score at 30–20% of feature space.
tend to increase.


                                                                                31
                                 TABLE III. RELEVANT TERMS (IN LITHUANIAN) OF THE CLUSTERING WITH THE HIGHEST F1 SCORE
Cluster
         143               113                   907        484           420             308            449            121            232         395
  size
         variklio          šalčio                aktorė      rungtynių     karinių      policijos      bendrovės     dakaro         jums       šulinį
         dyzelinių         temperatūra           dainininkė ekipa          partijos     ugniagesiai    barelį        ruožas         organizmą saviečių
         gamintojų         laipsnių              muzikinę žaidėjų          sąjungininkų vpk            indeksas      ralio          vitaminų   tragedijos
         automobilių       kritulių              prodiuseris turnyro       nimro        neblaivus      brent         ekipažas       astrologė smurto
Relevant elektromobilių    rajoniniai            eurovizijos taškų         šiitų        ambulatoriškai įmonės        vanagas        cukraus    mažamečius
 words tesla               sniego                koncertinį pergalę        obama        prom           valiutos      juknevičius    horoskopas smurtaudamas
         motors            provėžoti             meninės     kamuolį       narystės     patrulių       wti           lenktynininkai riebalų    sugyventinę
         fiat              hidrometeorologijos   žanro       treneris      referendumas komisariato    holding       trasos         jus        sumetė
         volkswagen        plikledis             scenoje     rezultatyvaus sirijos      girtumas       akcininkų     lenktynių      mitybos    nužudė
         lg                prispausto            režisierius įvartį        nato         vairuojamas    aplinkosaugos benediktas     ožiaragis  ekspertizė
    When comparing both clustering algorithms, k-means                                clusters reveals that it is not uncommon for other themes to
seems to intrinsically diminish the value of rare terms. In case                      appear in the clusters, sometimes not very related to the
of group average hierarchical clustering, the impact of such                          relevant words.
terms to the clustering results is significantly higher.
                                                                                                               VIII. CONCLUSIONS
D. A drill–down look at the clustering results
                                                                                          In this work, several different experiments were performed
    During this study, several hundred runs were made of                              studying feature selection process and clustering algorithms.
various clustering configurations. From the top 50 tests sorted                       Term frequency and stemming experiment proved that TF–IDF
by the F1 score, every single one was done using one of the                           with stemming is superior to other configurations. In addition,
two k–means algorithms, only several of them being bisecting                          this experiment showed that one misconfigured step in the
k–means. Most of these tests came from the filtering                                  process can greatly diminish the effects of other steps. In the
experiment, over a half having 40% or less features remaining.                        clustering algorithms experiment only 3 of 5 algorithms tested
    In order to have a better idea if the methods actually                            produced acceptable results, k–means performing arguably the
produce meaningful clusters, a metric was proposed to measure                         best overall. In case of filtering experiment, k–means algorithm
which terms are more relevant to the cluster. It is a                                 seemed mostly unaffected by removal of major part of the
combination of two inverse document frequencies and is                                features, while the results of group average linkage hierarchical
defined as                                                                            clustering varied more and was less predictable.
                                                                                                                   REFERENCES
                                                                          (9)         [1] G. Ciganaitė, A. Mackutė-Varoneckienė, T. Krilavičius, "Text
                                                                                          documents clustering," in 19-oji tarpuniversitetinė tarptautinė
where t is a term, c is a set of documents in a cluster, and D is                         magistrantų ir doktorantų konferencija IVUS, 2014.
the documents in the data set. In other words, term relevance is                      [2] A. Mackutė-Varoneckienė, T. Krilavičius, "Empirical Study on
high if the term is (1) relatively rare across the documents in                           Unsupervised Feature Selection for Document Clustering," in Human
                                                                                          Language Technologies – The Baltic Perspective, pp. 107-109, 2014.
the data set and (2) relatively common in the documents of the
cluster.                                                                              [3] E. Kebelytė, M. Lukoševičius, "Keywords analysis of Lithuanian news
                                                                                          stream," in 20-toji tarptautinė magistrantų ir doktorantų konferencija
    The test with the highest F1 score came from the filtering                            "Informacinės technologijos 2015", Kaunas, 2015.
experiment: k–means algorithm and 20% of remaining                                    [4] N. Shah, S. Mahajan, "Document clustering: a detailed review,"
features. The most relevant terms of this test are displayed in                           International Journal of Applied Information Systems, vol. 4(5), pp. 30-
                                                                                          38, 2012.
Table III. As expected, the clusters don’t seem to identically
map to the categories in the data set. However, the themes of                         [5] C. C. Aggarwal, C. Zhai, "A survey of text clustering algorithms," in
                                                                                          Mining Text Data, New York, Springer-Verlag, pp. 77-128, 2012.
the clusters are easily noticeable (in Table III, some of the
themes are “cars”, “sports”, “Dakar rally”, “business”, “foreign                      [6] N. Slonim, N. Tishby, "Document clustering using word clusters via the
                                                                                          information bottleneck method," in Proceedings of the 23rd annual
politics” and “crime”). Despite that, a deeper look at the                                international ACM SIGIR conference, 2000.
                                                                                      [7] D. Arthur, S. Vassilvitskii, "k-means++: the advantages of careful
                                                                                          seeding," in Proceedings of the eighteenth annual ACM-SIAM
                                                                                          symposium on Discrete algorithms, 2007.
                                                                                      [8] H. Anna, "Similarity measures for text document clustering," in
                                                                                          Proceedings of the Sixth New Zealand, 2008.
                                                                                      [9] A. Strehl, J. Ghosh, R. Mooney, "Impact of similarity measures on web-
                                                                                          page clustering," in In Workshop on Artificial Intelligence for Web
                                                                                          Search (AAAI 2000), 2000.
                                                                                      [10] R. J. G. B. Campello, D. Moulavi, J. Sander, "Density-based clustering
                                                                                           based on hierarchical density estimates," Advances in Knowledge
                                                                                           Discovery and Data Mining, pp. 160-172, 2013.
Fig. 5. F1 score dependency on the percentage of features used. K–means               [11] T. Liu, S. Liu, Z. Chen, W.-Y. Ma, "An evaluation on feature selection
algorithm was less sensitive to the change of feature space dimensionality. F1             for text clustering," Icml, pp. 488-495, 2003.
score of hierarchical clustering varies significantly more.


                                                                                 32