Application of BIRCH to text clustering

                  c Ilya Karpov                                 c Alexandr Goroslavskiy
                   Federal state unitary enterprise “Institute “KVANT”, Moscow
              karpovilia@gmail.com                               sashka airok@mail.ru


                     Abstract                               2     Related work
        This work represents a clustering                   There has been proposed number of approaches
    technique, based on the Balanced Iterative              for clustering large collections of arbitrary data.
    Reducing and Clustering using Hierarchies               MST [7], DBSCAN [1], CLOPE [4] and BIRCH [8]
    (BIRCH) algorithm and LSA-methods for                   are the most suitable techniques for text clustering
    clustering large, high dimensional datasets.            according to the (1)-(3) criteria. All of them
    We present a document model and a                       are suitable high feature dimensionality and have
    clustering tool for processing texts in                 complexity O (n log n) for MST, DBSCAN and
    Russian and English languages and compare               CLOPE and O (n log k) for BIRCH. Another
    our results with other clustering techniques.           method for clustering text datasets is called
    Experimental results for clustering the                 Canopies algorithm and described in [4]. The
    datasets of 10’000, 100’000 and 850’000                 key idea of Canopies is to divide the data into
    documents are provided.                                 overlapping subsets (canopies). Then clustering
                                                            is performed by measuring exact distances only
                                                            between points that occur in a common canopy.
1   Introduction                                            The algorithm requires O (nk) distance comparisons
                                                            per iteration of clustering, where n is the number
Many important tasks in IR involve clustering large         of documents and k - number of canopies.
datasets of unstructured text. Although there is a
large set of efficient techniques for clustering of         3     Preprocessing
abstract data sets, few of them have been applied
to clustering textual data. The task is specific            3.1   Building vector space representation
due to the following reasons: (1) large number
of data points, (2) large number of clusters and            We use the "bag-of-words" model, where a
(3) high clustering feature dimensionality. The             document is represented as an unordered collection
other problem is requirement in high performance            of terms, disregarding grammar and even word
and precise linguistic techniques to deal with a            order.      In this work a term is a normal
large set of documents.       This work represents          form of the word with its part-of-speech tag.
a clustering technique, based on the Balanced               t =< normalizedw ord, P OS >; Note that in this
Iterative Reducing and Clustering using Hierarchies         case different part of speech are normalized as
(BIRCH) algorithm and LSA-methods for clustering            follows:
these large, high dimensional datasets in Russian               ADJECTIVE - subj case, singular num, masculine
and English languages.                                          NOUN - subjective case, singular num
    Input documents are stored as a set of                      VERB - infinitive
normalized terms vectors as described in Bag of                 We checked two strategies to resolve word
words model.       Initial dimensionality of nearly         polysemy: The naive one was to add all possible
300’000 terms is reduced by the TF-IDF dispersion           terms to the term-vector. The second way was
threshold for each term of the document. Then,              to use lexical compatibility and choose the most
term vectors are analyzed with latent semantic              probable variant. Performance and quality will
analysis (LSA) as described in [2] and [3]. After           be discussed at the RESULTS section.             After
that, clustering with BIRCH algorithm is performed          the normalization, a document in the collection is
                                                            represented as a vector of term weights, counted
 Proceedings of the 14th All-Russian Conference             according to the term frequency − inverse document
 ”Digital Libraries:    Advanced Methods and                frequency (TF/IDF) model. D = (w1 , ..., wn ) The
 Technologies, Digital Collections” — RCDL-2012,            result of this stage is a term matrix containing word
 Pereslavl Zalesskii, Russia, October 15-18, 2012.          weights per document (rows represent unique words


                                                      102
and columns represent documents) as shown below:               A leaf node contains at most L entries, each of
                                                             the form [CF i , childi ], , where i = 1, 2, ..., L. In
             w1,1 w1,2 · · · w1,n                              addition, each leaf node has two pointers, ”prev”
          w2,1 w2,2 · · · w2,n 
                                                             and ”next” which are used to chain all leaf nodes
          ..
         
                      ..  ..     ..                           together for efficient scans. A leaf node also
                             .
                                     
          .           .          .                           represents a cluster made up of all the subclusters
              wm,1   wm,2    ···   wm,n                        represented by its entries. But all entries in a
                                                               leaf node must satisfy a threshold requirement, with
where m is the number of documents and n is the                respect to a threshold value T: the diameter (or
number of unique terms.                                        radius) has to be less than T
                                                                   The tree size is a function of T. The larger T
3.2   Reducing the clustering space dimension                  is, the smaller the tree is. We require a node to fit
                                                               in a page of size P. Once the dimension d of the
First step in dimension reduction is removing                  data space is given, the sizes of leaf and nonleaf
"noisy" parts of speech from the document model.               entries are known, then B and L are determined by
It stands to reason that adjectives and verbs bring            P . So P can be varied for performance tuning.
rather noise than useful information when they are                 Def.    3 A cluster radius R is an average
disconnected from nouns, so we used only nouns                 Euclidean distance between the data points and the
in our experiments.                                            cluster centroid.
    The next step is selecting the most informative
                                                                                   PN →    − −   → 2 1/2
                                                                                                      !
terms in the model. There are several methods
                                                                                      i=1 ( xi − x0 )
for choosing a threshold, based on the TF/IDF:                              R=
                                                                                            N
M. Kiselev uses the sum of term weights in
all documents [5], or the total frequency of the               where → −
                                                                       xi - documents in cluster, −   →0 - center of
                                                                                                      x
term [6]. In this work we use the dispersion                   the cluster (arithmetical mean of all documents in
of weight for each term in the collection as the               the clustering space), N - number of points in the
threshold: T = max σ(wi ), where i is term number.             cluster. BIRCH algorithm has three main stages:
    The last step is Latent semantic analysis (LSA)                1. Given the threshold T, the clustering space
of the term matrix. LSA assumes that words that                is divided into the initial clusters in such a way
are close in meaning will occur close together in              that cluster radius is smaller than R. BIRCH uses
text and uses singular value decomposition (SVD)               extremely high performance method based on CF
to reduce the number of terms while preserving the             trees to form the initial clusters. See [8] for details.
similarity structure among documents.                              2.    Initial cluster centroids are once again
                                                               clustered with agglomerative clustering tool. New
4     BIRCH algorithm                                          cluster centroids are determined after that.
                                                                   3. New cluster centroids are used as seeds for
BIRCH algorithm is detaily specified in [8], we                clustering space redistribution. Each document is
only describe the basic ideas and concepts here.               reassigned to the nearest seed.
The core concepts of BIRCH are Clustering feature,
CF-tree and Cluster radius. A Clustering feature is
a triple summarizing the information we maintain
                                                               5     Modifications
about the cluster:                                             5.1   Defining the clustering threshold
    Def. 1 Given N d-dimensional data points in
            −
            →
a cluster {Xi } where i = 1, 2, .., N , the Clustering         The main advantage of BIRCH algorithm is its
Feature (CF) vector of the
                           cluster is defined as a            high performance: given a threshold T value, the
                   −→                                          cost of clustering will be only O (2n), but choosing
triple: CF = N, LS, SS , where N is the number
                                −→                             bad threshold may cause significant performance
of data points in the cluster, LS is the linear sum            reduction.
                             PN −  →
of the N data points, i.e.     i=0 Xi , and SS is the              BIRCH authors proposes the following method:
                                               −→2
square sum of the N data points, i.e. N
                                         P
                                           i=0 Xi              Initially T is set to 0. If the number of leafs reaches
    Def. 2 A CF tree is a height-balanced tree with            the maximum amount, T is extended and CF tree
two parameters: branching factor B and threshold               is being rebuild. An assumption that the number of
T . Each nonleaf node contains at most B entries               points Ni , that can be contained in CF tree at the
of the form [CF i , childi ], where i = 1, 2, ..., B ,         i-th stage of the algorithm is in linear fashion with
”childi ” is a pointer to its i − th child node, and           Ti d where d is the clustering dimension. The next
CF i is the CF of the subcluster represented by this           threshold Ti+1 is determined by linear regression
child. So a nonleaf node represents a cluster made             with regard to Ni+1 = min (2Ni , N ), where N is
up of all the subclusters represented by its entries.          the total amount of points in the clustering space.


                                                         103
    An implementation of this method shows that
either the threshold grows too slowly that causes
multiple tree reconstruction or the threshold grows
too fast that causes the loss of accuracy. We
provide a new method for determining the T value:
    As in the previous solution, initially T is set to 0.
The next threshold Ti+1 is determined as a function
of the cluster size for a random set of clusters:
First, a cluster radius R = max (dist(X0 , Xi)) is
determined as the maximum distance between the
point and the cluster centroid for a number N umr
of randomly selected clusters. Then, the threshold
Tk,i+1 = Round(α ∗ Rk ) is determined for each
cluster. Resulting threshold Ti+1 is determined as                Figure 1: Clustering time vs Number of documents
arithmetical mean of Tk,i+1 .                                     dependency

5.2   Splitting tree nodes
                                                                  for one iteration of the default MATLAB k-means
The original BIRCH algorithm proposes splitting                   algorithm with different methods for centroids
node into to new nodes when the limit of childs                   calculating. The first one selects centers randomly
B is reached. Such a technique has disadvantages                  and the second one uses a random 10% subsample
when inner childs are miscellaneous node need to                  of the entire training set. The rest of the parameters
be divided into three, four or more new node. We                  are as follows: number of clusters - 100, number
propose agglomerative clustering to split the node:               of iterations - 1, number of dimensions - 200.
the minimum number of nodes is set as 2, and the                  The third column is based on the modified BIRCH
maximum as the (B - parent node childs + 2).                      clustering algoritm with the same parameters.
                                                                      The results are shown at figure 1. n*log(k)
                                                                  plot shows the estimating complexity of the BIRCH
6     Results                                                     algorithm where n is the number of documents and
                                                                  k - the number of clusters. It is assumed that k =
Method has been tested on the following collections:              log(n).
Lenta.ru news articles collection (approximately
100’000 text documents for 2007–2012) Russian
volume of Wikipedia (850’000 articles at May                      6.2   Accuracy results
2012)                                                             Accuracy results for a set of 600 articles from
                                                                  russian Wikipedia and 10000 news articles from
6.1   Performance results                                         Lenta.ru are provided in table 2.
                                                                  Training sets were made as an intersection of three
Experimental clustering results for sets of various               accessors hand-made clustering results. The results
sizes from Wikipedia are shown in table 1.                        were measured with F-measuse as follows: Given
Table 1: Clustering time for BIRCH* and MATLAB                    i - some rubric from the document collection, j -
k-means algoritms                                                 some cluster, gained from clustering let Dri be the
                                                                  documents of i rubric, and Dci - documents of i
    N docs      k-means      k-means       BIRCH*                 cluster. Precision of j-th cluster about i-th rubric
                (rand)       (10%)                                          |D ∩Dcj |
                                                                  P (i, j) = ri
                                                                              |Dcj |  . Recall of j-th cluster about
    1 000       3,2604       3,4008        1,2034                 i-th rubric R(i, j) = |Dri ∩Dcj |
                                                                                                    .    F-measure of j-th
                                                                                          |Dri |
    2 000       8,8453       9,5005        1,3329                 cluster about i-th rubric is F (i, j) = 2∗P   (i,j)∗R(i,j)
                                                                                                            P (i,j)+R(i,j) .
    5 000       30,6698      46,2855       2,1813                 F-measure of the resulting clustering is:
    10 000      158,8558     170,1035      5,1964
                                                                                      M
    20 000      396,1333     337,009       8,8000
                                                                                      X |Dri |
                                                                                F =              maxj F (i, j),
                                                                                            N
    100 000     —            —             19,66                                      i=1

    500 000     —            —             52,89                  where N – total amount of docs.
    850 000     —            —             104,59                 All algorithms have random factor, so we evaluated
                                                                  average F-measure for a set of 20 measurements.
     Clustering has been performed with Intel Core i7             Thow BIRCH can predict the number of clusters,
- 2600 CPU (3,4 GHz), 16 Gb DDR 3 RAM. The                        we fixed it to calculate the F-measure. It should
first and the second column shows time in seconds                 be mentioned, that being used with one iteration,


                                                            104
k-means algorithm shows very poor quality results                 discovering clusters in large spatial databases
so we used 30 iterations for k-means with random                  with noise. In Evangelos Simoudis, Jiawei
centers and 3 iterations for k-means with 10%                     Han, Usama M. Fayyad. Proceedings of the
subsample. Total amount of iterations was found                   Second International Conference on Knowledge
experimentaly as the function of the F-measure                    Discovery and Data Mining (KDD-96). AAAI
dispersion.                                                       Press. pp. 226−231.
Table 2: F-measure results for BIRCH* MATLAB                   [2] Thomas Hofman. Probabilistic Latent Semantic
k-means algoritm                                                   Analysis. EECS Department, Computer Science
                                                                   Division, University of California, Berkeley &
  Measure      k-means       k-means       BIRCH*
                                                                   International Computer Science Institute, 1999.
               (rand) 30     (10%) 3
               iterations    iterations                        [3] Thomas Hofmann. Probabilistic Latent Seman-
  avg(F )      0, 948        0, 933        0, 923                  tic Indexing. Conference on Research and
  600docs                                                          Development in Information Retrieval, 1999.
  σ(F )        8, 2 ∗ 10−4   1, 2 ∗ 10−4   1, 6 ∗ 10−4         [4] Andrew McCallumzy, Kamal Nigamy. Lyle
  600docs                                                          H. Ungar, Efficient Clustering of High-
  avg(F )   0, 567           0, 572        0, 563                  Dimensional Data Sets with Application to
  10000docs                                                        Reference Matching. //KDD ’00 Proceedings
  σ(F )     5, 4 ∗ 10−4      2, 7 ∗ 10−4   1, 7 ∗ 10−4             of the sixth ACM SIGKDD international
  10000docs                                                        conference on Knowledge discovery and data
                                                                   mining, 2000, pp. 169−178
    Simularity of the results can be explained by
simular methods of assigning points to the clusters            [5] M. Kiselev, Text Clustering Procedure Based
at the last stage of all algorithms. Significant                   on Pair-wise Proximity of Key Terms and its
loss of quality for news collections can be caused                 Comparison with Metric Clustering Methods,
by mixture of many topics in one document and                      "Internet-mathematics 2007", Moscow, Yandex
different taxonomies of the human classification and               publishing, 2007, pp. 74-83.
statistical clustering. Better results can be achieved         [6] M. Kiselev, M. Shmulevich, A. Ehrlich Auto-
in the combination of clustering and topic-based                   matic clustering method and its applications,
classification.                                                    Program products and systems No2, 2008, pp
                                                                   47−50
6.3   Conclusions
                                                               [7] Zhou, Yan; Grygorash, Oleksandr; Hain,
This work has demonstrated the use of the modified                 Thomas F. Clustering with Minimum Spanning
BIRCH clustering method for clustering large                       Trees. International Journal on Artificial
datasets. In comparison with the naive clustering                  Intelligence Tools, Feb2011, Vol. 20 Issue 1,
methods, such as k-means or EM-clustering                          p139−177, 39p
computation time has been reduced by more than
two orders with the same accuracy.                             [8] Tian Zhang, Raghu Ramakrishnan, Miron
                                                                   Livny BIRCH:an efficient data clustering
                                                                   method for very large databases. //Proceedings
References                                                         of the 1996 ACM SIGMOD international
                                                                   conference on Management of data,1996.
 [1] Martin Ester, Hans-Peter Kriegel, Jorg Sander,
     Xiaowei Xu. A density-based algorithm for


                                                         105