Application of BIRCH to text clustering c Ilya Karpov c Alexandr Goroslavskiy Federal state unitary enterprise “Institute “KVANT”, Moscow karpovilia@gmail.com sashka airok@mail.ru Abstract 2 Related work This work represents a clustering There has been proposed number of approaches technique, based on the Balanced Iterative for clustering large collections of arbitrary data. Reducing and Clustering using Hierarchies MST [7], DBSCAN [1], CLOPE [4] and BIRCH [8] (BIRCH) algorithm and LSA-methods for are the most suitable techniques for text clustering clustering large, high dimensional datasets. according to the (1)-(3) criteria. All of them We present a document model and a are suitable high feature dimensionality and have clustering tool for processing texts in complexity O (n log n) for MST, DBSCAN and Russian and English languages and compare CLOPE and O (n log k) for BIRCH. Another our results with other clustering techniques. method for clustering text datasets is called Experimental results for clustering the Canopies algorithm and described in [4]. The datasets of 10’000, 100’000 and 850’000 key idea of Canopies is to divide the data into documents are provided. overlapping subsets (canopies). Then clustering is performed by measuring exact distances only between points that occur in a common canopy. 1 Introduction The algorithm requires O (nk) distance comparisons per iteration of clustering, where n is the number Many important tasks in IR involve clustering large of documents and k - number of canopies. datasets of unstructured text. Although there is a large set of efficient techniques for clustering of 3 Preprocessing abstract data sets, few of them have been applied to clustering textual data. The task is specific 3.1 Building vector space representation due to the following reasons: (1) large number of data points, (2) large number of clusters and We use the "bag-of-words" model, where a (3) high clustering feature dimensionality. The document is represented as an unordered collection other problem is requirement in high performance of terms, disregarding grammar and even word and precise linguistic techniques to deal with a order. In this work a term is a normal large set of documents. This work represents form of the word with its part-of-speech tag. a clustering technique, based on the Balanced t =< normalizedw ord, P OS >; Note that in this Iterative Reducing and Clustering using Hierarchies case different part of speech are normalized as (BIRCH) algorithm and LSA-methods for clustering follows: these large, high dimensional datasets in Russian ADJECTIVE - subj case, singular num, masculine and English languages. NOUN - subjective case, singular num Input documents are stored as a set of VERB - infinitive normalized terms vectors as described in Bag of We checked two strategies to resolve word words model. Initial dimensionality of nearly polysemy: The naive one was to add all possible 300’000 terms is reduced by the TF-IDF dispersion terms to the term-vector. The second way was threshold for each term of the document. Then, to use lexical compatibility and choose the most term vectors are analyzed with latent semantic probable variant. Performance and quality will analysis (LSA) as described in [2] and [3]. After be discussed at the RESULTS section. After that, clustering with BIRCH algorithm is performed the normalization, a document in the collection is represented as a vector of term weights, counted Proceedings of the 14th All-Russian Conference according to the term frequency − inverse document ”Digital Libraries: Advanced Methods and frequency (TF/IDF) model. D = (w1 , ..., wn ) The Technologies, Digital Collections” — RCDL-2012, result of this stage is a term matrix containing word Pereslavl Zalesskii, Russia, October 15-18, 2012. weights per document (rows represent unique words 102 and columns represent documents) as shown below: A leaf node contains at most L entries, each of   the form [CF i , childi ], , where i = 1, 2, ..., L. In w1,1 w1,2 · · · w1,n addition, each leaf node has two pointers, ”prev”  w2,1 w2,2 · · · w2,n    and ”next” which are used to chain all leaf nodes  ..  .. .. ..  together for efficient scans. A leaf node also .   . . .  represents a cluster made up of all the subclusters wm,1 wm,2 ··· wm,n represented by its entries. But all entries in a leaf node must satisfy a threshold requirement, with where m is the number of documents and n is the respect to a threshold value T: the diameter (or number of unique terms. radius) has to be less than T The tree size is a function of T. The larger T 3.2 Reducing the clustering space dimension is, the smaller the tree is. We require a node to fit in a page of size P. Once the dimension d of the First step in dimension reduction is removing data space is given, the sizes of leaf and nonleaf "noisy" parts of speech from the document model. entries are known, then B and L are determined by It stands to reason that adjectives and verbs bring P . So P can be varied for performance tuning. rather noise than useful information when they are Def. 3 A cluster radius R is an average disconnected from nouns, so we used only nouns Euclidean distance between the data points and the in our experiments. cluster centroid. The next step is selecting the most informative PN → − − → 2 1/2 ! terms in the model. There are several methods i=1 ( xi − x0 ) for choosing a threshold, based on the TF/IDF: R= N M. Kiselev uses the sum of term weights in all documents [5], or the total frequency of the where → − xi - documents in cluster, − →0 - center of x term [6]. In this work we use the dispersion the cluster (arithmetical mean of all documents in of weight for each term in the collection as the the clustering space), N - number of points in the threshold: T = max σ(wi ), where i is term number. cluster. BIRCH algorithm has three main stages: The last step is Latent semantic analysis (LSA) 1. Given the threshold T, the clustering space of the term matrix. LSA assumes that words that is divided into the initial clusters in such a way are close in meaning will occur close together in that cluster radius is smaller than R. BIRCH uses text and uses singular value decomposition (SVD) extremely high performance method based on CF to reduce the number of terms while preserving the trees to form the initial clusters. See [8] for details. similarity structure among documents. 2. Initial cluster centroids are once again clustered with agglomerative clustering tool. New 4 BIRCH algorithm cluster centroids are determined after that. 3. New cluster centroids are used as seeds for BIRCH algorithm is detaily specified in [8], we clustering space redistribution. Each document is only describe the basic ideas and concepts here. reassigned to the nearest seed. The core concepts of BIRCH are Clustering feature, CF-tree and Cluster radius. A Clustering feature is a triple summarizing the information we maintain 5 Modifications about the cluster: 5.1 Defining the clustering threshold Def. 1 Given N d-dimensional data points in − → a cluster {Xi } where i = 1, 2, .., N , the Clustering The main advantage of BIRCH algorithm is its Feature (CF) vector of the  cluster is defined as a high performance: given a threshold T value, the −→ cost of clustering will be only O (2n), but choosing triple: CF = N, LS, SS , where N is the number −→ bad threshold may cause significant performance of data points in the cluster, LS is the linear sum reduction. PN − → of the N data points, i.e. i=0 Xi , and SS is the BIRCH authors proposes the following method: −→2 square sum of the N data points, i.e. N P i=0 Xi Initially T is set to 0. If the number of leafs reaches Def. 2 A CF tree is a height-balanced tree with the maximum amount, T is extended and CF tree two parameters: branching factor B and threshold is being rebuild. An assumption that the number of T . Each nonleaf node contains at most B entries points Ni , that can be contained in CF tree at the of the form [CF i , childi ], where i = 1, 2, ..., B , i-th stage of the algorithm is in linear fashion with ”childi ” is a pointer to its i − th child node, and Ti d where d is the clustering dimension. The next CF i is the CF of the subcluster represented by this threshold Ti+1 is determined by linear regression child. So a nonleaf node represents a cluster made with regard to Ni+1 = min (2Ni , N ), where N is up of all the subclusters represented by its entries. the total amount of points in the clustering space. 103 An implementation of this method shows that either the threshold grows too slowly that causes multiple tree reconstruction or the threshold grows too fast that causes the loss of accuracy. We provide a new method for determining the T value: As in the previous solution, initially T is set to 0. The next threshold Ti+1 is determined as a function of the cluster size for a random set of clusters: First, a cluster radius R = max (dist(X0 , Xi)) is determined as the maximum distance between the point and the cluster centroid for a number N umr of randomly selected clusters. Then, the threshold Tk,i+1 = Round(α ∗ Rk ) is determined for each cluster. Resulting threshold Ti+1 is determined as Figure 1: Clustering time vs Number of documents arithmetical mean of Tk,i+1 . dependency 5.2 Splitting tree nodes for one iteration of the default MATLAB k-means The original BIRCH algorithm proposes splitting algorithm with different methods for centroids node into to new nodes when the limit of childs calculating. The first one selects centers randomly B is reached. Such a technique has disadvantages and the second one uses a random 10% subsample when inner childs are miscellaneous node need to of the entire training set. The rest of the parameters be divided into three, four or more new node. We are as follows: number of clusters - 100, number propose agglomerative clustering to split the node: of iterations - 1, number of dimensions - 200. the minimum number of nodes is set as 2, and the The third column is based on the modified BIRCH maximum as the (B - parent node childs + 2). clustering algoritm with the same parameters. The results are shown at figure 1. n*log(k) plot shows the estimating complexity of the BIRCH 6 Results algorithm where n is the number of documents and k - the number of clusters. It is assumed that k = Method has been tested on the following collections: log(n). Lenta.ru news articles collection (approximately 100’000 text documents for 2007–2012) Russian volume of Wikipedia (850’000 articles at May 6.2 Accuracy results 2012) Accuracy results for a set of 600 articles from russian Wikipedia and 10000 news articles from 6.1 Performance results Lenta.ru are provided in table 2. Training sets were made as an intersection of three Experimental clustering results for sets of various accessors hand-made clustering results. The results sizes from Wikipedia are shown in table 1. were measured with F-measuse as follows: Given Table 1: Clustering time for BIRCH* and MATLAB i - some rubric from the document collection, j - k-means algoritms some cluster, gained from clustering let Dri be the documents of i rubric, and Dci - documents of i N docs k-means k-means BIRCH* cluster. Precision of j-th cluster about i-th rubric (rand) (10%) |D ∩Dcj | P (i, j) = ri |Dcj | . Recall of j-th cluster about 1 000 3,2604 3,4008 1,2034 i-th rubric R(i, j) = |Dri ∩Dcj | . F-measure of j-th |Dri | 2 000 8,8453 9,5005 1,3329 cluster about i-th rubric is F (i, j) = 2∗P (i,j)∗R(i,j) P (i,j)+R(i,j) . 5 000 30,6698 46,2855 2,1813 F-measure of the resulting clustering is: 10 000 158,8558 170,1035 5,1964 M 20 000 396,1333 337,009 8,8000 X |Dri | F = maxj F (i, j), N 100 000 — — 19,66 i=1 500 000 — — 52,89 where N – total amount of docs. 850 000 — — 104,59 All algorithms have random factor, so we evaluated average F-measure for a set of 20 measurements. Clustering has been performed with Intel Core i7 Thow BIRCH can predict the number of clusters, - 2600 CPU (3,4 GHz), 16 Gb DDR 3 RAM. The we fixed it to calculate the F-measure. It should first and the second column shows time in seconds be mentioned, that being used with one iteration, 104 k-means algorithm shows very poor quality results discovering clusters in large spatial databases so we used 30 iterations for k-means with random with noise. In Evangelos Simoudis, Jiawei centers and 3 iterations for k-means with 10% Han, Usama M. Fayyad. Proceedings of the subsample. Total amount of iterations was found Second International Conference on Knowledge experimentaly as the function of the F-measure Discovery and Data Mining (KDD-96). AAAI dispersion. Press. pp. 226−231. Table 2: F-measure results for BIRCH* MATLAB [2] Thomas Hofman. Probabilistic Latent Semantic k-means algoritm Analysis. EECS Department, Computer Science Division, University of California, Berkeley & Measure k-means k-means BIRCH* International Computer Science Institute, 1999. (rand) 30 (10%) 3 iterations iterations [3] Thomas Hofmann. Probabilistic Latent Seman- avg(F ) 0, 948 0, 933 0, 923 tic Indexing. Conference on Research and 600docs Development in Information Retrieval, 1999. σ(F ) 8, 2 ∗ 10−4 1, 2 ∗ 10−4 1, 6 ∗ 10−4 [4] Andrew McCallumzy, Kamal Nigamy. Lyle 600docs H. Ungar, Efficient Clustering of High- avg(F ) 0, 567 0, 572 0, 563 Dimensional Data Sets with Application to 10000docs Reference Matching. //KDD ’00 Proceedings σ(F ) 5, 4 ∗ 10−4 2, 7 ∗ 10−4 1, 7 ∗ 10−4 of the sixth ACM SIGKDD international 10000docs conference on Knowledge discovery and data mining, 2000, pp. 169−178 Simularity of the results can be explained by simular methods of assigning points to the clusters [5] M. Kiselev, Text Clustering Procedure Based at the last stage of all algorithms. Significant on Pair-wise Proximity of Key Terms and its loss of quality for news collections can be caused Comparison with Metric Clustering Methods, by mixture of many topics in one document and "Internet-mathematics 2007", Moscow, Yandex different taxonomies of the human classification and publishing, 2007, pp. 74-83. statistical clustering. Better results can be achieved [6] M. Kiselev, M. Shmulevich, A. Ehrlich Auto- in the combination of clustering and topic-based matic clustering method and its applications, classification. Program products and systems No2, 2008, pp 47−50 6.3 Conclusions [7] Zhou, Yan; Grygorash, Oleksandr; Hain, This work has demonstrated the use of the modified Thomas F. Clustering with Minimum Spanning BIRCH clustering method for clustering large Trees. International Journal on Artificial datasets. In comparison with the naive clustering Intelligence Tools, Feb2011, Vol. 20 Issue 1, methods, such as k-means or EM-clustering p139−177, 39p computation time has been reduced by more than two orders with the same accuracy. [8] Tian Zhang, Raghu Ramakrishnan, Miron Livny BIRCH:an efficient data clustering method for very large databases. //Proceedings References of the 1996 ACM SIGMOD international conference on Management of data,1996. [1] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu. A density-based algorithm for 105