A Lattice-Based Consensus Clustering Algorithm Artem Bocharov, Dmitry Gnatyshak, Dmitry I. Ignatov, Boris G. Mirkin, and Andrey Shestakov National Research University Higher School of Economics dignatov@hse.ru http://www.hse.ru Abstract. We propose a new algorithm for consensus clustering, FCA- Consensus, based on Formal Concept Analysis. As the input, the algo- rithm takes T partitions of a certain set of objects obtained by k-means algorithm after T runs from different initialisations. The resulting con- sensus partition is extracted from an antichain of the concept lattice built on a formal context objects × classes, where the classes are the set of all cluster labels from each initial k-means partition. We compare the results of the proposed algorithm in terms of ARI measure with the state-of-the- art algorithms on synthetic datasets. Under certain conditions, the best ARI values are demonstrated by FCA-Consensus. Keywords: consensus clustering, k-means, Formal Concept Analysis, ensemble clustering, lattice-based clustering 1 Introduction and related work Although the subject of consensus partition has been considered in the literature as early as in 1960s ([1], [2]), its popularity is based on concerns of the 21st century when clustering has become an ubiquitous activity. An innocent user wants to segment their data into homogeneous segments, a.k.a. clusters; they apply clustering tools and see many different solutions whose comparative merits are not clear. Therefore, they need a tool to reconcile all the clusterings produced by different tools or even by the same tool at different parameter values. As the input the consensus clustering approach usually takes T partitions of a certain set of objects obtained, for example, by k-means algorithm after its T different executions with possibly different k. The resulting consensus partition is built from the matrix objects × classes, where the classes are the set of all cluster labels from each initial k-means partition. Thus, the main goal of consen- sus clustering is to find (recover) an optimal partition, i.e. to guess the proper number of resulting clusters and put the objects into each part correctly (see, for example, [3], [4]). To evaluate a consensus clustering method, researchers usually hypothesise that if a particular consensus clustering approach is able to find a proper k and attain high accuracy on pre-labeled datasets, then it can be used in the unsupervised setting. In [5], consensus clustering algorithms are classified in three main groups: probabilistic approaches [6,7]; direct approaches [3,8,9,10], and pairwise similarity- based approaches [11,12]. In the last category of methods, the (i, j)-th entry aij of the consensus matrix A = (aij ) shows the number of partitions in which objects gi and gj belong to the same cluster. In the previous papers [13,14], a least-squares consensus clustering approach was invoked from the paper [15], to equip it with a more recent clustering procedure for consensus clustering and compare the results on synthetic data of Gaussian clusters with those by the more recent methods. Here, our main goal is to propose a novel lattice-based consensus clustering algorithm by means of FCA and show its competitive applicability. To the best of our knowledge, a variant of FCA-based consensus approach was firstly proposed for clustering genes [16]. Those who are interested in theoretical properties of consensus procedures and relations to FCA are referred to [17]. The paper is organised in five sections. In Section 2, we refresh some defini- tions from FCA, introduce partitions and the lattice of partitions, and show how any partition lattice can be mapped to a concept lattice. In Section 3, we in- troduce our modification of Close-by-One algorithm for consensus clustering. In Section 4, we describe our experimental results over synthetic data both for indi- vidual properties of FCA-Consensus and for comparison with the state-of-the-art clustering methods. Section 5 concludes the paper and outlines prospective ways of research and developments. 2 Basic definitions First, we recall several notions related to lattices and partitions. Definition 1. A partition S of a nonempty set A is a set of its subsets σ = {B | B ⊆ A} such that B = A and B ∩ C = ∅ for all B, C ∈ σ. Every element of B∈σ σ is called block. Definition 2. A partition lattice of set A is an ordered set (P art(A), ∨, ∧) where P art(A) is a set of all possible partitions of A and for all partitions σ and ρ supremum and infimum are defined as follows: [ σ ∨ ρ = {Nρ (B) ∪ Nσ (C)|B ∈ σ}, C∈Nρ (B) σ ∧ ρ = {B ∩ C | for all B ∈ σ and C ∈ ρ}, where Nρ (B) = {C | B ∈ σ, C ∈ ρ and B ∩ C 6= ∅} and Nσ (C) = {B | B ∈ σ, C ∈ ρ and B ∩ C 6= ∅}. Definition 3. Let A be a set and let ρ, σ ∈ P art(A). The partition ρ is finer than the partition σ if every block B of σ is a union of blocks of ρ, that is ρ ≤ σ. Equivalently one can use traditional connection between supremum, infimum and partial order in the lattice: ρ ≤ σ iff ρ ∨ σ = σ (ρ ∧ σ = ρ). Now, we recall some basic notions of Formal Concept Analysis (FCA) [18]. Let G and M be sets, called the set of objects and attributes, respectively, and let I be a relation I ⊆ G × M : for g ∈ G, m ∈ M , gIm holds iff the object g has the attribute m. The triple K = (G, M, I) is called a (formal) context. If A ⊆ G, B ⊆ M are arbitrary subsets, then the Galois connection is given by the following derivation operators: A0 = {m ∈ M | gIm for all g ∈ A}, (1) B 0 = {g ∈ G | gIm for all m ∈ B}. The pair (A, B), where A ⊆ G, B ⊆ M , A0 = B, and B 0 = A is called a (formal) concept (of the context K) with extent A and intent B (in this case we have also A00 = A and B 00 = B). The concepts, ordered by (A1 , B1 ) ≥ (A2 , B2 ) ⇐⇒ A1 ⊇ A2 form a com- plete lattice, called the concept lattice B(G, M, I). Theorem 1. (Ganter&Wille [18]) For a given partially ordered set P = (P, ≤) the concept lattice of the formal context K = (J(P ), M (P ), ≤) is isomorphic to the Dedekind–MacNeille completion of P, where J(P) and M(P) are set of join-irreducible and meet-irreducible elements of P. Theorem 2. (this paper) For a given partition lattice L = (P art(A), ∨, ∧) there exist a formal context K = (P2 , A2 , I), where P2 = {{a, b} | a, b ∈ A and a 6= b}, A2 = {σ | σ ∈ P art(A) and |σ| = 2} and {a, b}Iσ when a and b belong to the same block of σ. The concept lattice B(P2 , A2 , I) is isomorphic to the initial lattice (P art(A), ∨, ∧). Proof. According to Theorem 1 the concept lattice of the context KDM = (J(L), M (L), ≤) is isomorphic to the Dedekind–McNeille completion of L. The Dedekind–McNeille completion of a lattice is its isomorphic lattice by the defini- tion (as a minimal completion which forms a lattice). So, we have to show that contexts K and KDM (or their concept lattices) are isomorphic. E.g., from [19] (Lemma 1, Chapter 4, Partition Lattices), we have that the atoms of a partition lattice are those its partitions which have only one block of two elements, the remaining blocks are singletons, and its coatoms are partitions into two blocks. It is evident that all the atoms are join-irreducible and all the coatoms are meet-irreducible and that there are no other irreducible elements of the partition lattice L. Let σ and ρ be two partitions from L, σ ∈ J(L) and ρ ∈ M (L), and σ ≤ ρ. It means that all blocks of σ are subsets of blocks of ρ and the non-trivial block {i, j} ∈ σ is a subset of one of the blocks of ρ. Note that A2 coincides with the coatom set. It directly implies that {i, j}Iρ iff an atom σ with block {i, j} is finer than a coatom ρ. In addition we can show the correspondence between elements of L = (P art(A), ∨, ∧) and formal V concepts of B(P2 , A2 , I). Every (A, B) ∈ B(P2 , A2 , I) corresponds to σ = B and every pair {i, j} from A is in one of σ blocks,VwhereWσ ∈ P art(A). Every (A, B) ∈ BDM (J(L), M (L), ≤) corresponds to σ = B = A. Example 1. In Fig. 1, one can see the diagram of a concept lattice isomorphic to the partition lattice of 4-element set. Fig. 1. The line diagram of a concept lattice isomorphic to the partition lattice of 4-element set (reduced labeling). 3 FCA-Consensus: adding objects one-by-one To work in FCA terms we need to introduce a (formal) partition context that corresponds to the matrix X from the previous subsection. Let us consider such a context KR = (G, tMt , I ⊆ G × tMt ), where G is a set of objects, t = 1, . . . , T , and each Mt consists of labels of all clusters in the t-th k-means partition from the ensemble. For example, gImt1 means that object g has been clustered to the first cluster by t-th clustering algorithm in the ensemble. Our FCA-Consensus algorithm looks for S, an antichain of concepts of KR , such that for every (A, B) and (C, D) the condition A ∩ C = ∅ is fulfilled. Here, the concept extent A corresponds to one of the resulting clusters, and its intent contains all labels of the ensemble members that voted for the objects from A being in one cluster. The input cluster sizes may vary, but it is a reasonable consensus hypothesis that at least dT /2e should vote for a set of objects to be in cluster. One can prove a theorem below, where by true partition we mean the original partition into clusters to be recovered. Theorem 3. In the concept lattice of a partition context KR = (G, tMt , I ⊆ G×tMt ), there is the antichain of concepts S such that all extents of its concepts Ai coincide with Si from σ, the true partition, if and only if Si00 = Si where i = 1, . . . , |σ|. Proof. The proof is obvious because of the fact that parts of partitions are non- intersecting and each part should be closed to form a concept extent. In fact, it happens if all ensemble algorithms have voted for all objects from Si to belong in a same concept (cluster). However, this is a rather strong require- ment and we should experimentally study good candidates for such an antichain. The algorithm below works incrementally by adding objects one by one and checking a new “canonicity” conditions, like it is in algorithms ADDI [11] and Close by One (CbO) [20]. Here the stopping condition is of course different: it is |Y | ≥ dT /2e, where Y is the intent of the current concept. Moreover, the covered objects at a particular step should not be added with any concept to the antichain S further. Algorithm 1: Main((G, M, I), T ) Input: a partition context (G, M, I) and the number of ensemble clusterers T Output: S 1: C = ∅ 2: for all g ∈ G do 3: if g 6∈ C then 4: gpp = g 00 5: gp = g 0 6: S.enqueue(gpp, gp) 7: C = C ∪ gpp 8: end if 9: end for 10: return Process((G, M, I), k, S) Thus, the resulting antichain S may not cover all objects but we can add each non-covered object g to a concept (A, B) ∈ S with maximal size of the intersection, |B∩g 0 |. Traditionally, the algorithm consists of two parts, a wrapper procedure, Main, and a recursive procedure, Process. 4 Experimental results All evaluations are done on synthetic datasets that have been generated using Matlab. Each of the datasets consists of 300 five-dimensional objects compris- ing three randomly generated spherical Gaussian clusters. The variance of each Algorithm 2: Process((G, M, I), T, S) 1: T = S 2: Cover = ∅ While T 6= ∅ 3: T.dequeue(A, B) 4: if A ∩ Cover = ∅ then 5: Cover = Cover ∪ A 6: P.enqueue(A, B) 7: for all g ∈ min(G \ Cover) do 8: X = A ∪ {g} 9: Y = X0 10: if |Y | ≥ dT /2e then 11: Z =Y0 12: if {h|h ∈ Z \ X, h < g} = ∅ then 13: P.dequeue(A, B) 14: P.enqueue(Z, Y ) 15: Cover = Cover ∪ Z 16: end if 17: end if 18: end for 19: end if 20: if S = P then 21: return P 22: end if 23: S = P 24: return Process((G, M, I), T, P) cluster lies in 0.1 − 0.3 and its center components are independently generated from the Gaussian distribution N (0, 0.7). Let us denote thus generated partition as λ with kλ clusters. The profile of par- titions R = {ρ1 , ρ2 , . . . , ρT } for consensus algorithms is constructed as a result of T runs of k-means clustering algorithm starting from random k centers. We carry out the experiments in four settings: 1. Investigation of the influence of the number of clusters kλ ∈ {2, 3, 5, 9} under various numbers of minimal votes (Fig. 2), a) two clusters case kλ = 2, k ∈ {2, 3, 4, 5}, b) three clusters case kλ = 3, k ∈ {2, 3}, c) five clusters case kλ = 5, k ∈ {2, 5}, d) nine clusters case kλ = 9, k ∈ {2, 3, 4, 5, 6, 7, 8, 9}; 2. Investigation of the numbers of clusters of ensemble clusterers with a fixed number of true clusters kλ = 5 (Fig. 3), a) k = 2, b) k ∈ {2, 3, 4, 5}, c) k ∈ {5}, d) k ∈ {5, 6, 7, 8, 9} e) k = 9; 3. Investigation of the number of objects N ∈ {100, 300, 500, 1000} (Fig. 4); 4. Comparison with other state-of-the-art algorithms (Fig. 5–8), a) two clusters case kλ = 2, k ∈ {2, 3, 4, 5}, b) three clusters case kλ = 3, k ∈ {2, 3}, c) five clusters case kλ = 5, k ∈ {2, 5}, d) nine clusters case kλ = 9, k ∈ {2, 3, 4, 5, 6, 7, 8, 9}. Each experiment encompasses 10 runs for each of the ten generated datasets. Such meta-parameters as the dimension number p = 3, the number of partitions (clusterers) in the ensemble T = 100, and the parameters of Gaussian distribu- tion have been fixed for each experiment. After applying consensus algorithms, Adjusted Rand Index (ARI) [5] for the obtained consensus partition σ and the generated partition λ is computed as ARI(σ, λ). Given two partitions ρa = {R1a , . . . , Rkaa } and ρb = {R1b , . . . , Rkb b }, where Nh = |Rha | is the cardinality a of Rha , Nhm = |Rha Rm T b |, N is the number of P Nha  P Nha (Nha −1) objects, Ca = = 2 . h 2 h     P Nhm N − Ca Cb hm 2 2 ARI(ρa , ρb ) =   (2) 1 N 2 (C a + Cb ) − Ca Cb 2 This criterion expresses similarity of two partitions; its values vary from 0 to 1, where 1 means identical partitions, and 0 means totally different ones. 4.1 Comparing consensus algorithms The lattice-based consensus results have been compared with the results of the following algorithms (Fig. 5–8): – AddRemAdd ([21,13]) – Voting Scheme (Dimitriadou, Weingessel and Hornik, 2002) [8] – cVote (Ayad, 2010) [9] – Condorcet and Borda Consensus (Dominguez, Carrie and Pujol, 2008) [10] – Meta-CLustering Algorithm (Strehl and Ghosh, 2002) [3] – Hyper Graph Partitioning Algorithm [3] – Cluster-based Similarity Partitioning Algorithm [3] 0.9 Two cluters 0.8 Three clusters Five clusters 0.7 Nine clusters ARI 0.6 0.5 0.4 0.3 0 10% 20% 30% 40% 50% 60% 70% Minimal voting threshold Fig. 2. Influence of minimal voting threshold to ARI for different number of true clusters To provide the reader with more details we show the values of ARI graphically for each dataset out of ten used. The summarised conclusions are given in the next section. 5 Conclusion Our experiments lead us to the following conclusions: – The “Optimal voting threshold” as related to the minimum intent size for the resulting antichain of concepts is not constant; moreover, it is not usually the majority of ensemble members (see Fig. 2). – Our FCA-based consensus clustering method works better when the number of clusters at the ensemble clusterers is equal to the number of true clusters (see Fig. 3). 1 2 0.9 2–5 0.8 5 0.7 5–9 0.6 9 ARI 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 Dataset no. Fig. 3. Influence of minimal voting threshold to ARI for different numbers of clusters of the ensemble clusterers (each point is averaged over 10 datasets) 1 0.9 100 300 0.8 500 0.7 1000 0.6 ARI 0.5 0.4 0.3 0.2 0.1 1 2 3 4 5 6 7 8 9 10 Dataset no. Fig. 4. Influence of different numbers of objects to ARI HGPA Condorse CVote Vote Lattice ARA Borda MCLA CSPA 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 ARI ARI 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Dataset no. Dataset no. Fig. 5. Two clusters Fig. 6. Three clusters 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 ARI ARI 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Dataset no. Dataset no. Fig. 7. Five clusters Fig. 8. Nine clusters – The resulting ARI value depends on the number of objects: The greater the number, the smaller the ARI (see Fig. 4). – When the number of true clusters is two (and almost always when it is three) our method beats the other algorithms under comparison; in some of these cases the consensus has 100% accuracy (see Fig. 5–6). – For larger numbers of clusters, our method is positioned as a median among the methods under comparison (see Fig. 7–8). One straightforward step to be taken is testing our algorithm over real datasets. The algorithm can be modified for application on the space of all partition labels when the number of objects is greater than that of the labels. The algorithm complexity and time-efficiency should be carefully studied and compared with those of the existing algorithms. An interesting venue is to con- sider the partition lattices as a search space for finding an optimal partition. For example, one can build a pattern structure [22] over partitions similar to one in [23] and analyse the correlation of stability indicies [24] of the partitions as pattern concepts with the ARI measure. One may hope that by so doing it could be possible to find or describe “good” regions in the lattice by using the partition union and partition intersection operations. Acknowledgments We would like to thank Jaume Baixeries, Amedeo Napoli, Alexei Buzmakov, Mehdi Kaytoue, and Oleg Anshakov for their comments, re- marks and help while preparing this paper. The paper was prepared within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE) and supported within the framework of a sub- sidy by the Russian Academic Excellence Project “5-100”. The third co-author was partially supported by Russian Foundation for Basic Research, grants no. 16-29-12982 and 16-01-00583. References 1. Régnier, S.: Sur quelques aspects mathématiques des problèmes de classification automatique. Mathématiques et Sciences humaines 82 (1983) 13–29 (reprint of I.C.C. bulletin 4: 1965, 175–191). 2. Mirkin, B.: An approach to the analysis of non-numerical data. In Bagrinovsky, K., ed.: Mathematical Methods of Modeling and Solutions for Economic Problems, Institute of Economics and Organization of Industrial Production, Siberian Branch of the USSR’s Academy of Science (1969) 141–150 (in Russian). 3. Strehl, A., Ghosh, J.: Cluster ensembles — A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3 (2002) 583–617 4. Kuncheva, L.I., Vetrov, D.P.: Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans. Pattern Anal. Mach. Intell. 28(11) (2006) 1798–1808 5. Ghosh, J., Acharya, A.: Cluster ensembles. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery 1(4) (2011) 305–315 6. Topchy, A.P., Jain, A.K., Punch, W.F.: A mixture model for clustering ensembles. In: Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004. (2004) 379–390 7. Wang, H., Shan, H., Banerjee, A.: Bayesian cluster ensembles. Statistical Analysis and Data Mining 4(1) (2011) 54–70 8. Dimitriadou, E., Weingessel, A., Hornik, K.: A combination scheme for fuzzy clustering. IJPRAI 16(7) (2002) 901–912 9. Ayad, H., Kamel, M.S.: On voting-based consensus of cluster ensembles. Pattern Recognition 43(5) (2010) 1943–1953 10. Dominguez, X.S., Carrie, J.S., Pujol, F.A.: Fuzzy clusterers combination by posi- tional voting for robust document clustering. Procesamiento del lenguaje natural 43 (2009) 245–253 11. Mirkin, B.: Clustering: A Data Recovery Approach. CRC Press (2012) 12. Guénoche, A.: Consensus of partitions : a constructive approach. Advances in Data Analysis and Classification 5(3) (2011) 215–229 13. Mirkin, B.G., Shestakov, A.: Least square consensus clustering: Criteria, methods, experiments. In: Advances in Information Retrieval - 35th European Conference on IR Research, ECIR 2013, Moscow, Russia, March 24-27, 2013. Proceedings. (2013) 764–767 14. Mirkin, B., Shestakov, A. In: A Note on the Effectiveness of the Least Squares Consensus Clustering. Springer New York, New York, NY (2014) 181–185 15. Mirkin, B., Muchnik, I.: Geometrical interpretation of clustering scoring func- tions. In: Methods for the Analysis of Multivariate Data in Economics, Novosibirsk, Nauka Publisher (1981) 3 – 11 (In Russian). 16. Hristoskova, A., Boeva, V., Tsiporkova, E.: A Formal Concept Analysis Approach to Consensus Clustering of Multi-Experiment Expression Data. BMC Bioinfor- matics 15 (2014) 151 17. Domenach, F., Tayari, A.: Implications of Axiomatic Consensus Properties. In: Al- gorithms from and for Nature and Life: Classification and Data Analysis. Springer International Publishing, Cham (2013) 59–67 18. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. 1st edn. Springer-Verlag New York, Inc., Secaucus, NJ, USA (1999) 19. Graetzer, G.: General Lattice Theory. Akademie-Verlag (1978) 20. Kuznetsov, S.O.: A fast algorithm for computing all intersections of objects from an arbitrary semilattice. Nauchno-Tekhnicheskaya Informatsiya Seriya 2 - Infor- matsionnye protsessy i sistemy (1) (1993) 17–20 21. Mirkin, B.: Core concepts in Data Analysis: summarization, correlation, visualiza- tion. Springer (2011) 22. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: Concep- tual Structures: Broadening the Base, 9th International Conference on Conceptual Structures, ICCS 2001, Stanford, CA, USA, July 30-August 3, 2001, Proceedings. (2001) 129–142 23. Codocedo, V., Napoli, A.: Lattice-based biclustering using partition pattern struc- tures. In: ECAI 2014 - 21st European Conference on Artificial Intelligence, 18-22 August 2014, Prague, Czech Republic - Including Prestigious Applications of In- telligent Systems (PAIS 2014). (2014) 213–218 24. Kuznetsov, S.O.: Stability as an estimate of degree of substantiation of hypotheses derived on the basis of operational similarity. Nauchno-Tekhnicheskaya Informat- siya Seriya 2 - Informatsionnye protsessy i sistemy (12) (1990) 21–29