=Paper=
{{Paper
|id=Vol-2037/paper28
|storemode=property
|title=The Projective Clustering Ensemble Problem for Advanced Data Clustering
|pdfUrl=https://ceur-ws.org/Vol-2037/paper_28.pdf
|volume=Vol-2037
|authors=Carlotta Domeniconi,Francesco Gullo,Andrea Tagarelli
|dblpUrl=https://dblp.org/rec/conf/sebd/DomeniconiGT17
}}
==The Projective Clustering Ensemble Problem for Advanced Data Clustering==
The Projective Clustering Ensemble problem for Advanced Data Clustering E XTENDED A BSTRACT Carlotta Domeniconi1 , Francesco Gullo2 , and Andrea Tagarelli3 1 George Mason University, USA cdomenic@gmu.edu 2 UniCredit, R&D Dept., Italy gullof@acm.org 3 University of Calabria, Italy andrea.tagarelli@unical.it Abstract. After more than five decades, a huge number of models and algo- rithms have been developed for data clustering. While most attention has been devoted to data types, algorithmic features, and application targets, in the last years there has also been an increasing interest in developing advanced data- clustering tools. In this respect, projective clustering and clustering ensembles represent two of the most important directions: the former is concerned with the discovery of subsets of the input data having different, possibly overlapping sub- sets of features associated with them, while the latter allows for the induction of a prototype consensus clustering from an available ensemble of clustering solu- tions. In this paper we discuss the current state-of-the-art research in which the prob- lems of projective clustering and clustering ensembles have been revisited and integrated in a unified framework, called Projective Clustering Ensemble (PCE). We discuss how PCE has originally been formalized as either a two-objective or a single-objective optimization problem, and how the limitations of such early approaches have been overcome by a metacluster-based formulation. We also summarize main empirical results, and provide pointers for future research. 1 Introduction Given a set of data objects as points in a multi-dimensional space (or feature space), the problem of clustering consists in discovering a number of homogeneous subsets of data, called clusters, which are well-separated from each other [8]. Clustering is a key step for a myriad of data-management/data-mining tasks that require the discov- ery of unknown relationships and patterns in large datasets. Although most clustering approaches usually provide single clustering solutions and/or use the same (typically large) feature space, latest advances have focused on two main advanced aspects: (i) dealing with high dimensionality, and (ii) handling multiple clustering solutions. Several problems of practical interest are inherently high-dimensional, i.e., they in- volve data objects represented by large sets of features. A common scenario with high- dimensional data is that several clusters may exist in different subspaces that correspond to different combinations of features. In fact, it is unlikely that all features of the data objects are equally relevant to form meaningful clusters. Another challenge in the clustering process is that in many real-life domains multi- ple, and potentially meaningful groupings of the input data are available, hence provid- ing different views of the data. For instance, in genomics, multiple clustering solutions Projective Clustering Ensemble C’ C1’ C2’ C3’ C4’ Projective Consensus Clustering C ’’ C1’’ C * C1* C2’’ C2* C3’’ C3* D C ’’’ C1’’’ C2’’’ C3’’’ Fig. 1: Illustration of a projective clustering ensemble and derived consensus clustering. Each gradient refers to the cluster memberships over all objects. Colors denote different feature sub- spaces associated with the projective clusters [5]. would be needed to capture the multiple functional roles of genes. In text mining, doc- uments discuss multiple topics, thus their grouping by content should reflect different informative views which correspond to multiple (possibly alternative) clusterings. Recent advances in data clustering have led to the definition of the problems of projective clustering—to deal with high dimensionality—and clustering ensembles— to handle multiple clustering solutions. The goal of projective clustering [9] is to dis- cover projective clusters, i.e., subsets of the input data having different, and possibly overlapping subsets of features (subspaces) associated with them. Projective clustering aims to solve issues that typically arise in high-dimensional data, such as sparsity and concentration of distances [7]. The problem of clustering ensembles [12], also known as consensus clustering or aggregation clustering, is stated as follows: given a set of clustering solutions, or ensemble, one must derive a consensus clustering that properly summarizes the solutions of the ensemble. The input ensemble is typically generated by varying one or more aspects of the clustering process, such as the clustering algorithm, the parameter setting, and the number of features, objects or clusters. Projective clustering and clustering ensembles have recently been treated in a uni- fied framework [2,3,4,5,6]. The underlying motivation of that study is that many real- world problems are high-dimensional and lack a-priori knowledge. Examples include clustering of multi-view data, privacy preserving clustering, news or document retrieval based on pre-defined categorizations, and distributed clustering of high-dimensional data. To address both issues simultaneously, the problem of Projective Clustering En- sembles (PCE) is hence formalized, whose goal is to compute a projective consensus clustering from an ensemble of projective clustering solutions. Intuitively, each pro- jective cluster is characterized by a distribution of memberships of the objects as well as a distribution over the features that belong to the subspace of that cluster. Figure 1 illustrates an example of projective clustering ensemble with three projective cluster- ing solutions, which are obtained according to different views over the same dataset. A projective cluster is graphically represented as a rectangle filled with a color gra- dient, where higher intensities correspond to larger membership values of objects to the cluster. Clusters of the same clustering may overlap with their gradient (i.e., ob- jects can have multiple assignments with different degrees of membership), and colors change to denote that different groupings of objects are associated with different feature subspaces. A projective consensus clustering is derived by suitably “aggregating” the ensemble members: the first projective consensus cluster is derived by summarizing C10 , C200 , and C2000 , the second from C20 , C300 , and C3000 , and the third from C40 , C100 , and C1000 . Note that the resulting color in each projective consensus cluster would merge colors in the original projective clusters, which means that a projective consensus cluster is associated with a subset of features shared by the objects in the original clusters. In this paper we provide an overview of the PCE problem. We discuss how PCE has been originally formalized as either a two-objective or a single-objective optimiza- tion problem, and how the limitations of the early approaches have been overcome by a metacluster-based method that is able to jointly consider the object-based and feature-based cluster representations in a single-objective optimization problem. We also summarize main empirical results obtained on benchmark datasets. We finally pro- vide pointers for future research in such a challenging context. 2 Projective Clustering Ensembles (PCE) Let D be a set of data objects, where each o ∈ D is an |F|-dimensional point defined over a feature space F.4 A projective cluster C defined over D is a pair hΓC , ∆C i. ΓC , termed the object-based representation of C, is a |D|-dimensional real-valued vector whose components ΓC,o ∈ [0, 1], ∀o ∈ D, represent the object-to-cluster assignment of o to C, i.e., the probability Pr(o|C) that the object o belongs to C. ∆C , termed the feature-based representation of C, is an |F|-dimensional real-valued vector whose components ∆C,f ∈ [0, 1], ∀f ∈ F, represent the feature-to-cluster assignments of the f -th feature to C, i.e., the probability Pr(f |C) that the feature f is informative for cluster C (f belongs to the subspace associated with C). The object-based (ΓC ) and the feature-based (∆C ) representations of a projective cluster C implicitly define the projective cluster representation matrix (for short, pro- jective matrix) XC of C. XC is a |D| × |F | matrix that stores, ∀o ∈ D, f ∈ F, the probability of the intersection of the events “object o belongs to C” and “feature f belongs to the subspace associated with C”. Under the assumption of independence between the two events, such a probability is equal to Pr(C|o) = ΓC,o joint with Pr(f |C) = ∆C,f . Hence, given D = {o1 , . . . , o|D| } and F = {1, . . . , |F|}, the matrix XC can formally be defined as XC = ΓT C ∆C . A projective clustering P solution, denoted by C, isP defined as a set of projective clus- ters that satisfy C∈C ΓC,o = 1, ∀o ∈ D, and f ∈F ∆C,f = 1, ∀C ∈ C. The semantics of any projective clustering C is that for each projective cluster C ∈ C, the objects belonging to C are close to each other if (and only if) they are projected onto the subspace associated with C. A projective ensemble E is defined as a set of projective clustering solutions. No in- formation about the ensemble generation strategy (algorithms and/or setups), nor orig- inal feature values of the objects within D are provided along with E. Moreover, each projective clustering solution in E may contain in general a different number of clusters. The goal of PCE is to derive a projective consensus clustering that properly sum- marizes the projective clustering solutions within the input projective ensemble. 4 Vectorial notation here denotes row vectors. 2.1 Single-objective PCE The earliest PCE formulation proposed in [5] is based on a single-objective function: XX C ∗ = arg min α ΓC,o XC,o , (1) C C∈C o∈D X 2 1 XX where XC,o = (∆C,f − Λo,f ) , Λo,f = ΓĈ,o ∆Ĉ,f , (2) |E| f ∈F Ĉ∈E Ĉ∈Ĉ and α > 1 is a positive integer. To solve the above optimization problem, the EM- based Projective Clustering Ensembles (EM-PCE) method is proposed in [5]. EM-PCE iteratively looks for the optimal values of ΓC,o (resp. ∆C,f ) while keeping ∆C,f (resp. ΓC,o ) fixed, until convergence. Weaknesses of single-objective PCE. The objective function in (1) does not allow for a perfect balance between object- and feature-to-cluster assignments when measuring the error of a candidate projective consensus clustering solution. This weakness is for- mally shown in [3] and avoided by adjusting (1) with a corrective term. The resulting formulation of the problem based on the corrected objective function is the following: ∗ XX α 1 1 0 C = arg min ΓC,o XC,o + X , (3) C 2|E| |D| − 1 C,o C∈C o∈D 0 X 1 − Γ C,o 0 X X where XC,o = ΓĈ,o ΓĈ,o0 . 0 |E| o 6=o Ĉ∈E Ĉ∈Ĉ The above optimization problem is tackled in [3] by proposing two different meth- ods. The first one, called E-EM-PCE, follows the same scheme as the EM-PCE al- gorithm for the early single-objective PCE formulation. The second method, dubbed E-2S-PCE, consists of two sequential steps that handle the object-to-cluster and the feature-to-cluster assignments separately. 2.2 Two-objective PCE PCE is formulated in [5] also as a two-objective problem, whose functions account for the object-based (function Ψo ) and the feature-based (function Ψf ) representations: C ∗ = arg min {Ψo (C, E), Ψf (C, E)} , (4) C X X where Ψo (C, E) = ˆ ψ o (C, C), Ψf (C, E) = ˆ ψ f (C, C). (5) Ĉ∈E Ĉ∈E Functions ψ o and ψ f are defined as ψ o (C 0 , C 00 ) = 12 (ψo (C 0 , C ) + ψo (C 00 , C 0 )) and 00 ψ f (C 0 , C 00 ) = 12 (ψf (C 0 , C 00 ) + ψf (C 00 , C 0 )), respectively, where 1 X ψo (C 0 , C 00 ) = (1 − max J(ΓC 0 , ΓC 00 )), |C 0 | 0 0 C 00 ∈C 00 C ∈C 1 X ψf (C 0 , C 00 ) = (1 − max J(∆C 0 , ∆C 00 )), |C 0 | 0 0 C 00 ∈C 00 C ∈C and J u, v = u vT / kuk22 + kvk22 − u vT denotes the Tanimoto coefficient. The problem defined in (4) is solved by the MOEA-PCE method, in which a Pareto- based Multi-Objective Evolutionary Algorithm is exploited to avoid combining the two objective functions into a single one. Weaknesses of two-objective PCE. Experimental evidence in [5] has shown that the two-objective PCE formulation is more accurate than the single-objective one. Nev- ertheless, the original two-objective PCE still suffers from an important conceptual issue: it does not take into consideration the interrelation between the object-based and the feature-based cluster representations. This can be overcome by employing a metacluster-based PCE formulation, which we discuss next. 2.3 Metacluster-based PCE Metacluster-based PCE is defined in terms of the following single-objective function [4]: C ∗ = arg min Ψof (C, E), (6) C where Ψof is a function designed to measure the “distance” of a projective clustering solution C from P E in terms of both data clustering and feature-to-cluster assignment: Ψof (C, E) = Ĉ∈E ψ of (C, C), ˆ where ψ of (C 0 , C 00 ) = 1 (ψof (C 0 , C 00 ) + ψof (C 00 , C 0 )), 2 ˆ C 0 , XC 00 )), and Jˆ is the Tanimoto coef- ψof (C 0 , C 00 ) = |C10 | C 0 ∈C 0 (1−maxC 00 ∈C 00 J(X P ficient between the linearized versions of the matrices to be compared. The PCE formulation reported in (6) is well-suited to measure the quality of a can- didate consensus clustering in terms of both object-to-cluster and feature-to-cluster as- signments as a whole. As shown in [4], this enables us to overcome the conceptual disadvantages of both early single-objective and two-objective PCE. To solve the metacluster-based PCE problem, a two-step is defined in [4]. The method, called CB-PCE, first discovers a suitable metacluster structure, and then com- putes optimal object- and feature-based representations of each metacluster. 2.4 Enhanced metacluster-based PCE Although the early metacluster-based PCE formulation in (6) mitigates the issues of two-objective PCE, such a formulation has still some limitations. By some rearrange- ment and omitting constant terms, the objective function in (6) can be rewritten as: XX XX min 1− Jˆ XC , XĈ + min 1− Jˆ XC , XĈ . (7) Ψof (C, E) = C∈C Ĉ∈Ĉ Ĉ∈E Ĉ∈Ĉ C∈C Ĉ∈E | {z } | {z } 0 (C,E) Ψof 00 (C,E) Ψof 0 00 As formally shown in [6], both Ψof and Ψof suffer from conceptual drawbacks: the op- 0 timization of Ψof favors projective clustering solutions composed of clusters coming all 00 from the same input ensemble solution, while the optimization of Ψof favors the repli- cation of the same projective cluster within the output consensus projective clustering. Table 1: Datasets used in the experimental evaluation dataset objects features classes Iris 150 4 3 Wine 178 13 3 Glass 214 10 6 dataset objects features classes Ecoli 327 7 5 Yeast 1,484 8 10 Shapes 160 1,614 9 Multiple-Features 2,000 585 10 Tracedata 200 275 4 Segmentation 2,310 19 7 ControlChart 600 60 6 Abalone 4,124 7 17 Twopat 800 128 4 Waveform 5,000 40 3 N30 1,356 20 8 Letter 7,648 16 10 D75 1,365 75 7 Isolet 7,797 617 26 S2500 2,262 20 8 Gisette 13,500 5,000 2 p53-Mutants 300 5,409 2 Amazon 120 10,000 4 Arcene 200 10,000 2 To solve these issues, an enhanced metacluster-based PCE formulation is proposed in [6]: Given a projective ensemble E defined over objects D and features F, and an integer K > 0, find a projective clustering solution C ∗ such that |C ∗ | = K and XXX C ∗ = arg min x(Ĉ, C) T (Ĉ, C) (8) C C∈C Ĉ∈E Ĉ∈Ĉ X X s.t. ΓC,o = 1, ∀o ∈ D, ∆C,f = 1, ∀C ∈ C, C∈C f ∈F x(Ĉ, C) ∈ {0, 1}, ∀Cˆ ∈ E, ∀Ĉ ∈ C, ˆ ∀C ∈ C, X x(Ĉ, C) ≥ 1, ∀Cˆ ∈ E, ∀Ĉ ∈ C, ˆ (9) C∈C X x(Ĉ, C) ≥ 1, ∀Cˆ ∈ E, ∀C ∈ C. (10) Ĉ∈Ĉ The above (NP-hard [6]) problem is solved by E-CB-PCE, i.e., a more effective and more efficient variant of the method devised for early metacluster-based PCE. 3 Experiments In this section we report the main experimental findings on the various state-of-the-art PCE methods, i.e., MOEA-PCE [2,5], EM-PCE [2,5], E-EM-PCE [3], E-2S-PCE [3], CB-PCE [4], and E-CB-PCE [6]. Datasets. The evaluation was performed on 22 publicly-available benchmark datasets, whose main characteristics are shown in Table 1.5 Projective-ensemble generation. Projective ensembles were generated by running the well-established projective-custering LAC algorithm [1] on the selected datasets. The diversity of the projective clustering solutions was ensured by randomly choosing the initial centroids and varying the LAC’s parameter h. For each dataset, we generated 10 different projective ensembles; all results were averaged over these 10 ensembles. 5 The first 15 datasets are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml), the next four ones are from the UCR Time Series Classifica- tion/Clustering Page (http://www.cs.ucr.edu/ eamonn/timeseries data), whereas the last three ones are synthetic datasets from http://dme.rwth-aachen.de/en/OpenSubspace/evaluation. Table 2: Average gains of E-CB-PCE w.r.t. the other methods criterion MOEA-PCE EM-PCE E-EM-PCE E-2S-PCE CB-PCE F 1of 0 .014 .021 .025 .020 F 1o .029 .049 .054 .108 .023 F 1f .005 .056 .055 .060 .078 F 1of .147 .186 .194 .237 .019 F 1o .021 .046 .055 .100 .021 F 1f .022 .066 .067 .076 -.010 avg .037 .070 .074 .101 .025 Assessment criteria. The quality of a projective consensus clustering C was assessed by using both external and internal evaluations: the former compares C against a reference classification, whereas the latter is based on the average similarity w.r.t. the solutions of the input projective ensemble. Three variants of the classic F1-measure were used as external criteria, aiming at comparing projective clustering solutions in terms of their object-based representation (F 1o ), feature-based representation (F 1f ), or both (F 1of ), respectively. The same criteria were used for internal evaluation too. The difference in this case is that the target consensus clustering C is compared against all solutions in the input ensemble, and an aggregated score is ultimately derived from all individual scores. Again, three variants were employed, accounting for object-based representation (F 1o ), feature-based representation (F c1f ), or both (F 1of ), respectively. For a more detailed definition of all F 1o , F 1f , F 1of , F 1o , F 1f , and F 1of criteria, please refer to [6]. Results. Table 2 shows the accuracy of all methods for each assessment criterion, aver- aged on all the selected datasets.6 In particular the table shows the average gain achieved by the E-CB-PCE method, which was recognized as the most accurate one. E-CB-PCE was more accurate than both MOEA-PCE and CB-PCE on 16 out of 22 datasets on average, while achieving average gains up to 0.147 (F 1of assessment criterion) and 0.078 (F 1f assessment criterion) w.r.t. MOEA-PCE and CB-PCE, respectively. Larger improvements were produced by E-CB-PCE w.r.t. the early single-objective PCE meth- ods, i.e., EM-PCE, E-EM-PCE, and E-2S-PCE. Table 3 shows the runtimes (in milliseconds) of the various algorithms involved in the comparison. E-CB-PCE was slower than EM-PCE on most datasets, while clearly outperforming MOEA-PCE. The runtimes of E-CB-PCE were one or two orders of magnitude smaller than those of MOEA-PCE on average, up to four orders on Isolet. Only on one dataset, MOEA-PCE was more efficient than E-CB-PCE (Glass), even though the runtimes of the two methods remained of the same order of magnitude. Compared to CB-PCE, E-CB-PCE was faster on 11 out of 22 datasets. 4 Conclusions and Future Work We provided an overview of research work on projective clustering ensembles (PCE). We discussed the major reasons behind the formulation of PCE as a new topic in the data clustering field, which required novel computational approaches and algorithmic solutions. We formally presented a summary of existing formulations of PCE, and re- ported major findings learned from experimental evaluation studies. Our research work paved the way for the development of data clustering frame- works that can support a variety of current and emerging applications in several do- mains of data management and analysis. For instance, in the complex network systems 6 Dataset-by-dataset scores are available in [6]. Table 3: Execution times (milliseconds) dataset MOEA-PCE EM-PCE E-EM-PCE E-2S-PCE CB-PCE E-CB-PCE Iris 2,056 37 109 253 74 1,492 Wine 2,558 29 88 163 144 1,223 Glass 7,712 56 615 248 500 9,177 Ecoli 14,401 59 685 625 1,147 9,337 Yeast 227,878 757 20,438 21,560 30,384 83,008 Mult.-Feat. 490,602 116,334 151,582 88,713 1,562,139 114,368 Segmentation 233,951 1,854 17,385 40,014 25,692 31,780 Abalone 3,411,116 4,105 156,959 430,974 275,053 857,164 Waveform 125,247 4,912 15,905 91,844 10,487 2,179 Letter 2,248,695 9,591 126,222 772,883 70,458 135,832 Isolet 20,676,754 10,100 10,000 8,488 66,136 1,447 Gisette 966,108 34,260 38,216 29,700 93,148 1,450 p53-Mutants 58,695 22,209 22,168 19,347 65,501 1,490 Amazon 395,988 21,556 21,619 20,914 135,446 12,737 Arcene 120,537 21,557 21,499 17,405 81,433 2,413 Shapes 211,654 17,752 19,152 12,473 282,857 106,180 Tracedata 12,777 1,062 1,108 960 9,000 4,716 ControlChart 50,798 1,522 5,397 2,801 13,900 20,708 Twopat 31,850 2,946 5,706 4,606 9,788 5,344 N30 164,969 1,340 13,781 16,243 22,904 44,517 D75 135,297 4,558 13,414 15,477 32,938 30,631 S2500 290,408 2,717 29,223 44,008 39,039 55,607 area, one interesting direction is to exploit PCE for addressing problems of dimension- ality reduction and community detection in time-evolving attributed networks. Another problem that can benefit from our framework is outlier detection in high dimensional data, which has many of the same challenges as clustering. Some work has been done on outlier ensembles (e.g., [10]) and on subspace outlier detection (e.g., [11]), but a unified framework encompassing both does not exist. References 1. Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., Papadopoulos, D.: Lo- cally Adaptive Metrics for Clustering High Dimensional Data. Data Mining and Knowledge Discovery 14(1), 63–97 (2007) 2. Gullo, F., Domeniconi, C., Tagarelli, A.: Projective Clustering Ensembles. In: Proc. IEEE Int. Conf. on Data Mining (ICDM). pp. 794–799 (2009) 3. Gullo, F., Domeniconi, C., Tagarelli, A.: Enhancing Single-Objective Projective Clustering Ensembles. In: Proc. IEEE Int. Conf. on Data Mining (ICDM). pp. 833–838 (2010) 4. Gullo, F., Domeniconi, C., Tagarelli, A.: Advancing data clustering via projective clustering ensembles. In: Proc. ACM SIGMOD Int. Conf. on Management of Data. pp. 733–744 (2011) 5. Gullo, F., Domeniconi, C., Tagarelli, A.: Projective Clustering Ensembles. Data Min. Knowl. Discov. 26(3), 452–511 (2013) 6. Gullo, F., Domeniconi, C., Tagarelli, A.: Metacluster-based projective clustering ensembles. Machine Learning 98(1), 181–216 (2015) 7. Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What Is the Nearest Neighbor in High Dimen- sional Spaces? In: Proc. VLDB Conf. pp. 506–515 (2000) 8. Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall (1988) 9. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1), 1:1–1:58 (2009) 10. Rayana, S., Akoglu, L.: Less is more: Building selective anomaly ensembles. TKDD 10(4), 42:1–42:33 (2016) 11. Sathe, S., Aggarwal, C.C.: Subspace outlier detection in linear time with randomized hash- ing. In: Proc. IEEE Int. Conf. on Data Mining (ICDM). pp. 459–468 (2016) 12. Strehl, A., Ghosh, J.: Cluster Ensembles — A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research 3, 583–617 (2002)