-

Int. Conf. on Data Mining (ICDM). pp.

The Projective Clustering Ensemble problem for Advanced Data Clustering

EXTENDED ABSTRACT

Carlotta Domeniconi

Francesco Gullo

gullof@acm.org 1

Andrea Tagarelli

andrea.tagarelli@unical.it 2 0 George Mason University , USA 1 UniCredit, R&D Dept. , Italy 2 University of Calabria , Italy

2009

799 2009 794 799

After more than five decades, a huge number of models and algorithms have been developed for data clustering. While most attention has been devoted to data types, algorithmic features, and application targets, in the last years there has also been an increasing interest in developing advanced dataclustering tools. In this respect, projective clustering and clustering ensembles represent two of the most important directions: the former is concerned with the discovery of subsets of the input data having different, possibly overlapping subsets of features associated with them, while the latter allows for the induction of a prototype consensus clustering from an available ensemble of clustering solutions. In this paper we discuss the current state-of-the-art research in which the problems of projective clustering and clustering ensembles have been revisited and integrated in a unified framework, called Projective Clustering Ensemble (PCE). We discuss how PCE has originally been formalized as either a two-objective or a single-objective optimization problem, and how the limitations of such early approaches have been overcome by a metacluster-based formulation. We also summarize main empirical results, and provide pointers for future research.

Given a set of data objects as points in a multi-dimensional space (or feature space), the problem of clustering consists in discovering a number of homogeneous subsets of data, called clusters, which are well-separated from each other [8]. Clustering is a key step for a myriad of data-management/data-mining tasks that require the discovery of unknown relationships and patterns in large datasets. Although most clustering approaches usually provide single clustering solutions and/or use the same (typically large) feature space, latest advances have focused on two main advanced aspects: (i) dealing with high dimensionality, and (ii) handling multiple clustering solutions.

Several problems of practical interest are inherently high-dimensional, i.e., they involve data objects represented by large sets of features. A common scenario with highdimensional data is that several clusters may exist in different subspaces that correspond to different combinations of features. In fact, it is unlikely that all features of the data objects are equally relevant to form meaningful clusters.

Another challenge in the clustering process is that in many real-life domains multiple, and potentially meaningful groupings of the input data are available, hence providing different views of the data. For instance, in genomics, multiple clustering solutions D

Projective Clustering Ensemble C ’ C1’

C2’ C3’

C4’ C ’’ C1’’

C2’’

C3’’ C ’’’ C1’’’

C2’’’ C3’’’

Projective Consensus Clustering C * C1*

C2* C3* would be needed to capture the multiple functional roles of genes. In text mining, documents discuss multiple topics, thus their grouping by content should reflect different informative views which correspond to multiple (possibly alternative) clusterings.

Recent advances in data clustering have led to the definition of the problems of projective clustering—to deal with high dimensionality—and clustering ensembles— to handle multiple clustering solutions. The goal of projective clustering [9] is to discover projective clusters, i.e., subsets of the input data having different, and possibly overlapping subsets of features (subspaces) associated with them. Projective clustering aims to solve issues that typically arise in high-dimensional data, such as sparsity and concentration of distances [ 7 ]. The problem of clustering ensembles [12], also known as consensus clustering or aggregation clustering, is stated as follows: given a set of clustering solutions, or ensemble, one must derive a consensus clustering that properly summarizes the solutions of the ensemble. The input ensemble is typically generated by varying one or more aspects of the clustering process, such as the clustering algorithm, the parameter setting, and the number of features, objects or clusters.

Projective clustering and clustering ensembles have recently been treated in a unified framework [2,3,4,5,6]. The underlying motivation of that study is that many realworld problems are high-dimensional and lack a-priori knowledge. Examples include clustering of multi-view data, privacy preserving clustering, news or document retrieval based on pre-defined categorizations, and distributed clustering of high-dimensional data. To address both issues simultaneously, the problem of Projective Clustering Ensembles (PCE) is hence formalized, whose goal is to compute a projective consensus clustering from an ensemble of projective clustering solutions. Intuitively, each projective cluster is characterized by a distribution of memberships of the objects as well as a distribution over the features that belong to the subspace of that cluster. Figure 1 illustrates an example of projective clustering ensemble with three projective clustering solutions, which are obtained according to different views over the same dataset. A projective cluster is graphically represented as a rectangle filled with a color gradient, where higher intensities correspond to larger membership values of objects to the cluster. Clusters of the same clustering may overlap with their gradient (i.e., objects can have multiple assignments with different degrees of membership), and colors change to denote that different groupings of objects are associated with different feature subspaces. A projective consensus clustering is derived by suitably “aggregating” the ensemble members: the first projective consensus cluster is derived by summarizing C10, C200, and C2000, the second from C20, C300, and C3000, and the third from C40, C100, and C1000. Note that the resulting color in each projective consensus cluster would merge colors in the original projective clusters, which means that a projective consensus cluster is associated with a subset of features shared by the objects in the original clusters.

In this paper we provide an overview of the PCE problem. We discuss how PCE has been originally formalized as either a two-objective or a single-objective optimization problem, and how the limitations of the early approaches have been overcome by a metacluster-based method that is able to jointly consider the object-based and feature-based cluster representations in a single-objective optimization problem. We also summarize main empirical results obtained on benchmark datasets. We finally provide pointers for future research in such a challenging context. 2

Projective Clustering Ensembles (PCE)

Let D be a set of data objects, where each o 2 D is an jF j-dimensional point defined over a feature space F .4 A projective cluster C defined over D is a pair h C ; C i. C , termed the object-based representation of C, is a jDj-dimensional real-valued vector whose components C;o 2 [0; 1], 8o 2 D, represent the object-to-cluster assignment of o to C, i.e., the probability Pr(ojC) that the object o belongs to C. C , termed the feature-based representation of C, is an jF j-dimensional real-valued vector whose components C;f 2 [0; 1], 8f 2 F , represent the feature-to-cluster assignments of the f -th feature to C, i.e., the probability Pr(f jC) that the feature f is informative for cluster C (f belongs to the subspace associated with C).

The object-based ( C ) and the feature-based ( C ) representations of a projective cluster C implicitly define the projective cluster representation matrix (for short, projective matrix) XC of C. XC is a jDj jF j matrix that stores, 8o 2 D, f 2 F , the probability of the intersection of the events “object o belongs to C” and “feature f belongs to the subspace associated with C”. Under the assumption of independence between the two events, such a probability is equal to Pr(Cjo) = C;o joint with Pr(f jC) = C;f . Hence, given D = fo1; : : : ; ojDjg and F = f1; : : : ; jF jg, the matrix XC can formally be defined as XC = TC C .

A projective clustering solution, denoted by C, is defined as a set of projective clusters that satisfy PC2C C;o = 1, 8o 2 D, and Pf 2F C;f = 1, 8C 2 C. The semantics of any projective clustering C is that for each projective cluster C 2 C, the objects belonging to C are close to each other if (and only if) they are projected onto the subspace associated with C.

A projective ensemble E is defined as a set of projective clustering solutions. No information about the ensemble generation strategy (algorithms and/or setups), nor original feature values of the objects within D are provided along with E . Moreover, each projective clustering solution in E may contain in general a different number of clusters.

The goal of PCE is to derive a projective consensus clustering that properly summarizes the projective clustering solutions within the input projective ensemble.

4 Vectorial notation here denotes row vectors.

2.1

Single-objective PCE

The earliest PCE formulation proposed in [5] is based on a single-objective function: C = arg min X X

C C2C o2D

C;o XC;o; where

XC;o =

X ( C;f o;f )2 ;

o;f = f2F and > 1 is a positive integer. To solve the above optimization problem, the EMbased Projective Clustering Ensembles (EM-PCE) method is proposed in [5]. EM-PCE iteratively looks for the optimal values of C;o (resp. C;f ) while keeping C;f (resp.

C;o) fixed, until convergence.

Weaknesses of single-objective PCE. The objective function in (1) does not allow for a perfect balance between object- and feature-to-cluster assignments when measuring the error of a candidate projective consensus clustering solution. This weakness is formally shown in [3] and avoided by adjusting (1) with a corrective term. The resulting formulation of the problem based on the corrected objective function is the following: C = arg min X X

C C2C o2D where

C;o 0

1 2jE j

XC;o +

1 jDj C;o0 X

X 1 XC0;o ;

C^;o C^;o0 A : o06=o jE j C^2E C^2C^

The above optimization problem is tackled in [3] by proposing two different methods. The first one, called E-EM-PCE, follows the same scheme as the EM-PCE algorithm for the early single-objective PCE formulation. The second method, dubbed E-2S-PCE, consists of two sequential steps that handle the object-to-cluster and the feature-to-cluster assignments separately. (1) (2) (3) (4) (5) f (C; C^): o(C00; C0)) and 2.2

Two-objective PCE

PCE is formulated in [5] also as a two-objective problem, whose functions account for the object-based (function o) and the feature-based (function f ) representations: where

Functions

o and f (C0; C00) = 12 ( f (C0; C00) +

C = arg min f o(C; E ); f (C; E )g ;

C o(C; E ) = X o(C; C^);

f (C; E ) = X

C^2E C^2E f are defined as o(C0; C00) = 21 ( o(C0; C00) +

f (C00; C0)), respectively, where o(C0; C00) = 1

X (1 max J ( C0 ; C00 ));

C002C00 and J u; v = u vT = kuk22 + kvk22 u vT denotes the Tanimoto coefficient.

The problem defined in (4) is solved by the MOEA-PCE method, in which a Paretobased Multi-Objective Evolutionary Algorithm is exploited to avoid combining the two objective functions into a single one.

Weaknesses of two-objective PCE. Experimental evidence in [5] has shown that the two-objective PCE formulation is more accurate than the single-objective one. Nevertheless, the original two-objective PCE still suffers from an important conceptual issue: it does not take into consideration the interrelation between the object-based and the feature-based cluster representations. This can be overcome by employing a metacluster-based PCE formulation, which we discuss next. 2.3

Metacluster-based PCE

Metacluster-based PCE is defined in terms of the following single-objective function [4]: P1C^2E of (C; C^), where

C = arg min of (C; E ); (6)

C where of is a function designed to measure the “distance” of a projective clustering solution C from E in terms of both data clustering and feature-to-cluster assignment: of (C; E ) = of (C0; C00) = 12 ( of (C0; C00) + of (C00; C0)), of (C0; C00) = jC0j C02C0 (1 maxC002C00 J^(XC0 ; XC00 )), and J^ is the Tanimoto coefficient between the linearized versions of the matrices to be compared.

The PCE formulation reported in (6) is well-suited to measure the quality of a candidate consensus clustering in terms of both object-to-cluster and feature-to-cluster assignments as a whole. As shown in [4], this enables us to overcome the conceptual disadvantages of both early single-objective and two-objective PCE.

To solve the metacluster-based PCE problem, a two-step is defined in [4]. The method, called CB-PCE, first discovers a suitable metacluster structure, and then computes optimal object- and feature-based representations of each metacluster. 2.4

Enhanced metacluster-based PCE

Although the early metacluster-based PCE formulation in (6) mitigates the issues of two-objective PCE, such a formulation has still some limitations. By some rearrangement and omitting constant terms, the objective function in (6) can be rewritten as: of (C; E ) = X X mC2inC 1 J^ XC ; XC^ + X X min 1 J^ XC ; XC^ : (7)

C^2C^ C^2E C^2C^ |

{z o0f (C;E) }

C2C C^2E | o00f{(zC;E) } As formally shown in [6], both o0f and o00f suffer from conceptual drawbacks: the optimization of o0f favors projective clustering solutions composed of clusters coming all from the same input ensemble solution, while the optimization of o00f favors the replication of the same projective cluster within the output consensus projective clustering.

To solve these issues, an enhanced metacluster-based PCE formulation is proposed in [6]: Given a projective ensemble E defined over objects D and features F , and an integer K > 0, find a projective clustering solution C such that jC j = K and C s :t : = arg min X X X x(C^; C) T (C^; C)

C2C C^2E C^2C^

C;o = 1; 8o 2 D; X C2C x(C^; C) 2 f0; 1g; X x(C^; C) 1; C2C X x(C^; C) C^2C^

X f 2F

C;f = 1; 8C 2 C; 8C^ 2 E ; 8C^ 2 C^; 8C 2 C; 8C^ 2 E ; 8C^ 2 C^; 1; 8C^ 2 E ; 8C 2 C: (8) (9) (10)

The above (NP-hard [6]) problem is solved by E-CB-PCE, i.e., a more effective and more efficient variant of the method devised for early metacluster-based PCE. 3

Experiments

In this section we report the main experimental findings on the various state-of-the-art PCE methods, i.e., MOEA-PCE [2,5], EM-PCE [2,5], E-EM-PCE [3], E-2S-PCE [3], CB-PCE [4], and E-CB-PCE [6].

Datasets. The evaluation was performed on 22 publicly-available benchmark datasets, whose main characteristics are shown in Table 1.5 Projective-ensemble generation. Projective ensembles were generated by running the well-established projective-custering LAC algorithm [1] on the selected datasets. The diversity of the projective clustering solutions was ensured by randomly choosing the initial centroids and varying the LAC’s parameter h. For each dataset, we generated 10 different projective ensembles; all results were averaged over these 10 ensembles. 5 The first 15 datasets are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml), the next four ones are from the UCR Time Series Classification/Clustering Page (http://www.cs.ucr.edu/ eamonn/timeseries data), whereas the last three ones are synthetic datasets from http://dme.rwth-aachen.de/en/OpenSubspace/evaluation. Assessment criteria. The quality of a projective consensus clustering C was assessed by using both external and internal evaluations: the former compares C against a reference classification, whereas the latter is based on the average similarity w.r.t. the solutions of the input projective ensemble. Three variants of the classic F1-measure were used as external criteria, aiming at comparing projective clustering solutions in terms of their object-based representation (F 1o), feature-based representation (F 1f ), or both (F 1of ), respectively. The same criteria were used for internal evaluation too. The difference in this case is that the target consensus clustering C is compared against all solutions in the input ensemble, and an aggregated score is ultimately derived from all individual scores. Again, three variants were employed, accounting for object-based representation (F 1o), feature-based representation (Fc1f ), or both (F 1of ), respectively. For a more detailed definition of all F 1o, F 1f , F 1of , F 1o, F 1f , and F 1of criteria, please refer to [6]. Results. Table 2 shows the accuracy of all methods for each assessment criterion, averaged on all the selected datasets.6 In particular the table shows the average gain achieved by the E-CB-PCE method, which was recognized as the most accurate one. E-CB-PCE was more accurate than both MOEA-PCE and CB-PCE on 16 out of 22 datasets on average, while achieving average gains up to 0:147 (F 1of assessment criterion) and 0:078 (F 1f assessment criterion) w.r.t. MOEA-PCE and CB-PCE, respectively. Larger improvements were produced by E-CB-PCE w.r.t. the early single-objective PCE methods, i.e., EM-PCE, E-EM-PCE, and E-2S-PCE.

Table 3 shows the runtimes (in milliseconds) of the various algorithms involved in the comparison. E-CB-PCE was slower than EM-PCE on most datasets, while clearly outperforming MOEA-PCE. The runtimes of E-CB-PCE were one or two orders of magnitude smaller than those of MOEA-PCE on average, up to four orders on Isolet. Only on one dataset, MOEA-PCE was more efficient than E-CB-PCE (Glass), even though the runtimes of the two methods remained of the same order of magnitude. Compared to CB-PCE, E-CB-PCE was faster on 11 out of 22 datasets. 4

Conclusions and Future Work

We provided an overview of research work on projective clustering ensembles (PCE). We discussed the major reasons behind the formulation of PCE as a new topic in the data clustering field, which required novel computational approaches and algorithmic solutions. We formally presented a summary of existing formulations of PCE, and reported major findings learned from experimental evaluation studies.

Our research work paved the way for the development of data clustering frameworks that can support a variety of current and emerging applications in several domains of data management and analysis. For instance, in the complex network systems

6 Dataset-by-dataset scores are available in [6].

dataset

Iris Wine Glass Ecoli

Yeast Mult.-Feat.

Segmentation

Abalone Waveform

Letter Isolet

Gisette p53-Mutants

Amazon Arcene Shapes Tracedata ControlChart

Twopat N30 D75 S2500 2,056 2,558 7,712 14,401 227,878 490,602 233,951 3,411,116 125,247 2,248,695 20,676,754 966,108 58,695 395,988 120,537 211,654 12,777 50,798 31,850 164,969 135,297 290,408 area, one interesting direction is to exploit PCE for addressing problems of dimensionality reduction and community detection in time-evolving attributed networks. Another problem that can benefit from our framework is outlier detection in high dimensional data, which has many of the same challenges as clustering. Some work has been done on outlier ensembles (e.g., [10]) and on subspace outlier detection (e.g., [11]), but a unified framework encompassing both does not exist. cally Adaptive Metrics for Clustering High Dimensional Data. Data Mining and Knowledge

42:1–42:33 (2016)

7. Hinneburg , A. , Aggarwal , C.C. , Keim , D.A. : What Is the Nearest Neighbor in High Dimen8 . Jain , A. , Dubes , R. : Algorithms for Clustering Data . Prentice-Hall ( 1988 ) 9 . Kriegel , H.P. , Kro¨ ger, P. , Zimek , A. : Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering . TKDD 3 ( 1 ), 1: 1 - 1 : 58 ( 2009 ) 10 . Rayana , S. , Akoglu , L. : Less is more: Building selective anomaly ensembles . TKDD 10 ( 4 ), 11 . Sathe , S. , Aggarwal , C.C. : Subspace outlier detection in linear time with randomized hashing . In: Proc. IEEE Int. Conf. on Data Mining (ICDM) . pp. 459 - 468 ( 2016 ) 12 . Strehl , A. , Ghosh , J.: Cluster

Ensembles - A Knowledge Reuse

Framework for Combining