1. Introduction

Correlation Clustering: from Local to Global Constraints

(Discussion Paper)

Domenico Mandaglio

Andrea Tagarelli

Francesco Gullo

1 0 DIMES Dept., University of Calabria , 87036 Rende (CS) , Italy 1 UniCredit , Rome , Italy

Given a set of data objects, consider that object pairs are assigned two weights expressing the advantage of putting those objects in the same cluster or in separate clusters, respectively. Correlation clustering partitions the input object set so as to minimize the sum of the intra-cluster negative-type weights plus the sum of the inter-cluster positive-type weights. Existing approximation algorithms provide quality guarantees if the weights are bounded in some way. Regardless of the type, the weight bounds that have been so far studied are local bounds, i.e., constraints that are required to hold for every object pair in isolation. In this paper, we discuss global weight bounds in correlation clustering, and in particular, we derive bounds on edge weights' aggregate functions that are suficient to lead to proved quality guarantees. Our formulation extends the range of applicability of the most prominent existing correlationclustering algorithms thus providing benefits, both theoretical and practical. Also, we showcase our results in a real-world scenario of feature selection for fair clustering.

eol>min-disagreement correlation clustering probability constraint fair clustering

1. Introduction

Correlation clustering [ 1 ] is an important clustering formulation that has received considerable attention from theoreticians and practitioners, and found application in several contexts [ 2 ].

Given a set of objects and nonnegative real weights expressing “positive” and “negative” feeling of clustering any two objects together, min-disagreement correlation clustering (Min-CC) partitions the input object set so as to minimize the sum of the intra-cluster negative-type weights plus the sum of the inter-cluster positive-type weights. Min-disagreement correlation clustering is APX-hard, but eficient constant-factor approximation algorithms exist if the weights are bounded in some way. The weight bounds tackled so far in the literature are said local, as they are required to hold for every single object pair.

In this paper, we discuss the main theoretical and experimental results from our study in [ 3 ], where we introduced the problem of min-disagreement correlation clustering with global weight bounds, i.e., constraints to be satisfied by the input weights altogether. Our main contribution is a suficient condition that establishes when any algorithm achieving a certain Input: Graph = (, ); nonnegative weights +, − , ∀ ∈ Output: Clustering of 1: ← ∅ , ′ ← 2: while ′ ̸= ∅ do 3: pick a pivot vertex ∈ ′ uniformly at random 4: add = {} ∪ { ∈ ′ | (, ) ∈ , + > −} to and remove from ′ approximation under the probability constraint keeps the same guarantee on an input that violates the constraint. This extends the range of applicability of the most prominent existing correlation-clustering algorithms, including the popular Pivot, thus providing both theoretical and practical benefits. Experiments have shown the usefulness of our approach, in terms of both worthiness of employing existing eficient algorithms, and guidance on the definition of weights from feature vectors in a task of fair clustering.

2. Correlation Clustering with local weight bounds

The input of correlation clustering is a set of objects, and two nonnegative, real-valued weights +, − for every (unordered) object pair , ∈ . Any “positive” + (resp. “negative” −) weight expresses the benefit of clustering and together (resp. separately). This input can equivalently be represented as a graph with vertex set and edge weights +, −, for all , ∈ , and with edge (, ) being drawn only if at least one among + and − is nonzero.

In this work we tackle the problem of min-disagreement correlation clustering: Problem 1 (Min-CC [ 4 ]). Given an undirected graph = (, ), with vertex set and edge set ⊆ × , and nonnegative weights +, − ∈ R0+ for all edges ∈ , find a clustering (i.e., an injective function expressing cluster-membership) : → N+ that minimizes ∑︁ − + ∑︁

+. (,)∈,()=() (,)∈,()̸=() (1)

For the sake of presentation, we assume + = − = 0, for all ∈/ , and non-trivial Min-CC instances, i.e., + ̸= − , for some ∈ .

Min-CC is NP-hard [ 1, 5 ] yet dificult to approximate, being it APX-hard even for complete graphs and edge weights (+, − ) ∈ {(0, 1), (1, 0)}, ∀ ∈ [ 6 ]. For general (i.e., not necessarily complete) graphs and unconstrained weights, the best known approximation factor is (log | |) [ 6, 7 ]. This factor improves if restrictions on edge weights are imposed. The one that has received remarkable attention is the probability constraint (pc): A Min-CC instance is said to satisfy the probability constraint if + + − = 1, for all pairs , ∈ .

A Min-CC instance obeying the pc necessarily corresponds to a complete graph (otherwise, any missing edge would violate the pc). Under the pc, Min-CC admits constant-factor guarantees. The best known approximation factor is 4, achievable – as shown in [ 8 ] – by Charikar et al.’s algorithm [ 6 ]. That algorithm is based on rounding the solution to a large linear program (with a number Ω(| |3) of constraints), thus being feasible only on small graphs. Algorithm 2 GlobalCC Input: Graph = (, ); nonnegative weights +, − , ∀ ∈ , satisfying Theorem 1; algorithm A achieving -approximation guarantee for Min-pc-CC Output: Clustering of 1: choose , s.t. ∈ [Δ, + + − ] {Theorem 1} 2: compute = (+ + −) − , ∀, ∈ 3: compute ± = 1 (︀ ± − 2 )︀ , ∀, ∈ (using , defined in Step 1) 4: ← run A on Min-pc-CC instance ⟨′ = (, × ), { +, − }∈ × ⟩

Here, we are particularly interested in the Pivot algorithm [ 4 ], due to its theoretical properties – it achieves a factor-5 expected guarantee for Min-CC under the pc – and practical benefits – it takes (||) time, and is easy-to-implement. Pivot simply picks a random vertex , builds a cluster as composed of and all the vertices such that an edge with + > − exists, and removes that cluster. The process is repeated until the graph has become empty (Algorithm 1).

3. Correlation Clustering with global weight bounds

The weight bounds that have been so far studied are local bounds, i.e., constraints that are required to hold for every object pair in isolation. In this work, we are the first to consider global weight bounds in min-disagreement correlation clustering. We derive bounds on edge weights’ aggregate functions that are suficient to lead to proved quality guarantees. More specifically, for a Min-CC instance ⟨ = (, ), {+, − }∈⟩ we define: + = ︁( |2 | )︁ − 1 ∑︀∈ +, − = ︁( |2 | )︁ − 1 ∑︀∈ − , Δ = max∈ |+ − − |

The main theoretical result of this work is described by the following theorem. Theorem 1. If the condition + + − ≥ Δ holds for a Min-CC instance , then it is possible to construct a Min-CC instance ′ (in linear time and space) such that () the probability constraint holds on ′, and () an -approximate clustering on ′ (i.e., a clustering whose objectivefunction value is no more than times ′’s optimum) is an -approximate clustering on too.

Let Min-pc-CC denote the version of Min-CC operating on instances that satisfy the pc. According to Theorem 1, if + + − ≥ Δ holds for a Min-CC instance, then any -approximation algorithm A for Min-pc-CC can be employed – as a black box – to get an -approximate solution to that Min-CC instance. The algorithm for doing so is simple: given a Min-CC instance , get a Min-pc-CC instance via strict approximation-preserving (sap) reduction, and run the black-box algorithm A on it (Algorithm 2). Being the reduction sap, Algorithm 2 on input ⟨, A⟩ achieves factor- guarantee on .

A noteworthy consequence of this result is that, if a Min-CC instance satisfies our condition, then the Pivot algorithm can be used to get (in linear time and space) a clustering achieving a 5-approximation guarantee on .1 This corresponds to extending the range of validity of 1A probability-constraint-compliant Min-CC instance ′ is derivable from in linear time and space (cf. Th. 1 ()). Pivot on ′ yields a 5-approximate clustering [ 4 ]. A 5-approximate clustering on ′ is a 5-approximate clustering on (cf. Th. 1 ()). Pivot’s guarantee beyond the probability constraint: our global-weight-bounds condition now sufices for the 5-approximation to hold. A key advantage is that our condition is more likely to be satisfied than the probability constraint; for instance, it may happen that a bunch of edges are missing in the graph (thus violating the probability constraint), but, if our condition holds, one can get a 5-approximate clustering with Pivot. We point out that our result is general and holds for any Min-CC algorithm achieving approximation guarantees under the probability constraint. However, the contextualization to the Pivot algorithm is relevant and worth to be exploited, since Pivot achieves the best tradeof between quality guarantees and eficiency.

4. Analysis of the global-weight-bounds condition

Our result can be exploited to quickly yet easily recognize whether employing probabilityconstraint-aware approximation algorithms is a worth choice even if the probability constraint is not met. As an example, consider a graph that violates the probability constraint. So far, that graph would have likely been handled with linear-programming (LP) algorithms [ 6, 7 ], as they achieve (factor-(log | |)) approximation guarantees on general graphs/weights (whereas algorithms like Pivot are just heuristics if the probability constraint does not hold). Instead, our condition can be used as an indicator of whether Pivot can still achieve guarantees even if the probability constraint is violated, thus being preferred over the less eficient LP algorithms. Our experiments confirmed this theoretical finding, i.e. a better fulfilment of our condition corresponds to better performance of Pivot with respect to the LP algorithms, and vice versa. Settings. We selected four real-world graphs, namely Karate, Dolphins, Adjnoun, and Football.2 Note that the small size of such graphs is not an issue because this evaluation stage involves, among others, LP correlation-clustering algorithms, whose Ω(| |3) time complexity makes them unafordable for graphs larger than that. We augmented these graphs with artificially-generated edge weights, to test diferent levels of fulfilment of our global-weight-bounds condition stated in Theorem 1. We controlled the degree of compliance of the condition by a target ratio parameter, defined as = Δ/(+ + − ). The condition is satisfied if and only if ∈ [ 0, 1 ], and smaller target-ratio values correspond to better fulfilment of the condition, and vice versa.

Given a desired target ratio, edge weights are generated as follows. First, all weights are drawn uniformly at random from a desired [, ] range. Then, the weights are adjusted in a two-step iterative fashion, until the desired target ratio is achieved: () the maximum gap Δ is fixed, the weights are changed for pairs that do not contribute to Δ so as to reflect a change in +, − ; () +, − are fixed, Δ is updated by randomly modifying pairs that contribute to Δ. Finally, weight pairs are randomly assigned to the edges.

We compared the performance of Pivot (Algorithm 1 [ 4 ]) to one of the state-of-the-art algorithms achieving factor-(log | |) guarantee on general graphs/ weights [ 6 ]. We dub the latter LP+R, alluding to the fact that it rounds the solution of a linear program. We evaluated correlation-clustering objective, number of output clusters, and runtimes of these algorithms. Results. Figure 1 shows the quality (i.e., Min-CC objective) of the produced clusterings, with the bottom-left insets reporting the ratio between the performance of Pivot and LP+R. Results refer to target ratios varied from [ 0, 3 ], with stepsize 0.1, and weights generated with = 0, = 1.

2Publicly available at http://konect.cc/networks/

For each target ratio, all measurements correspond to averages over 10 weight-generation runs, and each of such runs corresponds to averages over 50 runs of the tested algorithms.

The main goal here is to have experimental evidence that a better fulfilment of our global condition leads to Pivot’s performance closer to LP+R’s one, and vice versa. Figure 1 confirms the above: in all datasets, Pivot performs more closely to LP+R as the target ratio gets smaller. In general, Pivot performs similarly to LP+R for ∈ [ 0, 1 ], while being outperformed for > 1. This conforms with the theory: on these small graphs, factor-5 Pivot’s approximation is close to factor-(log | |) LP+R’s approximation. Pivot achieves the best performance on Football, where it outperforms LP+R even if the condition is not met. This is motivated by Football’s higher clustering coeficient and average degree, which help Pivot sample vertices (and, thus, build clusters) in dense regions of the graph. This is confirmed by the number of clusters (Table 1-(left)): Pivot yields more clusters than LP+R on all datasets but Football.

Concerning execution times, Pivot runs in less than one second, and as expected, it is extremely faster than LP+R, whose runtimes are about 2 seconds in Karate, 37 in Dolphins, and above 770 in Adjnoun and Football.3 The ineficiency of LP+R further emphasizes the importance of our result in extending the applicability of faster algorithms like Pivot.

We complement this stage of evaluation by testing diferent graph densities, and for target ratios = 1 (borderline satisfaction of our condition) and = 20 (far fulfilment of the condition). Again, the results (not shown) meet the expectations: in terms of clustering quality, Pivot performs closely to or better than LP+R for = 1, while the opposite happens for = 20. Denser graphs correspond to better Pivot performance. This is again explained since higher densities favor better Pivot’s random choices. Runtimes are not afected by graph density. This is expected as well, as LP+R runtimes are dominated by the time spent in building and solving the linear program, which depends on the number of vertices only, whereas variations in the runtimes of Pivot cannot be observed due to the small size of the datasets at hand.

5. Application to fair clustering

An important exploitation of our theoretical results concerns the selection of features that lead edge weights to express the best tradeof between an accurate representation of the objects and the suitability of the correlation-clustering weights to ensure approximation guarantees. Our global-weight-bounds condition can be an efective way to the achievement of this tradeof, and it can be fulfilled more easily than local weight bounds (e.g., in case of probability constraint, it

3Experiments were carried out on the Cresco6 cluster (https://www.eneagrid.enea.it)

Let be a set of objects defined over a set of attributes , which is assumed to be partitioned into two sets: , which contains fairness-aware or sensitive attributes (e.g., gender, race, religion), and ¬ , which includes the remaining, non-sensitive attributes. We also assume that part of the attributes might be numerical, and the others as categorical; we will use superscripts and to distinguish the two types, therefore = ∪ and ¬ = ¬ ∪ ¬ .

We consider a twofold fair-clustering objective: cluster the objects such that () the intracluster similarity and the inter-cluster similarity are maximized and minimized, respectively, according to the non-sensitive attributes; () the intra-cluster similarity and the inter-cluster similarity are minimized and maximized, respectively, according to the sensitive attributes. Pursuing this second objective would help distribute similar objects (in terms of sensitive attributes) across diferent clusters, thus helping the formation of diverse clusters. This is beneficial to ensure that the distribution of groups defined on sensitive attributes within each cluster approximates the distribution across the dataset.

The task of fair clustering can be mapped to a Min-CC instance where the positive-type and negative-type weights, respectively, can be defined as follows:

︁( + := + ¬ · ¬ (, ) + (1 − ¬ ) ·

¬ (, ) ︁( − := − ·

(, ) + (1 − ) · (, ) ︁) ︁) (2) (3) |¬ |) − 1) and −

= (|¬ |/(| | + |¬ |) − similarities proportionally to the size of the involved set of attributes, + = (| where = | |/(| |+| |) and ¬ = |¬ |/(|¬ |+|¬ |) are coeficients to weight |/(|

| +

1) are smoothing factors to penalize

correlation-clustering weights that are computed on a small number of attributes (which is any object similarity function defined over the subspace of the attribute set. usually the case for sensitive attributes, and hence negative-type weights), and (· ) denotes Problem 2 (Attribute Selection for Fair Clustering). Given a set of objects over the attribute sets , ¬ , find maximal subsets ⊆ and ¬ ⊆ ¬ , with | | ≥ 1, |¬ | ≥ 1, s.t. the weights computed by Eqs. (2)–(3) satisfy the global-weight-bounds condition in Th. 1. Heuristics. Our first proposal to solve Problem 2 is a greedy heuristic, dubbed

Greedy, which

iteratively removes the attribute that leads to the correlation-clustering weights with the lowest target ratio until our global condition is satisfied. This algorithm runs in (| |2||2) time since, at each iteration, for each candidate attribute to be removed (| |2) similarities are computed to quantify the decrease of the target ratio. We also devised other heuristics which, like Greedy, remove one attribute at time, but exploit some easy-to-compute proxy measures to select the attribute that avoid the pairwise similarity computation for each candidate attribute. The Hlv (resp. Hmv) heuristic removes the least (resp. most) variable attribute where the variability is measured through normalized entropy for categorical attributes and with variation coeficient (capped to 1 if above 1) for numerical features. Hlv_B and Hmv_B, like the previous two heuristics, remove the least and most variable attribute, respectively, but the selection is constrained to the biggest set of features among and ¬ , in order to try to balance their size. Hlv_BW removes the least variable attribute from the set ( or ¬ ) which induces the highest average similarity value using the current weights, whereas Hmv_SW selects the most variable attribute from the set which induces the lowest average similarity value using the current weights. Note that all these heuristics (but Greedy) run in (| |2||) time. Data and results. We considered 4 relational datasets: CreditCardCustomers, Adult, Bank, and Student.4 For each of them, Table 1 shows the number of objects, a pair of values corresponding to the count of non-sensitive and sensitive attributes, and a description of the latter.

Table 2 summarizes results achieved by each of the above heuristics, on the various datasets, according to the following criteria (columns from left to right): number of iterations at convergence, target ratio, percentage of pairs , having + > −; also, computed w.r.t. the full attribute space are: value of the objective function, average Euclidean fairness [ 9 ], average number of clusters, intra-cluster and inter-cluster similarities according to either the subset of sensitive attributes or the subset of non-sensitive attributes, and running time. Euclidean and Jaccard similarity functions are used for numerical and categorical attributes, resp., and the overall similarity is obtained by linear combination analogously to Eqs. (2)–(3). Note that higher values correspond to better performance for -based intra-cluster and ¬ -based inter-cluster similarities, while the opposite holds for the other two measures and the Euclidean fairness. The ifrst row in each table refers to the initial, full-attribute-space status of the relational network, as a baseline, whereby the global-weight-bounds condition is not satisfied.

Hlv_BW and Hlv_B tend to produce solutions that correspond to the highest (i.e., worst) value of the objective function and of the clustering size; this should be ascribed to the fact 4https://www.kaggle.com/sakshigoyal7/credit-card-customers; https://archive.ics.uci.edu/ml/index.php that both heuristics favor the selection of the least variable attributes. By contrast, Hmv_SW is the best performing in terms of objective function and Euclidean fairness. This method also tends to produce very few clusters. Note that, while a higher number of clusters is found to be coupled with a worsening of the objective function, the opposite does not hold in general; also, in contrast to the intuition that a higher percentage of pairs having + > − should favor the grouping into fewer clusters, we observed that the clustering sizes are not necessarily ordered as with the percentage ordering. As far as eficiency, Hmv_B mostly provides the best time performance. While being one order of magnitude slower, Greedy tends to converge in less iterations, as it indeed removes fewer attributes than the other methods; we found that in some cases (e.g., Student, Adult, results not shown), this allows Greedy for compensating its expected higher cost per iteration. Notably, each method lowers the initial target ratio below 1 so as to satisfy the global condition, and the per-dataset best-performing method (not shown) improves all intra-/inter-cluster similarities and Euclidean fairness w.r.t. the baseline.

6. Conclusions

We discussed a novel perspective in correlation clustering, which considers global weight bounds. A suficient condition is defined to extend the range of validity of approximation guarantees beyond local weight bounds, such as the probability constraint. Experimental results, including a case study in fair clustering, put in evidence of the usefulness of our approach.

One interesting future direction we plan to investigate is to define the edge weights (bounds) based on relational properties of the objects under an uncertain data modeling framework [ 10 ].

[1]

Bansal ,

Blum ,

Chawla , Correlation clustering, Mach. Learn . 56 ( 2004 ) 89 - 113 .

[2]

Bonchi ,

García-Soriano , E. Liberty, Correlation clustering: from theory to practice , in: Proc. ACM KDD Conf. , 2014 , p. 1972 .

[3]

Mandaglio ,

Tagarelli ,

Gullo , Correlation Clustering with Global Weight Bounds , in: Proc. ECML-PKDD , 2021 , pp. 499 - 515 . doi: 10 .1007/978-3- 030 -86520-7\_ 31 .

[4]

Ailon ,

Charikar ,

Newman , Aggregating inconsistent information: Ranking and clustering , JACM 55 ( 2008 ) 23 : 1 - 23 : 27 .

[5]

Shamir ,

Sharan ,

Tsur , Cluster graph modification problems , Discret. Appl. Math. 144 ( 2004 ) 173 - 182 .

[6]

Charikar ,

Guruswami ,

Wirth , Clustering with qualitative information , JCSS 71 ( 2005 ) 360 - 383 .

[7]

E. D.

Demaine ,

Emanuel ,

Fiat ,

Immorlica , Correlation clustering in general weighted graphs , TCS 361 ( 2006 ) 172 - 187 .

[8]

G. J.

Puleo ,

Milenkovic , Correlation clustering with constrained cluster sizes and extended weights bounds , SIAM J. Optim . 25 ( 2015 ) 1857 - 1872 .

[9]

S. S.

Abraham , D. P,

S. S.

Sundaram , Fairness in clustering with multiple sensitive attributes , in: Proc. EDBT Conf ., 2020 , pp. 287 - 298 .

[10]

Gullo ,

Ponti ,

Tagarelli ,

Greco , An information-theoretic approach to hierarchical clustering of uncertain data , Inf. Sci . 402 ( 2017 ) 199 - 215 . doi: 10 .1016/j.ins. 2017 . 03 . 030.