1. INTRODUCTION

Reliable Clustering with Applications to Data Integration

2020

ion of crowdsourcing. Second, community detection applications su er from evaluation in real world scenarios due to lack of ground truth data. We propose a generative model to capture interactions between records that belong to di erent clusters and devise techniques for e cient cluster recovery. Third, manifestation of bias in data could arise due to discriminatory treatment of marginalized groups, sampling methods or even measurement errors in the data. We study the impact of this bias on generated clusters and develop techniques that guarantee fair representation from di erent groups. We prove the noise tolerance of our algorithms and back the theory by demonstrating the e cacy and e ciency on various real world datasets for these applications.

1. INTRODUCTION

With the advances in machine learning and availability of vast amounts of data, Arti cial Intelligence based systems are allowed to make autonomous decisions. Already, software makes decisions in who gets a loan [ 24 ], hiring [ 1 ], self-driving car actions that may lead to property damage or human injury [ 22 ], medical diagnosis and treatment [ 28 ], and every stage of the criminal justice system including arraignment and sentencing that determine who goes to jail and who is set free [ 6 ]. The importance of these decisions makes fairness and quality of the employed algorithms of prime importance.

A number of these real-world applications employ entity resolution, community detection, taxonomy construction and outlier detection as some of their key constituents. Clustering is one of the fundamental techniques that is commonly used to formally study these components. Clustering has been studied for many decades and is considered a challenging task that has evolved over time. In the modern era of big data, the problems of noise, bias and poor quality data have adversely a ected the quality of traditional clustering techniques. Along with quality, it is very important to improve their scalability to run on web-scale datasets. Additionally, clustering is generally an unsupervised task and su ers from lack of ground truth data for e ective evaluation. There has been a lot of interest in devising generative models to simulate real-world interaction between records of di erent clusters and benchmark various techniques. In this work, we focus on these di erent facets of clustering along with its applications towards data integration. Table 1 presents a summary of our contributions. 1.1

Clustering using supervision

Clustering is an intricate problem especially due to the absence of domain knowledge, and the nal set of clusters identi ed using automated techniques can be highly inaccurate and noisy. There has been a lot of recent interest to leverage humans to answer pairwise queries of the form `do u and v belong to the same optimal cluster?'. Since humans have much more context and domain knowledge, they can answer such queries quite easily. For this reason, many frameworks have been developed to leverage humans (abstracted as an oracle) to perform entity resolution, one of the traditional applications of oracle based clustering techniques in data integration.

Entity Resolution refers to the task of identifying all records that refer to the same entity. Entity resolution is one of the classical data management problems that has been studied since the seminal work of Fellegi and Sunter in 1969 [ 12 ]. The explosion of data sources has aggravated the presence of duplicates in a dataset, elevating the importance of Entity Resolution (abbreviated as ER and often referred to as deduplication). Web-Scale algorithms for de-duplication and organization of data is the need of the hour. ER has evolved from using rule based systems to using human annotators for expert guidance. In traditional settings, the goal of ER

Robust and Scalable Generative Model Fair and interpretable

Oracle-based Clustering: Entity Resolution Semantic concept identi cation and Feature Enrichment

Geometric Block Model

Fair Correlation Clustering Interpretable k-center Clustering was to match records obtained from two data sources which has now evolved to identify a cluster of records referring to same entity. The heterogeneity of data sources has raised the amount of noise in these datasets and motivated the study of ER scalability. There has been limited work on holistic approaches to identify entities across multiple sources. We develop techniques that are able to resolve entities in datasets with varied cluster distributions and noise levels. To achieve this goal, we make the following contributions.

Robustness. The queries to the oracle can have low accuracy based on their di culty. Prior oracle-based clustering techniques [ 29, 30, 15 ] assumed that all the answers returned by the oracle are correct and hence constructed a spanning tree over the queried edges to identify all the matching pairs. In the absence of noise, this was su cient due to transitivity (if u, v refer to same entity and v, w refer to same entity then u,w can be inferred as same entity) but it leads to very poor F-score of generated clusters even in case of low error. We propose a cost-e ective approach [ 16, 14 ] that can be added as an extra-layer to any oracle-based strategy, helping to preserve the performance guarantees of [ 31, 29, 15 ] along with high precision. Instead of constructing spanning tree over the records, our approach strengthens all the cuts by constructing sparse graphs with strong connectivity properties. We achieve this with the help of expander graphs [ 5 ] and prove precision guarantees of our technique. The error correction layer can be tuned (or even turned o ) trading o budget for accuracy, thereby providing exibility to adapt to different ER applications. In order to e ciently leverage this toolkit, we propose an adaptive technique that changes the connectivity strength of the queried graph based on noise in results and prior similarity of record pairs. We empirically demonstrate that our technique achieves high F-score over di erent real world datasets.

Scalability. ER is generally preceded by blocking as a pre-processing step to handle large scale datasets. Blocking constitutes the rst step that selects sub-quadratic number of record pairs to compare in the subsequent steps. Blocking groups similar records into blocks and then selects pairs from the \cleanest" blocks { i.e., those with fewer non-matching pairs { for further comparisons in the pair matching phase. The literature is rich with methods for building and processing blocks [ 25 ], but depending on the data, blocking techniques are either (a) too aggressive that they help scale but adversely a ect ER accuracy, or (b) too permissive to potentially harm ER e ciency. Due to these limitations, blocking require tuning for each dataset and is one of the most time-consuming components of the pipeline.

We propose a new methodology of progressive blocking [ 17 ] that overcomes the above limitations by self-regulating blocking and adapting to the properties of each dataset, with no con guration e ort. Our approach performs blocking and matching in tandem, where pair matching results are fed back to the blocking to re ne and improve its quality. We demonstrate that our technique achieves the best trade-of between the quality of nal results and ER e ciency for a variety of million scale datasets.

As a future work, we are planning to extend oracle-based techniques to perform hierarchical clustering. Hierarchical clustering techniques are very useful to construct taxonomies, analyze phylogenetic trees and construct product catalogs. In this setting, we assume that all the leaf level records are known and the goal is to organize these records in the form of a type-subtype hierarchy. Pairwise oracle query between two leaf level records is not su cient to construct the hierarchy. Therefore, we consider a triplet query consisting of three records and the oracle identi es the pair of nodes that are closer to each other than the third node. The oracle output provides a local evidence of the hierarchy and is helpful to uncover the structure. One of the key challenges in this line of work is to e ciently identify a small set of queries that can help recover the hierarchy. For a dataset of n records, the total number of possible triplet queries is O(n3) and enumerating all such queries is impossible for million scale datasets. We leverage pairwise similarities as a guidance to quickly identify the most bene cial triplet queries. Our algorithm maintains a hierarchy of all the processed records and iteratively processes each node with the help of already identi ed bene cial queries. We show that our technique is able to construct the hierarchy with O(n log n) queries under reasonable assumptions of the similarity distribution. This work is under progress and we are currently evaluating the quality of our techniques with respect to other baselines.

In addition to oracle based clustering techniques, we are exploring the use of semantic knowledge present in the form of knowledge graphs to identify clusters of web tables and columns that refer to the same concept [ 18 ]. Given the scale of data available over the web, the amount of noise and missing information, identifying these clusters is quite challenging. To achieve this goal, we propose an index structure that uses semantic knowledge graphs to quickly identify the distribution of concepts for a particular column. Currently, our index supports text based attributes but does not work for numerical attributes like population, year, age, etc. Identifying clusters of numerical columns requires additional context from the meta-data and other co-occurring columns. We are developing a uni ed framework to identify semantically coherent clusters of columns and further use these for applications like dataset discovery, feature enrichment, improving search, etc.

ABSENCE OF GROUND TRUTH

In this section, we discuss clustering from the lens of community detection over social networks. There are a plethora of techniques that are used to identify clusters of records referring to same community. However, all these datasets su er from the scarcity of ground truth data. In order to circumvent this drawback, generative models have been proposed to model the interaction between records of di erent communities. These models are helpful to benchmark the quality of known clustering techniques to identify clusters.

Stochastic block model (SBM) is one of the most popular random graph model that generalizes the Erdo}s-Renyi graphs. According to SBM, edges between every pair of nodes are drawn randomly with probability p if the endpoints belong to the same cluster and q if they belong to di erent clusters. One aspect that SBM does not capture is the `transitivity rule' (friends having common friends), which is inherent to formation of communities over social networks. Intuitively, if two nodes x; y are connected by an edge and y; z are connected by an edge then it is more likely than not that x; z are connected by an edge. Inspired by this, we proposed the geometric block model [ 20, 19 ] that models community formation according to random geometric graphs. One of the key distinction from SBM is that it considers correlated edge formation, capturing the properties of transitivity rule. We empirically validated the model over collaboration networks and co-purchase networks.

We observed that traditional techniques that were developed for cluster recovery in SBM could not be used for the geometric block model. We proposed a simple motif-based counting algorithm to identify clusters and show that it is optimal upto a constant fraction. We tested the e ectiveness of our algorithm to recover clusters over various real-world and synthetic datasets.

FAIRNESS AND INTERPRETABILITY

There are a countless number of examples where the use of biased systems have led to disastrous consequences. Clustering techniques are used in various applications like team formation and community detection which have societal impact. Given their importance, there has been little work on improving the fairness and interpretability of these algorithms. We consider di erent clustering techniques and devise scalable methods to improve their fairness and interpretability.

Correlation Clustering. Correlation clustering, introduced by Bansal, Blum and Chawla in 2004 [ 7 ], has received tremendous attention in the past decade. The problem is NP-complete and a series of follow-up work has resulted in better approximation ratio, generalization to weighted graphs, etc. [ 4, 9, 10 ]. This problem captures a wide range of applications including clustering gene expression patterns [ 8, 23 ], and the aggregation of inconsistent information [ 13 ].

Chierichetti et al. [ 11 ] extended the notion of disparate impact to k-center and k-median objectives, and studied these problems for the case of two groups. Their result was later generalized to multiple groups by Rosner and Schmidt [ 26 ]. We generalize the notion of disparate impact [ 2 ] to correlation clustering for multiple colors and our goal is to make sure that the distribution of colors in each cluster is identical to the global distribution. Additionally, we extend the model introduced by Ahmadian et al. [ 3 ] on k-center to correlation clustering to ensure that no color is over or under represented in each cluster.

More formally, our fairness-aware variant of correlation clustering [ 7 ] identi es clusters while ensuring equal distribution of demographics. Our algorithm proceeds in two steps. In the rst step it identi es a matching between nodes of di erent colors to construct small clusters that satisfy fairness constraints. In the second step it chooses representative nodes (one from each matched clusters) and employs traditional correlation clustering algorithm to identify the nal set of clusters. We prove that our algorithm identi es clusters within a constant factor approximation of the optimal solution. We further relax the equal distribution constraint and extend our algorithm for a lower and upper bound constraint on the number of nodes of each color in a cluster. To further instill trust in the data, we explore multi-objective clustering algorithms to generate explainable clusters with minimal loss in the clustering objective.

Interpretable Clustering. Clustering techniques are expected to be inherently interpretable as the goal is to group similar nodes together. However, with the increase in number of features for each record, the generated clusters can have poor interpretability. In our work [ 27 ], we measure interpretability in terms of the homogeneity of nodes in a cluster with respect to the features of interest for the end-user. We consider the k-center clustering objective and develop techniques to achieve -interpretability (for a given parameter ) with respect to features of interest. The choice of determines the trade-o between clustering objective and interpretability.

Multi-Objective Clustering. With the increased societal impact of clustering techniques, the importance of considering additional constraints like fairness, diversity, interpretability and e ciency has increased. This has motivated the study of multi-objective clustering techniques focused towards these objectives. Existing techniques that support multi-objective clustering either leverage a scalarization function, which combines the multiple objectives into a single objective, or nd clusters in parallel for each objective and combine the results using di erent approaches such as a tness function. Such techniques lose theoretical guarantees with respect to any of the considered objectives. In [ 21 ], we consider a lexicographic multi-objective framework where the optimization objectives are lexicographically ordered and our optimization algorithm follows the same preference. In this setting, the goal is to prioritize primary clustering objectives over ancillary objectives. To further simulate di erent scenarios, our model uses a slack value to improve the quality on secondary objectives and allows minor deviations of the primary objective from its optimal value. Our algorithm processes the di erent objectives in the order of their preference and generates nal clustering. Incase of any violation of clustering objectives, local search techniques are employed to satisfy the corresponding slack values. 4.

CONCLUSION AND FUTURE WORK

In this work, we have studied the di erent facets of clustering focussing on robustness, scalability, generative modelling, fairness and interpretability of at clustering algorithms. We demonstrated the e ectiveness of our techniques to perform clustering with applications towards entity resolution, community detection and other societal issues of bias and discrimination. We study entity resolution from the perspective of using oracles as an abstraction of humans to answer pairwise queries and discuss the importance of scalable techniques for web-scale datasets. In community detection, we study generative models to simulate interaction between records of di erent clusters. Additionally, we study traditional clustering techniques along with fairness and interpretability constraints. As a future work, we are working towards extending our work to consider these different facets for hierarchical clustering for applications like taxonomy construction, knowledge graph construction, data organization and team formation.

ACKNOWLEDGEMENT

The authors would like to thank all the contributors, Barna Saha (advisor), Donatella Firmani, Divesh Srivastava, Arya Mazumdar, Soumyabrata Pal, Saba Ahmadi, Roy Schwartz, Sandhya Saisubramanian, Shlomo Zilberstein, Udayan Khurana, Oktie Hassanzadeh and Kavitha Srinivas.

[1] Are ai hiring programs eliminating bias or making it worse? Forbes.

[2]

Ahmadi ,

Galhotra ,

Saha , and

Schwartz . Fair correlation clustering, 2020 .

[3]

Ahmadian ,

Epasto ,

Kumar , and

Mahdian . Clustering without over-representation . In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages 267 { 275 , 2019 .

[4]

Ailon ,

Charikar , and

Newman . Aggregating inconsistent information: ranking and clustering . Journal of the ACM (JACM) , 55 ( 5 ):1{ 27 , 2008 .

[5]

Alon and

J. H.

Spencer . The probabilistic method . John Wiley & Sons, 2004 .

[6]

Angwin ,

Larson ,

Mattu , and

Kirchner . Machine bias . ProPublica, May 23 , 2016 .

[7]

Bansal ,

Blum , and

Chawla . Correlation clustering . Machine learning , 56 ( 1-3 ), 2004 .

[8]

Ben-Dor ,

Shamir , and

Yakhini . Clustering gene expression patterns . Journal of computational biology , 6 ( 3 -4): 281 { 297 , 1999 .

[9]

Charikar ,

Guruswami , and

Wirth . Clustering with qualitative information . Journal of Computer and System Sciences , 71 ( 3 ): 360 { 383 , 2005 .

[10]

Chawla ,

Makarychev ,

Schramm , and

Yaroslavtsev . Near optimal lp rounding algorithm for correlationclustering on complete and complete k-partite graphs . In Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages 219 { 228 , 2015 .

[11]

Chierichetti ,

Kumar ,

Lattanzi , and

Vassilvitskii . Fair clustering through fairlets . In I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5029 { 5037 . Curran Associates, Inc., 2017 .

[12]

I. P.

Fellegi and

A. B.

Sunter . A theory for record linkage . Journal of the American Statistical Association , 64 ( 328 ): 1183 { 1210 , 1969 .

[13]

Filkov and

Skiena . Integrating microarray data by consensus clustering . International Journal on Arti cial Intelligence Tools , 13 ( 04 ): 863 { 880 , 2004 .

[14]

Firmani ,

Galhotra ,

Saha , and

Srivastava . Robust entity resolution using a crowdoracle . 2018 .

[15]

Firmani ,

Saha , and

Srivastava . Online entity resolution using an oracle . PVLDB , 9 ( 5 ): 384 { 395 , 2016 .

[16]

Galhotra ,

Firmani ,

Saha , and

Srivastava . Robust entity resolution using random graphs . In SIGMOD , 2018 .

[17]

Galhotra ,

Firmani ,

Saha , and

Srivastava . E cient and e ective er with progressive blocking , 2020 .

[18]

Galhotra ,

Khurana ,

Hassanzadeh ,

Srinivas ,

Samulowitz , and

Qi . Automated feature enhancement for predictive modeling using external knowledge . ICDM, 2019 .

[19]

Galhotra ,

Mazumdar ,

Pal , and

Saha . Connectivity in random annulus graphs and the geometric block model . CoRR , abs/ 1804 .05013, 2018 .

[20]

Galhotra ,

Mazumdar ,

Pal , and

Saha . The geometric block model . In Thirty-Second AAAI Conference on Arti cial Intelligence , 2018 .

[21]

Galhotra ,

Saisubramanian , and

Zilberstein . Lexicographically ordered multi-objective clustering . arXiv preprint arXiv:1903.00750 , 2019 .

[22]

N. J.

Goodall . Can you program ethics into a self-driving car ? IEEE Spectrum, 53 ( 6 ): 28 { 58 , June 2016 .

[23]

Guo , F. Hu ner, C. Komusiewicz, and Y. Zhang. Improved algorithms for bicluster editing . In M. Agrawal,

Du ,

Duan , and A . Li, editors, Theory and Applications of Models of Computation , pages 445 { 456 , Berlin, Heidelberg, 2008 . Springer Berlin Heidelberg.

[24]

Olson . The algorithm that beats your bank manager . CNN Money, March 15 , 2011 .

[25]

Papadakis ,

Svirsky ,

Gal , and

Palpanas . Comparative analysis of approximate blocking techniques for entity resolution . Proceedings of the VLDB Endowment , 9 ( 9 ): 684 { 695 , 2016 .

[26] C. R osner and M. Schmidt. Privacy preserving clustering with constraints . arXiv preprint arXiv:1802.02497 , 2018 .

[27]

Saisubramanian ,

Galhotra , and

Zilberstein . Balancing the tradeo between clustering value and interpretability . In Proceedings of the AAAI/ACM Conference on AI , Ethics , and Society, AIES ' 20 , page 351 { 357 , New York, NY, USA, 2020 .

[28]

Strickland . Doc bot preps for the O .R. IEEE Spectrum, 53 ( 6 ): 32 { 60 , June 2016 .

[29]

Vesdapunt ,

Bellare , and

Dalvi . Crowdsourcing algorithms for entity resolution . PVLDB , 7 ( 12 ): 1071 { 1082 , 2014 .

[30]

Wang ,

Kraska ,

M. J.

Franklin , and

Feng . Crowder: Crowdsourcing entity resolution . PVLDB , 5 ( 11 ): 1483 { 1494 , 2012 .

[31]

Wang ,

Li ,

Kraska ,

M. J.

Franklin , and

Feng . Leveraging transitive relations for crowdsourced joins . In SIGMOD Conference , 2013 .