Reliable Clustering with Applications to Data Integration Sainyam Galhotra Supervised by: Barna Saha UMass Amherst sainyam@cs.umass.edu ABSTRACT and every stage of the criminal justice system including ar- Today, a number of critical applications that have a sig- raignment and sentencing that determine who goes to jail nificant societal impact are powered by data driven Arti- and who is set free [6]. The importance of these decisions ficial Intelligence. Given their ubiquity, it is very critical makes fairness and quality of the employed algorithms of to perform accurate and fair analytics. Entity resolution, prime importance. community detection and taxonomy construction are some A number of these real-world applications employ entity of the building blocks of these applications and for these resolution, community detection, taxonomy construction and methods, clustering is one of the common fundamental un- outlier detection as some of their key constituents. Cluster- derlying concept. Therefore, the use of accurate, robust and ing is one of the fundamental techniques that is commonly fair methods for clustering cannot be overstated. This work used to formally study these components. Clustering has focuses on these different facets of clustering. been studied for many decades and is considered a challeng- First, we study the problem of clustering in the presence ing task that has evolved over time. In the modern era of big of supervision, specifically aimed at entity resolution. In data, the problems of noise, bias and poor quality data have this setting, we study the robustness and scalability of the adversely affected the quality of traditional clustering tech- methods that leverage supervision through an oracle i.e an niques. Along with quality, it is very important to improve abstraction of crowdsourcing. their scalability to run on web-scale datasets. Additionally, Second, community detection applications suffer from eval- clustering is generally an unsupervised task and suffers from uation in real world scenarios due to lack of ground truth lack of ground truth data for effective evaluation. There data. We propose a generative model to capture interactions has been a lot of interest in devising generative models to between records that belong to different clusters and devise simulate real-world interaction between records of different techniques for efficient cluster recovery. clusters and benchmark various techniques. In this work, Third, manifestation of bias in data could arise due to we focus on these different facets of clustering along with discriminatory treatment of marginalized groups, sampling its applications towards data integration. Table 1 presents methods or even measurement errors in the data. We study a summary of our contributions. the impact of this bias on generated clusters and develop 1.1 Clustering using supervision techniques that guarantee fair representation from different Clustering is an intricate problem especially due to the groups. We prove the noise tolerance of our algorithms and absence of domain knowledge, and the final set of clusters back the theory by demonstrating the efficacy and efficiency identified using automated techniques can be highly inac- on various real world datasets for these applications. curate and noisy. There has been a lot of recent interest to leverage humans to answer pairwise queries of the form ‘do u and v belong to the same optimal cluster?’. Since hu- 1. INTRODUCTION mans have much more context and domain knowledge, they With the advances in machine learning and availability can answer such queries quite easily. For this reason, many of vast amounts of data, Artificial Intelligence based sys- frameworks have been developed to leverage humans (ab- tems are allowed to make autonomous decisions. Already, stracted as an oracle) to perform entity resolution, one of software makes decisions in who gets a loan [24], hiring [1], the traditional applications of oracle based clustering tech- self-driving car actions that may lead to property damage niques in data integration. or human injury [22], medical diagnosis and treatment [28], Entity Resolution refers to the task of identifying all records that refer to the same entity. Entity resolution is one of the classical data management problems that has been studied since the seminal work of Fellegi and Sunter in 1969 [12]. The explosion of data sources has aggravated the presence of duplicates in a dataset, elevating the importance of Entity Resolution (abbreviated as ER and often referred to as de- duplication). Web-Scale algorithms for de-duplication and organization of data is the need of the hour. ER has evolved Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted from using rule based systems to using human annotators for private and academic purposes. for expert guidance. In traditional settings, the goal of ER Table 1: Summary of contributions. Robust and Scalable Oracle-based Clustering: Entity Resolution [16, 14, 17] Semantic concept identification and Feature Enrichment [18] Generative Model Geometric Block Model [20, 19] Fair and interpretable Fair Correlation Clustering [2] Interpretable k-center Clustering [27] was to match records obtained from two data sources which with no configuration effort. Our approach performs block- has now evolved to identify a cluster of records referring to ing and matching in tandem, where pair matching results same entity. The heterogeneity of data sources has raised the are fed back to the blocking to refine and improve its amount of noise in these datasets and motivated the study quality. We demonstrate that our technique achieves the of ER scalability. There has been limited work on holis- best trade-of between the quality of final results and ER tic approaches to identify entities across multiple sources. efficiency for a variety of million scale datasets. We develop techniques that are able to resolve entities in As a future work, we are planning to extend oracle-based datasets with varied cluster distributions and noise levels. techniques to perform hierarchical clustering. Hierarchi- To achieve this goal, we make the following contributions. cal clustering techniques are very useful to construct tax- • Robustness. The queries to the oracle can have low ac- onomies, analyze phylogenetic trees and construct product curacy based on their difficulty. Prior oracle-based clus- catalogs. In this setting, we assume that all the leaf level tering techniques [29, 30, 15] assumed that all the answers records are known and the goal is to organize these records returned by the oracle are correct and hence constructed in the form of a type-subtype hierarchy. Pairwise oracle a spanning tree over the queried edges to identify all the query between two leaf level records is not sufficient to con- matching pairs. In the absence of noise, this was sufficient struct the hierarchy. Therefore, we consider a triplet query due to transitivity (if u, v refer to same entity and v, w consisting of three records and the oracle identifies the pair refer to same entity then u,w can be inferred as same en- of nodes that are closer to each other than the third node. tity) but it leads to very poor F-score of generated clusters The oracle output provides a local evidence of the hierarchy even in case of low error. We propose a cost-effective ap- and is helpful to uncover the structure. One of the key chal- proach [16, 14] that can be added as an extra-layer to any lenges in this line of work is to efficiently identify a small oracle-based strategy, helping to preserve the performance set of queries that can help recover the hierarchy. For a guarantees of [31, 29, 15] along with high precision. In- dataset of n records, the total number of possible triplet stead of constructing spanning tree over the records, our queries is O(n3 ) and enumerating all such queries is impos- approach strengthens all the cuts by constructing sparse sible for million scale datasets. We leverage pairwise simi- graphs with strong connectivity properties. We achieve larities as a guidance to quickly identify the most beneficial this with the help of expander graphs [5] and prove pre- triplet queries. Our algorithm maintains a hierarchy of all cision guarantees of our technique. The error correction the processed records and iteratively processes each node layer can be tuned (or even turned off) trading off budget with the help of already identified beneficial queries. We for accuracy, thereby providing flexibility to adapt to dif- show that our technique is able to construct the hierarchy ferent ER applications. In order to efficiently leverage this with O(n log n) queries under reasonable assumptions of the toolkit, we propose an adaptive technique that changes similarity distribution. This work is under progress and we the connectivity strength of the queried graph based on are currently evaluating the quality of our techniques with noise in results and prior similarity of record pairs. We respect to other baselines. empirically demonstrate that our technique achieves high In addition to oracle based clustering techniques, we are F-score over different real world datasets. exploring the use of semantic knowledge present in the form of knowledge graphs to identify clusters of web tables and • Scalability. ER is generally preceded by blocking as a columns that refer to the same concept [18]. Given the scale pre-processing step to handle large scale datasets. Block- of data available over the web, the amount of noise and miss- ing constitutes the first step that selects sub-quadratic ing information, identifying these clusters is quite challeng- number of record pairs to compare in the subsequent steps. ing. To achieve this goal, we propose an index structure that Blocking groups similar records into blocks and then se- uses semantic knowledge graphs to quickly identify the dis- lects pairs from the “cleanest” blocks – i.e., those with tribution of concepts for a particular column. Currently, our fewer non-matching pairs – for further comparisons in the index supports text based attributes but does not work for pair matching phase. The literature is rich with methods numerical attributes like population, year, age, etc. Identi- for building and processing blocks [25], but depending on fying clusters of numerical columns requires additional con- the data, blocking techniques are either (a) too aggressive text from the meta-data and other co-occurring columns. that they help scale but adversely affect ER accuracy, We are developing a unified framework to identify seman- or (b) too permissive to potentially harm ER efficiency. tically coherent clusters of columns and further use these Due to these limitations, blocking require tuning for each for applications like dataset discovery, feature enrichment, dataset and is one of the most time-consuming compo- improving search, etc. nents of the pipeline. We propose a new methodology of progressive blocking [17] 2. ABSENCE OF GROUND TRUTH that overcomes the above limitations by self-regulating In this section, we discuss clustering from the lens of com- blocking and adapting to the properties of each dataset, munity detection over social networks. There are a plethora of techniques that are used to identify clusters of records bution of demographics. Our algorithm proceeds in two referring to same community. However, all these datasets steps. In the first step it identifies a matching between nodes suffer from the scarcity of ground truth data. In order to of different colors to construct small clusters that satisfy fair- circumvent this drawback, generative models have been pro- ness constraints. In the second step it chooses representative posed to model the interaction between records of different nodes (one from each matched clusters) and employs tradi- communities. These models are helpful to benchmark the tional correlation clustering algorithm to identify the final quality of known clustering techniques to identify clusters. set of clusters. We prove that our algorithm identifies clus- Stochastic block model (SBM) is one of the most popu- ters within a constant factor approximation of the optimal lar random graph model that generalizes the Erdős-Renyi solution. We further relax the equal distribution constraint graphs. According to SBM, edges between every pair of and extend our algorithm for a lower and upper bound con- nodes are drawn randomly with probability p if the end- straint on the number of nodes of each color in a cluster. To points belong to the same cluster and q if they belong to further instill trust in the data, we explore multi-objective different clusters. One aspect that SBM does not capture clustering algorithms to generate explainable clusters with is the ‘transitivity rule’ (friends having common friends), minimal loss in the clustering objective. which is inherent to formation of communities over social Interpretable Clustering. Clustering techniques are ex- networks. Intuitively, if two nodes x, y are connected by an pected to be inherently interpretable as the goal is to group edge and y, z are connected by an edge then it is more likely similar nodes together. However, with the increase in num- than not that x, z are connected by an edge. Inspired by ber of features for each record, the generated clusters can this, we proposed the geometric block model [20, 19] that have poor interpretability. In our work [27], we measure in- models community formation according to random geomet- terpretability in terms of the homogeneity of nodes in a clus- ric graphs. One of the key distinction from SBM is that it ter with respect to the features of interest for the end-user. considers correlated edge formation, capturing the proper- We consider the k-center clustering objective and develop ties of transitivity rule. We empirically validated the model techniques to achieve β-interpretability (for a given param- over collaboration networks and co-purchase networks. eter β) with respect to features of interest. The choice of We observed that traditional techniques that were devel- β determines the trade-off between clustering objective and oped for cluster recovery in SBM could not be used for the interpretability. geometric block model. We proposed a simple motif-based Multi-Objective Clustering. With the increased soci- counting algorithm to identify clusters and show that it is etal impact of clustering techniques, the importance of con- optimal upto a constant fraction. We tested the effectiveness sidering additional constraints like fairness, diversity, in- of our algorithm to recover clusters over various real-world terpretability and efficiency has increased. This has mo- and synthetic datasets. tivated the study of multi-objective clustering techniques focused towards these objectives. Existing techniques that 3. FAIRNESS AND INTERPRETABILITY support multi-objective clustering either leverage a scalar- ization function, which combines the multiple objectives into There are a countless number of examples where the use a single objective, or find clusters in parallel for each ob- of biased systems have led to disastrous consequences. Clus- jective and combine the results using different approaches tering techniques are used in various applications like team such as a fitness function. Such techniques lose theoretical formation and community detection which have societal im- guarantees with respect to any of the considered objectives. pact. Given their importance, there has been little work In [21], we consider a lexicographic multi-objective frame- on improving the fairness and interpretability of these al- work where the optimization objectives are lexicographically gorithms. We consider different clustering techniques and ordered and our optimization algorithm follows the same devise scalable methods to improve their fairness and inter- preference. In this setting, the goal is to prioritize primary pretability. clustering objectives over ancillary objectives. To further Correlation Clustering. Correlation clustering, intro- simulate different scenarios, our model uses a slack value duced by Bansal, Blum and Chawla in 2004 [7], has received to improve the quality on secondary objectives and allows tremendous attention in the past decade. The problem is minor deviations of the primary objective from its optimal NP-complete and a series of follow-up work has resulted value. Our algorithm processes the different objectives in in better approximation ratio, generalization to weighted the order of their preference and generates final clustering. graphs, etc. [4, 9, 10]. This problem captures a wide range of Incase of any violation of clustering objectives, local search applications including clustering gene expression patterns [8, techniques are employed to satisfy the corresponding slack 23], and the aggregation of inconsistent information [13]. values. Chierichetti et al. [11] extended the notion of disparate im- pact to k-center and k-median objectives, and studied these problems for the case of two groups. Their result was later 4. CONCLUSION AND FUTURE WORK generalized to multiple groups by Rösner and Schmidt [26]. In this work, we have studied the different facets of clus- We generalize the notion of disparate impact [2] to correla- tering focussing on robustness, scalability, generative mod- tion clustering for multiple colors and our goal is to make elling, fairness and interpretability of flat clustering algo- sure that the distribution of colors in each cluster is identi- rithms. We demonstrated the effectiveness of our techniques cal to the global distribution. Additionally, we extend the to perform clustering with applications towards entity res- model introduced by Ahmadian et al. [3] on k-center to cor- olution, community detection and other societal issues of relation clustering to ensure that no color is over or under bias and discrimination. We study entity resolution from represented in each cluster. the perspective of using oracles as an abstraction of humans More formally, our fairness-aware variant of correlation to answer pairwise queries and discuss the importance of clustering [7] identifies clusters while ensuring equal distri- scalable techniques for web-scale datasets. In community detection, we study generative models to simulate interac- [15] D. Firmani, B. Saha, and D. Srivastava. Online entity tion between records of different clusters. Additionally, we resolution using an oracle. PVLDB, 9(5):384–395, study traditional clustering techniques along with fairness 2016. and interpretability constraints. As a future work, we are [16] S. Galhotra, D. Firmani, B. Saha, and D. Srivastava. working towards extending our work to consider these dif- Robust entity resolution using random graphs. In ferent facets for hierarchical clustering for applications like SIGMOD, 2018. taxonomy construction, knowledge graph construction, data [17] S. Galhotra, D. Firmani, B. Saha, and D. Srivastava. organization and team formation. Efficient and effective er with progressive blocking, 2020. 5. ACKNOWLEDGEMENT [18] S. Galhotra, U. Khurana, O. Hassanzadeh, The authors would like to thank all the contributors, Barna K. Srinivas, H. Samulowitz, and M. Qi. Automated Saha (advisor), Donatella Firmani, Divesh Srivastava, Arya feature enhancement for predictive modeling using Mazumdar, Soumyabrata Pal, Saba Ahmadi, Roy Schwartz, external knowledge. ICDM, 2019. Sandhya Saisubramanian, Shlomo Zilberstein, Udayan Khu- [19] S. Galhotra, A. Mazumdar, S. Pal, and B. Saha. rana, Oktie Hassanzadeh and Kavitha Srinivas. Connectivity in random annulus graphs and the geometric block model. CoRR, abs/1804.05013, 2018. [20] S. Galhotra, A. Mazumdar, S. Pal, and B. Saha. The 6. REFERENCES geometric block model. In Thirty-Second AAAI [1] Are ai hiring programs eliminating bias or making it Conference on Artificial Intelligence, 2018. worse? Forbes. [21] S. Galhotra, S. Saisubramanian, and S. Zilberstein. [2] S. Ahmadi, S. Galhotra, B. Saha, and R. Schwartz. Lexicographically ordered multi-objective clustering. Fair correlation clustering, 2020. arXiv preprint arXiv:1903.00750, 2019. [3] S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian. [22] N. J. Goodall. Can you program ethics into a Clustering without over-representation. In Proceedings self-driving car? IEEE Spectrum, 53(6):28–58, June of the 25th ACM SIGKDD International Conference 2016. on Knowledge Discovery & Data Mining, pages [23] J. Guo, F. Hüffner, C. Komusiewicz, and Y. Zhang. 267–275, 2019. Improved algorithms for bicluster editing. In [4] N. Ailon, M. Charikar, and A. Newman. Aggregating M. Agrawal, D. Du, Z. Duan, and A. Li, editors, inconsistent information: ranking and clustering. Theory and Applications of Models of Computation, Journal of the ACM (JACM), 55(5):1–27, 2008. pages 445–456, Berlin, Heidelberg, 2008. Springer [5] N. Alon and J. H. Spencer. The probabilistic method. Berlin Heidelberg. John Wiley & Sons, 2004. [24] P. Olson. The algorithm that beats your bank [6] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. manager. CNN Money, March 15, 2011. Machine bias. ProPublica, May 23, 2016. [25] G. Papadakis, J. Svirsky, A. Gal, and T. Palpanas. [7] N. Bansal, A. Blum, and S. Chawla. Correlation Comparative analysis of approximate blocking clustering. Machine learning, 56(1-3), 2004. techniques for entity resolution. Proceedings of the [8] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering VLDB Endowment, 9(9):684–695, 2016. gene expression patterns. Journal of computational [26] C. Rösner and M. Schmidt. Privacy preserving biology, 6(3-4):281–297, 1999. clustering with constraints. arXiv preprint [9] M. Charikar, V. Guruswami, and A. Wirth. Clustering arXiv:1802.02497, 2018. with qualitative information. Journal of Computer [27] S. Saisubramanian, S. Galhotra, and S. Zilberstein. and System Sciences, 71(3):360–383, 2005. Balancing the tradeoff between clustering value and [10] S. Chawla, K. Makarychev, T. Schramm, and interpretability. In Proceedings of the AAAI/ACM G. Yaroslavtsev. Near optimal lp rounding algorithm Conference on AI, Ethics, and Society, AIES ’20, page for correlationclustering on complete and complete 351–357, New York, NY, USA, 2020. k-partite graphs. In Proceedings of the forty-seventh [28] E. Strickland. Doc bot preps for the O.R. IEEE annual ACM symposium on Theory of computing, Spectrum, 53(6):32–60, June 2016. pages 219–228, 2015. [29] N. Vesdapunt, K. Bellare, and N. Dalvi. [11] F. Chierichetti, R. Kumar, S. Lattanzi, and Crowdsourcing algorithms for entity resolution. S. Vassilvitskii. Fair clustering through fairlets. In PVLDB, 7(12):1071–1082, 2014. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, [30] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. R. Fergus, S. Vishwanathan, and R. Garnett, editors, Crowder: Crowdsourcing entity resolution. PVLDB, Advances in Neural Information Processing Systems 5(11):1483–1494, 2012. 30, pages 5029–5037. Curran Associates, Inc., 2017. [31] J. Wang, G. Li, T. Kraska, M. J. Franklin, and [12] I. P. Fellegi and A. B. Sunter. A theory for record J. Feng. Leveraging transitive relations for linkage. Journal of the American Statistical crowdsourced joins. In SIGMOD Conference, 2013. Association, 64(328):1183–1210, 1969. [13] V. Filkov and S. Skiena. Integrating microarray data by consensus clustering. International Journal on Artificial Intelligence Tools, 13(04):863–880, 2004. [14] D. Firmani, S. Galhotra, B. Saha, and D. Srivastava. Robust entity resolution using a crowdoracle. 2018.