Reliable Clustering with Applications to Data Integration

                                                          Sainyam Galhotra
                                                      Supervised by: Barna Saha
                                                                  UMass Amherst
                                                         sainyam@cs.umass.edu


ABSTRACT                                                                     and every stage of the criminal justice system including ar-
Today, a number of critical applications that have a sig-                    raignment and sentencing that determine who goes to jail
nificant societal impact are powered by data driven Arti-                    and who is set free [6]. The importance of these decisions
ficial Intelligence. Given their ubiquity, it is very critical               makes fairness and quality of the employed algorithms of
to perform accurate and fair analytics. Entity resolution,                   prime importance.
community detection and taxonomy construction are some                          A number of these real-world applications employ entity
of the building blocks of these applications and for these                   resolution, community detection, taxonomy construction and
methods, clustering is one of the common fundamental un-                     outlier detection as some of their key constituents. Cluster-
derlying concept. Therefore, the use of accurate, robust and                 ing is one of the fundamental techniques that is commonly
fair methods for clustering cannot be overstated. This work                  used to formally study these components. Clustering has
focuses on these different facets of clustering.                             been studied for many decades and is considered a challeng-
   First, we study the problem of clustering in the presence                 ing task that has evolved over time. In the modern era of big
of supervision, specifically aimed at entity resolution. In                  data, the problems of noise, bias and poor quality data have
this setting, we study the robustness and scalability of the                 adversely affected the quality of traditional clustering tech-
methods that leverage supervision through an oracle i.e an                   niques. Along with quality, it is very important to improve
abstraction of crowdsourcing.                                                their scalability to run on web-scale datasets. Additionally,
   Second, community detection applications suffer from eval-                clustering is generally an unsupervised task and suffers from
uation in real world scenarios due to lack of ground truth                   lack of ground truth data for effective evaluation. There
data. We propose a generative model to capture interactions                  has been a lot of interest in devising generative models to
between records that belong to different clusters and devise                 simulate real-world interaction between records of different
techniques for efficient cluster recovery.                                   clusters and benchmark various techniques. In this work,
   Third, manifestation of bias in data could arise due to                   we focus on these different facets of clustering along with
discriminatory treatment of marginalized groups, sampling                    its applications towards data integration. Table 1 presents
methods or even measurement errors in the data. We study                     a summary of our contributions.
the impact of this bias on generated clusters and develop                    1.1    Clustering using supervision
techniques that guarantee fair representation from different
                                                                                Clustering is an intricate problem especially due to the
groups. We prove the noise tolerance of our algorithms and
                                                                             absence of domain knowledge, and the final set of clusters
back the theory by demonstrating the efficacy and efficiency
                                                                             identified using automated techniques can be highly inac-
on various real world datasets for these applications.
                                                                             curate and noisy. There has been a lot of recent interest
                                                                             to leverage humans to answer pairwise queries of the form
                                                                             ‘do u and v belong to the same optimal cluster?’. Since hu-
1.    INTRODUCTION                                                           mans have much more context and domain knowledge, they
  With the advances in machine learning and availability                     can answer such queries quite easily. For this reason, many
of vast amounts of data, Artificial Intelligence based sys-                  frameworks have been developed to leverage humans (ab-
tems are allowed to make autonomous decisions. Already,                      stracted as an oracle) to perform entity resolution, one of
software makes decisions in who gets a loan [24], hiring [1],                the traditional applications of oracle based clustering tech-
self-driving car actions that may lead to property damage                    niques in data integration.
or human injury [22], medical diagnosis and treatment [28],                     Entity Resolution refers to the task of identifying all records
                                                                             that refer to the same entity. Entity resolution is one of the
                                                                             classical data management problems that has been studied
                                                                             since the seminal work of Fellegi and Sunter in 1969 [12].
                                                                             The explosion of data sources has aggravated the presence
                                                                             of duplicates in a dataset, elevating the importance of Entity
                                                                             Resolution (abbreviated as ER and often referred to as de-
                                                                             duplication). Web-Scale algorithms for de-duplication and
                                                                             organization of data is the need of the hour. ER has evolved
Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted   from using rule based systems to using human annotators
for private and academic purposes.                                           for expert guidance. In traditional settings, the goal of ER
                                               Table 1: Summary of contributions.

               Robust and Scalable             Oracle-based Clustering: Entity Resolution             [16, 14, 17]
                                          Semantic concept identification and Feature Enrichment          [18]
               Generative Model                           Geometric Block Model                         [20, 19]
               Fair and interpretable                   Fair Correlation Clustering                        [2]
                                                     Interpretable k-center Clustering                    [27]


was to match records obtained from two data sources which              with no configuration effort. Our approach performs block-
has now evolved to identify a cluster of records referring to          ing and matching in tandem, where pair matching results
same entity. The heterogeneity of data sources has raised the          are fed back to the blocking to refine and improve its
amount of noise in these datasets and motivated the study              quality. We demonstrate that our technique achieves the
of ER scalability. There has been limited work on holis-               best trade-of between the quality of final results and ER
tic approaches to identify entities across multiple sources.           efficiency for a variety of million scale datasets.
We develop techniques that are able to resolve entities in           As a future work, we are planning to extend oracle-based
datasets with varied cluster distributions and noise levels.      techniques to perform hierarchical clustering. Hierarchi-
To achieve this goal, we make the following contributions.        cal clustering techniques are very useful to construct tax-
• Robustness. The queries to the oracle can have low ac-          onomies, analyze phylogenetic trees and construct product
  curacy based on their difficulty. Prior oracle-based clus-      catalogs. In this setting, we assume that all the leaf level
  tering techniques [29, 30, 15] assumed that all the answers     records are known and the goal is to organize these records
  returned by the oracle are correct and hence constructed        in the form of a type-subtype hierarchy. Pairwise oracle
  a spanning tree over the queried edges to identify all the      query between two leaf level records is not sufficient to con-
  matching pairs. In the absence of noise, this was sufficient    struct the hierarchy. Therefore, we consider a triplet query
  due to transitivity (if u, v refer to same entity and v, w      consisting of three records and the oracle identifies the pair
  refer to same entity then u,w can be inferred as same en-       of nodes that are closer to each other than the third node.
  tity) but it leads to very poor F-score of generated clusters   The oracle output provides a local evidence of the hierarchy
  even in case of low error. We propose a cost-effective ap-      and is helpful to uncover the structure. One of the key chal-
  proach [16, 14] that can be added as an extra-layer to any      lenges in this line of work is to efficiently identify a small
  oracle-based strategy, helping to preserve the performance      set of queries that can help recover the hierarchy. For a
  guarantees of [31, 29, 15] along with high precision. In-       dataset of n records, the total number of possible triplet
  stead of constructing spanning tree over the records, our       queries is O(n3 ) and enumerating all such queries is impos-
  approach strengthens all the cuts by constructing sparse        sible for million scale datasets. We leverage pairwise simi-
  graphs with strong connectivity properties. We achieve          larities as a guidance to quickly identify the most beneficial
  this with the help of expander graphs [5] and prove pre-        triplet queries. Our algorithm maintains a hierarchy of all
  cision guarantees of our technique. The error correction        the processed records and iteratively processes each node
  layer can be tuned (or even turned off) trading off budget      with the help of already identified beneficial queries. We
  for accuracy, thereby providing flexibility to adapt to dif-    show that our technique is able to construct the hierarchy
  ferent ER applications. In order to efficiently leverage this   with O(n log n) queries under reasonable assumptions of the
  toolkit, we propose an adaptive technique that changes          similarity distribution. This work is under progress and we
  the connectivity strength of the queried graph based on         are currently evaluating the quality of our techniques with
  noise in results and prior similarity of record pairs. We       respect to other baselines.
  empirically demonstrate that our technique achieves high           In addition to oracle based clustering techniques, we are
  F-score over different real world datasets.                     exploring the use of semantic knowledge present in the form
                                                                  of knowledge graphs to identify clusters of web tables and
• Scalability. ER is generally preceded by blocking as a          columns that refer to the same concept [18]. Given the scale
  pre-processing step to handle large scale datasets. Block-      of data available over the web, the amount of noise and miss-
  ing constitutes the first step that selects sub-quadratic       ing information, identifying these clusters is quite challeng-
  number of record pairs to compare in the subsequent steps.      ing. To achieve this goal, we propose an index structure that
  Blocking groups similar records into blocks and then se-        uses semantic knowledge graphs to quickly identify the dis-
  lects pairs from the “cleanest” blocks – i.e., those with       tribution of concepts for a particular column. Currently, our
  fewer non-matching pairs – for further comparisons in the       index supports text based attributes but does not work for
  pair matching phase. The literature is rich with methods        numerical attributes like population, year, age, etc. Identi-
  for building and processing blocks [25], but depending on       fying clusters of numerical columns requires additional con-
  the data, blocking techniques are either (a) too aggressive     text from the meta-data and other co-occurring columns.
  that they help scale but adversely affect ER accuracy,          We are developing a unified framework to identify seman-
  or (b) too permissive to potentially harm ER efficiency.        tically coherent clusters of columns and further use these
  Due to these limitations, blocking require tuning for each      for applications like dataset discovery, feature enrichment,
  dataset and is one of the most time-consuming compo-            improving search, etc.
  nents of the pipeline.
  We propose a new methodology of progressive blocking [17]       2.     ABSENCE OF GROUND TRUTH
  that overcomes the above limitations by self-regulating          In this section, we discuss clustering from the lens of com-
  blocking and adapting to the properties of each dataset,        munity detection over social networks. There are a plethora
of techniques that are used to identify clusters of records        bution of demographics. Our algorithm proceeds in two
referring to same community. However, all these datasets           steps. In the first step it identifies a matching between nodes
suffer from the scarcity of ground truth data. In order to         of different colors to construct small clusters that satisfy fair-
circumvent this drawback, generative models have been pro-         ness constraints. In the second step it chooses representative
posed to model the interaction between records of different        nodes (one from each matched clusters) and employs tradi-
communities. These models are helpful to benchmark the             tional correlation clustering algorithm to identify the final
quality of known clustering techniques to identify clusters.       set of clusters. We prove that our algorithm identifies clus-
   Stochastic block model (SBM) is one of the most popu-           ters within a constant factor approximation of the optimal
lar random graph model that generalizes the Erdős-Renyi           solution. We further relax the equal distribution constraint
graphs. According to SBM, edges between every pair of              and extend our algorithm for a lower and upper bound con-
nodes are drawn randomly with probability p if the end-            straint on the number of nodes of each color in a cluster. To
points belong to the same cluster and q if they belong to          further instill trust in the data, we explore multi-objective
different clusters. One aspect that SBM does not capture           clustering algorithms to generate explainable clusters with
is the ‘transitivity rule’ (friends having common friends),        minimal loss in the clustering objective.
which is inherent to formation of communities over social          Interpretable Clustering. Clustering techniques are ex-
networks. Intuitively, if two nodes x, y are connected by an       pected to be inherently interpretable as the goal is to group
edge and y, z are connected by an edge then it is more likely      similar nodes together. However, with the increase in num-
than not that x, z are connected by an edge. Inspired by           ber of features for each record, the generated clusters can
this, we proposed the geometric block model [20, 19] that          have poor interpretability. In our work [27], we measure in-
models community formation according to random geomet-             terpretability in terms of the homogeneity of nodes in a clus-
ric graphs. One of the key distinction from SBM is that it         ter with respect to the features of interest for the end-user.
considers correlated edge formation, capturing the proper-         We consider the k-center clustering objective and develop
ties of transitivity rule. We empirically validated the model      techniques to achieve β-interpretability (for a given param-
over collaboration networks and co-purchase networks.              eter β) with respect to features of interest. The choice of
   We observed that traditional techniques that were devel-        β determines the trade-off between clustering objective and
oped for cluster recovery in SBM could not be used for the         interpretability.
geometric block model. We proposed a simple motif-based            Multi-Objective Clustering. With the increased soci-
counting algorithm to identify clusters and show that it is        etal impact of clustering techniques, the importance of con-
optimal upto a constant fraction. We tested the effectiveness      sidering additional constraints like fairness, diversity, in-
of our algorithm to recover clusters over various real-world       terpretability and efficiency has increased. This has mo-
and synthetic datasets.                                            tivated the study of multi-objective clustering techniques
                                                                   focused towards these objectives. Existing techniques that
3.   FAIRNESS AND INTERPRETABILITY                                 support multi-objective clustering either leverage a scalar-
                                                                   ization function, which combines the multiple objectives into
   There are a countless number of examples where the use
                                                                   a single objective, or find clusters in parallel for each ob-
of biased systems have led to disastrous consequences. Clus-
                                                                   jective and combine the results using different approaches
tering techniques are used in various applications like team
                                                                   such as a fitness function. Such techniques lose theoretical
formation and community detection which have societal im-
                                                                   guarantees with respect to any of the considered objectives.
pact. Given their importance, there has been little work
                                                                   In [21], we consider a lexicographic multi-objective frame-
on improving the fairness and interpretability of these al-
                                                                   work where the optimization objectives are lexicographically
gorithms. We consider different clustering techniques and
                                                                   ordered and our optimization algorithm follows the same
devise scalable methods to improve their fairness and inter-
                                                                   preference. In this setting, the goal is to prioritize primary
pretability.
                                                                   clustering objectives over ancillary objectives. To further
Correlation Clustering. Correlation clustering, intro-
                                                                   simulate different scenarios, our model uses a slack value
duced by Bansal, Blum and Chawla in 2004 [7], has received
                                                                   to improve the quality on secondary objectives and allows
tremendous attention in the past decade. The problem is
                                                                   minor deviations of the primary objective from its optimal
NP-complete and a series of follow-up work has resulted
                                                                   value. Our algorithm processes the different objectives in
in better approximation ratio, generalization to weighted
                                                                   the order of their preference and generates final clustering.
graphs, etc. [4, 9, 10]. This problem captures a wide range of
                                                                   Incase of any violation of clustering objectives, local search
applications including clustering gene expression patterns [8,
                                                                   techniques are employed to satisfy the corresponding slack
23], and the aggregation of inconsistent information [13].
                                                                   values.
   Chierichetti et al. [11] extended the notion of disparate im-
pact to k-center and k-median objectives, and studied these
problems for the case of two groups. Their result was later        4.   CONCLUSION AND FUTURE WORK
generalized to multiple groups by Rösner and Schmidt [26].           In this work, we have studied the different facets of clus-
We generalize the notion of disparate impact [2] to correla-       tering focussing on robustness, scalability, generative mod-
tion clustering for multiple colors and our goal is to make        elling, fairness and interpretability of flat clustering algo-
sure that the distribution of colors in each cluster is identi-    rithms. We demonstrated the effectiveness of our techniques
cal to the global distribution. Additionally, we extend the        to perform clustering with applications towards entity res-
model introduced by Ahmadian et al. [3] on k-center to cor-        olution, community detection and other societal issues of
relation clustering to ensure that no color is over or under       bias and discrimination. We study entity resolution from
represented in each cluster.                                       the perspective of using oracles as an abstraction of humans
   More formally, our fairness-aware variant of correlation        to answer pairwise queries and discuss the importance of
clustering [7] identifies clusters while ensuring equal distri-    scalable techniques for web-scale datasets. In community
detection, we study generative models to simulate interac-        [15] D. Firmani, B. Saha, and D. Srivastava. Online entity
tion between records of different clusters. Additionally, we           resolution using an oracle. PVLDB, 9(5):384–395,
study traditional clustering techniques along with fairness            2016.
and interpretability constraints. As a future work, we are        [16] S. Galhotra, D. Firmani, B. Saha, and D. Srivastava.
working towards extending our work to consider these dif-              Robust entity resolution using random graphs. In
ferent facets for hierarchical clustering for applications like        SIGMOD, 2018.
taxonomy construction, knowledge graph construction, data         [17] S. Galhotra, D. Firmani, B. Saha, and D. Srivastava.
organization and team formation.                                       Efficient and effective er with progressive blocking,
                                                                       2020.
5.   ACKNOWLEDGEMENT                                              [18] S. Galhotra, U. Khurana, O. Hassanzadeh,
  The authors would like to thank all the contributors, Barna          K. Srinivas, H. Samulowitz, and M. Qi. Automated
Saha (advisor), Donatella Firmani, Divesh Srivastava, Arya             feature enhancement for predictive modeling using
Mazumdar, Soumyabrata Pal, Saba Ahmadi, Roy Schwartz,                  external knowledge. ICDM, 2019.
Sandhya Saisubramanian, Shlomo Zilberstein, Udayan Khu-           [19] S. Galhotra, A. Mazumdar, S. Pal, and B. Saha.
rana, Oktie Hassanzadeh and Kavitha Srinivas.                          Connectivity in random annulus graphs and the
                                                                       geometric block model. CoRR, abs/1804.05013, 2018.
                                                                  [20] S. Galhotra, A. Mazumdar, S. Pal, and B. Saha. The
6.   REFERENCES                                                        geometric block model. In Thirty-Second AAAI
 [1] Are ai hiring programs eliminating bias or making it              Conference on Artificial Intelligence, 2018.
     worse? Forbes.                                               [21] S. Galhotra, S. Saisubramanian, and S. Zilberstein.
 [2] S. Ahmadi, S. Galhotra, B. Saha, and R. Schwartz.                 Lexicographically ordered multi-objective clustering.
     Fair correlation clustering, 2020.                                arXiv preprint arXiv:1903.00750, 2019.
 [3] S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian.            [22] N. J. Goodall. Can you program ethics into a
     Clustering without over-representation. In Proceedings            self-driving car? IEEE Spectrum, 53(6):28–58, June
     of the 25th ACM SIGKDD International Conference                   2016.
     on Knowledge Discovery & Data Mining, pages                  [23] J. Guo, F. Hüffner, C. Komusiewicz, and Y. Zhang.
     267–275, 2019.                                                    Improved algorithms for bicluster editing. In
 [4] N. Ailon, M. Charikar, and A. Newman. Aggregating                 M. Agrawal, D. Du, Z. Duan, and A. Li, editors,
     inconsistent information: ranking and clustering.                 Theory and Applications of Models of Computation,
     Journal of the ACM (JACM), 55(5):1–27, 2008.                      pages 445–456, Berlin, Heidelberg, 2008. Springer
 [5] N. Alon and J. H. Spencer. The probabilistic method.              Berlin Heidelberg.
     John Wiley & Sons, 2004.                                     [24] P. Olson. The algorithm that beats your bank
 [6] J. Angwin, J. Larson, S. Mattu, and L. Kirchner.                  manager. CNN Money, March 15, 2011.
     Machine bias. ProPublica, May 23, 2016.                      [25] G. Papadakis, J. Svirsky, A. Gal, and T. Palpanas.
 [7] N. Bansal, A. Blum, and S. Chawla. Correlation                    Comparative analysis of approximate blocking
     clustering. Machine learning, 56(1-3), 2004.                      techniques for entity resolution. Proceedings of the
 [8] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering                 VLDB Endowment, 9(9):684–695, 2016.
     gene expression patterns. Journal of computational           [26] C. Rösner and M. Schmidt. Privacy preserving
     biology, 6(3-4):281–297, 1999.                                    clustering with constraints. arXiv preprint
 [9] M. Charikar, V. Guruswami, and A. Wirth. Clustering               arXiv:1802.02497, 2018.
     with qualitative information. Journal of Computer            [27] S. Saisubramanian, S. Galhotra, and S. Zilberstein.
     and System Sciences, 71(3):360–383, 2005.                         Balancing the tradeoff between clustering value and
[10] S. Chawla, K. Makarychev, T. Schramm, and                         interpretability. In Proceedings of the AAAI/ACM
     G. Yaroslavtsev. Near optimal lp rounding algorithm               Conference on AI, Ethics, and Society, AIES ’20, page
     for correlationclustering on complete and complete                351–357, New York, NY, USA, 2020.
     k-partite graphs. In Proceedings of the forty-seventh        [28] E. Strickland. Doc bot preps for the O.R. IEEE
     annual ACM symposium on Theory of computing,                      Spectrum, 53(6):32–60, June 2016.
     pages 219–228, 2015.                                         [29] N. Vesdapunt, K. Bellare, and N. Dalvi.
[11] F. Chierichetti, R. Kumar, S. Lattanzi, and                       Crowdsourcing algorithms for entity resolution.
     S. Vassilvitskii. Fair clustering through fairlets. In            PVLDB, 7(12):1071–1082, 2014.
     I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,              [30] J. Wang, T. Kraska, M. J. Franklin, and J. Feng.
     R. Fergus, S. Vishwanathan, and R. Garnett, editors,              Crowder: Crowdsourcing entity resolution. PVLDB,
     Advances in Neural Information Processing Systems                 5(11):1483–1494, 2012.
     30, pages 5029–5037. Curran Associates, Inc., 2017.          [31] J. Wang, G. Li, T. Kraska, M. J. Franklin, and
[12] I. P. Fellegi and A. B. Sunter. A theory for record               J. Feng. Leveraging transitive relations for
     linkage. Journal of the American Statistical                      crowdsourced joins. In SIGMOD Conference, 2013.
     Association, 64(328):1183–1210, 1969.
[13] V. Filkov and S. Skiena. Integrating microarray data
     by consensus clustering. International Journal on
     Artificial Intelligence Tools, 13(04):863–880, 2004.
[14] D. Firmani, S. Galhotra, B. Saha, and D. Srivastava.
     Robust entity resolution using a crowdoracle. 2018.