<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reliable Clustering with Applications to Data Integration</article-title>
      </title-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>ion of crowdsourcing. Second, community detection applications su er from evaluation in real world scenarios due to lack of ground truth data. We propose a generative model to capture interactions between records that belong to di erent clusters and devise techniques for e cient cluster recovery. Third, manifestation of bias in data could arise due to discriminatory treatment of marginalized groups, sampling methods or even measurement errors in the data. We study the impact of this bias on generated clusters and develop techniques that guarantee fair representation from di erent groups. We prove the noise tolerance of our algorithms and back the theory by demonstrating the e cacy and e ciency on various real world datasets for these applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        With the advances in machine learning and availability
of vast amounts of data, Arti cial Intelligence based
systems are allowed to make autonomous decisions. Already,
software makes decisions in who gets a loan [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], hiring [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
self-driving car actions that may lead to property damage
or human injury [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], medical diagnosis and treatment [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ],
and every stage of the criminal justice system including
arraignment and sentencing that determine who goes to jail
and who is set free [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The importance of these decisions
makes fairness and quality of the employed algorithms of
prime importance.
      </p>
      <p>A number of these real-world applications employ entity
resolution, community detection, taxonomy construction and
outlier detection as some of their key constituents.
Clustering is one of the fundamental techniques that is commonly
used to formally study these components. Clustering has
been studied for many decades and is considered a
challenging task that has evolved over time. In the modern era of big
data, the problems of noise, bias and poor quality data have
adversely a ected the quality of traditional clustering
techniques. Along with quality, it is very important to improve
their scalability to run on web-scale datasets. Additionally,
clustering is generally an unsupervised task and su ers from
lack of ground truth data for e ective evaluation. There
has been a lot of interest in devising generative models to
simulate real-world interaction between records of di erent
clusters and benchmark various techniques. In this work,
we focus on these di erent facets of clustering along with
its applications towards data integration. Table 1 presents
a summary of our contributions.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Clustering using supervision</title>
      <p>Clustering is an intricate problem especially due to the
absence of domain knowledge, and the nal set of clusters
identi ed using automated techniques can be highly
inaccurate and noisy. There has been a lot of recent interest
to leverage humans to answer pairwise queries of the form
`do u and v belong to the same optimal cluster?'. Since
humans have much more context and domain knowledge, they
can answer such queries quite easily. For this reason, many
frameworks have been developed to leverage humans
(abstracted as an oracle) to perform entity resolution, one of
the traditional applications of oracle based clustering
techniques in data integration.</p>
      <p>
        Entity Resolution refers to the task of identifying all records
that refer to the same entity. Entity resolution is one of the
classical data management problems that has been studied
since the seminal work of Fellegi and Sunter in 1969 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
The explosion of data sources has aggravated the presence
of duplicates in a dataset, elevating the importance of Entity
Resolution (abbreviated as ER and often referred to as
deduplication). Web-Scale algorithms for de-duplication and
organization of data is the need of the hour. ER has evolved
from using rule based systems to using human annotators
for expert guidance. In traditional settings, the goal of ER
      </p>
      <sec id="sec-2-1">
        <title>Robust and Scalable</title>
      </sec>
      <sec id="sec-2-2">
        <title>Generative Model</title>
      </sec>
      <sec id="sec-2-3">
        <title>Fair and interpretable</title>
        <p>Oracle-based Clustering: Entity Resolution
Semantic concept identi cation and Feature Enrichment</p>
        <p>Geometric Block Model</p>
        <p>Fair Correlation Clustering
Interpretable k-center Clustering
was to match records obtained from two data sources which
has now evolved to identify a cluster of records referring to
same entity. The heterogeneity of data sources has raised the
amount of noise in these datasets and motivated the study
of ER scalability. There has been limited work on
holistic approaches to identify entities across multiple sources.
We develop techniques that are able to resolve entities in
datasets with varied cluster distributions and noise levels.
To achieve this goal, we make the following contributions.</p>
        <p>
          Robustness. The queries to the oracle can have low
accuracy based on their di culty. Prior oracle-based
clustering techniques [
          <xref ref-type="bibr" rid="ref15 ref29 ref30">29, 30, 15</xref>
          ] assumed that all the answers
returned by the oracle are correct and hence constructed
a spanning tree over the queried edges to identify all the
matching pairs. In the absence of noise, this was su cient
due to transitivity (if u, v refer to same entity and v, w
refer to same entity then u,w can be inferred as same
entity) but it leads to very poor F-score of generated clusters
even in case of low error. We propose a cost-e ective
approach [
          <xref ref-type="bibr" rid="ref14 ref16">16, 14</xref>
          ] that can be added as an extra-layer to any
oracle-based strategy, helping to preserve the performance
guarantees of [
          <xref ref-type="bibr" rid="ref15 ref29 ref31">31, 29, 15</xref>
          ] along with high precision.
Instead of constructing spanning tree over the records, our
approach strengthens all the cuts by constructing sparse
graphs with strong connectivity properties. We achieve
this with the help of expander graphs [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and prove
precision guarantees of our technique. The error correction
layer can be tuned (or even turned o ) trading o budget
for accuracy, thereby providing exibility to adapt to
different ER applications. In order to e ciently leverage this
toolkit, we propose an adaptive technique that changes
the connectivity strength of the queried graph based on
noise in results and prior similarity of record pairs. We
empirically demonstrate that our technique achieves high
F-score over di erent real world datasets.
        </p>
        <p>
          Scalability. ER is generally preceded by blocking as a
pre-processing step to handle large scale datasets.
Blocking constitutes the rst step that selects sub-quadratic
number of record pairs to compare in the subsequent steps.
Blocking groups similar records into blocks and then
selects pairs from the \cleanest" blocks { i.e., those with
fewer non-matching pairs { for further comparisons in the
pair matching phase. The literature is rich with methods
for building and processing blocks [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], but depending on
the data, blocking techniques are either (a) too aggressive
that they help scale but adversely a ect ER accuracy,
or (b) too permissive to potentially harm ER e ciency.
Due to these limitations, blocking require tuning for each
dataset and is one of the most time-consuming
components of the pipeline.
        </p>
        <p>
          We propose a new methodology of progressive blocking [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
that overcomes the above limitations by self-regulating
blocking and adapting to the properties of each dataset,
with no con guration e ort. Our approach performs
blocking and matching in tandem, where pair matching results
are fed back to the blocking to re ne and improve its
quality. We demonstrate that our technique achieves the
best trade-of between the quality of nal results and ER
e ciency for a variety of million scale datasets.
        </p>
        <p>As a future work, we are planning to extend oracle-based
techniques to perform hierarchical clustering.
Hierarchical clustering techniques are very useful to construct
taxonomies, analyze phylogenetic trees and construct product
catalogs. In this setting, we assume that all the leaf level
records are known and the goal is to organize these records
in the form of a type-subtype hierarchy. Pairwise oracle
query between two leaf level records is not su cient to
construct the hierarchy. Therefore, we consider a triplet query
consisting of three records and the oracle identi es the pair
of nodes that are closer to each other than the third node.
The oracle output provides a local evidence of the hierarchy
and is helpful to uncover the structure. One of the key
challenges in this line of work is to e ciently identify a small
set of queries that can help recover the hierarchy. For a
dataset of n records, the total number of possible triplet
queries is O(n3) and enumerating all such queries is
impossible for million scale datasets. We leverage pairwise
similarities as a guidance to quickly identify the most bene cial
triplet queries. Our algorithm maintains a hierarchy of all
the processed records and iteratively processes each node
with the help of already identi ed bene cial queries. We
show that our technique is able to construct the hierarchy
with O(n log n) queries under reasonable assumptions of the
similarity distribution. This work is under progress and we
are currently evaluating the quality of our techniques with
respect to other baselines.</p>
        <p>
          In addition to oracle based clustering techniques, we are
exploring the use of semantic knowledge present in the form
of knowledge graphs to identify clusters of web tables and
columns that refer to the same concept [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Given the scale
of data available over the web, the amount of noise and
missing information, identifying these clusters is quite
challenging. To achieve this goal, we propose an index structure that
uses semantic knowledge graphs to quickly identify the
distribution of concepts for a particular column. Currently, our
index supports text based attributes but does not work for
numerical attributes like population, year, age, etc.
Identifying clusters of numerical columns requires additional
context from the meta-data and other co-occurring columns.
We are developing a uni ed framework to identify
semantically coherent clusters of columns and further use these
for applications like dataset discovery, feature enrichment,
improving search, etc.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>ABSENCE OF GROUND TRUTH</title>
      <p>In this section, we discuss clustering from the lens of
community detection over social networks. There are a plethora
of techniques that are used to identify clusters of records
referring to same community. However, all these datasets
su er from the scarcity of ground truth data. In order to
circumvent this drawback, generative models have been
proposed to model the interaction between records of di erent
communities. These models are helpful to benchmark the
quality of known clustering techniques to identify clusters.</p>
      <p>
        Stochastic block model (SBM) is one of the most
popular random graph model that generalizes the Erdo}s-Renyi
graphs. According to SBM, edges between every pair of
nodes are drawn randomly with probability p if the
endpoints belong to the same cluster and q if they belong to
di erent clusters. One aspect that SBM does not capture
is the `transitivity rule' (friends having common friends),
which is inherent to formation of communities over social
networks. Intuitively, if two nodes x; y are connected by an
edge and y; z are connected by an edge then it is more likely
than not that x; z are connected by an edge. Inspired by
this, we proposed the geometric block model [
        <xref ref-type="bibr" rid="ref19 ref20">20, 19</xref>
        ] that
models community formation according to random
geometric graphs. One of the key distinction from SBM is that it
considers correlated edge formation, capturing the
properties of transitivity rule. We empirically validated the model
over collaboration networks and co-purchase networks.
      </p>
      <p>We observed that traditional techniques that were
developed for cluster recovery in SBM could not be used for the
geometric block model. We proposed a simple motif-based
counting algorithm to identify clusters and show that it is
optimal upto a constant fraction. We tested the e ectiveness
of our algorithm to recover clusters over various real-world
and synthetic datasets.</p>
    </sec>
    <sec id="sec-4">
      <title>FAIRNESS AND INTERPRETABILITY</title>
      <p>There are a countless number of examples where the use
of biased systems have led to disastrous consequences.
Clustering techniques are used in various applications like team
formation and community detection which have societal
impact. Given their importance, there has been little work
on improving the fairness and interpretability of these
algorithms. We consider di erent clustering techniques and
devise scalable methods to improve their fairness and
interpretability.</p>
      <p>
        Correlation Clustering. Correlation clustering,
introduced by Bansal, Blum and Chawla in 2004 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], has received
tremendous attention in the past decade. The problem is
NP-complete and a series of follow-up work has resulted
in better approximation ratio, generalization to weighted
graphs, etc. [
        <xref ref-type="bibr" rid="ref10 ref4 ref9">4, 9, 10</xref>
        ]. This problem captures a wide range of
applications including clustering gene expression patterns [
        <xref ref-type="bibr" rid="ref23 ref8">8,
23</xref>
        ], and the aggregation of inconsistent information [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Chierichetti et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] extended the notion of disparate
impact to k-center and k-median objectives, and studied these
problems for the case of two groups. Their result was later
generalized to multiple groups by Rosner and Schmidt [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
We generalize the notion of disparate impact [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to
correlation clustering for multiple colors and our goal is to make
sure that the distribution of colors in each cluster is
identical to the global distribution. Additionally, we extend the
model introduced by Ahmadian et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on k-center to
correlation clustering to ensure that no color is over or under
represented in each cluster.
      </p>
      <p>
        More formally, our fairness-aware variant of correlation
clustering [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] identi es clusters while ensuring equal
distribution of demographics. Our algorithm proceeds in two
steps. In the rst step it identi es a matching between nodes
of di erent colors to construct small clusters that satisfy
fairness constraints. In the second step it chooses representative
nodes (one from each matched clusters) and employs
traditional correlation clustering algorithm to identify the nal
set of clusters. We prove that our algorithm identi es
clusters within a constant factor approximation of the optimal
solution. We further relax the equal distribution constraint
and extend our algorithm for a lower and upper bound
constraint on the number of nodes of each color in a cluster. To
further instill trust in the data, we explore multi-objective
clustering algorithms to generate explainable clusters with
minimal loss in the clustering objective.
      </p>
      <p>
        Interpretable Clustering. Clustering techniques are
expected to be inherently interpretable as the goal is to group
similar nodes together. However, with the increase in
number of features for each record, the generated clusters can
have poor interpretability. In our work [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], we measure
interpretability in terms of the homogeneity of nodes in a
cluster with respect to the features of interest for the end-user.
We consider the k-center clustering objective and develop
techniques to achieve -interpretability (for a given
parameter ) with respect to features of interest. The choice of
determines the trade-o between clustering objective and
interpretability.
      </p>
      <p>
        Multi-Objective Clustering. With the increased
societal impact of clustering techniques, the importance of
considering additional constraints like fairness, diversity,
interpretability and e ciency has increased. This has
motivated the study of multi-objective clustering techniques
focused towards these objectives. Existing techniques that
support multi-objective clustering either leverage a
scalarization function, which combines the multiple objectives into
a single objective, or nd clusters in parallel for each
objective and combine the results using di erent approaches
such as a tness function. Such techniques lose theoretical
guarantees with respect to any of the considered objectives.
In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], we consider a lexicographic multi-objective
framework where the optimization objectives are lexicographically
ordered and our optimization algorithm follows the same
preference. In this setting, the goal is to prioritize primary
clustering objectives over ancillary objectives. To further
simulate di erent scenarios, our model uses a slack value
to improve the quality on secondary objectives and allows
minor deviations of the primary objective from its optimal
value. Our algorithm processes the di erent objectives in
the order of their preference and generates nal clustering.
Incase of any violation of clustering objectives, local search
techniques are employed to satisfy the corresponding slack
values.
4.
      </p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this work, we have studied the di erent facets of
clustering focussing on robustness, scalability, generative
modelling, fairness and interpretability of at clustering
algorithms. We demonstrated the e ectiveness of our techniques
to perform clustering with applications towards entity
resolution, community detection and other societal issues of
bias and discrimination. We study entity resolution from
the perspective of using oracles as an abstraction of humans
to answer pairwise queries and discuss the importance of
scalable techniques for web-scale datasets. In community
detection, we study generative models to simulate
interaction between records of di erent clusters. Additionally, we
study traditional clustering techniques along with fairness
and interpretability constraints. As a future work, we are
working towards extending our work to consider these
different facets for hierarchical clustering for applications like
taxonomy construction, knowledge graph construction, data
organization and team formation.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENT</title>
      <p>The authors would like to thank all the contributors, Barna
Saha (advisor), Donatella Firmani, Divesh Srivastava, Arya
Mazumdar, Soumyabrata Pal, Saba Ahmadi, Roy Schwartz,
Sandhya Saisubramanian, Shlomo Zilberstein, Udayan
Khurana, Oktie Hassanzadeh and Kavitha Srinivas.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Are ai hiring programs eliminating bias or making it worse? Forbes.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          . Fair correlation clustering,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Epasto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahdian</surname>
          </string-name>
          .
          <article-title>Clustering without over-representation</article-title>
          .
          <source>In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          , pages
          <volume>267</volume>
          {
          <fpage>275</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ailon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Charikar</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Newman</surname>
          </string-name>
          .
          <article-title>Aggregating inconsistent information: ranking and clustering</article-title>
          .
          <source>Journal of the ACM (JACM)</source>
          ,
          <volume>55</volume>
          (
          <issue>5</issue>
          ):1{
          <fpage>27</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Alon</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Spencer</surname>
          </string-name>
          .
          <article-title>The probabilistic method</article-title>
          . John Wiley &amp; Sons,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Angwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mattu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Kirchner</surname>
          </string-name>
          .
          <article-title>Machine bias</article-title>
          .
          <source>ProPublica, May</source>
          <volume>23</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chawla</surname>
          </string-name>
          .
          <article-title>Correlation clustering</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>56</volume>
          (
          <issue>1-3</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben-Dor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shamir</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yakhini</surname>
          </string-name>
          .
          <article-title>Clustering gene expression patterns</article-title>
          .
          <source>Journal of computational biology</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          -4):
          <volume>281</volume>
          {
          <fpage>297</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Charikar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Guruswami</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Wirth</surname>
          </string-name>
          .
          <article-title>Clustering with qualitative information</article-title>
          .
          <source>Journal of Computer and System Sciences</source>
          ,
          <volume>71</volume>
          (
          <issue>3</issue>
          ):
          <volume>360</volume>
          {
          <fpage>383</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Makarychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schramm</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Yaroslavtsev</surname>
          </string-name>
          .
          <article-title>Near optimal lp rounding algorithm for correlationclustering on complete and complete k-partite graphs</article-title>
          .
          <source>In Proceedings of the forty-seventh annual ACM symposium on Theory of computing</source>
          , pages
          <volume>219</volume>
          {
          <fpage>228</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chierichetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lattanzi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vassilvitskii</surname>
          </string-name>
          .
          <article-title>Fair clustering through fairlets</article-title>
          . In I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , and R. Garnett, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          , pages
          <fpage>5029</fpage>
          {
          <fpage>5037</fpage>
          . Curran Associates, Inc.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I. P.</given-names>
            <surname>Fellegi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Sunter</surname>
          </string-name>
          .
          <article-title>A theory for record linkage</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>64</volume>
          (
          <issue>328</issue>
          ):
          <volume>1183</volume>
          {
          <fpage>1210</fpage>
          ,
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Filkov</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiena</surname>
          </string-name>
          .
          <article-title>Integrating microarray data by consensus clustering</article-title>
          .
          <source>International Journal on Arti cial Intelligence Tools</source>
          ,
          <volume>13</volume>
          (
          <issue>04</issue>
          ):
          <volume>863</volume>
          {
          <fpage>880</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Robust entity resolution using a crowdoracle</article-title>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Online entity resolution using an oracle</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>9</volume>
          (
          <issue>5</issue>
          ):
          <volume>384</volume>
          {
          <fpage>395</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Robust entity resolution using random graphs</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>E cient and e ective er with progressive blocking</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Khurana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Samulowitz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Qi</surname>
          </string-name>
          .
          <article-title>Automated feature enhancement for predictive modeling using external knowledge</article-title>
          . ICDM,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          .
          <article-title>Connectivity in random annulus graphs and the geometric block model</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1804</year>
          .05013,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          .
          <article-title>The geometric block model</article-title>
          .
          <source>In Thirty-Second AAAI Conference on Arti cial Intelligence</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saisubramanian</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Zilberstein</surname>
          </string-name>
          .
          <article-title>Lexicographically ordered multi-objective clustering</article-title>
          .
          <source>arXiv preprint arXiv:1903.00750</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Goodall</surname>
          </string-name>
          .
          <article-title>Can you program ethics into a self-driving car</article-title>
          ? IEEE Spectrum,
          <volume>53</volume>
          (
          <issue>6</issue>
          ):
          <volume>28</volume>
          {
          <fpage>58</fpage>
          ,
          <year>June 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , F. Hu ner, C. Komusiewicz, and
          <string-name>
            <surname>Y. Zhang.</surname>
          </string-name>
          <article-title>Improved algorithms for bicluster editing</article-title>
          . In M. Agrawal,
          <string-name>
            <given-names>D.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Li, editors,
          <source>Theory and Applications of Models of Computation</source>
          , pages
          <volume>445</volume>
          {
          <fpage>456</fpage>
          , Berlin, Heidelberg,
          <year>2008</year>
          . Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Olson</surname>
          </string-name>
          .
          <article-title>The algorithm that beats your bank manager</article-title>
          .
          <source>CNN Money, March</source>
          <volume>15</volume>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Svirsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          .
          <article-title>Comparative analysis of approximate blocking techniques for entity resolution</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>9</volume>
          (
          <issue>9</issue>
          ):
          <volume>684</volume>
          {
          <fpage>695</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>C. R</surname>
          </string-name>
          <article-title>osner and M. Schmidt. Privacy preserving clustering with constraints</article-title>
          .
          <source>arXiv preprint arXiv:1802.02497</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saisubramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Zilberstein</surname>
          </string-name>
          .
          <article-title>Balancing the tradeo between clustering value and interpretability</article-title>
          .
          <source>In Proceedings of the AAAI/ACM Conference on AI</source>
          ,
          <string-name>
            <surname>Ethics</surname>
          </string-name>
          , and Society, AIES '
          <volume>20</volume>
          , page
          <volume>351</volume>
          {
          <fpage>357</fpage>
          , New York, NY, USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Strickland</surname>
          </string-name>
          .
          <article-title>Doc bot preps for the O</article-title>
          .R. IEEE Spectrum,
          <volume>53</volume>
          (
          <issue>6</issue>
          ):
          <volume>32</volume>
          {
          <fpage>60</fpage>
          ,
          <year>June 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>N.</given-names>
            <surname>Vesdapunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bellare</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          .
          <article-title>Crowdsourcing algorithms for entity resolution</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>7</volume>
          (
          <issue>12</issue>
          ):
          <volume>1071</volume>
          {
          <fpage>1082</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          . Crowder:
          <article-title>Crowdsourcing entity resolution</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <volume>1483</volume>
          {
          <fpage>1494</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Leveraging transitive relations for crowdsourced joins</article-title>
          .
          <source>In SIGMOD Conference</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>