<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crowdsourced Entity Resolution with Control Queries (Discussion Paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sainyam Galhotra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donatella Firmani</string-name>
          <email>donatella.firmani@uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barna Saha</string-name>
          <email>barnag@cs.umass.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Divesh Srivastava</string-name>
          <email>divesh@research.att.com</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Roma Tre University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UMass Amherst</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. Nonetheless, also humans can make mistakes. Our contribution is an error correction toolkit that can be leveraged by a variety of hybrid human-machine ER algorithms, based on a formal way for selecting \control queries" for the human experts. We demonstrate empirically that less recent ER algorithms equipped with our tool can perform even better than most recent ER methods with built-in error correction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Entity resolution (ER) is the problem of identifying which records in a data set
refer to the same underlying real-world entity [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ]. Examples include resolving
pro les of people and businesses, or speci cations of products and services, from
websites and social media. ER can be very intricate, as entities are typically
represented in a wide variety of ways that humans can match and distinguish based
on domain knowledge, but would be challenging for automated strategies. For
these reasons, many frameworks have been developed to leverage humans for
performing entity resolution tasks [
        <xref ref-type="bibr" rid="ref12 ref21">21, 12</xref>
        ], by using, for instance, a crowd-sourcing
platform such as Amazon Mechanical Turk. In these frameworks, humans are
typically asked questions of the kind \do the record u represent the same
entity than the record v?" However, certain record pairs can be challenging to
resolve correctly even for humans: take for instance two used iPhones in a retail
website, and try to verify whether they are the same model or they have some
small di erences, for instance in the hardware equipment. Cross-checking human
answers can require signi cant e ort, resulting in more money invested in the
crowd-sourcing platform or low precision and recall in the computed result.
      </p>
      <p>
        To this end, we design an error correction toolkit that operates outside the
platform, and only requires a small amount of extra questions. Compared to
Copyright c 2019 for the individual papers by the papers authors. Copying
permitted for private and academic purposes. This volume is published and copyrighted by
its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
(a)
(b)
(c)
other error correction methods, our methodology can be applied to any ER
algorithm, boosting its precision and recall and providing formal performance
guarantees. In addition, it can be tuned (or even seamlessly turned o ) trading
o budget for accuracy and adapts to di erent ER applications.
Main intuition. In literature, there are two main ways to solve ER errors.
{ Replicating the same question (i.e. about the same pair) and submitting the
replicas to multiple humans, until enough evidence is collected for labeling
the pair as matching or non-matching (e.g., with voting strategies).
{ Using robust clustering techniques for partitioning records into entities,
minimizing disagreements between answers (e.g., with correlation clustering).
While recent error-correction works such as [
        <xref ref-type="bibr" rid="ref14 ref18 ref19">14, 18, 19</xref>
        ] provide speci c ER
algorithms that incorporate both approaches at the same time, we decouple the
pair-wise error correction layer from the robust clustering, and focus on which
extra-answers would solve disagreements in a principled way, transparently to the
underlying ER logic. At the pair level, we can leverage general purpose
answerquality mechanisms already described in the crowd-sourcing literature [
        <xref ref-type="bibr" rid="ref2 ref23">2, 23</xref>
        ]
(that can take advantage of workers meta-data and incentives, to name a few,
other than just replicating queries) or use simple majority voting, depending on
the application domain. At the cluster level, we proceed as follows.
Expanders. ER strategies based on human experts (e.g., [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref8">21, 22, 20, 8</xref>
        ]) typically
build a spanning forest of positive answers for inferring entities via connectivity,
based on the assumption that all answers are correct. However, the minimal
connectivity of trees yields bad results even in presence of few errors. In a nutshell,
we search for queries that can make the connectivity stronger by transforming
each tree into an expander graph. Graphs with good expansion properties have
indeed provably high connectivity and are cheap to build. Intuitively, every edge
cut of a graph expander is times larger than the smallest side of the cut, while
every edge of a tree is a cut-edge. This di erence is illustrated in Figure 1.
Contributions and outline. We show that the most recent error-correction
algorithms can be outperformed by simpler strategies equipped with our
expander layer. Section 2 contains an informal overview of our approach. Detailed
description of our algorithms and their theoretical basis can be found in [
        <xref ref-type="bibr" rid="ref7 ref9">9, 7</xref>
        ].
Main experimental results of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] are reported in Section 3. Finally, Sections 4
and 5 contain related work and our concluding remarks.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Overview</title>
      <p>
        We model the pair-wise error correction layer as a black box { that is, a noisy
oracle { returning already processed pair-wise answers and their corresponding
con dence score, possibly labeling some queried pairs incorrectly. We formalize
a robust version of the entity resolution problem based on such an oracle and
consider progressive F-measure [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as the metric to be maximized in this setting.
If one plots a curve of F-measure as a function of the number of oracle queries,
progressive F-measure is quanti ed as the area under this curve.
Problem 1 (Noisy Oracle Strategy) Given a set of records V , a noisy
oracle access to C, and a matching probability function pm (possibly de ned on a
subset of V V ), nd the strategy that maximizes progressive F-measure.
Random expansion. Our toolkit guarantees that every cluster { more precisely,
the corresponding subgraph induced by positive answers { has good expansion
properties. Speci cally, every time an ER strategy such as [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref8">21, 22, 20, 8</xref>
        ] asks a
query involving two nodes { say (x; y) { we ask other queries incident to the
corresponding clusters cx and cy, such as (x1; y1) and (x2; y2), with x1; x2 2 cx
and y1; y2 2 cy. Note that extra queries are not replicas of (x; y), and they are
all distinct. If the answers are positive then the degree of nodes within cx [ cy
increases, and if the degree gets high enough then we merge cx and cy together.
We know that random regular graphs have good expansion properties [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in
practice, and hence choosing the queries randomly from cx cy helps us to
maintain the required expansion. We refer to our approach as random expansion.
Technically, we consider weighted expanders, where the weight of and edge (x; y)
is set to the probability pe(x; y) that the oracle answer to x; y is erroneous.
Example 1 In Figure 1 we show the possible result of having an expander
( = 1) for data in Figure 1a, compared with spanning trees. Both connected
components of the left gure (i.e., trees in the spanning forest) and expander graphs of
right gure yield the same, correct, clustering f1; 2; 3; 4; 5g; f6; 7; 8g; f9g. While
connected components only work in absence of errors, expander produces the
correct clustering also in presence of plausible human errors such as (1; 9), (8; 3)
(false positives) and (3; 5) (false negative). Even though expansion requires more
queries than building a spanning forest, it is far from being exhaustive: in the
larger cluster out of 52 = 10 possible queries, only 5 are asked.
      </p>
      <p>Algorithm 1 High-level description on adaptive( ) pipeline.
1: run node ordering equipped with query edges( ; ; ) upon disagreement
2: boost fscore( )
3: run edge ordering equipped with query edges( ; ; ) upon disagreement
4: boost fscore( )
Expander Toolkit. Our toolkit consists of two methods: query edges() and
boost fscore(), as described next. The input includes the parameter , the
matching probability function pm, and the error probabilities pe.
{ Given a query (u; v) selected by an ER strategy, query edges( ; u; v)
provides an intermediate layer between the ER logic and the noisy oracle:
instead of asking the selected query (u; v) as the considered strategy would do,
query edges( ; u; v) selects a bunch of random queries between the cluster
of u and the cluster of v, and returns a YES answer only if the joint error
probability of the cut is small.
{ The boost fscore( ) method instead provides functionalities for
maintaining the expansion properties of subgraphs that were not considered during
the execution of the ER process, by either adding new edges to weak cuts
or splitting the clusters in expander subgraphs. Running boost fscore( )
at a later phase of the ER process can x premature decisions.</p>
      <p>
        The parameter trades o the number of queries for precision: smaller values of
correspond to sparser clusters, and therefore to less queries. = 0 is equivalent
to using the oracle strategy alone. Greater values of correspond to denser
clusters and to higher precision: we prove that its expected value is 1 O(1=n ).
Deployment of the toolkit. Consider a perfect oracle strategy such as those
in [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref8">21, 22, 20, 8</xref>
        ], assuming all the answers correct and thus asking a spanning
forest. As mentioned, query edges() can be used as a substitute of plain
connectivity for deciding if two clusters refer to the same entity or not. For instance,
in case of node ordering based strategies, such as [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], a singleton node is added
to a cluster by querying a single edge with it. We can replace that querying step
with query edges() which will try to check if the singleton cluster containing
the node refers to the same entity as the one it was supposed to be queried with.
The edge ordering based strategies, such as [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], query the edges in decreasing
order of probabilities to decide if the two endpoint clusters ought to be merged
or not. Similarly, we can make the same decision by replacing this querying step
with query edges() of the two endpoint clusters. We note that the strategy in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
combines node and edge ordering, and can be modi ed to apply random
expansion similarly. Along with the incorporation of query edges(), any strategy can
call boost fscore() towards the end for correcting the errors made in earlier
stages. This shows that our toolkit is independent and easy to use.
Adaptive strategy. In the previous paragraph, we have described a simple
way to merge the expansion toolkit with a perfect oracle strategies to
minimize the e ect of error in the noisy setting. One of the main advantages of the
above described methods is that they maintain high precision throughout the
dataset n k jC1j ref. description
cora 1.9K 191 236 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] Title, author, venue, and date of scienti c papers.
captcha? 244 69 8 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] CAPTCHA images showing four-digit numbers.
gym? 94 12 15 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] Images of gymnastics athletes in di erent positions.
Table 1: Datasets used in our experiments. We report the number of records
n, number of clusters k (i.e., entities), size of the largest cluster jC1j, and the
paper where they appeared rst. Datasets with the symbol ? come with a cache
of answers from the AMT crowd. cora has synthetic noisy oracle answers.
resolution phase. This high precision is achieved at the cost of lower
progressive F-score. A close look at the expansion toolkit shows that we can possibly
improve this shortcoming as well. To this end, we also provide the adaptive()
pipeline in Algorithm 1, building on the ideas of the hybrid() method in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Let query single edge(u; v) refer to a pair-wise oracle query of the kind \do
the record u represent the same entity than the record v?". The intuition of
adaptive() is to switch between query single edge() and query edges()
depending on the current answer. Speci cally, we call query edges() only if the
con dence score returned by the noisy oracle in \disagreement" with the
matching probabilities (e.g., a positive answer in case of low matching probability).
We make use of boost fscore() between the node and edge ordering phases,
for correcting early errors due to the adaptive nature of the strategy. Speci ally,
adaptive() allows temporary non-expander clusters, thus allowing spurious
violations of our invariant. However, we observe in practice that such behaviour
can provide high gains in progressive F-score without losing on precision.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>
        We compare our algorithms with the most recent approaches for crowdsourced
ER [
        <xref ref-type="bibr" rid="ref14 ref18 ref19">14, 18, 19</xref>
        ]. As a frame of reference, we use the ideal curve with that achieves
maximum progressive F-score by progressively growing the true clusters as in
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We ran experiments on a machine with two CPU Intel Xeon E5520 units
with 16 cores each, running at 2.67GHz, with 32GB RAM. We consider both
real and synthetic datasets, as summarized in Table 1. We show performances
not only of adaptive(), but also of other variants considered in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and refer
the reader to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for a more extensive experimental comparison.
      </p>
      <p>
        In Figures 2a and 2b, we compare the progressive F-score of our and
previous strategies. We set the x axis to the number of distinct queries in all the
datasets. The plots show that our adaptive() pipeline achieves more than 90%
F-score with less than 400 queries in case of gym. Similarly for captchas, the
F-score achieved is more than 90% even with the simplest pipeline lazy(), which
correspond to the ER strategy in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] equipped with our boost fscore() method
at the end. This is due to the zero false positives in the oracle answers for
captchas. In such scenario, lazy() has 100% precision. Bene ts of using our
expander pipelines become evident with more challenging dataset, such as gym.
In order to investigate further the e ect of oracle answers on gym, we set input
probability 1 of a pair (u; v) if and only if (u; v) is matching, and 0 otherwise,
and generate synthetic erroneous answers as the error rate for both false
positives and negatives varies in the range [0; 0:3]. In this setting, we observed that
performance of adaptive() is almost ideal up to answer error rate 0:2, which is
higher than the error typically observed in real crowd sourcing platforms such
as Amazon Mechanical Turk [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Figure 2c demonstrates the e ect of changing
the tuning parameter on the performance of adaptive() (multiple queries are
counted separately). It is evident that the progressive F-score improves as the
value is reduced. The downside of this is that it tries to grow a sparser graph,
which has lower precision and the algorithm plateau's out at lower nal F-score.
On the other hand, higher values of make the approach conservative to ask
more queries and reduces the progressive F-score. The above described behavior
is consistent with our theoretical bounds proven in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Alternative forms of expansion. Although random expansion is analytically
proven to require less queries than deterministic expansion [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], empirically one
can make use of alternative forms of expansions for selecting \control queries".
The table below compares the number of queries asked by adaptive() on cora
with di erent expansion strategies, dubbed high, low, and uncertain, selecting
queries with the closest matching probability to 1, 0, and 0:5 respectively.
F-score
0:9
0:85
p+; p
Random expansion gets to the same F-score of deterministic alternatives with
less or comparable number of queries. We observed similar results for all the
considered datasets. Even though cora seems to select uncertain as the best
deterministic alternative, a deeper look at the experimental data reveals that
the 0:5 matching probability bucket is the one containing most edges in cora,
and drawing edges from this bucket closely resembles random selection.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Related Works</title>
      <p>
        ER has a long history, from the seminal paper by Fellegi and Sunter in 1969 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
which proposed the use of a learning-based approach, to the recently proposed
hybrid human-machine approaches. We focus on the latter line of work, which
we refer to crowdsourced ER, where we can ask questions to humans about two
records representing the same real-world entity.
      </p>
      <p>
        Perfect oracle strategies. The strategies described in [
        <xref ref-type="bibr" rid="ref20 ref22 ref8">20, 22, 8</xref>
        ] abstract the
crowd as a whole by an oracle, which can provide a correct YES/NO answer
for a given pair of items. This allows for leveraging transitivity: when we know
that u matches with v and v matches with w, the matching of u and w is
inferred automatically. The strategies focus on di erent ER logics, which can be
formally compared in terms of the number of questions asked for 100% recall [
        <xref ref-type="bibr" rid="ref15 ref8">8,
15</xref>
        ]. Unfortunately, the strategies do not apply to low quality of answers.
Error-correction methods. In order to deal with erroneous answers by
humans, recent approaches typically ask the same query to individual crowd
workers, rather than making use of control queries to a crowd abstraction such as the
noisy oracle. The strategies in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] submits the same query to multiple crowd
workers, and every answer represents a vote. The goal is to achieve a clear
majority of YES or NO answers for some pairs (u; v), and use those high con dence
pairs for building clusters. The strategies described in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] assume that every
crowd worker has a known error rate pE . In this setting, each YES/NO answer
for a given pair of items can be interpreted as a matching probability. Such
strategies start from an initial clustering and then re nes the solution by asking
to the crowd. Finally, the work in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] uses a combination of k-node queries (for
instance, with k = 3, \are u, v and z the same entity?") and pair-wise queries.
Other Related Works. Wang et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] describe a hybrid human-machine
framework CrowdER, that automatically detects pairs that have a high
likelihood of matching, which are then veri ed by humans. Gokhale et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] propose
a hybrid approach for the end-to-end work ow, making e ective use of active
learning via human labeling. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the authors devise a partial ordering based
technique to resolve entities, which applies for records of text attributes.
Progressive F-score has been discussed in [
        <xref ref-type="bibr" rid="ref13 ref17 ref24">24, 17, 13</xref>
        ]. Finally, estimating or improving
crowd accuracy in general, as in [
        <xref ref-type="bibr" rid="ref11 ref2 ref4">4, 11, 2</xref>
        ], is outside the scope of this paper.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We address the problem of maximizing progressive F-measure for the
crowdsourced ER task. We propose an e cient and cost-e ective layer which can
be applied on typical ER strategies to make them robust to crowd errors, and
present di erent pipelines that use the expansion toolkit to provide high
progressive F-score with provable guarantees. Our experiments over real and synthetic
datasets con rm our theory and show the superiority of our pipelines over the
techniques proposed in the literature. In our future work, we plan to apply our
results to temporal record linkage, where error probability is higher as two records
are farther in time and very small for nearby records of the same entity.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.</given-names>
            <surname>Alon</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Spencer</surname>
          </string-name>
          .
          <article-title>The probabilistic method</article-title>
          . John Wiley &amp; Sons,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Bragg</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          .
          <article-title>Optimal testing for crowd workers</article-title>
          .
          <source>In AAMAS</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Cost-e ective crowdsourced entity resolution: A partial-order approach</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <volume>969</volume>
          {
          <fpage>984</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          .
          <article-title>Aggregating crowdsourced binary ratings</article-title>
          .
          <source>In WWW</source>
          , pages
          <volume>285</volume>
          {
          <fpage>294</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>A. K. Elmagarmid</surname>
            ,
            <given-names>P. G.</given-names>
          </string-name>
          <string-name>
            <surname>Ipeirotis</surname>
            , and
            <given-names>V. S.</given-names>
          </string-name>
          <string-name>
            <surname>Verykios</surname>
          </string-name>
          .
          <article-title>Duplicate record detection: A survey</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>16</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>I. P.</given-names>
            <surname>Fellegi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Sunter</surname>
          </string-name>
          .
          <article-title>A theory for record linkage</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>64</volume>
          (
          <issue>328</issue>
          ):
          <volume>1183</volume>
          {
          <fpage>1210</fpage>
          ,
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Robust entity resolution using a crowdoracle</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>41</volume>
          (
          <issue>2</issue>
          ):
          <volume>91</volume>
          {
          <fpage>103</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Online entity resolution using an oracle</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>9</volume>
          (
          <issue>5</issue>
          ):
          <volume>384</volume>
          {
          <fpage>395</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Robust entity resolution using random graphs</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <volume>3</volume>
          {
          <fpage>18</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Machanavajjhala</surname>
          </string-name>
          .
          <article-title>Entity resolution: theory, practice &amp; open challenges</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>12</issue>
          ):
          <year>2018</year>
          {
          <year>2019</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kale</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>McAfee</surname>
          </string-name>
          .
          <article-title>Who moderates the moderators?: crowdsourcing abuse detection in user-generated content</article-title>
          .
          <source>In EC</source>
          , pages
          <volume>167</volume>
          {
          <fpage>176</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>C. Gokhale</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Naughton</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Rampalli</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Shavlik</surname>
            , and
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          . Corleone:
          <article-title>Hands-o crowdsourcing for entity matching</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <volume>601</volume>
          {
          <fpage>612</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gruenheid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Incremental record linkage</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>7</volume>
          (
          <issue>9</issue>
          ):
          <volume>697</volume>
          {
          <fpage>708</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gruenheid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gatterbauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          .
          <article-title>Faulttolerant entity resolution with the crowd</article-title>
          .
          <source>arXiv preprint arXiv:1512.00537</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazumdar</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          .
          <article-title>A theoretical analysis of rst heuristics of crowdsourced entity resolution</article-title>
          .
          <source>AAAI</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>A. McCallum</surname>
          </string-name>
          ,
          <year>2004</year>
          . https://people.cs.umass.edu/~mccallum/data.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>T.</given-names>
            <surname>Papenbrock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Progressive duplicate detection. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>27</volume>
          (
          <issue>5</issue>
          ):
          <volume>1316</volume>
          {
          <fpage>1329</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>V.</given-names>
            <surname>Verroios</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Entity resolution with crowd errors</article-title>
          .
          <source>In ICDE Conference</source>
          , pages
          <volume>219</volume>
          {
          <fpage>230</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>V.</given-names>
            <surname>Verroios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou. Waldo</surname>
          </string-name>
          :
          <article-title>An adaptive human interface for crowd entity resolution</article-title>
          .
          <source>In SIGMOD Conference</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>N.</given-names>
            <surname>Vesdapunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bellare</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          .
          <article-title>Crowdsourcing algorithms for entity resolution</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>7</volume>
          (
          <issue>12</issue>
          ):
          <volume>1071</volume>
          {
          <fpage>1082</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. J.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kraska</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
          </string-name>
          . Crowder:
          <article-title>Crowdsourcing entity resolution</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <volume>1483</volume>
          {
          <fpage>1494</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. J.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kraska</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Leveraging transitive relations for crowdsourced joins</article-title>
          .
          <source>In SIGMOD Conference</source>
          , pages
          <volume>229</volume>
          {
          <fpage>240</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Branson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          .
          <article-title>The multidimensional wisdom of crowds</article-title>
          .
          <source>In NIPS</source>
          , pages
          <volume>2424</volume>
          {
          <fpage>2432</fpage>
          , USA,
          <year>2010</year>
          . Curran Associates Inc.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marmaros</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Pay-as-you-go entity resolution. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>25</volume>
          (
          <issue>5</issue>
          ):
          <volume>1111</volume>
          {
          <fpage>1124</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>