<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Pizzo Calabro (VV),
Italy</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Case for Multi-task Active Learning Entity Resolution</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Simonini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henrique Saccani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Gagliardelli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Zecchini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Beneventano</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonia Bergamaschi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Università degli Studi di Modena e Reggio Emilia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy &lt;name.surname&gt;@unimore.it</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@studenti.unimore.it</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Entity Resolution (ER) is a multi-task process that aims to detect diferent records in dirty datasets that refer to the same real-world entity. It is a building block for any data integration and cleaning framework. The state-of-the-art ER approaches heavily rely on supervised Machine Learning, hence on the availability of training data-which is notoriously hard to get in many contexts. To overcome this burden, many Active Learning (AL) strategies have been applied to every single task of ER to minimize the amount of training data by judiciously choosing which data to label. Yet, no study has been conducted to apply AL to ER in a holistic way. This paper aims to fill this gap by experimentally studying on real-world datasets how to tackle ER as a multi-task AL problem and provides guidelines on how to balance AL on the diferent ER tasks to get better results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Entity Resolution</kwd>
        <kwd>Active Learning</kwd>
        <kwd>Data Integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Entity Resolution with Active Learning</title>
        <p>Identifying records in datasets that refer to the same real-world entity is a fundamental task in
data cleaning and data integration—known as Entity Resolution (ER).</p>
        <p>
          State-of-the-art ER methods heavily rely on Machine Learning [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], hence on the availability
of training data. One way to alleviate this problem is to employ Active Learning (AL), to
judiciously choose the data to label so to minimize the human intervention, which is typically
the bottleneck [2, 3, 4, 5]. As a matter of fact, AL is commonly employed for training binary
classifiers to determine if a pair of records is a match or not. We call these classifiers matchers.
The basic idea is that for pairs of records we can compute similarity/distance measures (e.g.,
string similarities of the attributes, such as Jaccard index) that allow to represent them as
vectors of features. Then, some of these pairs are labeled by humans to provide a training
set for a classifier (e.g., Random Forest, SVM, etc.). AL aims to select the pairs that, once
employed to train the learner, will increase its accuracy the most. The major shortcoming
is that computing a feature vector for all possible pairs of records in a dataset is typically
computationally unbearable, due to its inherently quadratic complexity. Thus, blocking is
employed for discarding pairs of records that most likely will not match. The basic idea is
to employ “cheap” similarity (blocking) functions that work as surrogates for the “expensive”
one employed by the matchers. Also for blocking we can learn blocking functions employing
standard Machine Learning classifiers [6]—which can be combined with AL.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Contribution</title>
        <p>To date, employing AL-based matchers and blocking for the same ER task has always been
studied as separate problems [7]. In this paper, we aim to investigate how a practitioner should
allocate her “budget” for labeling data (i.e., a number of instances she is willing to label due
to time constraints) between blocking and matching to achieve the best final result for her
ER task at hand. Our preliminary results suggest that modeling ER as a multi-task Active
Learning problem can actually lead to better-quality performance than simply splitting the
budget between the two tasks. In Section 4 we demonstrate it on 4 real-world well-known
datasets, showing that in some cases a multi-task AL approach can achieve results as good as
a state-of-the-art Deep Learning ER algorithm trained with a significantly higher amount of
labeled pairs.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Entity Matching (EM) can be interpreted as a binary classification problem applied to record
pairs, labeled as matches or non-matches. AL application to classification tasks has been covered
in literature [8] with the goal of maximizing the classification accuracy while minimizing the
required human labeling efort. However, state-of-the-art solutions for EM, like Magellan [ 9] or
DeepMatcher [10], do not exploit AL, relying in some cases on other methods to achieve this
result (e.g., transfer learning [11]).</p>
      <p>If the oracle is a human annotator, AL can be considered as a case of a human-in-the-loop,
situation discussed in EM literature [12]. Significant approaches to this problem move from the
idea of using crowdsourcing for EM tasks [13, 12]: it is the case of Falcon [3], which proposes the
idea of hands-of crowdsourcing (HOC), using the crowd in the major tasks of the EM process.</p>
      <p>AL for EM is often faced as a problem of choosing the best combination of example selector
and classifier [ 7]. In many cases, related work focuses either on matching (considering blocking
as a separated pre-processing step) [14], or on blocking [15]; when both steps are considered
(e.g., Falcon), the diferent tasks of the EM pipeline are still treated separately, lacking a holistic
approach. In fact, even if multi-task AL has already been used for other tasks [16], at the best of
our knowledge, it was never applied to data cleaning before our study.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Study Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Active Learning ER pipeline</title>
        <p>Remember that the goal of blocking is indeed to avoid expensive comparisons. Thus, for each
pair, we compute a set of features exploiting blocking functions—details below. These features
are used to train a classifier with AL using budget 1: first, the classifier is trained with a minimal
number of samples randomly chosen, then until the budget is consumed the classifier is trained
by asking a user to label the most important pairs. At each iteration, the AL algorithm chooses
the pairs to be labeled that are more informative, i.e., that will improve the most the accuracy
of the classification model once labeled.</p>
        <p>At the end of this process, only the pairs that are classified as candidates are kept. In the
matching phase the same approach is applied, but using a diferent set of features that are based
on expensive—more accurate—string similarity measures. Thus, a classifier is trained with AL
using budget 2 and applied to all the non-labeled pairs to detect all the matching pairs. Of
course, the matcher is trained also with the pairs labeled in the blocking phase since already
available.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Blocking and Matching Framework</title>
        <p>We define an entity profile  as a tuple composed of a unique identifier and a set of name-value
pairs. We call profile collection  a set of profiles. Two profiles ,  ∈  are called matches if
they refer to the same real-world object. Thus, Entity Resolution (ER) is the task of identifying
all the matches, given  .</p>
        <p>We employed a schema-agnostic redundant blocking technique (i.e. Token Blocking [17])
combined with meta-blocking [6]. We choose this setting because it has been proven to be an
eficient and efective blocking technique, which does not require parameter tuning and works
also when the schema alignment is not complete (or present at all). Given a profile collection  ,
Token Blocking generates an inverted index for each token appearing in the profiles; then, it
considers each entry of that index as a block. This means that two profiles appear together in
a block if they share at least one token, and they will share as many blocks as many are the
tokens that they share. The set of resulting blocks is called block collection .</p>
        <p>We indicate with || the number of blocks  ∈  and with ‖‖ the number of comparisons
⟨,  ⟩ involved by the block . From  we extract the set of features ℱ as described by
Papadakis et al. in [6]. Computing these features from  is orders of magnitude faster than
computing string-based similarities, hence being the ideal candidate method for generating
the features in the blocking phase. Thus, for each comparison in  we extract a feature vector,
which can be used for Active Learning. In particular, we employed the best feature set as
reported in [6] and summarized in Table 1—notation in Table 2.</p>
        <p>Formula/Description
Number of non-redundant comparisons
involving a profile .</p>
        <p>|,|
(, ) = ||+||−| ,|
((,,))==|∈∑,︀|·,‖1|‖| ||
|| ·  ||</p>
        <p>Name</p>
        <p>Node Degree (ND)</p>
        <p>Jaccard Similarity (JS)</p>
        <p>Co-occurrence Frequency-Inverse Block Frequency (CFIBF)</p>
        <p>Reciprocal Aggregate Cardinality of Common Blocks (RACCB)</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Matching</title>
        <p>For the matching phase, we employed common similarity/distance functions for computing
feature records of pairs of candidate matches. We employed simple heuristic that has been shown
to work well on all the datasets: (i) for attributes with long string values (i.e., &gt; 20 characters), we
computed token-based measures (namely: Jaccard similarity, Dice similarity, Overlap coeficient,
and Cosine similarity); (ii) for attributes with short string values (i.e., &lt; 20 characters), we
computed character-based measures (namely: Levenshtein distance and Needleman–Wunsch
distance); (iii) for numeric attributes, we computed the ratio  = (, )/(, ),
for all non-null values and values diferent from 0.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>In Table 3 we report the characteristics of the employed datasets—commonly used as ER
benchmark datasets [6, 10, 13, 18].</p>
        <p>We divided them into two main groups:
(a) structured datasets, where the attributes are completely aligned and the attribute values
are mostly short, carrying specific information about a certain domain (e.g., Title of a
journal, or Venue of a conference);
(b) dirty datasets, where some information is scattered among attributes (e.g., one may find
the Brand of a product in the attribute Title) and also composed of log text describing in
plain English some characteristics of the entity (e.g., Description of a product).
As binary classifier, we employ Random Forest 1, and uncertainty-based sampling as AL strategy2,
for both blocking and matching employing the features described in Section 3. We choose
this Machine Learning (ML) algorithm for its interpretability and because it has been proven
to work well for ER [9]. We consider two sets of budgets for labeling: 100 and 500. This
means that for each experiment the AL for blocking and matching can ask to label 100 and 500
instances respectively, in total. These are representative of two typical scenarios: one where
the practitioner has limited time to label instances and the other where she has more time (or
more collaborators).</p>
        <p>1https://scikit-learn.org/stable/modules/ensemble.html#forest
2https://modal-python.readthedocs.io/en/latest/
Name
DblpAcm
ScholarDblp
AbtBuy</p>
        <p>AmazonGoogleProd.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Results</title>
        <p>In our experiment, we allocate a diferent share of AL budget for the blocking and the matching
classifier by varying 10% at a time. The results are summarized in Figure 2, grouped by type of
dataset (i.e., structured vs. dirty).</p>
        <p>(a) Structured datasets with budget=100</p>
        <p>(b) Structured datasets with budget=500
(c) Dirty datasets with budget=100
(d) Dirty datasets with budget=500</p>
        <p>The results show how with just a little amount as 100 labels, AL can learn an efective ER
pipeline for all the datasets: these results in some cases are almost as good as those that can be
obtained with a complex Deep Learning (DL) model and thousands of labeled data, and better
than those that can be obtained with traditional ML methods3.</p>
        <sec id="sec-4-2-1">
          <title>4.3.1. Structured datasets.</title>
          <p>For the structured datasets, with just 100 labels as budget for AL we can achieve almost the
highest results published so far. Both DeepMatcher [10] and Magellan [9] (respectively DL
and traditional ML state-of-the-art ER tools) are able to reach a 0.984 F-score on the structured
DblpAcm dataset, but they require a significant amount of labeled data (more than 9k labeled
pairs), while we can achieve a 0.981 F-score exploiting 100 labeled examples (0.990 if using 500).
Similarly, for ScholarDblp, our AL reaches a 0.939 (0.972) F-score with a budget of 100 (500)
labels, while DeepMatcher achieves a 0.947 F-score with more than 20k labeled pairs.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.3.2. Dirty datasets.</title>
          <p>For dirty datasets, our AL method manages to compete with state-of-the-art solutions relying on
a budget of just 500 labeled examples: on AmazonGoogleProducts and AbtBuy, DeepMatcher
yields a F-score of 0.693 and 0.628, respectively (trained with more than 6k labeled pairs), while
AL achieves 0.463 and 0.521—which is 0.03 lower than Magellan for the former and 0.09 higher
for the latter.
4.3.3. Takeaway.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>Overall, the results in Figure 2 suggest also that the budget for blocking and matching should
be split with a 40:60 ratio, which seems to work well for all the datasets.</p>
      <p>We investigate the problem of ER as a multi-task Active Learning problem. Surprisingly, we
ifnd that AL can be employed to easily achieve state-of-the-art performance with basically no
feature engineering and with a very common ML algorithm. Further, our preliminary results
suggest that employing only 40% of the budget for labeling pairs for AL-based blocking and
the rest of the budget for the AL-based matching seems to be the best configuration for all the
datasets. We are currently studying how to combine diferent AL strategies, ML methods and
blocking techniques (involving loose schema information [19]), to provide a more depth analysis
of the problem.</p>
      <p>3We refer to the results reported in [10].
[2] J. Wang, T. Kraska, M. J. Franklin, J. Feng, CrowdER: Crowdsourcing Entity Resolution,</p>
      <p>PVLDB 5 (2012) 1483–1494.
[3] S. Das, P. Suganthan G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute,
V. Raghavendra, Y. Park, Falcon: Scaling Up Hands-Of Crowdsourced Entity Matching to
Build Cloud Services, in: SIGMOD, ACM, 2017, pp. 1431–1446.
[4] E. K. Rezig, L. Cao, G. Simonini, M. Schoemans, S. Madden, N. Tang, M. Ouzzani, M.
Stonebraker, Dagger: A data (not code) debugger, in: CIDR, 2020.
[5] E. K. Rezig, L. Cao, M. Stonebraker, G. Simonini, W. Tao, S. Madden, M. Ouzzani, N. Tang,
A. K. Elmagarmid, Data civilizer 2.0: A holistic framework for data preparation and
analytics, PVLDB 12 (2019) 1954–1957.
[6] G. Papadakis, G. Papastefanatos, G. Koutrika, Supervised Meta-blocking, PVLDB 7 (2014)
1929–1940.
[7] V. V. Meduri, L. Popa, P. Sen, M. Sarwat, A Comprehensive Benchmark Framework for</p>
      <p>Active Learning Methods in Entity Matching, in: SIGMOD, ACM, 2020, pp. 1133–1147.
[8] B. Settles, Active Learning Literature Survey, University of Wisconsin-Madison Computer</p>
      <p>Sciences Technical Report 1648 (2009).
[9] P. Konda, S. Das, P. Suganthan G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi,
H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra, Magellan:
Toward Building Entity Matching Management Systems, PVLDB 9 (2016) 1197–1208.
[10] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute,
V. Raghavendra, Deep Learning for Entity Matching: A Design Space Exploration, in:
SIGMOD, ACM, 2018, pp. 19–34.
[11] Y. Li, J. Li, Y. Suhara, A. Doan, W. Tan, Deep Entity Matching with Pre-Trained Language</p>
      <p>Models, PVLDB 14 (2020) 50–60.
[12] D. Firmani, B. Saha, D. Srivastava, Online Entity Resolution Using an Oracle, PVLDB 9
(2016) 384–395.
[13] B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, S. Madden, Scaling Up Crowd-Sourcing
to Very Large Datasets: A Case for Active Learning, PVLDB 8 (2014) 125–136.
[14] K. Qian, L. Popa, P. Sen, Active Learning for Large-Scale Entity Resolution, in: CIKM,</p>
      <p>ACM, 2017, pp. 1379–1388.
[15] G. Dal Bianco, M. A. Gonçalves, D. Duarte, BLOSS: Efective meta-blocking with almost
no efort, Inf. Syst. 75 (2018) 75–89.
[16] Y. Zhang, Q. Yang, A Survey on Multi-Task Learning, CoRR abs/1707.08114 (2017).</p>
      <p>arXiv:1707.08114.
[17] G. Papadakis, G. M. Mandilaras, L. Gagliardelli, G. Simonini, E. Thanos, G. Giannakopoulos,
S. Bergamaschi, T. Palpanas, M. Koubarakis, Three-dimensional Entity Resolution with
JedAI, volume 93, 2020, p. 101565.
[18] L. Gagliardelli, S. Zhu, G. Simonini, S. Bergamaschi, Bigdedup: a big data integration
toolkit for duplicate detection in industrial scenarios, in: 25th International Conference
on Transdisciplinary Engineering (TE2018), volume 7, NLD, 2018, pp. 1015–1023.
[19] D. Beneventano, S. Bergamaschi, L. Gagliardelli, G. Simonini, BLAST2: An Eficient
Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources,
ACM JIDQ 12 (2020) 18:1–18:22.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Suganthan G. C.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Govind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Paulsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chandrasekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martinkus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Christie</surname>
          </string-name>
          ,
          <article-title>Magellan: toward building ecosystems of entity matching solutions</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>83</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>