<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>KG, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Entity Resolution on Camera Records without Machine Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Zecchini</string-name>
          <email>zecchini.luca@libero.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Simonini</string-name>
          <email>simonini@unimore.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonia Bergamaschi</string-name>
          <email>sonia.bergamaschi@unimore.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Entity Matching, Entity Resolution, Data Integration, Data Wran-</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi, di Modena e Reggio Emilia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi, di Modena e Reggio Emilia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi, di Modena e Reggio Emilia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>gling</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>31</volume>
      <issue>2020</issue>
      <abstract>
        <p>This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose target was to identify the speciifcations (i.e., records) collected across 24 e-commerce data sources that refer to the same real-world entities. First, we investigate the machine learning (ML) approach, but surprisingly find that existing state-of-the-art ML-based methods fall short in such a context-not reaching 0.49 F-score. Then, we propose an eficient solution that exploits annotated lists and regular expressions generated by humans that reaches a 0.99 F-score. In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based methods, in terms of human eforts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Entity resolution.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
    </sec>
    <sec id="sec-3">
      <title>Competition Description</title>
      <p>This paper reports our solution to the ACM SIGMOD 2020
programming contest, which resulted runner-up. During the contest, we
were asked to solve an Entity Resolution (ER) problem; in particular,
to perform ER on a provided dirty dataset of specifications (i.e.,
records) referring to cameras, extracted from diferent sources—
with the help of an available labelled dataset, if needed. In this
context, the aim of ER was to identify all the specifications (i.e.,
records) which referred to the same real-world camera model. The
solutions were evaluated according to the F-measure (harmonic
mean of precision and recall, rounded up to two decimal places)
reached on a secret evaluation dataset.</p>
      <p>Dataset Description. The provided dataset, called D, is composed of
29,787 specifications collected from 24 e-commerce websites. Each
specification is a list of name-value pairs describing a camera that is
being sold on the website. As illustrated in Table 1, the distribution
of the specifications across the sources is not uniform, with few
sources contributing with most of the specifications ( www.ebay.com
and www.alibaba.com).</p>
      <p>SOURCE
buy.net
cammarkt.com
www.alibaba.com
www.buzzillions.com
www.cambuy.com.au
www.camerafarm.com.au
www.canon-europe.com</p>
      <p>www.ebay.com
www.eglobalcentral.co.uk</p>
      <p>www.flipkart.com
www.garricks.com.au
www.gosale.com</p>
      <p>SOURCE
www.henrys.com</p>
      <p>www.ilgs.net
www.mypriceindia.com
www.pcconnection.com
www.pricedekho.com
www.price-hunt.com
www.priceme.co.nz
www.shopbot.com.au
www.shopmania.in
www.ukdigitalcameras.co.uk</p>
      <p>www.walmart.com
www.wexphotographic.com
As for the attributes, Page Title is the only one present in each
specification, while none of the other 4,660 attributes appears in
all the sources or, in most cases, in all the specifications from the
same source. Furthermore, the attributes sufer from problems of
homonymy (same name but diferent meaning) and synonymy
(same meaning but diferent names). Finally, many details about
additional accessories are provided (e.g., lens, bag, tripod, etc.) even
if they do not contribute to the identification of the entities—the
same for many other attributes, such as the color.</p>
      <p>Labelled Data. In addition to D (the complete dataset), a labelled
dataset to train a model in case of a machine learning (ML) solution
is provided. Diferently from D, this dataset contains a list of
speciifcations id pairs and the related binary label: 1, which denotes a
match (i.e., both specifications are related to the same real-world
camera model), and 0, which denotes a non-match.</p>
      <p>The labelled dataset is provided in two versions: Y, generated
from 306 specifications and available since the start of the
competition, and W, which is larger (generated from 908 specifications)
and includes all the couples already present in Y, made available
only later. Table 2 reports the size of the labelled datasets and their
internal distribution of matching and non-matching couples.</p>
      <sec id="sec-3-1">
        <title>DATASET</title>
        <p>Dataset Y
Dataset W</p>
      </sec>
      <sec id="sec-3-2">
        <title>COUPLES</title>
        <p>46,665
297,651</p>
      </sec>
      <sec id="sec-3-3">
        <title>MATCHES</title>
        <p>3,582
44,039</p>
      </sec>
      <sec id="sec-3-4">
        <title>NON-MATCHES</title>
        <p>43,083
253,612</p>
        <p>The data sources are dirty, meaning that it is possible to find
matches even among specifications coming from the same source.
Moreover, the transitive closure is applied on the matches: if 1
matches with 2 and 3, then also 2 matches with 3.
Evaluation Process. The competition’s participants were asked to
identify all the matching pairs of D. A portion of the ground truth of
D was known only to the organizer, which returned the F-measure
on that portion as feedback of each submitted solution. The last
version submitted before the contest deadline was used to determine
the final ranking. After the deadline, the top teams had to provide
their code and a guide to execute it, in order to validate and certify
the score (through manual inspection and a reproducibility test of
the solution) and to measure its execution time; this happened in a
network-isolated container with defined specifications (4x30 GHz
processor, 16 GB main memory, 128 GB storage, Linux OS). Both
training (not included in final execution time) and execution had
to respect a maximum time limit each of 12 hours.
1.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Our Solution and Paper Organization</title>
      <p>Our solution is based on the definition of human-crafted rules
and lists to detect the information about brands and models in
the specification titles—all titles contain this information—and to
normalize them in order to perform an equality join completing
the ER process.</p>
      <p>In our experience, the definition of brand-based rules obtained
through the human study of data was not more expensive than
labelling a general training dataset for a state-of-the-art ML-based
solution, which cannot capture with the same accuracy all the
brand-dependent patterns needed for detecting matches—i.e., the
ML-based solution achieves lower F-score.</p>
      <p>The remainder of the paper is organized as follows: Section 2
reviews the related work; Section 3 shows how the ML approach
performs on our problem; Section 4 presents our solution and the
results in the challenge; finally, the lesson learned is reported in
the Section 5.
2</p>
    </sec>
    <sec id="sec-5">
      <title>RELATED WORK</title>
      <p>
        ER (Entity Resolution, a.k.a.: Entity Matching, Duplicate
Detection, Record Linkage) has been studied for decades [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but is still a
relevant research problem, since it significantly relies on human
efort for labelling training data, for generating rules ¸ for blocking,
and many other related tasks—see [
        <xref ref-type="bibr" rid="ref1 ref12 ref15 ref3 ref4 ref9">1, 3, 4, 9, 12, 15</xref>
        ] for a survey.
      </p>
      <p>
        Frameworks have been proposed to support practitioners to solve
this task within data preparation pipelines [
        <xref ref-type="bibr" rid="ref11 ref13 ref5 ref7">5, 7, 11, 13</xref>
        ]. Among all
the proposed solutions for structured in the literature, the latest
research outcomes indicate the Machine Learning (ML) approaches
as most promising, achieving the best performance in publicly
available benchmarks when training data is available. In particular,
the state-of-the-art ER algorithms based on ML (and deep learning)
are Magellan [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and DeepMatcher [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], respectively.
      </p>
      <p>Magellan is an ER tool that allows to perform ER by using
supervised learning techniques. In practice, in the training phase
it takes as input a set of labelled examples of matching and
nonmatching pairs of specifications and trains a binary classifier. Then,
in the inference phase, it predicts the labels of unseen pairs of
specifications, i.e., performing the ER on the unlabelled part of the
dataset. Of course, a ML classifier needs feature engineering for
processing pairs of specifications; thus, Magellan employs
similarity functions computed on pairs of corresponding attribute values
(e.g., Jaccard and cosine similarity, edit distance, etc.) as features of
each specification pair. As classification algorithms, it implements
all the most-common and best-performing ones, such as decision
tree, random forest, SVM, naïve Bayes.</p>
      <p>Magellan may be sensitive to the definition of the features (e.g.,
by selecting some similarity functions rather than others or by using
diferent sizes for the tokenization of the attributes); hence, it might
be hard sometimes to tune. On the other hand, by employing deep
learning, this step is basically avoided: features are automatically
extracted by the neural net. Moreover, the final results are often
better, with the right neural net architecture. This is essentially the
idea behind DeepMatcher. In particular, it provides four diferent
kinds of architecture for EM tasks: (i) SIF, which determines a match
or non-match by considering the words present in each attribute
value pair, without caring about their order; (ii) RNN, which
considers the sequences of words; (iii) Attention, which considers the
alignment of words, without caring about word order; (iv) Hybrid,
which cares about the alignment of sequences of words, selected as
default model.</p>
      <p>
        Regular expressions have been used in the ER context for
discovering transformations for the entity reconciliation process (e.g., to
uniform the model representations among diferent specifications
of the same camera) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This is known as the golden record
approach and exploits labelled data to synthesize regular expressions
that can be used to transform strings in a canonical way, for all the
attributes of a record—which becomes the golden record at the end
of the transformations. Our solution exploits regular expressions to
isolate and normalize substrings that refer to a camera model, but
allows brand-dependent transformations, which do not generalize
to the whole dataset as the golden-record approach assumes.
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>HOW DID THE ML-APPROACH PERFORM</title>
    </sec>
    <sec id="sec-7">
      <title>IN THIS SETTING?</title>
      <p>As mentioned in Section 1.1, the data sources from which the
specifications were extracted have highly heterogeneous schemas, the
attributes have ambiguous names and most of them are used only
in a handful of specifications. This makes the schema alignment a
hard task itself. Furthermore, in our experiments we were not able
to find any attributes that would appreciably help in the
resolution of the ER problem. Thus, we decided not to explore this route
further—i.e., we do not perform a schema alignment. As a
consequence, EM with Magellan and DeepMatcher is performed on the
attribute Page Title, which is always present and contains in most
cases the two elements suficient to uniquely identify a camera: the
brand and the model. Notice that both Magellan and DeepMatcher
can operate only with an aligned schema, thus a diferent approach
would not be possible.</p>
      <p>We preprocess the labelled dataset to the format required by
the libraries and set the attribute Page Title to lowercase, then we
generate train, validation and test sets respecting a 3:1:1 ratio. All
of these sets contain matches and non-matches following the same
distribution, which reflects the one of the complete dataset.</p>
      <p>Because of the extremely variable length of the attribute values,
the results obtained by Magellan on Y are extremely poor (best
F-measure of 0.40 with naïve Bayes). This because only a portion of
the titles is relevant to determine a match. Since in most cases the
useful information (Brand and Model) is located among the first
words of Page Title, the normalization is realized by truncating the
string, considering only the first n words, with the best value of n
determined in an empirical way as 4. Table 3 and Table 4 show the
results obtained on the datasets Y and W, with the best performing
model (RNN by DeepMatcher) chosen to face the challenge.</p>
      <p>Moving to the complete dataset requires to consider all the pairs
of tuples generated by the Cartesian product (887,265,369 pairs
of tuples), an amount that even considering the reduction due to
the deletion of identities and reflexive pairs is not computationally
afordable, thus blocking is needed.</p>
      <p>
        We employ blocking to reduce the number of candidates, in
particular we find that token blocking [
        <xref ref-type="bibr" rid="ref10 ref14 ref16">10, 14, 16</xref>
        ] on the truncated
version of the attribute Page Title is a good choice. In practice, token
blocking indexes two specifications in the same block if they share
one of the first four words in their Page Title. Then, all possible pairs
of specifications within each block are considered as candidates.
This blocking strategy produces 3,914 blocks, yielding 54,000,932
candidate pairs, and on the labelled ones (i.e., dataset W ), it achieves
a precision of 0.28 and a recall of 0.99, making it a suitable candidate
set for being processed in an acceptable amount of time.
      </p>
      <p>In order to improve precision and recall, we also employ a list of
frequent useless words to be deleted from Page Title, increasing the
probability of finding relevant information among the 4 maintained
words. This list is generated by looking for the most shared words
and detecting the ones which do not express meaningful
information, before focusing on more specific cases. Further improvements
can be given by the use of aliases, to solve the problem of popular
synonyms (e.g., "fuji" and "fujifilm").
3.1</p>
    </sec>
    <sec id="sec-8">
      <title>Results with the ML approach</title>
      <p>On  the best performing configuration we were able to find could
not go beyond a F-measure of 0.47. In detail, it achieves a recall
The best result has been achieved by employing DeepMatcher with a RNN architecture.
of 0.85 but a very poor precision of 0.32, due to a serious problem
of false positives—not solvable through a simple useless words
removal.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Why does the ML approach fall short?</title>
      <p>The problem of false positives is linked to the nature of the dataset:
the matching can be determined only on little brand-dependent
details (e.g., the variation of a single letter or digit). Because of
its size, the labelled dataset can cover only a little portion of all
possible relevant cases which can be met considering all
specifications (furthermore, a lot of brands and models of  do not even
appear in  ). So, the similarity patterns learned by Magellan and
DeepMatcher on a few brands and models cannot be efective on
the whole dataset, since it is impossible to generalize them.
4</p>
    </sec>
    <sec id="sec-10">
      <title>A REGEX APPROACH</title>
      <p>Our solution is based on regular expressions (regex), developed as a
variation of the blocking method used in the previous approach. In
particular, exploring the data it is possible to notice that in most
cases the model is expressed as a string composed by both letters
and digits, so it can be retrieved using a regular expression, while
the number of brands with a high distribution is quite limited,
manageable through a (human-generated) list.</p>
      <p>In most cases, by performing simple data cleansing operations
(e.g., by removing special characters and white spaces) the
extraction of the first alpha-numerical string allows to detect the right
model. Using this in combination with a list of the most popular
brands, in 19,513 out of the 29,787 specifications, both the brand and
the model are detected. In the following we describe the complete
procedure performed to get the final, complete result.
4.1</p>
    </sec>
    <sec id="sec-11">
      <title>The Procedure</title>
      <p>The camera specifications are read into a dictionary with an
identifier ID and the attribute Page Title, considered in its entirety, in
lowercase, and replacing all the punctuation characters with an idle
character (except for the character “-”, substituted by the empty
string because it is often used inside model names). Then, Page Title
is transformed in a list by tokenization, and the elements that have
aliases are replaced with their basic form by employing an apposite
human-generated dictionary—e.g., diferent or misspelled versions
used to indicate the same brand, like "fuji" for "fujifilm" or
"cannon" for "canon". Afterwards, the list is used for the brand
retrieval phase, to identify the brand of each camera—also for this
phase a human-generated list, containing popular brand names, is
employed.</p>
      <p>For extracting the model type (needed for the actual ER) four lists
for each brand are defined: the prefixes list and the suffixes list,
which contain the recurrent elements that can appear separately
from the model strings and that must be considered part of the
model name; the model list, which is composed by those only
alphabetical or only numerical words (so, they would not be detected
by the defined regex system without the use of this list) that must be
considered as models; the exceptions list, which contains words
that are both alphabetical and numerical but that must be skipped
because they do not represent a model, using also an additional list
of sufixes that denote measures (e.g., “mm”,“gb”, etc.). Finally,
an equivalences dictionary is employed to conform the diferent
versions of a model name choosing a standard one.</p>
      <p>If a prefix (sufix), identified through the prefix (sufix) list,
appears in the list of tokens representing Page Title, it is concatenated
to its next (previous) token. Scanning the resulting list, a model
can be detected and assigned to the current specification if a token
is both alphabetical and numerical (detected through the defined
regular expression) and does not appear in the exceptions list, or
if it is only alphabetical or only numerical and appears in the model
list—the first detected model is assigned to the specification.</p>
      <p>Once extracted the model, further brand-based operations can be
performed to normalize it: some prefixes and sufixes are modified
or removed (e.g., the sufixes that indicate the colors and that would
cause false negatives); some models may have synonyms depending
on the geographic area (e.g., Canon EOS 450D is sold in North
America as EOS Rebel XSi and in Japan as EOS Kiss X2), so they must
be normalized to a single standard in order to avoid false negatives;
some models need the retrieval of additional information for the
disambiguation (e.g., EOS 5D Mark II, EOS 5D Mark III, and EOS 5D
Mark IV must be considered as diferent models): the models which
are sensitive to this problem (e.g., 5D) are stored in an additional
brand-related list. If a model appears in this list, the information
related to its edition (e.g., mark for Canon) in the attribute Page
Title is retrieved and attached to the model name, in order to avoid
false positives.</p>
      <p>Thus, the result of the process is that a new attribute is added
to each specification: brand_n_model. This attribute contains the
concatenation of the extracted brand and model—of course, if they
have been both detected.</p>
      <p>If the detection was successful, the specification is appended to
the solved_specs list, otherwise to unsolved_specs list.</p>
      <p>Once concluded this first step of extracting brand and model
from the specifications, the following step is to determine the
matching pairs of specifications. An inverted index built from
the solved_specs list generates clusters representing entities, by
grouping elements according to the perfect match of the attribute
brand_n_model. Then, the specifications in the unsolved_specs
list are considered as matches only if the content of the attribute
Page Title is exactly the same: once again, inverted index is build
to generate clusters.</p>
      <p>We finally generate the list of matching pairs by emitting all
possible pairs of specifications from each identified cluster. We
need this step to submit the final solution and compute its F-score.</p>
      <p>Notice that all the lists and rules are human-crafted; yet, in our
experience, this required human labor is no more expensive than
the amount of work required for labelling a training dataset of
match/non-match pairs for Magellan or DeepMatcher.</p>
    </sec>
    <sec id="sec-12">
      <title>4.2 Results</title>
      <p>After all refinements applied to the method, the final submission
reached the top F-measure of 0.99, with precision equal to 0.99 and
recall to 0.98. Other 4 teams were able to reach the top result, so
the tie was broken according to the execution time of the solution,
concluding the contest as runner-up.</p>
    </sec>
    <sec id="sec-13">
      <title>5 CONCLUSION</title>
      <p>While looking for a good solution for the contest’s ER problem, we
investigated the limits of state-of-the-art machine learning
(Magellan) and deep learning (DeepMatcher) methods for ER. These
methods are able to achieve good results when matches can be
identified by means of similarity-based features, but in a lot of
realworld scenarios it may happen that matching is based on small
variations that make the generalization on the entire dataset of
the learned patterns impossible—this is the case of camera
models, which have tiny brand-dependent variations to distinguish a
camera (an entity) to another. In fact, we found that ER on camera
specifications requires human-designed rules and lists (prefixes and
sufixes management, exceptions, etc.), which existing ML
methods are not able to synthesize. Yet, in our experience, these rules
and lists are no more expensive to build than a labelled dataset
of match/non-match pairs required by ML-based methods. This
suggests that when approaching an ER problem, to start collecting
match/non-match labels might not be the first thing to do, and
might not be necessary at all.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Christen</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Data Matching - Concepts and Techniques for Record Linkage</article-title>
          , Entity Resolution, and Duplicate Detection. Springer.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dong</given-names>
            <surname>Deng</surname>
          </string-name>
          , Wenbo Tao, Ziawasch Abedjan,
          <string-name>
            <surname>Ahmed K. Elmagarmid</surname>
            , Ihab F. Ilyas,
            <given-names>Guoliang</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Samuel</given-names>
          </string-name>
          <string-name>
            <surname>Madden</surname>
            , Mourad Ouzzani,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Stonebraker</surname>
            , and
            <given-names>Nan</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Unsupervised String Transformation Learning for Entity Consolidation</article-title>
          .
          <source>In ICDE</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>AnHai</given-names>
            <surname>Doan</surname>
          </string-name>
          , Alon Y. Halevy, and
          <string-name>
            <surname>Zachary</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Ives</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Principles of Data Integration</article-title>
          . Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ahmed</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Elmagarmid</surname>
          </string-name>
          , Panagiotis G. Ipeirotis, and
          <string-name>
            <surname>Vassilios</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Verykios</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Duplicate Record Detection: A Survey</article-title>
          .
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>19</volume>
          ,
          <issue>1</issue>
          (
          <year>2007</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yash</given-names>
            <surname>Govind</surname>
          </string-name>
          et al.
          <year>2019</year>
          .
          <article-title>Entity Matching Meets Data Science: A Progress Report from the Magellan Project</article-title>
          .
          <source>In SIGMOD</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ivan</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Fellegi and Alan B Sunter</surname>
          </string-name>
          .
          <year>1969</year>
          .
          <article-title>A theory for record linkage</article-title>
          .
          <source>J. Amer. Statist. Assoc</source>
          .
          <volume>64</volume>
          ,
          <issue>328</issue>
          (
          <year>1969</year>
          ),
          <fpage>1183</fpage>
          -
          <lpage>1210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Luca</given-names>
            <surname>Gagliardelli</surname>
          </string-name>
          , Giovanni Simonini, Domenico Beneventano, and
          <string-name>
            <given-names>Sonia</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SparkER: Scaling Entity Resolution in Spark</article-title>
          .
          <source>In EDBT</source>
          <year>2019</year>
          , Lisbon, Portugal, March
          <volume>26</volume>
          -29,
          <year>2019</year>
          .
          <fpage>602</fpage>
          -
          <lpage>605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sidharth</given-names>
            <surname>Mudgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Han</given-names>
            <surname>Li</surname>
          </string-name>
          , Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep Learning for Entity Matching: A Design Space Exploration</article-title>
          .
          <source>In SIGMOD</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          and
          <string-name>
            <given-names>Melanie</given-names>
            <surname>Herschel</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>An Introduction to Duplicate Detection</article-title>
          . Morgan &amp; Claypool Publishers.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          , Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Nejdl</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces</article-title>
          .
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>25</volume>
          ,
          <issue>12</issue>
          (
          <year>2013</year>
          ),
          <fpage>2665</fpage>
          -
          <lpage>2682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Georgios M. Mandilaras</surname>
            , Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and
            <given-names>Manolis</given-names>
          </string-name>
          <string-name>
            <surname>Koubarakis</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Three-dimensional Entity Resolution with JedAI</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>93</volume>
          (
          <year>2020</year>
          ),
          <fpage>101565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          , Dimitrios Skoutas, Emmanouil Thanos, and
          <string-name>
            <given-names>Themis</given-names>
            <surname>Palpanas</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Blocking and Filtering Techniques for Entity Resolution: A Survey</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 53</source>
          ,
          <issue>2</issue>
          (
          <year>2020</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>El Kindi</surname>
            <given-names>Rezig</given-names>
          </string-name>
          , Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao, Samuel Madden, Mourad Ouzzani,
          <string-name>
            <given-names>Nan</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ahmed</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Elmagarmid</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>12</volume>
          ,
          <issue>12</issue>
          (
          <year>2019</year>
          ),
          <fpage>1954</fpage>
          -
          <lpage>1957</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Giovanni</surname>
            <given-names>Simonini</given-names>
          </string-name>
          , Sonia Bergamaschi, and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>9</volume>
          ,
          <issue>12</issue>
          (
          <year>2016</year>
          ),
          <fpage>1173</fpage>
          -
          <lpage>1184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Giovanni</surname>
            <given-names>Simonini</given-names>
          </string-name>
          , Luca Gagliardelli, Sonia Bergamaschi, and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Scaling entity resolution: A loosely schema-aware approach</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>83</volume>
          (
          <year>2019</year>
          ),
          <fpage>145</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Giovanni</surname>
            <given-names>Simonini</given-names>
          </string-name>
          , George Papadakis, Themis Palpanas, and
          <string-name>
            <given-names>Sonia</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Schema-Agnostic Progressive Entity Resolution</article-title>
          .
          <source>In ICDE</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>