Entity Resolution on Camera Records without Machine Learning
                                                       SIGMOD 2020 Contest Finalist Paper

                  Luca Zecchini                                        Giovanni Simonini                              Sonia Bergamaschi
              Università degli Studi                                   Università degli Studi                          Università degli Studi
           di Modena e Reggio Emilia                                di Modena e Reggio Emilia                       di Modena e Reggio Emilia
                       Italy                                                   Italy                                           Italy
             zecchini.luca@libero.it                                   simonini@unimore.it                        sonia.bergamaschi@unimore.it

ABSTRACT                                                                                     SOURCE               SPECS           SOURCE                SPECS
                                                                                              buy.net               358        www.henrys.com            181
This paper reports the runner-up solution to the ACM SIGMOD                               cammarkt.com              198         www.ilgs.net             102
2020 programming contest, whose target was to identify the speci-                        www.alibaba.com           7,972    www.mypriceindia.com         347
fications (i.e., records) collected across 24 e-commerce data sources                   www.buzzillions.com         832     www.pcconnection.com         211
that refer to the same real-world entities.                                             www.cambuy.com.au           118      www.pricedekho.com          366
                                                                                       www.camerafarm.com.au        120      www.price-hunt.com          327
   First, we investigate the machine learning (ML) approach, but                       www.canon-europe.com         164       www.priceme.co.nz          740
surprisingly find that existing state-of-the-art ML-based methods                         www.ebay.com            14,274     www.shopbot.com.au          516
fall short in such a context—not reaching 0.49 F-score. Then, we                       www.eglobalcentral.co.uk     571       www.shopmania.in           630
propose an efficient solution that exploits annotated lists and regu-                    www.flipkart.com           157    www.ukdigitalcameras.co.uk    129
lar expressions generated by humans that reaches a 0.99 F-score.                        www.garricks.com.au         130       www.walmart.com            195
                                                                                         www.gosale.com            1,002   www.wexphotographic.com       147
In our experience, our approach was not more expensive than the
dataset labeling of match/non-match pairs required by ML-based                             Table 1: Specifications extracted from each source
methods, in terms of human efforts.

CCS CONCEPTS                                                                              As for the attributes, Page Title is the only one present in each
• Information systems → Entity resolution.                                            specification, while none of the other 4,660 attributes appears in
                                                                                      all the sources or, in most cases, in all the specifications from the
KEYWORDS                                                                              same source. Furthermore, the attributes suffer from problems of
                                                                                      homonymy (same name but different meaning) and synonymy
Entity Matching, Entity Resolution, Data Integration, Data Wran-                      (same meaning but different names). Finally, many details about
gling                                                                                 additional accessories are provided (e.g., lens, bag, tripod, etc.) even
                                                                                      if they do not contribute to the identification of the entities—the
1 INTRODUCTION                                                                        same for many other attributes, such as the color.
1.1 Competition Description                                                           Labelled Data. In addition to D (the complete dataset), a labelled
This paper reports our solution to the ACM SIGMOD 2020 program-                       dataset to train a model in case of a machine learning (ML) solution
ming contest, which resulted runner-up. During the contest, we                        is provided. Differently from D, this dataset contains a list of speci-
were asked to solve an Entity Resolution (ER) problem; in particular,                 fications id pairs and the related binary label: 1, which denotes a
to perform ER on a provided dirty dataset of specifications (i.e.,                    match (i.e., both specifications are related to the same real-world
records) referring to cameras, extracted from different sources—                      camera model), and 0, which denotes a non-match.
with the help of an available labelled dataset, if needed. In this                        The labelled dataset is provided in two versions: Y, generated
context, the aim of ER was to identify all the specifications (i.e.,                  from 306 specifications and available since the start of the compe-
records) which referred to the same real-world camera model. The                      tition, and W, which is larger (generated from 908 specifications)
solutions were evaluated according to the F-measure (harmonic                         and includes all the couples already present in Y, made available
mean of precision and recall, rounded up to two decimal places)                       only later. Table 2 reports the size of the labelled datasets and their
reached on a secret evaluation dataset.                                               internal distribution of matching and non-matching couples.
Dataset Description. The provided dataset, called D, is composed of                      DATASET        COUPLES        MATCHES          NON-MATCHES
29,787 specifications collected from 24 e-commerce websites. Each                        Dataset Y      46,665         3,582            43,083
specification is a list of name-value pairs describing a camera that is                  Dataset W      297,651        44,039           253,612
being sold on the website. As illustrated in Table 1, the distribution
of the specifications across the sources is not uniform, with few                              Table 2: Characteristics of labelled datasets
sources contributing with most of the specifications (www.ebay.com
and www.alibaba.com).                                                                   The data sources are dirty, meaning that it is possible to find
                                                                                      matches even among specifications coming from the same source.
DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY     Moreover, the transitive closure is applied on the matches: if 𝑒 1
4.0).                                                                                 matches with 𝑒 2 and 𝑒 3 , then also 𝑒 2 matches with 𝑒 3 .
DI2KG, August 31, 2020, Tokyo                                                                    Luca Zecchini, Giovanni Simonini, and Sonia Bergamaschi


Evaluation Process. The competition’s participants were asked to            each specification pair. As classification algorithms, it implements
identify all the matching pairs of D. A portion of the ground truth of      all the most-common and best-performing ones, such as decision
D was known only to the organizer, which returned the F-measure             tree, random forest, SVM, naïve Bayes.
on that portion as feedback of each submitted solution. The last               Magellan may be sensitive to the definition of the features (e.g.,
version submitted before the contest deadline was used to determine         by selecting some similarity functions rather than others or by using
the final ranking. After the deadline, the top teams had to provide         different sizes for the tokenization of the attributes); hence, it might
their code and a guide to execute it, in order to validate and certify      be hard sometimes to tune. On the other hand, by employing deep
the score (through manual inspection and a reproducibility test of          learning, this step is basically avoided: features are automatically
the solution) and to measure its execution time; this happened in a         extracted by the neural net. Moreover, the final results are often
network-isolated container with defined specifications (4x30 GHz            better, with the right neural net architecture. This is essentially the
processor, 16 GB main memory, 128 GB storage, Linux OS). Both               idea behind DeepMatcher. In particular, it provides four different
training (not included in final execution time) and execution had           kinds of architecture for EM tasks: (i) SIF, which determines a match
to respect a maximum time limit each of 12 hours.                           or non-match by considering the words present in each attribute
                                                                            value pair, without caring about their order; (ii) RNN, which con-
1.2     Our Solution and Paper Organization                                 siders the sequences of words; (iii) Attention, which considers the
                                                                            alignment of words, without caring about word order; (iv) Hybrid,
Our solution is based on the definition of human-crafted rules
                                                                            which cares about the alignment of sequences of words, selected as
and lists to detect the information about brands and models in
                                                                            default model.
the specification titles—all titles contain this information—and to
                                                                               Regular expressions have been used in the ER context for discov-
normalize them in order to perform an equality join completing
                                                                            ering transformations for the entity reconciliation process (e.g., to
the ER process.
                                                                            uniform the model representations among different specifications
   In our experience, the definition of brand-based rules obtained
                                                                            of the same camera) [2]. This is known as the golden record ap-
through the human study of data was not more expensive than
                                                                            proach and exploits labelled data to synthesize regular expressions
labelling a general training dataset for a state-of-the-art ML-based
                                                                            that can be used to transform strings in a canonical way, for all the
solution, which cannot capture with the same accuracy all the
                                                                            attributes of a record—which becomes the golden record at the end
brand-dependent patterns needed for detecting matches—i.e., the
                                                                            of the transformations. Our solution exploits regular expressions to
ML-based solution achieves lower F-score.
                                                                            isolate and normalize substrings that refer to a camera model, but
   The remainder of the paper is organized as follows: Section 2
                                                                            allows brand-dependent transformations, which do not generalize
reviews the related work; Section 3 shows how the ML approach
                                                                            to the whole dataset as the golden-record approach assumes.
performs on our problem; Section 4 presents our solution and the
results in the challenge; finally, the lesson learned is reported in
the Section 5.                                                              3    HOW DID THE ML-APPROACH PERFORM
                                                                                 IN THIS SETTING?
2     RELATED WORK                                                          As mentioned in Section 1.1, the data sources from which the spec-
    ER (Entity Resolution, a.k.a.: Entity Matching, Duplicate Detec-        ifications were extracted have highly heterogeneous schemas, the
tion, Record Linkage) has been studied for decades [6], but is still a      attributes have ambiguous names and most of them are used only
relevant research problem, since it significantly relies on human           in a handful of specifications. This makes the schema alignment a
effort for labelling training data, for generating rules¸ for blocking,     hard task itself. Furthermore, in our experiments we were not able
and many other related tasks—see [1, 3, 4, 9, 12, 15] for a survey.         to find any attributes that would appreciably help in the resolu-
    Frameworks have been proposed to support practitioners to solve         tion of the ER problem. Thus, we decided not to explore this route
this task within data preparation pipelines [5, 7, 11, 13]. Among all       further—i.e., we do not perform a schema alignment. As a conse-
the proposed solutions for structured in the literature, the latest         quence, EM with Magellan and DeepMatcher is performed on the
research outcomes indicate the Machine Learning (ML) approaches             attribute Page Title, which is always present and contains in most
as most promising, achieving the best performance in publicly               cases the two elements sufficient to uniquely identify a camera: the
available benchmarks when training data is available. In particular,        brand and the model. Notice that both Magellan and DeepMatcher
the state-of-the-art ER algorithms based on ML (and deep learning)          can operate only with an aligned schema, thus a different approach
are Magellan [5] and DeepMatcher [8], respectively.                         would not be possible.
    Magellan is an ER tool that allows to perform ER by using su-               We preprocess the labelled dataset to the format required by
pervised learning techniques. In practice, in the training phase            the libraries and set the attribute Page Title to lowercase, then we
it takes as input a set of labelled examples of matching and non-           generate train, validation and test sets respecting a 3:1:1 ratio. All
matching pairs of specifications and trains a binary classifier. Then,      of these sets contain matches and non-matches following the same
in the inference phase, it predicts the labels of unseen pairs of           distribution, which reflects the one of the complete dataset.
specifications, i.e., performing the ER on the unlabelled part of the           Because of the extremely variable length of the attribute values,
dataset. Of course, a ML classifier needs feature engineering for           the results obtained by Magellan on Y are extremely poor (best
processing pairs of specifications; thus, Magellan employs similar-         F-measure of 0.40 with naïve Bayes). This because only a portion of
ity functions computed on pairs of corresponding attribute values           the titles is relevant to determine a match. Since in most cases the
(e.g., Jaccard and cosine similarity, edit distance, etc.) as features of   useful information (Brand and Model) is located among the first
Entity Resolution on Camera Records without Machine Learning                                                                      DI2KG, August 31, 2020, Tokyo


words of Page Title, the normalization is realized by truncating the                  of 0.85 but a very poor precision of 0.32, due to a serious problem
string, considering only the first n words, with the best value of n                  of false positives—not solvable through a simple useless words
determined in an empirical way as 4. Table 3 and Table 4 show the                     removal.
results obtained on the datasets Y and W, with the best performing
model (RNN by DeepMatcher) chosen to face the challenge.                              3.2    Why does the ML approach fall short?
                                                                                      The problem of false positives is linked to the nature of the dataset:
      CLASSIFIER            PRECISION           RECALL          F-SCORE
                                                                                      the matching can be determined only on little brand-dependent
      Decision Tree         92.45%              90.64%          91.54%
                                                                                      details (e.g., the variation of a single letter or digit). Because of
      Random Forest         93.91%              88.27%          91.00%
                                                                                      its size, the labelled dataset can cover only a little portion of all
      RNN                   97%                 95%             96%                   possible relevant cases which can be met considering all specifica-
      Attention             100%                73%             84%                   tions (furthermore, a lot of brands and models of 𝐷 do not even
      Hybrid                98%                 73%             84%                   appear in 𝑊 ). So, the similarity patterns learned by Magellan and
      SIF                   46%                 31%             37%                   DeepMatcher on a few brands and models cannot be effective on
                     Table 3: Results on dataset Y                                    the whole dataset, since it is impossible to generalize them.

                                                                                      4     A REGEX APPROACH
                                                                                      Our solution is based on regular expressions (regex), developed as a
 CLASSIFIER                 PRECISION           RECALL         F-MEASURE
                                                                                      variation of the blocking method used in the previous approach. In
 Random Forest              98.90%              95.03%         96.93%
                                                                                      particular, exploring the data it is possible to notice that in most
 SVM                        98.29%              93.52%         95.85%
                                                                                      cases the model is expressed as a string composed by both letters
 Logistic Regression        96.56%              89.10%         92.68%
                                                                                      and digits, so it can be retrieved using a regular expression, while
 Decision Tree              97.72%              87.10%         92.11%
                                                                                      the number of brands with a high distribution is quite limited,
 Linear Regression          97.47%              79.27%         87.43%
                                                                                      manageable through a (human-generated) list.
 Naïve Bayes                70.35%              94.13%         80.52%
                                                                                         In most cases, by performing simple data cleansing operations
 RNN                        99.59%              96.96%         98.26%
                                                                                      (e.g., by removing special characters and white spaces) the extrac-
                    Table 4: Results on dataset W                                     tion of the first alpha-numerical string allows to detect the right
                                                                                      model. Using this in combination with a list of the most popular
                                                                                      brands, in 19,513 out of the 29,787 specifications, both the brand and
   Moving to the complete dataset requires to consider all the pairs                  the model are detected. In the following we describe the complete
of tuples generated by the Cartesian product (887,265,369 pairs                       procedure performed to get the final, complete result.
of tuples), an amount that even considering the reduction due to
the deletion of identities and reflexive pairs is not computationally                 4.1    The Procedure
affordable, thus blocking is needed.
                                                                                      The camera specifications are read into a dictionary with an iden-
   We employ blocking to reduce the number of candidates, in
                                                                                      tifier ID and the attribute Page Title, considered in its entirety, in
particular we find that token blocking [10, 14, 16] on the truncated
                                                                                      lowercase, and replacing all the punctuation characters with an idle
version of the attribute Page Title is a good choice. In practice, token
                                                                                      character (except for the character “-”, substituted by the empty
blocking indexes two specifications in the same block if they share
                                                                                      string because it is often used inside model names). Then, Page Title
one of the first four words in their Page Title. Then, all possible pairs
                                                                                      is transformed in a list by tokenization, and the elements that have
of specifications within each block are considered as candidates.
                                                                                      aliases are replaced with their basic form by employing an apposite
This blocking strategy produces 3,914 blocks, yielding 54,000,932
                                                                                      human-generated dictionary—e.g., different or misspelled versions
candidate pairs, and on the labelled ones (i.e., dataset W ), it achieves
                                                                                      used to indicate the same brand, like "fuji" for "fujifilm" or
a precision of 0.28 and a recall of 0.99, making it a suitable candidate
                                                                                      "cannon" for "canon". Afterwards, the list is used for the brand
set for being processed in an acceptable amount of time.
                                                                                      retrieval phase, to identify the brand of each camera—also for this
   In order to improve precision and recall, we also employ a list of
                                                                                      phase a human-generated list, containing popular brand names, is
frequent useless words to be deleted from Page Title, increasing the
                                                                                      employed.
probability of finding relevant information among the 4 maintained
                                                                                          For extracting the model type (needed for the actual ER) four lists
words. This list is generated by looking for the most shared words
                                                                                      for each brand are defined: the prefixes list and the suffixes list,
and detecting the ones which do not express meaningful informa-
                                                                                      which contain the recurrent elements that can appear separately
tion, before focusing on more specific cases. Further improvements
                                                                                      from the model strings and that must be considered part of the
can be given by the use of aliases, to solve the problem of popular
                                                                                      model name; the model list, which is composed by those only al-
synonyms (e.g., "fuji" and "fujifilm").
                                                                                      phabetical or only numerical words (so, they would not be detected
3.1     Results with the ML approach                                                  by the defined regex system without the use of this list) that must be
                                                                                      considered as models; the exceptions list, which contains words
On 𝐷 the best performing configuration we were able to find could                     that are both alphabetical and numerical but that must be skipped
not go beyond a F-measure of 0.47. In detail, it achieves a recall                    because they do not represent a model, using also an additional list
The best result has been achieved by employing DeepMatcher with a RNN architecture.   of suffixes that denote measures (e.g., “mm”,“gb”, etc.). Finally,
DI2KG, August 31, 2020, Tokyo                                                                         Luca Zecchini, Giovanni Simonini, and Sonia Bergamaschi


an equivalences dictionary is employed to conform the different              5    CONCLUSION
versions of a model name choosing a standard one.                            While looking for a good solution for the contest’s ER problem, we
    If a prefix (suffix), identified through the prefix (suffix) list, ap-   investigated the limits of state-of-the-art machine learning (Mag-
pears in the list of tokens representing Page Title, it is concatenated      ellan) and deep learning (DeepMatcher) methods for ER. These
to its next (previous) token. Scanning the resulting list, a model           methods are able to achieve good results when matches can be
can be detected and assigned to the current specification if a token         identified by means of similarity-based features, but in a lot of real-
is both alphabetical and numerical (detected through the defined             world scenarios it may happen that matching is based on small
regular expression) and does not appear in the exceptions list, or           variations that make the generalization on the entire dataset of
if it is only alphabetical or only numerical and appears in the model        the learned patterns impossible—this is the case of camera mod-
list—the first detected model is assigned to the specification.              els, which have tiny brand-dependent variations to distinguish a
    Once extracted the model, further brand-based operations can be          camera (an entity) to another. In fact, we found that ER on camera
performed to normalize it: some prefixes and suffixes are modified           specifications requires human-designed rules and lists (prefixes and
or removed (e.g., the suffixes that indicate the colors and that would       suffixes management, exceptions, etc.), which existing ML meth-
cause false negatives); some models may have synonyms depending              ods are not able to synthesize. Yet, in our experience, these rules
on the geographic area (e.g., Canon EOS 450D is sold in North                and lists are no more expensive to build than a labelled dataset
America as EOS Rebel XSi and in Japan as EOS Kiss X2), so they must          of match/non-match pairs required by ML-based methods. This
be normalized to a single standard in order to avoid false negatives;        suggests that when approaching an ER problem, to start collecting
some models need the retrieval of additional information for the             match/non-match labels might not be the first thing to do, and
disambiguation (e.g., EOS 5D Mark II, EOS 5D Mark III, and EOS 5D            might not be necessary at all.
Mark IV must be considered as different models): the models which
are sensitive to this problem (e.g., 5D) are stored in an additional         REFERENCES
brand-related list. If a model appears in this list, the information          [1] Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage,
related to its edition (e.g., mark for Canon) in the attribute Page               Entity Resolution, and Duplicate Detection. Springer.
                                                                              [2] Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed K. Elmagarmid, Ihab F. Ilyas,
Title is retrieved and attached to the model name, in order to avoid              Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan
false positives.                                                                  Tang. 2019. Unsupervised String Transformation Learning for Entity Consolida-
    Thus, the result of the process is that a new attribute is added              tion. In ICDE 2019.
                                                                              [3] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data
to each specification: brand_n_model. This attribute contains the                 Integration. Morgan Kaufmann.
concatenation of the extracted brand and model—of course, if they             [4] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007.
have been both detected.                                                          Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007),
                                                                                  1–16.
    If the detection was successful, the specification is appended to         [5] Yash Govind et al. 2019. Entity Matching Meets Data Science: A Progress Report
the solved_specs list, otherwise to unsolved_specs list.                          from the Magellan Project. In SIGMOD 2019.
                                                                              [6] Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer.
    Once concluded this first step of extracting brand and model                  Statist. Assoc. 64, 328 (1969), 1183–1210.
from the specifications, the following step is to determine the               [7] Luca Gagliardelli, Giovanni Simonini, Domenico Beneventano, and Sonia Berga-
matching pairs of specifications. An inverted index built from                    maschi. 2019. SparkER: Scaling Entity Resolution in Spark. In EDBT 2019, Lisbon,
                                                                                  Portugal, March 26-29, 2019. 602–605.
the solved_specs list generates clusters representing entities, by            [8] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
grouping elements according to the perfect match of the attribute                 Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018.
brand_n_model. Then, the specifications in the unsolved_specs                     Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD
                                                                                  2018.
list are considered as matches only if the content of the attribute           [9] Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection.
Page Title is exactly the same: once again, inverted index is build               Morgan & Claypool Publishers.
                                                                             [10] George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and
to generate clusters.                                                             Wolfgang Nejdl. 2013. A Blocking Framework for Entity Resolution in Highly
    We finally generate the list of matching pairs by emitting all                Heterogeneous Information Spaces. IEEE Trans. Knowl. Data Eng. 25, 12 (2013),
possible pairs of specifications from each identified cluster. We                 2665–2682.
                                                                             [11] George Papadakis, Georgios M. Mandilaras, Luca Gagliardelli, Giovanni Simonini,
need this step to submit the final solution and compute its F-score.              Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Pal-
    Notice that all the lists and rules are human-crafted; yet, in our            panas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with
experience, this required human labor is no more expensive than                   JedAI. Inf. Syst. 93 (2020), 101565.
                                                                             [12] George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas.
the amount of work required for labelling a training dataset of                   2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM
match/non-match pairs for Magellan or DeepMatcher.                                Computing Surveys (CSUR) 53, 2 (2020), 1–42.
                                                                             [13] El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao,
                                                                                  Samuel Madden, Mourad Ouzzani, Nan Tang, and Ahmed K. Elmagarmid. 2019.
                                                                                  Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics.
                                                                                  Proc. VLDB Endow. 12, 12 (2019), 1954–1957.
                                                                             [14] Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: a
                                                                                  Loosely Schema-aware Meta-blocking Approach for Entity Resolution. Proc.
4.2     Results                                                                   VLDB Endow. 9, 12 (2016), 1173–1184.
After all refinements applied to the method, the final submission            [15] Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish.
                                                                                  2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83
reached the top F-measure of 0.99, with precision equal to 0.99 and               (2019), 145–165.
recall to 0.98. Other 4 teams were able to reach the top result, so          [16] Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi.
                                                                                  2018. Schema-Agnostic Progressive Entity Resolution. In ICDE 2018.
the tie was broken according to the execution time of the solution,
concluding the contest as runner-up.