=Paper=
{{Paper
|id=Vol-2726/paper3
|storemode=property
|title=Entity Resolution on Camera Records Without Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-2726/paper3.pdf
|volume=Vol-2726
|authors=Luca Zecchini,Giovanni Simonini,Sonia Bergamaschi
|dblpUrl=https://dblp.org/rec/conf/vldb/ZecchiniSB20
}}
==Entity Resolution on Camera Records Without Machine Learning==
Entity Resolution on Camera Records without Machine Learning
SIGMOD 2020 Contest Finalist Paper
Luca Zecchini Giovanni Simonini Sonia Bergamaschi
Università degli Studi Università degli Studi Università degli Studi
di Modena e Reggio Emilia di Modena e Reggio Emilia di Modena e Reggio Emilia
Italy Italy Italy
zecchini.luca@libero.it simonini@unimore.it sonia.bergamaschi@unimore.it
ABSTRACT SOURCE SPECS SOURCE SPECS
buy.net 358 www.henrys.com 181
This paper reports the runner-up solution to the ACM SIGMOD cammarkt.com 198 www.ilgs.net 102
2020 programming contest, whose target was to identify the speci- www.alibaba.com 7,972 www.mypriceindia.com 347
fications (i.e., records) collected across 24 e-commerce data sources www.buzzillions.com 832 www.pcconnection.com 211
that refer to the same real-world entities. www.cambuy.com.au 118 www.pricedekho.com 366
www.camerafarm.com.au 120 www.price-hunt.com 327
First, we investigate the machine learning (ML) approach, but www.canon-europe.com 164 www.priceme.co.nz 740
surprisingly find that existing state-of-the-art ML-based methods www.ebay.com 14,274 www.shopbot.com.au 516
fall short in such a context—not reaching 0.49 F-score. Then, we www.eglobalcentral.co.uk 571 www.shopmania.in 630
propose an efficient solution that exploits annotated lists and regu- www.flipkart.com 157 www.ukdigitalcameras.co.uk 129
lar expressions generated by humans that reaches a 0.99 F-score. www.garricks.com.au 130 www.walmart.com 195
www.gosale.com 1,002 www.wexphotographic.com 147
In our experience, our approach was not more expensive than the
dataset labeling of match/non-match pairs required by ML-based Table 1: Specifications extracted from each source
methods, in terms of human efforts.
CCS CONCEPTS As for the attributes, Page Title is the only one present in each
• Information systems → Entity resolution. specification, while none of the other 4,660 attributes appears in
all the sources or, in most cases, in all the specifications from the
KEYWORDS same source. Furthermore, the attributes suffer from problems of
homonymy (same name but different meaning) and synonymy
Entity Matching, Entity Resolution, Data Integration, Data Wran- (same meaning but different names). Finally, many details about
gling additional accessories are provided (e.g., lens, bag, tripod, etc.) even
if they do not contribute to the identification of the entities—the
1 INTRODUCTION same for many other attributes, such as the color.
1.1 Competition Description Labelled Data. In addition to D (the complete dataset), a labelled
This paper reports our solution to the ACM SIGMOD 2020 program- dataset to train a model in case of a machine learning (ML) solution
ming contest, which resulted runner-up. During the contest, we is provided. Differently from D, this dataset contains a list of speci-
were asked to solve an Entity Resolution (ER) problem; in particular, fications id pairs and the related binary label: 1, which denotes a
to perform ER on a provided dirty dataset of specifications (i.e., match (i.e., both specifications are related to the same real-world
records) referring to cameras, extracted from different sources— camera model), and 0, which denotes a non-match.
with the help of an available labelled dataset, if needed. In this The labelled dataset is provided in two versions: Y, generated
context, the aim of ER was to identify all the specifications (i.e., from 306 specifications and available since the start of the compe-
records) which referred to the same real-world camera model. The tition, and W, which is larger (generated from 908 specifications)
solutions were evaluated according to the F-measure (harmonic and includes all the couples already present in Y, made available
mean of precision and recall, rounded up to two decimal places) only later. Table 2 reports the size of the labelled datasets and their
reached on a secret evaluation dataset. internal distribution of matching and non-matching couples.
Dataset Description. The provided dataset, called D, is composed of DATASET COUPLES MATCHES NON-MATCHES
29,787 specifications collected from 24 e-commerce websites. Each Dataset Y 46,665 3,582 43,083
specification is a list of name-value pairs describing a camera that is Dataset W 297,651 44,039 253,612
being sold on the website. As illustrated in Table 1, the distribution
of the specifications across the sources is not uniform, with few Table 2: Characteristics of labelled datasets
sources contributing with most of the specifications (www.ebay.com
and www.alibaba.com). The data sources are dirty, meaning that it is possible to find
matches even among specifications coming from the same source.
DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY Moreover, the transitive closure is applied on the matches: if 𝑒 1
4.0). matches with 𝑒 2 and 𝑒 3 , then also 𝑒 2 matches with 𝑒 3 .
DI2KG, August 31, 2020, Tokyo Luca Zecchini, Giovanni Simonini, and Sonia Bergamaschi
Evaluation Process. The competition’s participants were asked to each specification pair. As classification algorithms, it implements
identify all the matching pairs of D. A portion of the ground truth of all the most-common and best-performing ones, such as decision
D was known only to the organizer, which returned the F-measure tree, random forest, SVM, naïve Bayes.
on that portion as feedback of each submitted solution. The last Magellan may be sensitive to the definition of the features (e.g.,
version submitted before the contest deadline was used to determine by selecting some similarity functions rather than others or by using
the final ranking. After the deadline, the top teams had to provide different sizes for the tokenization of the attributes); hence, it might
their code and a guide to execute it, in order to validate and certify be hard sometimes to tune. On the other hand, by employing deep
the score (through manual inspection and a reproducibility test of learning, this step is basically avoided: features are automatically
the solution) and to measure its execution time; this happened in a extracted by the neural net. Moreover, the final results are often
network-isolated container with defined specifications (4x30 GHz better, with the right neural net architecture. This is essentially the
processor, 16 GB main memory, 128 GB storage, Linux OS). Both idea behind DeepMatcher. In particular, it provides four different
training (not included in final execution time) and execution had kinds of architecture for EM tasks: (i) SIF, which determines a match
to respect a maximum time limit each of 12 hours. or non-match by considering the words present in each attribute
value pair, without caring about their order; (ii) RNN, which con-
1.2 Our Solution and Paper Organization siders the sequences of words; (iii) Attention, which considers the
alignment of words, without caring about word order; (iv) Hybrid,
Our solution is based on the definition of human-crafted rules
which cares about the alignment of sequences of words, selected as
and lists to detect the information about brands and models in
default model.
the specification titles—all titles contain this information—and to
Regular expressions have been used in the ER context for discov-
normalize them in order to perform an equality join completing
ering transformations for the entity reconciliation process (e.g., to
the ER process.
uniform the model representations among different specifications
In our experience, the definition of brand-based rules obtained
of the same camera) [2]. This is known as the golden record ap-
through the human study of data was not more expensive than
proach and exploits labelled data to synthesize regular expressions
labelling a general training dataset for a state-of-the-art ML-based
that can be used to transform strings in a canonical way, for all the
solution, which cannot capture with the same accuracy all the
attributes of a record—which becomes the golden record at the end
brand-dependent patterns needed for detecting matches—i.e., the
of the transformations. Our solution exploits regular expressions to
ML-based solution achieves lower F-score.
isolate and normalize substrings that refer to a camera model, but
The remainder of the paper is organized as follows: Section 2
allows brand-dependent transformations, which do not generalize
reviews the related work; Section 3 shows how the ML approach
to the whole dataset as the golden-record approach assumes.
performs on our problem; Section 4 presents our solution and the
results in the challenge; finally, the lesson learned is reported in
the Section 5. 3 HOW DID THE ML-APPROACH PERFORM
IN THIS SETTING?
2 RELATED WORK As mentioned in Section 1.1, the data sources from which the spec-
ER (Entity Resolution, a.k.a.: Entity Matching, Duplicate Detec- ifications were extracted have highly heterogeneous schemas, the
tion, Record Linkage) has been studied for decades [6], but is still a attributes have ambiguous names and most of them are used only
relevant research problem, since it significantly relies on human in a handful of specifications. This makes the schema alignment a
effort for labelling training data, for generating rules¸ for blocking, hard task itself. Furthermore, in our experiments we were not able
and many other related tasks—see [1, 3, 4, 9, 12, 15] for a survey. to find any attributes that would appreciably help in the resolu-
Frameworks have been proposed to support practitioners to solve tion of the ER problem. Thus, we decided not to explore this route
this task within data preparation pipelines [5, 7, 11, 13]. Among all further—i.e., we do not perform a schema alignment. As a conse-
the proposed solutions for structured in the literature, the latest quence, EM with Magellan and DeepMatcher is performed on the
research outcomes indicate the Machine Learning (ML) approaches attribute Page Title, which is always present and contains in most
as most promising, achieving the best performance in publicly cases the two elements sufficient to uniquely identify a camera: the
available benchmarks when training data is available. In particular, brand and the model. Notice that both Magellan and DeepMatcher
the state-of-the-art ER algorithms based on ML (and deep learning) can operate only with an aligned schema, thus a different approach
are Magellan [5] and DeepMatcher [8], respectively. would not be possible.
Magellan is an ER tool that allows to perform ER by using su- We preprocess the labelled dataset to the format required by
pervised learning techniques. In practice, in the training phase the libraries and set the attribute Page Title to lowercase, then we
it takes as input a set of labelled examples of matching and non- generate train, validation and test sets respecting a 3:1:1 ratio. All
matching pairs of specifications and trains a binary classifier. Then, of these sets contain matches and non-matches following the same
in the inference phase, it predicts the labels of unseen pairs of distribution, which reflects the one of the complete dataset.
specifications, i.e., performing the ER on the unlabelled part of the Because of the extremely variable length of the attribute values,
dataset. Of course, a ML classifier needs feature engineering for the results obtained by Magellan on Y are extremely poor (best
processing pairs of specifications; thus, Magellan employs similar- F-measure of 0.40 with naïve Bayes). This because only a portion of
ity functions computed on pairs of corresponding attribute values the titles is relevant to determine a match. Since in most cases the
(e.g., Jaccard and cosine similarity, edit distance, etc.) as features of useful information (Brand and Model) is located among the first
Entity Resolution on Camera Records without Machine Learning DI2KG, August 31, 2020, Tokyo
words of Page Title, the normalization is realized by truncating the of 0.85 but a very poor precision of 0.32, due to a serious problem
string, considering only the first n words, with the best value of n of false positives—not solvable through a simple useless words
determined in an empirical way as 4. Table 3 and Table 4 show the removal.
results obtained on the datasets Y and W, with the best performing
model (RNN by DeepMatcher) chosen to face the challenge. 3.2 Why does the ML approach fall short?
The problem of false positives is linked to the nature of the dataset:
CLASSIFIER PRECISION RECALL F-SCORE
the matching can be determined only on little brand-dependent
Decision Tree 92.45% 90.64% 91.54%
details (e.g., the variation of a single letter or digit). Because of
Random Forest 93.91% 88.27% 91.00%
its size, the labelled dataset can cover only a little portion of all
RNN 97% 95% 96% possible relevant cases which can be met considering all specifica-
Attention 100% 73% 84% tions (furthermore, a lot of brands and models of 𝐷 do not even
Hybrid 98% 73% 84% appear in 𝑊 ). So, the similarity patterns learned by Magellan and
SIF 46% 31% 37% DeepMatcher on a few brands and models cannot be effective on
Table 3: Results on dataset Y the whole dataset, since it is impossible to generalize them.
4 A REGEX APPROACH
Our solution is based on regular expressions (regex), developed as a
CLASSIFIER PRECISION RECALL F-MEASURE
variation of the blocking method used in the previous approach. In
Random Forest 98.90% 95.03% 96.93%
particular, exploring the data it is possible to notice that in most
SVM 98.29% 93.52% 95.85%
cases the model is expressed as a string composed by both letters
Logistic Regression 96.56% 89.10% 92.68%
and digits, so it can be retrieved using a regular expression, while
Decision Tree 97.72% 87.10% 92.11%
the number of brands with a high distribution is quite limited,
Linear Regression 97.47% 79.27% 87.43%
manageable through a (human-generated) list.
Naïve Bayes 70.35% 94.13% 80.52%
In most cases, by performing simple data cleansing operations
RNN 99.59% 96.96% 98.26%
(e.g., by removing special characters and white spaces) the extrac-
Table 4: Results on dataset W tion of the first alpha-numerical string allows to detect the right
model. Using this in combination with a list of the most popular
brands, in 19,513 out of the 29,787 specifications, both the brand and
Moving to the complete dataset requires to consider all the pairs the model are detected. In the following we describe the complete
of tuples generated by the Cartesian product (887,265,369 pairs procedure performed to get the final, complete result.
of tuples), an amount that even considering the reduction due to
the deletion of identities and reflexive pairs is not computationally 4.1 The Procedure
affordable, thus blocking is needed.
The camera specifications are read into a dictionary with an iden-
We employ blocking to reduce the number of candidates, in
tifier ID and the attribute Page Title, considered in its entirety, in
particular we find that token blocking [10, 14, 16] on the truncated
lowercase, and replacing all the punctuation characters with an idle
version of the attribute Page Title is a good choice. In practice, token
character (except for the character “-”, substituted by the empty
blocking indexes two specifications in the same block if they share
string because it is often used inside model names). Then, Page Title
one of the first four words in their Page Title. Then, all possible pairs
is transformed in a list by tokenization, and the elements that have
of specifications within each block are considered as candidates.
aliases are replaced with their basic form by employing an apposite
This blocking strategy produces 3,914 blocks, yielding 54,000,932
human-generated dictionary—e.g., different or misspelled versions
candidate pairs, and on the labelled ones (i.e., dataset W ), it achieves
used to indicate the same brand, like "fuji" for "fujifilm" or
a precision of 0.28 and a recall of 0.99, making it a suitable candidate
"cannon" for "canon". Afterwards, the list is used for the brand
set for being processed in an acceptable amount of time.
retrieval phase, to identify the brand of each camera—also for this
In order to improve precision and recall, we also employ a list of
phase a human-generated list, containing popular brand names, is
frequent useless words to be deleted from Page Title, increasing the
employed.
probability of finding relevant information among the 4 maintained
For extracting the model type (needed for the actual ER) four lists
words. This list is generated by looking for the most shared words
for each brand are defined: the prefixes list and the suffixes list,
and detecting the ones which do not express meaningful informa-
which contain the recurrent elements that can appear separately
tion, before focusing on more specific cases. Further improvements
from the model strings and that must be considered part of the
can be given by the use of aliases, to solve the problem of popular
model name; the model list, which is composed by those only al-
synonyms (e.g., "fuji" and "fujifilm").
phabetical or only numerical words (so, they would not be detected
3.1 Results with the ML approach by the defined regex system without the use of this list) that must be
considered as models; the exceptions list, which contains words
On 𝐷 the best performing configuration we were able to find could that are both alphabetical and numerical but that must be skipped
not go beyond a F-measure of 0.47. In detail, it achieves a recall because they do not represent a model, using also an additional list
The best result has been achieved by employing DeepMatcher with a RNN architecture. of suffixes that denote measures (e.g., “mm”,“gb”, etc.). Finally,
DI2KG, August 31, 2020, Tokyo Luca Zecchini, Giovanni Simonini, and Sonia Bergamaschi
an equivalences dictionary is employed to conform the different 5 CONCLUSION
versions of a model name choosing a standard one. While looking for a good solution for the contest’s ER problem, we
If a prefix (suffix), identified through the prefix (suffix) list, ap- investigated the limits of state-of-the-art machine learning (Mag-
pears in the list of tokens representing Page Title, it is concatenated ellan) and deep learning (DeepMatcher) methods for ER. These
to its next (previous) token. Scanning the resulting list, a model methods are able to achieve good results when matches can be
can be detected and assigned to the current specification if a token identified by means of similarity-based features, but in a lot of real-
is both alphabetical and numerical (detected through the defined world scenarios it may happen that matching is based on small
regular expression) and does not appear in the exceptions list, or variations that make the generalization on the entire dataset of
if it is only alphabetical or only numerical and appears in the model the learned patterns impossible—this is the case of camera mod-
list—the first detected model is assigned to the specification. els, which have tiny brand-dependent variations to distinguish a
Once extracted the model, further brand-based operations can be camera (an entity) to another. In fact, we found that ER on camera
performed to normalize it: some prefixes and suffixes are modified specifications requires human-designed rules and lists (prefixes and
or removed (e.g., the suffixes that indicate the colors and that would suffixes management, exceptions, etc.), which existing ML meth-
cause false negatives); some models may have synonyms depending ods are not able to synthesize. Yet, in our experience, these rules
on the geographic area (e.g., Canon EOS 450D is sold in North and lists are no more expensive to build than a labelled dataset
America as EOS Rebel XSi and in Japan as EOS Kiss X2), so they must of match/non-match pairs required by ML-based methods. This
be normalized to a single standard in order to avoid false negatives; suggests that when approaching an ER problem, to start collecting
some models need the retrieval of additional information for the match/non-match labels might not be the first thing to do, and
disambiguation (e.g., EOS 5D Mark II, EOS 5D Mark III, and EOS 5D might not be necessary at all.
Mark IV must be considered as different models): the models which
are sensitive to this problem (e.g., 5D) are stored in an additional REFERENCES
brand-related list. If a model appears in this list, the information [1] Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage,
related to its edition (e.g., mark for Canon) in the attribute Page Entity Resolution, and Duplicate Detection. Springer.
[2] Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed K. Elmagarmid, Ihab F. Ilyas,
Title is retrieved and attached to the model name, in order to avoid Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan
false positives. Tang. 2019. Unsupervised String Transformation Learning for Entity Consolida-
Thus, the result of the process is that a new attribute is added tion. In ICDE 2019.
[3] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data
to each specification: brand_n_model. This attribute contains the Integration. Morgan Kaufmann.
concatenation of the extracted brand and model—of course, if they [4] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007.
have been both detected. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007),
1–16.
If the detection was successful, the specification is appended to [5] Yash Govind et al. 2019. Entity Matching Meets Data Science: A Progress Report
the solved_specs list, otherwise to unsolved_specs list. from the Magellan Project. In SIGMOD 2019.
[6] Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer.
Once concluded this first step of extracting brand and model Statist. Assoc. 64, 328 (1969), 1183–1210.
from the specifications, the following step is to determine the [7] Luca Gagliardelli, Giovanni Simonini, Domenico Beneventano, and Sonia Berga-
matching pairs of specifications. An inverted index built from maschi. 2019. SparkER: Scaling Entity Resolution in Spark. In EDBT 2019, Lisbon,
Portugal, March 26-29, 2019. 602–605.
the solved_specs list generates clusters representing entities, by [8] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
grouping elements according to the perfect match of the attribute Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018.
brand_n_model. Then, the specifications in the unsolved_specs Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD
2018.
list are considered as matches only if the content of the attribute [9] Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection.
Page Title is exactly the same: once again, inverted index is build Morgan & Claypool Publishers.
[10] George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and
to generate clusters. Wolfgang Nejdl. 2013. A Blocking Framework for Entity Resolution in Highly
We finally generate the list of matching pairs by emitting all Heterogeneous Information Spaces. IEEE Trans. Knowl. Data Eng. 25, 12 (2013),
possible pairs of specifications from each identified cluster. We 2665–2682.
[11] George Papadakis, Georgios M. Mandilaras, Luca Gagliardelli, Giovanni Simonini,
need this step to submit the final solution and compute its F-score. Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Pal-
Notice that all the lists and rules are human-crafted; yet, in our panas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with
experience, this required human labor is no more expensive than JedAI. Inf. Syst. 93 (2020), 101565.
[12] George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas.
the amount of work required for labelling a training dataset of 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM
match/non-match pairs for Magellan or DeepMatcher. Computing Surveys (CSUR) 53, 2 (2020), 1–42.
[13] El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao,
Samuel Madden, Mourad Ouzzani, Nan Tang, and Ahmed K. Elmagarmid. 2019.
Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics.
Proc. VLDB Endow. 12, 12 (2019), 1954–1957.
[14] Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: a
Loosely Schema-aware Meta-blocking Approach for Entity Resolution. Proc.
4.2 Results VLDB Endow. 9, 12 (2016), 1173–1184.
After all refinements applied to the method, the final submission [15] Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish.
2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83
reached the top F-measure of 0.99, with precision equal to 0.99 and (2019), 145–165.
recall to 0.98. Other 4 teams were able to reach the top result, so [16] Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi.
2018. Schema-Agnostic Progressive Entity Resolution. In ICDE 2018.
the tie was broken according to the execution time of the solution,
concluding the contest as runner-up.