Entity Resolution on Camera Records without Machine Learning SIGMOD 2020 Contest Finalist Paper Luca Zecchini Giovanni Simonini Sonia Bergamaschi Università degli Studi Università degli Studi Università degli Studi di Modena e Reggio Emilia di Modena e Reggio Emilia di Modena e Reggio Emilia Italy Italy Italy zecchini.luca@libero.it simonini@unimore.it sonia.bergamaschi@unimore.it ABSTRACT SOURCE SPECS SOURCE SPECS buy.net 358 www.henrys.com 181 This paper reports the runner-up solution to the ACM SIGMOD cammarkt.com 198 www.ilgs.net 102 2020 programming contest, whose target was to identify the speci- www.alibaba.com 7,972 www.mypriceindia.com 347 fications (i.e., records) collected across 24 e-commerce data sources www.buzzillions.com 832 www.pcconnection.com 211 that refer to the same real-world entities. www.cambuy.com.au 118 www.pricedekho.com 366 www.camerafarm.com.au 120 www.price-hunt.com 327 First, we investigate the machine learning (ML) approach, but www.canon-europe.com 164 www.priceme.co.nz 740 surprisingly find that existing state-of-the-art ML-based methods www.ebay.com 14,274 www.shopbot.com.au 516 fall short in such a context—not reaching 0.49 F-score. Then, we www.eglobalcentral.co.uk 571 www.shopmania.in 630 propose an efficient solution that exploits annotated lists and regu- www.flipkart.com 157 www.ukdigitalcameras.co.uk 129 lar expressions generated by humans that reaches a 0.99 F-score. www.garricks.com.au 130 www.walmart.com 195 www.gosale.com 1,002 www.wexphotographic.com 147 In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based Table 1: Specifications extracted from each source methods, in terms of human efforts. CCS CONCEPTS As for the attributes, Page Title is the only one present in each • Information systems → Entity resolution. specification, while none of the other 4,660 attributes appears in all the sources or, in most cases, in all the specifications from the KEYWORDS same source. Furthermore, the attributes suffer from problems of homonymy (same name but different meaning) and synonymy Entity Matching, Entity Resolution, Data Integration, Data Wran- (same meaning but different names). Finally, many details about gling additional accessories are provided (e.g., lens, bag, tripod, etc.) even if they do not contribute to the identification of the entities—the 1 INTRODUCTION same for many other attributes, such as the color. 1.1 Competition Description Labelled Data. In addition to D (the complete dataset), a labelled This paper reports our solution to the ACM SIGMOD 2020 program- dataset to train a model in case of a machine learning (ML) solution ming contest, which resulted runner-up. During the contest, we is provided. Differently from D, this dataset contains a list of speci- were asked to solve an Entity Resolution (ER) problem; in particular, fications id pairs and the related binary label: 1, which denotes a to perform ER on a provided dirty dataset of specifications (i.e., match (i.e., both specifications are related to the same real-world records) referring to cameras, extracted from different sources— camera model), and 0, which denotes a non-match. with the help of an available labelled dataset, if needed. In this The labelled dataset is provided in two versions: Y, generated context, the aim of ER was to identify all the specifications (i.e., from 306 specifications and available since the start of the compe- records) which referred to the same real-world camera model. The tition, and W, which is larger (generated from 908 specifications) solutions were evaluated according to the F-measure (harmonic and includes all the couples already present in Y, made available mean of precision and recall, rounded up to two decimal places) only later. Table 2 reports the size of the labelled datasets and their reached on a secret evaluation dataset. internal distribution of matching and non-matching couples. Dataset Description. The provided dataset, called D, is composed of DATASET COUPLES MATCHES NON-MATCHES 29,787 specifications collected from 24 e-commerce websites. Each Dataset Y 46,665 3,582 43,083 specification is a list of name-value pairs describing a camera that is Dataset W 297,651 44,039 253,612 being sold on the website. As illustrated in Table 1, the distribution of the specifications across the sources is not uniform, with few Table 2: Characteristics of labelled datasets sources contributing with most of the specifications (www.ebay.com and www.alibaba.com). The data sources are dirty, meaning that it is possible to find matches even among specifications coming from the same source. DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY Moreover, the transitive closure is applied on the matches: if 𝑒 1 4.0). matches with 𝑒 2 and 𝑒 3 , then also 𝑒 2 matches with 𝑒 3 . DI2KG, August 31, 2020, Tokyo Luca Zecchini, Giovanni Simonini, and Sonia Bergamaschi Evaluation Process. The competition’s participants were asked to each specification pair. As classification algorithms, it implements identify all the matching pairs of D. A portion of the ground truth of all the most-common and best-performing ones, such as decision D was known only to the organizer, which returned the F-measure tree, random forest, SVM, naïve Bayes. on that portion as feedback of each submitted solution. The last Magellan may be sensitive to the definition of the features (e.g., version submitted before the contest deadline was used to determine by selecting some similarity functions rather than others or by using the final ranking. After the deadline, the top teams had to provide different sizes for the tokenization of the attributes); hence, it might their code and a guide to execute it, in order to validate and certify be hard sometimes to tune. On the other hand, by employing deep the score (through manual inspection and a reproducibility test of learning, this step is basically avoided: features are automatically the solution) and to measure its execution time; this happened in a extracted by the neural net. Moreover, the final results are often network-isolated container with defined specifications (4x30 GHz better, with the right neural net architecture. This is essentially the processor, 16 GB main memory, 128 GB storage, Linux OS). Both idea behind DeepMatcher. In particular, it provides four different training (not included in final execution time) and execution had kinds of architecture for EM tasks: (i) SIF, which determines a match to respect a maximum time limit each of 12 hours. or non-match by considering the words present in each attribute value pair, without caring about their order; (ii) RNN, which con- 1.2 Our Solution and Paper Organization siders the sequences of words; (iii) Attention, which considers the alignment of words, without caring about word order; (iv) Hybrid, Our solution is based on the definition of human-crafted rules which cares about the alignment of sequences of words, selected as and lists to detect the information about brands and models in default model. the specification titles—all titles contain this information—and to Regular expressions have been used in the ER context for discov- normalize them in order to perform an equality join completing ering transformations for the entity reconciliation process (e.g., to the ER process. uniform the model representations among different specifications In our experience, the definition of brand-based rules obtained of the same camera) [2]. This is known as the golden record ap- through the human study of data was not more expensive than proach and exploits labelled data to synthesize regular expressions labelling a general training dataset for a state-of-the-art ML-based that can be used to transform strings in a canonical way, for all the solution, which cannot capture with the same accuracy all the attributes of a record—which becomes the golden record at the end brand-dependent patterns needed for detecting matches—i.e., the of the transformations. Our solution exploits regular expressions to ML-based solution achieves lower F-score. isolate and normalize substrings that refer to a camera model, but The remainder of the paper is organized as follows: Section 2 allows brand-dependent transformations, which do not generalize reviews the related work; Section 3 shows how the ML approach to the whole dataset as the golden-record approach assumes. performs on our problem; Section 4 presents our solution and the results in the challenge; finally, the lesson learned is reported in the Section 5. 3 HOW DID THE ML-APPROACH PERFORM IN THIS SETTING? 2 RELATED WORK As mentioned in Section 1.1, the data sources from which the spec- ER (Entity Resolution, a.k.a.: Entity Matching, Duplicate Detec- ifications were extracted have highly heterogeneous schemas, the tion, Record Linkage) has been studied for decades [6], but is still a attributes have ambiguous names and most of them are used only relevant research problem, since it significantly relies on human in a handful of specifications. This makes the schema alignment a effort for labelling training data, for generating rules¸ for blocking, hard task itself. Furthermore, in our experiments we were not able and many other related tasks—see [1, 3, 4, 9, 12, 15] for a survey. to find any attributes that would appreciably help in the resolu- Frameworks have been proposed to support practitioners to solve tion of the ER problem. Thus, we decided not to explore this route this task within data preparation pipelines [5, 7, 11, 13]. Among all further—i.e., we do not perform a schema alignment. As a conse- the proposed solutions for structured in the literature, the latest quence, EM with Magellan and DeepMatcher is performed on the research outcomes indicate the Machine Learning (ML) approaches attribute Page Title, which is always present and contains in most as most promising, achieving the best performance in publicly cases the two elements sufficient to uniquely identify a camera: the available benchmarks when training data is available. In particular, brand and the model. Notice that both Magellan and DeepMatcher the state-of-the-art ER algorithms based on ML (and deep learning) can operate only with an aligned schema, thus a different approach are Magellan [5] and DeepMatcher [8], respectively. would not be possible. Magellan is an ER tool that allows to perform ER by using su- We preprocess the labelled dataset to the format required by pervised learning techniques. In practice, in the training phase the libraries and set the attribute Page Title to lowercase, then we it takes as input a set of labelled examples of matching and non- generate train, validation and test sets respecting a 3:1:1 ratio. All matching pairs of specifications and trains a binary classifier. Then, of these sets contain matches and non-matches following the same in the inference phase, it predicts the labels of unseen pairs of distribution, which reflects the one of the complete dataset. specifications, i.e., performing the ER on the unlabelled part of the Because of the extremely variable length of the attribute values, dataset. Of course, a ML classifier needs feature engineering for the results obtained by Magellan on Y are extremely poor (best processing pairs of specifications; thus, Magellan employs similar- F-measure of 0.40 with naïve Bayes). This because only a portion of ity functions computed on pairs of corresponding attribute values the titles is relevant to determine a match. Since in most cases the (e.g., Jaccard and cosine similarity, edit distance, etc.) as features of useful information (Brand and Model) is located among the first Entity Resolution on Camera Records without Machine Learning DI2KG, August 31, 2020, Tokyo words of Page Title, the normalization is realized by truncating the of 0.85 but a very poor precision of 0.32, due to a serious problem string, considering only the first n words, with the best value of n of false positives—not solvable through a simple useless words determined in an empirical way as 4. Table 3 and Table 4 show the removal. results obtained on the datasets Y and W, with the best performing model (RNN by DeepMatcher) chosen to face the challenge. 3.2 Why does the ML approach fall short? The problem of false positives is linked to the nature of the dataset: CLASSIFIER PRECISION RECALL F-SCORE the matching can be determined only on little brand-dependent Decision Tree 92.45% 90.64% 91.54% details (e.g., the variation of a single letter or digit). Because of Random Forest 93.91% 88.27% 91.00% its size, the labelled dataset can cover only a little portion of all RNN 97% 95% 96% possible relevant cases which can be met considering all specifica- Attention 100% 73% 84% tions (furthermore, a lot of brands and models of 𝐷 do not even Hybrid 98% 73% 84% appear in 𝑊 ). So, the similarity patterns learned by Magellan and SIF 46% 31% 37% DeepMatcher on a few brands and models cannot be effective on Table 3: Results on dataset Y the whole dataset, since it is impossible to generalize them. 4 A REGEX APPROACH Our solution is based on regular expressions (regex), developed as a CLASSIFIER PRECISION RECALL F-MEASURE variation of the blocking method used in the previous approach. In Random Forest 98.90% 95.03% 96.93% particular, exploring the data it is possible to notice that in most SVM 98.29% 93.52% 95.85% cases the model is expressed as a string composed by both letters Logistic Regression 96.56% 89.10% 92.68% and digits, so it can be retrieved using a regular expression, while Decision Tree 97.72% 87.10% 92.11% the number of brands with a high distribution is quite limited, Linear Regression 97.47% 79.27% 87.43% manageable through a (human-generated) list. Naïve Bayes 70.35% 94.13% 80.52% In most cases, by performing simple data cleansing operations RNN 99.59% 96.96% 98.26% (e.g., by removing special characters and white spaces) the extrac- Table 4: Results on dataset W tion of the first alpha-numerical string allows to detect the right model. Using this in combination with a list of the most popular brands, in 19,513 out of the 29,787 specifications, both the brand and Moving to the complete dataset requires to consider all the pairs the model are detected. In the following we describe the complete of tuples generated by the Cartesian product (887,265,369 pairs procedure performed to get the final, complete result. of tuples), an amount that even considering the reduction due to the deletion of identities and reflexive pairs is not computationally 4.1 The Procedure affordable, thus blocking is needed. The camera specifications are read into a dictionary with an iden- We employ blocking to reduce the number of candidates, in tifier ID and the attribute Page Title, considered in its entirety, in particular we find that token blocking [10, 14, 16] on the truncated lowercase, and replacing all the punctuation characters with an idle version of the attribute Page Title is a good choice. In practice, token character (except for the character “-”, substituted by the empty blocking indexes two specifications in the same block if they share string because it is often used inside model names). Then, Page Title one of the first four words in their Page Title. Then, all possible pairs is transformed in a list by tokenization, and the elements that have of specifications within each block are considered as candidates. aliases are replaced with their basic form by employing an apposite This blocking strategy produces 3,914 blocks, yielding 54,000,932 human-generated dictionary—e.g., different or misspelled versions candidate pairs, and on the labelled ones (i.e., dataset W ), it achieves used to indicate the same brand, like "fuji" for "fujifilm" or a precision of 0.28 and a recall of 0.99, making it a suitable candidate "cannon" for "canon". Afterwards, the list is used for the brand set for being processed in an acceptable amount of time. retrieval phase, to identify the brand of each camera—also for this In order to improve precision and recall, we also employ a list of phase a human-generated list, containing popular brand names, is frequent useless words to be deleted from Page Title, increasing the employed. probability of finding relevant information among the 4 maintained For extracting the model type (needed for the actual ER) four lists words. This list is generated by looking for the most shared words for each brand are defined: the prefixes list and the suffixes list, and detecting the ones which do not express meaningful informa- which contain the recurrent elements that can appear separately tion, before focusing on more specific cases. Further improvements from the model strings and that must be considered part of the can be given by the use of aliases, to solve the problem of popular model name; the model list, which is composed by those only al- synonyms (e.g., "fuji" and "fujifilm"). phabetical or only numerical words (so, they would not be detected 3.1 Results with the ML approach by the defined regex system without the use of this list) that must be considered as models; the exceptions list, which contains words On 𝐷 the best performing configuration we were able to find could that are both alphabetical and numerical but that must be skipped not go beyond a F-measure of 0.47. In detail, it achieves a recall because they do not represent a model, using also an additional list The best result has been achieved by employing DeepMatcher with a RNN architecture. of suffixes that denote measures (e.g., “mm”,“gb”, etc.). Finally, DI2KG, August 31, 2020, Tokyo Luca Zecchini, Giovanni Simonini, and Sonia Bergamaschi an equivalences dictionary is employed to conform the different 5 CONCLUSION versions of a model name choosing a standard one. While looking for a good solution for the contest’s ER problem, we If a prefix (suffix), identified through the prefix (suffix) list, ap- investigated the limits of state-of-the-art machine learning (Mag- pears in the list of tokens representing Page Title, it is concatenated ellan) and deep learning (DeepMatcher) methods for ER. These to its next (previous) token. Scanning the resulting list, a model methods are able to achieve good results when matches can be can be detected and assigned to the current specification if a token identified by means of similarity-based features, but in a lot of real- is both alphabetical and numerical (detected through the defined world scenarios it may happen that matching is based on small regular expression) and does not appear in the exceptions list, or variations that make the generalization on the entire dataset of if it is only alphabetical or only numerical and appears in the model the learned patterns impossible—this is the case of camera mod- list—the first detected model is assigned to the specification. els, which have tiny brand-dependent variations to distinguish a Once extracted the model, further brand-based operations can be camera (an entity) to another. In fact, we found that ER on camera performed to normalize it: some prefixes and suffixes are modified specifications requires human-designed rules and lists (prefixes and or removed (e.g., the suffixes that indicate the colors and that would suffixes management, exceptions, etc.), which existing ML meth- cause false negatives); some models may have synonyms depending ods are not able to synthesize. Yet, in our experience, these rules on the geographic area (e.g., Canon EOS 450D is sold in North and lists are no more expensive to build than a labelled dataset America as EOS Rebel XSi and in Japan as EOS Kiss X2), so they must of match/non-match pairs required by ML-based methods. This be normalized to a single standard in order to avoid false negatives; suggests that when approaching an ER problem, to start collecting some models need the retrieval of additional information for the match/non-match labels might not be the first thing to do, and disambiguation (e.g., EOS 5D Mark II, EOS 5D Mark III, and EOS 5D might not be necessary at all. Mark IV must be considered as different models): the models which are sensitive to this problem (e.g., 5D) are stored in an additional REFERENCES brand-related list. If a model appears in this list, the information [1] Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, related to its edition (e.g., mark for Canon) in the attribute Page Entity Resolution, and Duplicate Detection. Springer. [2] Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Title is retrieved and attached to the model name, in order to avoid Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan false positives. Tang. 2019. Unsupervised String Transformation Learning for Entity Consolida- Thus, the result of the process is that a new attribute is added tion. In ICDE 2019. [3] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data to each specification: brand_n_model. This attribute contains the Integration. Morgan Kaufmann. concatenation of the extracted brand and model—of course, if they [4] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. have been both detected. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1–16. If the detection was successful, the specification is appended to [5] Yash Govind et al. 2019. Entity Matching Meets Data Science: A Progress Report the solved_specs list, otherwise to unsolved_specs list. from the Magellan Project. In SIGMOD 2019. [6] Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Once concluded this first step of extracting brand and model Statist. Assoc. 64, 328 (1969), 1183–1210. from the specifications, the following step is to determine the [7] Luca Gagliardelli, Giovanni Simonini, Domenico Beneventano, and Sonia Berga- matching pairs of specifications. An inverted index built from maschi. 2019. SparkER: Scaling Entity Resolution in Spark. In EDBT 2019, Lisbon, Portugal, March 26-29, 2019. 602–605. the solved_specs list generates clusters representing entities, by [8] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, grouping elements according to the perfect match of the attribute Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. brand_n_model. Then, the specifications in the unsolved_specs Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD 2018. list are considered as matches only if the content of the attribute [9] Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Page Title is exactly the same: once again, inverted index is build Morgan & Claypool Publishers. [10] George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and to generate clusters. Wolfgang Nejdl. 2013. A Blocking Framework for Entity Resolution in Highly We finally generate the list of matching pairs by emitting all Heterogeneous Information Spaces. IEEE Trans. Knowl. Data Eng. 25, 12 (2013), possible pairs of specifications from each identified cluster. We 2665–2682. [11] George Papadakis, Georgios M. Mandilaras, Luca Gagliardelli, Giovanni Simonini, need this step to submit the final solution and compute its F-score. Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Pal- Notice that all the lists and rules are human-crafted; yet, in our panas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with experience, this required human labor is no more expensive than JedAI. Inf. Syst. 93 (2020), 101565. [12] George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. the amount of work required for labelling a training dataset of 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM match/non-match pairs for Magellan or DeepMatcher. Computing Surveys (CSUR) 53, 2 (2020), 1–42. [13] El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao, Samuel Madden, Mourad Ouzzani, Nan Tang, and Ahmed K. Elmagarmid. 2019. Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics. Proc. VLDB Endow. 12, 12 (2019), 1954–1957. [14] Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution. Proc. 4.2 Results VLDB Endow. 9, 12 (2016), 1173–1184. After all refinements applied to the method, the final submission [15] Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish. 2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83 reached the top F-measure of 0.99, with precision equal to 0.99 and (2019), 145–165. recall to 0.98. Other 4 teams were able to reach the top result, so [16] Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2018. Schema-Agnostic Progressive Entity Resolution. In ICDE 2018. the tie was broken according to the execution time of the solution, concluding the contest as runner-up.