=Paper=
{{Paper
|id=Vol-2726/paper2
|storemode=property
|title=Fast Entity Resolution With Mock Labels and Sorted Integer Sets
|pdfUrl=https://ceur-ws.org/Vol-2726/paper2.pdf
|volume=Vol-2726
|authors=Mark Blacher,Joachim Giesen,SΓΆren Laue,Julien Klaus,Matthias Mitterreiter
|dblpUrl=https://dblp.org/rec/conf/vldb/BlacherGLKM20
}}
==Fast Entity Resolution With Mock Labels and Sorted Integer Sets==
Fast Entity Resolution With Mock Labels and Sorted Integer Sets SIGMOD 2020 Contest Finalist Paper Mark Blacher Julien Klaus Matthias Mitterreiter Friedrich Schiller University Jena Friedrich Schiller University Jena Friedrich Schiller University Jena Germany Germany Germany mark.blacher@uni-jena.de julien.klaus@uni-jena.de matthias.mitterreiter@uni-jena.de Joachim Giesen SΓΆren Laue Friedrich Schiller University Jena Friedrich Schiller University Jena Germany Germany joachim.giesen@uni-jena.de soeren.laue@uni-jena.de ABSTRACT Definition 1.1 (Pair-wise Entity Resolution). Given a set of objects Entity resolution is the task of identifying equivalent real-world π, the output is a set of pairs {(ππ , π π ) | ππ , π π β π, ππ and π π refer to entities across different sources. We found that standard methods the same real-world entity}. based on unsupervised learning are slow during the resolution Definition 1.2 (Group-wise Entity Resolution). Given a set of ob- step. Thus, we propose an alternative approach with mock labels. jects π, the output is a family of sets {ππ | all objects in ππ refer to Mock labels are sets of strings that ideally represent real-world the same real-world entity} where ππ β© π π = β if π β π. entities. Using a running example, we show how to create mock labels from the training data and use them in the resolution step to Pairwise Entity Resolution approaches often use similarity or match real-world entities efficiently. We applied this approach in distance functions for determining whether two objects refer to the the SIGMOD 2020 Programming Contest and achieved significantly same real-world entity. Group-wise Entity Resolution approaches faster execution times in the resolution step compared to other additionally use clustering techniques for assigning an object to a top ranked teams. Since all the top ranked teams had the same group or cluster. The results of these two types of entity resolution F-measure, the execution speed of the resolution step was decisive. are not always consistent [4]. CCS CONCEPTS 1.2 SIGMOD 2020 Programming Contest β’ Information systems β Entity resolution. The challenge of the SIGMOD 2020 Programming Contest was to design an entity resolution system for cameras. The input data KEYWORDS consisted of about 30,000 e-commerce websites. Each website was entity resolution, data cleansing, programming contest provided as a JSON formatted file that specified a real-world product offer. Not every website contained a single valid camera. Camera 1 INTRODUCTION accessories such as bags or zoom lenses, TVs, and entire camera bundles were also part of the dataset. Also, the product descrip- Entity resolution is a problem that occurs in data integration, data tions in the JSON files were inconsistent (varying attribute names, cleansing, web search, and e-commerce search scenarios [4]. Given different semantics and syntax). Only the page title attribute was two or more sources, we need to identify real-world entities that always present. See Figure 1 for an example of a JSON formatted do not have unique identifiers that tell us which records from one website that represents a specific camera. source match those in the other sources. In addition, records rep- resenting the same entity may contain different information. For example, if we use digital cameras as real-world entities, one record { may have the brand or the model misspelled, and another may be "": "Canon PowerShot SX 500IS | eBay", missing some fields such as pixel count or zoom factor. Further- "brand": "Canon", more, it is possible that information within individual records is "manufacturer warranty": "No", contradictory, such as an incorrect pixel count for a particular cam- "megapixels": "16.0 MP", era model. An entity resolution algorithm has to deal with these "model": "SX500IS", inconsistencies in the data and identify records corresponding to "optical zoom": "30x" the same real-world entity as best as possible [2]. } 1.1 Background Figure 1: Example specification for a camera in JSON format. The page title exists on all websites, whereas the brand and There are basically two types of entity resolution, namely Pair-wise model attributes are only specified on eBay websites. Entity Resolution, and Group-Wise Entity Resolution. DI2KG 2020, August 31, Tokyo, Japan. Copyright Β©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY The task was to find as many identical camera pairs as possible 4.0). among all e-commerce websites (Pair-wise Entity Resolution). A DI2KG 2020, August 31, Tokyo, Japan Mark Blacher, Julien Klaus, Matthias Mitterreiter, Joachim Giesen, and SΓΆren Laue Input Output valid submission consisted of a CSV file containing only the match- 1. clean page titles (stop words: -, kit, /) ing camera pairs found during the resolution step. The Submissions Canon IXUS 750 - . . . - Canon Europe were ranked based on the F-measure [3], which was computed Canon EOS 7D kit . . . - ShopMania labels.txt on a secret evaluation dataset. The five finalist teams all achieved Canon EOS 7D Mark II / . . . - Australia canon eos 7d an F-measure of 0.99. Because of this tie, the running time of the Canon EOS 7D Mark III / . . . - Australia canon eos 7d mark ii canon eos 7d mark iii resolution step was decisive [1]. 2. synonymous camera names (Wikipedia) canon powershot sd750, canon ixus 75 canon ixus 75 == canon powershot sd750 nikon d7000 1.3 Outline 3. brand + model attributes (eBay websites) The rest of this paper is organized as follows. Throughout the paper "brand": "Nikon", "model": "D7000" we use a running example. In Section 2 we describe the training step of our approach. We show that by preprocessing the training data Figure 2: Semi-automatic generation of mock camera labels. and using additional information about synonymous camera names We use three sources to generate the labels. 1. The page ti- from the web, it is possible to create mock camera labels that roughly tles of websites like canon-europe, shopmania or shopbot are represent the real-world camera entities. In Section 3 we describe very clean. We extract labels from these websites by using our resolution step, that is, how we match the contents of the JSON stop words. 2. Camera names may differ between countries. files with the generated mock labels. To speed up the execution We download synonymous camera names from Wikipedia. time, we use sorted integer sets for the mock labels and the real- 3. We combine the brand and model attributes of eBay web- world camera descriptions. Furthermore, we give recommendations sites from the training data to create additional labels. The on how to speed up the cleansing of input data. In Section 4 we merged results of the three sources are saved to a text file (la- evaluate our solution. Finally, in Section 5 we conclude this paper bels.txt). Each line in labels.txt contains one or more comma with a discussion of the features and limitations of our approach. separated sets of strings. Ideally, each line represents a real- world camera entity. 2 TRAINING Our approach is based on a concept we call mock labels. We call compute matchings in Section 3.4 with real-world camera descrip- them mock labels because they are not real identifiers, but only sets tions faster than it would be possible using only strings. In Figure 3 of semi-automatically generated keywords. They may or may not we convert the mock labels from Figure 2 to sorted integer sets. represent a real-world entity. For the sake of simplicity, we also refer to mock labels in the text as labels. For creating mock camera labels, key value internal representa on of camera we use clean websites from the training data which give rise to about canon 0 labels 950 labels. These labels are simply extracted from the corresponding eos 1 ID camera label 7d 2 ID sorted integer set page titles by using stop words. We also use information about mark 3 0 canon eos 7d 0 0, 1, 2 synonymous camera names from Wikipedia which creates another ii 4 1 canon eos 7d mark ii 1 0, 1, 2, 3, 4 50 labels. Additionally, we combine the attributes brand and model iii 5 2 canon eos 7d mark iii powershot 6 2 0, 1, 2, 3, 5 from provided eBay websites to create roughly another 1800 labels. 3 canon powershot sd750 sd750 7 3 0, 6, 7 In Figure 2 we summarize our approach by constructing mock 3 canon ixus 75 ixus 8 3 0, 8, 9 75 9 camera labels for the running example. 4 nikon d7000 nikon 10 4 10, 11 d7000 11 3 RESOLUTION In the training step described in the previous section we generated Figure 3: Conversion of mock camera labels to sorted integer mock camera labels. In this section we describe the resolution step sets. We map each unique token (key) in camera labels to a of our implementation, that is, how we match these labels with unique value. Based on these key-value-mappings, we con- textual descriptions of real-world cameras. Our resolution step is vert camera labels to sorted integer sets. A camera can have subdivided into four parts: different names in different countries. Therefore, repeating β’ First, we convert the mock camera labels to an efficient in- IDs reference the same cameras (see, for example, ID=3). ternal representation (Section 3.1). β’ Second, we create simplified mock labels that allow us to match even incomplete camera descriptions (Section 3.2). 3.2 Simplified Labels to Handle Missing Tokens β’ Third, we extract camera descriptions from the JSON files It is not always possible to create a match with a camera label and preprocess them efficiently (Section 3.3). because some tokens may be missing in real-world camera descrip- β’ Finally, we match the camera mock labels with the extracted tions. We consider tokens that do not represent the camera brand camera descriptions (Section 3.4). and alphanumeric model information as non-critical for the res- olution step. Therefore, we create a second set of camera labels 3.1 Internal Representation of Mock Labels without these specific tokens (see Figure 4). We use the simplified Internally, we represent a mock label as a sorted integer set. Repre- camera labels in cases where the original camera labels fail to create senting mock labels as sorted sequences of integers enables us to a match. Fast Entity Resolution With Mock Labels and Sorted Integer Sets DI2KG 2020, August 31, Tokyo, Japan key value internal representa on of simpli ed json json json json json ... canon 0 camera labels eos 1 ID simpli ed camera label 7d 2 ID sorted integer set 1. extract page titles from json files mark 3 0 canon eos eos 7d 7d 0 0, 2 ii 4 a New Canon Camera EOS 7-D for sale 1 canon eos 7d mark ii iii 5 1 0, 2, 3, 4 b 10 month old Cannon 7d (Mark II) digital camera, good condition 2 canon eos 7d mark iii powershot 6 2 0, 2, 3, 5 sd750 7 c digi cam Nixon d-7000 Full-HD-Video, 16 Mega-pixel 3 canon powershot sd750 3 0, 7 ixus 8 d digital Camera Power-shot SD-750IS from Canon 3x Zoom (IXUS 75) removed tokens: eos, powershot 75 9 e Camera bundle Canon EOS 7d Mark II and Mark III for $840 nikon 10 d7000 11 2. clean page titles (in-place, hardcoded regex expressions) a new canon camera eos 7d for sale Figure 4: Simplification of camera labels by removing cer- b 10 month old canon 7d mark ii digital camera good condition tain tokens. The (greyed out) tokens eos and powershot are c digi cam nikon d7000 fullhdvideo 16 megapixel removed from the camera labels. We include only camera d digital camera powershot sd750 from canon 3x zoom ixus 75 e camera bundle canon eos 7d mark ii and mark iii for 840 labels that are affected by the removal of tokens in the sim- plified camera labels set. Note that the cameras canon ixus 3. convert cleaned page titles to sorted integer sets 75 and nikon d7000 are not included in the simplified cam- a 0, 1, 2 b 0, 2, 3, 4 c 10, 11 d 0, 6, 7, 8, 9 e 0, 1, 2, 3, 4, 5 era labels set as they contain no tokens that can be removed. Note also that we use the key-value-mappings of the orig- inal camera labels to convert the simplified camera labels Figure 5: Conversion of page titles to sorted integer sets. We into an internal representation. carry out all processing steps in parallel with 16 threads. First, we extract the page titles from the JSON websites. Next, we clean the page titles in-place with C-string library func- 3.3 Efficient Preprocessing of Input Data tions, and hardcoded regex expressions. Finally, we convert Reading the input data from SSD and cleansing takes up a consid- the cleaned page titles into sorted integer sets based on the erable part of the overall running time of the resolution step. The key-value-mappings of the original camera labels. following findings are important to speed up preprocessing of the input data: is a subset of the second. If the remaining labels have the same β’ Reading many small files concurrently, with multiple threads camera ID, then we can clearly assign a camera ID to a page title. (compared to a single thread), takes advantage of the internal Figure 6 shows our approach for assigning unique camera IDs to parallelism of SSDs and thus leads to higher throughput [5]. page titles for the running example. β’ C-string manipulation functions are often significantly faster than their C++ pendants. For example, locating substrings a 0, 1, 2 b 0, 2, 3, 4 c 10, 11 d 0, 6, 7, 8, 9 e 0, 1, 2, 3, 4, 5 with strstr is around five times faster than using the C++ 1. compute matching camera labels for each page title std::string function find. β’ Hardcoding regular expressions with while, for, switch or 0 0, 1, 2 0 0, 2 4 10, 11 3 0, 6, 7 1 0, 1, 2, 3, 4 if-else statements results in faster execution times than us- 1 0, 2, 3, 4 3 0, 8, 9 2 0, 1, 2, 3, 5 ing standard RegEx libraries, where regular expressions are 2. remove labels that are subsets of other labels compiled at runtime into state machines. β’ Changing strings in place, instead of treating them as im- 0 0, 1, 2 1 0, 2, 3, 4 4 10, 11 3 0, 6, 7 1 0, 1, 2, 3, 4 mutable objects, eliminates allocation and copying overhead. 3 0, 8, 9 2 0, 1, 2, 3, 5 To achieve fast preprocessing of the input data, we use 16 threads 3. try to assign a unique camera ID for each page title to read the JSON files from SSD and clean the page titles. We per- form string operations in-place with C library functions, and hard- 0 1 4 3 ambiguous code our regex expressions. The result of our preprocessing step are sorted integer sets that represent the page titles (see Figure 5). Figure 6: Assignment of camera IDs to page titles. First, we compute the matching camera labels for each page title. If 3.4 Matching of Mock Labels all integer tokens of a label are within the title, we consider After converting the page titles into sorted integer sets, we try to the label a match. Note that we use the simplified labels for assign unique camera IDs to them. First, we try to match labels the page title b because we cannot match a label from the from the original camera labels with the page title. If we do not find original camera labels. Next, we remove labels that are sub- a match, we try to match the simplified labels with the page title. sets of other labels. If all remaining labels have the same We consider labels as matches if all the corresponding keywords are camera ID, we assign the ID to the page title. also part of the title. If we can match more than one label with the title, then we remove all labels that are subsets of other labels. This To avoid iterating over all camera labels when searching for allows us, for example, to distinguish between the cameras canon matches with the page title, we store all camera labels in a sorted eos 7d and canon eos 7d mark ii, because the first camera description vector. We iterate only over relevant ranges in the vector. To look DI2KG 2020, August 31, Tokyo, Japan Mark Blacher, Julien Klaus, Matthias Mitterreiter, Joachim Giesen, and SΓΆren Laue up the relevant ranges, we index the initial keywords of the labels. Table 1: Comparison of the F-measure and the running For even faster matching of camera labels with the page title, we times of the resolution step of the five best placed teams. recommend storing the camera labels in a sorted trie data structure, The input data for the resolution step consisted of 29,787 in as shown in Figure 7. Finding all labels that match a page title, is JSON formatted e-commerce websites. Measurements were then a matter of traversing the trie and comparing integer tokens. taken on a laptop running Ubuntu 19.04 with 16 GB of RAM In real-world camera descriptions it can happen that the tokens and two Intel Core i5-4310U CPUs. The underlying SSD was of a label are scattered in the page title. The following traversing a 500 GB 860 EVO mSATA. We cleared the page cache, den- strategy finds these scattered labels in the page title: If there is no tries, and inodes before each run to avoid reading the input successor in a trie node for the current token in the page title, then data from RAM instead of the SSD. ignore the token and continue with the next token in the page title. Team Language F-measure Running time (s) camera labels simplified camera labels PictureMe (this paper) C++ 0.99 0.61 0 10 0 DBGroup@UniMoRe Python 0.99 10.65 1 6 8 11 2 7 DBGroup@SUSTech C++ 0.99 22.13 eats_shoots_and_leaves Python 0.99 28.66 2 7 9 4 3 0 3 DBTHU Python 0.99 63.21 3 0 3 3 4 5 4 5 1 2 approaches, in which the rules for clustering are hardcoded into 1 2 the source code, are rigid and must be written from scratch for new entity resolution tasks. But, there are also limitations of our Figure 7: Storing the mock camera labels in a sorted trie. approach. For the matching we use set operations on integer tokens. One integer token represents one word. If entities can only be dis- tinguished by word order, then our approach needs to be adapted. 4 EVALUATION By allowing n-grams as tokens in the labels and not just individual Our approach is related to Group-wise Entity Resolution. In the words, it is possible to integrate word order semantics into our training step we semi-automatically construct mock labels that can approach. Syntactic variations in the data can also be challenging be understood as centroids of clusters. Ideally, each label represents in our resolution step. In the contest we solved this problem by one real-world entity. In the contest solution we have also labels preprocessing the input data by hardcoded regex expressions and that represent several real-world entities, or no real-world entity at in-place string operations. A more general approach would be to all. In the resolution step we try to assign each real-world camera use distance metrics for deciding which tokens are present in real- description to one cluster with set operations on sorted integer sets. world descriptions. Still, the biggest challenge remains, namely to After assigning all camera descriptions to clusters, we compute generate representative labels from the data. If the training data is the Cartesian product of the cameras in each cluster and store the unstructured or incomplete, it may be impossible to extract repre- unique pairs of identical cameras in a CSV file. sentative labels for the resolution step. In the Programming Contest, The quality of the results of our resolution step depends mainly the training data was fairly well-structured and complete, that is, on the labels generated in the training step. With the 2800 semi- the generation of mock labels was mainly a semi-automatic extrac- automatically generated mock labels we achieved on the hidden tion task. This again shows that it is important to look at the data evaluation dataset an F-measure of 0.97. To achieve an F-measure first before deciding on an approach. of 0.99, we manually added about 200 contest specific labels to the semi-automatically generated ones. We extracted these additional ACKNOWLEDGMENTS labels by inspecting page titles that did not match with any semi- This work has been supported by the German Research Foundation automatically generated labels. Since all five best placed teams (DFG) grant GI-711/5-1 within the Priority Program 1736 Algo- reached an F-measure of 0.99, the running time of the resolution rithms for Big Data. step was decisive. We were chosen as the overall winner. Table 1 shows also the running times of the resolution step of the five best REFERENCES placed teams. [1] Database Research Group at Roma Tre University. 2020. ACM SIGMOD Program- ming Contest 2020. Retrieved June 24, 2020 from http://www.inf.uniroma3.it/db/ sigmod2020contest/index.html 5 CONCLUSIONS [2] Hector Garcia-Molina. 2004. Entity Resolution: Overview and Challenges. In In this paper we presented our entity resolution approach, that Lecture Notes in Computer Science. Springer, 1β2. https://doi.org/10.1007/978-3- 540-30464-7_1 we submitted in the SIGMOD 2020 Programming Contest. Our ap- [3] David Menestrina, Steven Euijong Whang, and Hector Garcia-Molina. 2010. Eval- proach is based on mock labels that we generate in a training step uating Entity Resolution Results. Proceedings of the VLDB Endowment 3, 1-2 (Sept. 2010), 208β219. https://doi.org/10.14778/1920841.1920871 and a fast matching algorithm that computes set intersections on [4] Hongzhi Wang. 2014. Innovative Techniques and Applications of Entity Resolution. sorted integer sets in the resolution step. Our matching algorithm IGI Global. https://doi.org/10.4018/978-1-4666-5198-2 knows nothing about the actual labels. The creation of labels in [5] Zhenyun Zhuang, Sergiy Zhuk, Haricharan Ramachandra, and Badri Sridharan. 2016. Designing SSD-Friendly Applications for Better Application Performance the training step and the matching of camera descriptions with the and Higher IO Efficiency. In 2016 IEEE 40th Annual Computer Software and Appli- labels in the resolution step are decoupled. In contrast, rule-based cations Conference (COMPSAC). IEEE. https://doi.org/10.1109/compsac.2016.94