=Paper=
{{Paper
|id=Vol-2726/paper2
|storemode=property
|title=Fast Entity Resolution With Mock Labels and Sorted Integer Sets
|pdfUrl=https://ceur-ws.org/Vol-2726/paper2.pdf
|volume=Vol-2726
|authors=Mark Blacher,Joachim Giesen,Sören Laue,Julien Klaus,Matthias Mitterreiter
|dblpUrl=https://dblp.org/rec/conf/vldb/BlacherGLKM20
}}
==Fast Entity Resolution With Mock Labels and Sorted Integer Sets==
<pdf width="1500px">https://ceur-ws.org/Vol-2726/paper2.pdf</pdf>
<pre>
Fast Entity Resolution With Mock Labels and Sorted Integer Sets
                                                          SIGMOD 2020 Contest Finalist Paper

                   Mark Blacher                                             Julien Klaus                               Matthias Mitterreiter
       Friedrich Schiller University Jena                      Friedrich Schiller University Jena                Friedrich Schiller University Jena
                   Germany                                                  Germany                                          Germany
          mark.blacher@uni-jena.de                                 julien.klaus@uni-jena.de                      matthias.mitterreiter@uni-jena.de

                                             Joachim Giesen                                         Sören Laue
                                   Friedrich Schiller University Jena                 Friedrich Schiller University Jena
                                               Germany                                            Germany
                                     joachim.giesen@uni-jena.de                           soeren.laue@uni-jena.de

ABSTRACT                                                                                 Definition 1.1 (Pair-wise Entity Resolution). Given a set of objects
Entity resolution is the task of identifying equivalent real-world                    𝑆, the output is a set of pairs {(𝑜𝑖 , 𝑜 𝑗 ) | 𝑜𝑖 , 𝑜 𝑗 ∈ 𝑆, 𝑜𝑖 and 𝑜 𝑗 refer to
entities across different sources. We found that standard methods                     the same real-world entity}.
based on unsupervised learning are slow during the resolution                            Definition 1.2 (Group-wise Entity Resolution). Given a set of ob-
step. Thus, we propose an alternative approach with mock labels.                      jects 𝑆, the output is a family of sets {𝑆𝑖 | all objects in 𝑆𝑖 refer to
Mock labels are sets of strings that ideally represent real-world                     the same real-world entity} where 𝑆𝑖 ∩ 𝑆 𝑗 = ∅ if 𝑖 ≠ 𝑗.
entities. Using a running example, we show how to create mock
labels from the training data and use them in the resolution step to                     Pairwise Entity Resolution approaches often use similarity or
match real-world entities efficiently. We applied this approach in                    distance functions for determining whether two objects refer to the
the SIGMOD 2020 Programming Contest and achieved significantly                        same real-world entity. Group-wise Entity Resolution approaches
faster execution times in the resolution step compared to other                       additionally use clustering techniques for assigning an object to a
top ranked teams. Since all the top ranked teams had the same                         group or cluster. The results of these two types of entity resolution
F-measure, the execution speed of the resolution step was decisive.                   are not always consistent [4].

CCS CONCEPTS                                                                          1.2     SIGMOD 2020 Programming Contest
• Information systems → Entity resolution.                                            The challenge of the SIGMOD 2020 Programming Contest was to
                                                                                      design an entity resolution system for cameras. The input data
KEYWORDS                                                                              consisted of about 30,000 e-commerce websites. Each website was
entity resolution, data cleansing, programming contest                                provided as a JSON formatted file that specified a real-world product
                                                                                      offer. Not every website contained a single valid camera. Camera
1     INTRODUCTION                                                                    accessories such as bags or zoom lenses, TVs, and entire camera
                                                                                      bundles were also part of the dataset. Also, the product descrip-
Entity resolution is a problem that occurs in data integration, data
                                                                                      tions in the JSON files were inconsistent (varying attribute names,
cleansing, web search, and e-commerce search scenarios [4]. Given
                                                                                      different semantics and syntax). Only the page title attribute was
two or more sources, we need to identify real-world entities that
                                                                                      always present. See Figure 1 for an example of a JSON formatted
do not have unique identifiers that tell us which records from one
                                                                                      website that represents a specific camera.
source match those in the other sources. In addition, records rep-
resenting the same entity may contain different information. For
example, if we use digital cameras as real-world entities, one record                 {
may have the brand or the model misspelled, and another may be                            "<page title>": "Canon PowerShot SX 500IS | eBay",
missing some fields such as pixel count or zoom factor. Further-                          "brand": "Canon",
more, it is possible that information within individual records is                        "manufacturer warranty": "No",
contradictory, such as an incorrect pixel count for a particular cam-                     "megapixels": "16.0 MP",
era model. An entity resolution algorithm has to deal with these                          "model": "SX500IS",
inconsistencies in the data and identify records corresponding to                         "optical zoom": "30x"
the same real-world entity as best as possible [2].                                   }

1.1     Background                                                                    Figure 1: Example specification for a camera in JSON format.
                                                                                      The page title exists on all websites, whereas the brand and
There are basically two types of entity resolution, namely Pair-wise
                                                                                      model attributes are only specified on eBay websites.
Entity Resolution, and Group-Wise Entity Resolution.
DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY       The task was to find as many identical camera pairs as possible
4.0).                                                                                 among all e-commerce websites (Pair-wise Entity Resolution). A
DI2KG 2020, August 31, Tokyo, Japan                                         Mark Blacher, Julien Klaus, Matthias Mitterreiter, Joachim Giesen, and Sören Laue


                                                                                                 Input                                   Output
valid submission consisted of a CSV file containing only the match-
                                                                                1. clean page titles (stop words: -, kit, /)
ing camera pairs found during the resolution step. The Submissions
                                                                                 Canon IXUS 750 - . . . - Canon Europe
were ranked based on the F-measure [3], which was computed                       Canon EOS 7D kit . . . - ShopMania
                                                                                                                                       labels.txt
on a secret evaluation dataset. The five finalist teams all achieved             Canon EOS 7D Mark II / . . . - Australia              canon eos 7d

an F-measure of 0.99. Because of this tie, the running time of the               Canon EOS 7D Mark III / . . . - Australia             canon eos 7d mark ii
                                                                                                                                       canon eos 7d mark iii
resolution step was decisive [1].                                               2. synonymous camera names (Wikipedia)
                                                                                                                                       canon powershot sd750, canon ixus 75
                                                                                 canon ixus 75 == canon powershot sd750                nikon d7000
1.3     Outline                                                                 3. brand + model attributes (eBay websites)
The rest of this paper is organized as follows. Throughout the paper             "brand": "Nikon", "model": "D7000"

we use a running example. In Section 2 we describe the training step
of our approach. We show that by preprocessing the training data           Figure 2: Semi-automatic generation of mock camera labels.
and using additional information about synonymous camera names             We use three sources to generate the labels. 1. The page ti-
from the web, it is possible to create mock camera labels that roughly     tles of websites like canon-europe, shopmania or shopbot are
represent the real-world camera entities. In Section 3 we describe         very clean. We extract labels from these websites by using
our resolution step, that is, how we match the contents of the JSON        stop words. 2. Camera names may differ between countries.
files with the generated mock labels. To speed up the execution            We download synonymous camera names from Wikipedia.
time, we use sorted integer sets for the mock labels and the real-         3. We combine the brand and model attributes of eBay web-
world camera descriptions. Furthermore, we give recommendations            sites from the training data to create additional labels. The
on how to speed up the cleansing of input data. In Section 4 we            merged results of the three sources are saved to a text file (la-
evaluate our solution. Finally, in Section 5 we conclude this paper        bels.txt). Each line in labels.txt contains one or more comma
with a discussion of the features and limitations of our approach.         separated sets of strings. Ideally, each line represents a real-
                                                                           world camera entity.
2     TRAINING
Our approach is based on a concept we call mock labels. We call            compute matchings in Section 3.4 with real-world camera descrip-
them mock labels because they are not real identifiers, but only sets      tions faster than it would be possible using only strings. In Figure 3
of semi-automatically generated keywords. They may or may not              we convert the mock labels from Figure 2 to sorted integer sets.
represent a real-world entity. For the sake of simplicity, we also refer
to mock labels in the text as labels. For creating mock camera labels,
                                                                                                                  key          value         internal representa on of camera
we use clean websites from the training data which give rise to about


                                                                                                                    canon          0                       labels
950 labels. These labels are simply extracted from the corresponding                                                  eos          1
                                                                           ID    camera label                           7d         2                ID   sorted integer set
page titles by using stop words. We also use information about                                                       mark          3
                                                                            0    canon eos 7d                                                       0    0, 1, 2
synonymous camera names from Wikipedia which creates another                                                            ii         4
                                                                            1    canon eos 7d mark ii                                               1    0, 1, 2, 3, 4
50 labels. Additionally, we combine the attributes brand and model                                                      iii        5
                                                                            2    canon eos 7d mark iii           powershot         6                2    0, 1, 2, 3, 5
from provided eBay websites to create roughly another 1800 labels.          3    canon powershot sd750              sd750          7                3    0, 6, 7
In Figure 2 we summarize our approach by constructing mock                  3    canon ixus 75
                                                                                                                      ixus         8
                                                                                                                                                    3    0, 8, 9
                                                                                                                        75         9
camera labels for the running example.                                      4    nikon d7000                        nikon         10                4    10, 11
                                                                                                                    d7000         11

3     RESOLUTION
In the training step described in the previous section we generated        Figure 3: Conversion of mock camera labels to sorted integer
mock camera labels. In this section we describe the resolution step        sets. We map each unique token (key) in camera labels to a
of our implementation, that is, how we match these labels with             unique value. Based on these key-value-mappings, we con-
textual descriptions of real-world cameras. Our resolution step is         vert camera labels to sorted integer sets. A camera can have
subdivided into four parts:                                                different names in different countries. Therefore, repeating
      • First, we convert the mock camera labels to an efficient in-       IDs reference the same cameras (see, for example, ID=3).
        ternal representation (Section 3.1).
      • Second, we create simplified mock labels that allow us to
        match even incomplete camera descriptions (Section 3.2).           3.2        Simplified Labels to Handle Missing Tokens
      • Third, we extract camera descriptions from the JSON files          It is not always possible to create a match with a camera label
        and preprocess them efficiently (Section 3.3).                     because some tokens may be missing in real-world camera descrip-
      • Finally, we match the camera mock labels with the extracted        tions. We consider tokens that do not represent the camera brand
        camera descriptions (Section 3.4).                                 and alphanumeric model information as non-critical for the res-
                                                                           olution step. Therefore, we create a second set of camera labels
3.1     Internal Representation of Mock Labels                             without these specific tokens (see Figure 4). We use the simplified
Internally, we represent a mock label as a sorted integer set. Repre-      camera labels in cases where the original camera labels fail to create
senting mock labels as sorted sequences of integers enables us to          a match.
Fast Entity Resolution With Mock Labels and Sorted Integer Sets                                                                                            DI2KG 2020, August 31, Tokyo, Japan


                                 key         value   internal representa on of simpli ed                json            json           json         json                            json
                                                                                                                                                                       ...


                                  canon          0              camera labels
                                    eos          1
ID    simpli ed camera label           7d        2       ID   sorted integer set                                               1. extract page titles from json files
            

                                   mark          3
 0    canon eos
            eos 7d
                7d                                        0   0, 2
                                       ii        4                                                 a New Canon Camera EOS 7-D for sale
 1    canon eos 7d mark ii             iii       5        1   0, 2, 3, 4
                                                                                                   b 10 month old Cannon 7d (Mark II) digital camera, good condition
 2    canon eos 7d mark iii      powershot       6        2   0, 2, 3, 5
                                  sd750          7                                                 c digi cam Nixon d-7000 Full-HD-Video, 16 Mega-pixel
 3    canon powershot sd750                               3   0, 7
                                   ixus          8                                                 d digital Camera Power-shot SD-750IS from Canon 3x Zoom (IXUS 75)
removed tokens: eos, powershot         75        9                                                 e Camera bundle Canon EOS 7d Mark II and Mark III for $840
                                   nikon        10
                                  d7000         11                                                       2. clean page titles (in-place, hardcoded regex expressions)

                                                                                                   a new canon camera eos 7d for sale
Figure 4: Simplification of camera labels by removing cer-                                         b 10 month old canon 7d mark ii digital camera good condition
tain tokens. The (greyed out) tokens eos and powershot are                                         c digi cam nikon d7000 fullhdvideo 16 megapixel
removed from the camera labels. We include only camera                                             d digital camera powershot sd750 from canon 3x zoom ixus 75
                                                                                                   e camera bundle canon eos 7d mark ii and mark iii for 840
labels that are affected by the removal of tokens in the sim-
plified camera labels set. Note that the cameras canon ixus                                                       3. convert cleaned page titles to sorted integer sets
75 and nikon d7000 are not included in the simplified cam-
                                                                                                   a 0, 1, 2 b 0, 2, 3, 4              c 10, 11     d 0, 6, 7, 8, 9          e 0, 1, 2, 3, 4, 5
era labels set as they contain no tokens that can be removed.
Note also that we use the key-value-mappings of the orig-
inal camera labels to convert the simplified camera labels                                 Figure 5: Conversion of page titles to sorted integer sets. We
into an internal representation.                                                           carry out all processing steps in parallel with 16 threads.
                                                                                           First, we extract the page titles from the JSON websites. Next,
                                                                                           we clean the page titles in-place with C-string library func-
3.3       Efficient Preprocessing of Input Data                                            tions, and hardcoded regex expressions. Finally, we convert
Reading the input data from SSD and cleansing takes up a consid-                           the cleaned page titles into sorted integer sets based on the
erable part of the overall running time of the resolution step. The                        key-value-mappings of the original camera labels.
following findings are important to speed up preprocessing of the
input data:                                                                                is a subset of the second. If the remaining labels have the same
     • Reading many small files concurrently, with multiple threads                        camera ID, then we can clearly assign a camera ID to a page title.
       (compared to a single thread), takes advantage of the internal                      Figure 6 shows our approach for assigning unique camera IDs to
       parallelism of SSDs and thus leads to higher throughput [5].                        page titles for the running example.
     • C-string manipulation functions are often significantly faster
       than their C++ pendants. For example, locating substrings                                   a 0, 1, 2 b 0, 2, 3, 4              c 10, 11     d 0, 6, 7, 8, 9          e 0, 1, 2, 3, 4, 5
       with strstr is around five times faster than using the C++                                              1. compute matching camera labels for each page title
       std::string function find.
     • Hardcoding regular expressions with while, for, switch or                                    0   0, 1, 2     0     0, 2           4 10, 11          3 0, 6, 7          1   0, 1, 2, 3, 4

       if-else statements results in faster execution times than us-                                                1     0, 2, 3, 4                       3 0, 8, 9          2   0, 1, 2, 3, 5

       ing standard RegEx libraries, where regular expressions are
                                                                                                                 2. remove labels that are subsets of other labels
       compiled at runtime into state machines.
     • Changing strings in place, instead of treating them as im-                                   0   0, 1, 2     1     0, 2, 3, 4     4 10, 11          3 0, 6, 7          1   0, 1, 2, 3, 4
       mutable objects, eliminates allocation and copying overhead.                                                                                        3 0, 8, 9          2   0, 1, 2, 3, 5

   To achieve fast preprocessing of the input data, we use 16 threads                                       3. try to assign a unique camera ID for each page title
to read the JSON files from SSD and clean the page titles. We per-
form string operations in-place with C library functions, and hard-                                     0                      1              4                3                  ambiguous

code our regex expressions. The result of our preprocessing step
are sorted integer sets that represent the page titles (see Figure 5).                     Figure 6: Assignment of camera IDs to page titles. First, we
                                                                                           compute the matching camera labels for each page title. If
3.4       Matching of Mock Labels                                                          all integer tokens of a label are within the title, we consider
After converting the page titles into sorted integer sets, we try to                       the label a match. Note that we use the simplified labels for
assign unique camera IDs to them. First, we try to match labels                            the page title b because we cannot match a label from the
from the original camera labels with the page title. If we do not find                     original camera labels. Next, we remove labels that are sub-
a match, we try to match the simplified labels with the page title.                        sets of other labels. If all remaining labels have the same
We consider labels as matches if all the corresponding keywords are                        camera ID, we assign the ID to the page title.
also part of the title. If we can match more than one label with the
title, then we remove all labels that are subsets of other labels. This                      To avoid iterating over all camera labels when searching for
allows us, for example, to distinguish between the cameras canon                           matches with the page title, we store all camera labels in a sorted
eos 7d and canon eos 7d mark ii, because the first camera description                      vector. We iterate only over relevant ranges in the vector. To look
DI2KG 2020, August 31, Tokyo, Japan                                                  Mark Blacher, Julien Klaus, Matthias Mitterreiter, Joachim Giesen, and Sören Laue


up the relevant ranges, we index the initial keywords of the labels.                Table 1: Comparison of the F-measure and the running
For even faster matching of camera labels with the page title, we                   times of the resolution step of the five best placed teams.
recommend storing the camera labels in a sorted trie data structure,                The input data for the resolution step consisted of 29,787 in
as shown in Figure 7. Finding all labels that match a page title, is                JSON formatted e-commerce websites. Measurements were
then a matter of traversing the trie and comparing integer tokens.                  taken on a laptop running Ubuntu 19.04 with 16 GB of RAM
In real-world camera descriptions it can happen that the tokens                     and two Intel Core i5-4310U CPUs. The underlying SSD was
of a label are scattered in the page title. The following traversing                a 500 GB 860 EVO mSATA. We cleared the page cache, den-
strategy finds these scattered labels in the page title: If there is no             tries, and inodes before each run to avoid reading the input
successor in a trie node for the current token in the page title, then              data from RAM instead of the SSD.
ignore the token and continue with the next token in the page title.
                                                                                     Team                           Language       F-measure      Running time (s)
                   camera labels                 simplified camera labels
                                                                                     PictureMe (this paper)            C++                0.99                   0.61
                             0         10                                   0
                                                                                     DBGroup@UniMoRe                  Python              0.99                  10.65
                       1     6     8        11                       2          7    DBGroup@SUSTech                   C++                0.99                  22.13
                                                                                     eats_shoots_and_leaves           Python              0.99                  28.66
                  2          7         9    4                   3    0          3
                                                                                     DBTHU                            Python              0.99                  63.21
              3   0          3         3                  4          5

          4       5                                       1          2
                                                                                    approaches, in which the rules for clustering are hardcoded into
          1       2                                                                 the source code, are rigid and must be written from scratch for
                                                                                    new entity resolution tasks. But, there are also limitations of our
    Figure 7: Storing the mock camera labels in a sorted trie.                      approach. For the matching we use set operations on integer tokens.
                                                                                    One integer token represents one word. If entities can only be dis-
                                                                                    tinguished by word order, then our approach needs to be adapted.
4     EVALUATION                                                                    By allowing n-grams as tokens in the labels and not just individual
Our approach is related to Group-wise Entity Resolution. In the                     words, it is possible to integrate word order semantics into our
training step we semi-automatically construct mock labels that can                  approach. Syntactic variations in the data can also be challenging
be understood as centroids of clusters. Ideally, each label represents              in our resolution step. In the contest we solved this problem by
one real-world entity. In the contest solution we have also labels                  preprocessing the input data by hardcoded regex expressions and
that represent several real-world entities, or no real-world entity at              in-place string operations. A more general approach would be to
all. In the resolution step we try to assign each real-world camera                 use distance metrics for deciding which tokens are present in real-
description to one cluster with set operations on sorted integer sets.              world descriptions. Still, the biggest challenge remains, namely to
After assigning all camera descriptions to clusters, we compute                     generate representative labels from the data. If the training data is
the Cartesian product of the cameras in each cluster and store the                  unstructured or incomplete, it may be impossible to extract repre-
unique pairs of identical cameras in a CSV file.                                    sentative labels for the resolution step. In the Programming Contest,
    The quality of the results of our resolution step depends mainly                the training data was fairly well-structured and complete, that is,
on the labels generated in the training step. With the 2800 semi-                   the generation of mock labels was mainly a semi-automatic extrac-
automatically generated mock labels we achieved on the hidden                       tion task. This again shows that it is important to look at the data
evaluation dataset an F-measure of 0.97. To achieve an F-measure                    first before deciding on an approach.
of 0.99, we manually added about 200 contest specific labels to the
semi-automatically generated ones. We extracted these additional                    ACKNOWLEDGMENTS
labels by inspecting page titles that did not match with any semi-                  This work has been supported by the German Research Foundation
automatically generated labels. Since all five best placed teams                    (DFG) grant GI-711/5-1 within the Priority Program 1736 Algo-
reached an F-measure of 0.99, the running time of the resolution                    rithms for Big Data.
step was decisive. We were chosen as the overall winner. Table 1
shows also the running times of the resolution step of the five best                REFERENCES
placed teams.                                                                       [1] Database Research Group at Roma Tre University. 2020. ACM SIGMOD Program-
                                                                                        ming Contest 2020. Retrieved June 24, 2020 from http://www.inf.uniroma3.it/db/
                                                                                        sigmod2020contest/index.html
5     CONCLUSIONS                                                                   [2] Hector Garcia-Molina. 2004. Entity Resolution: Overview and Challenges. In
In this paper we presented our entity resolution approach, that                         Lecture Notes in Computer Science. Springer, 1–2. https://doi.org/10.1007/978-3-
                                                                                        540-30464-7_1
we submitted in the SIGMOD 2020 Programming Contest. Our ap-                        [3] David Menestrina, Steven Euijong Whang, and Hector Garcia-Molina. 2010. Eval-
proach is based on mock labels that we generate in a training step                      uating Entity Resolution Results. Proceedings of the VLDB Endowment 3, 1-2 (Sept.
                                                                                        2010), 208–219. https://doi.org/10.14778/1920841.1920871
and a fast matching algorithm that computes set intersections on                    [4] Hongzhi Wang. 2014. Innovative Techniques and Applications of Entity Resolution.
sorted integer sets in the resolution step. Our matching algorithm                      IGI Global. https://doi.org/10.4018/978-1-4666-5198-2
knows nothing about the actual labels. The creation of labels in                    [5] Zhenyun Zhuang, Sergiy Zhuk, Haricharan Ramachandra, and Badri Sridharan.
                                                                                        2016. Designing SSD-Friendly Applications for Better Application Performance
the training step and the matching of camera descriptions with the                      and Higher IO Efficiency. In 2016 IEEE 40th Annual Computer Software and Appli-
labels in the resolution step are decoupled. In contrast, rule-based                    cations Conference (COMPSAC). IEEE. https://doi.org/10.1109/compsac.2016.94

</pre>