=Paper=
{{Paper
|id=Vol-2726/paper5
|storemode=property
|title=CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data
|pdfUrl=https://ceur-ws.org/Vol-2726/paper5.pdf
|volume=Vol-2726
|authors=Nan Deng,Wendi Luan,Haotian Liu,Bo Tang
|dblpUrl=https://dblp.org/rec/conf/vldb/DengLLT20
}}
==CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2726/paper5.pdf</pdf>
<pre>
 CheetahER: An Accurate and Efficient Entity Resolution System
               for Heterogeneous Camera Data
                                           SIGMOD 2020 Programming Contest Finalist Paper


                                       Nan Deng† ,            Wendi Luan† ,           Haotian Liu† ,          Bo Tang†‡∗
                † Department of Computer Science and Engineering, Southern University of Science and Technology
                             ‡ PCL Research Center of Networks and Communications, Peng Cheng Laboratory

                                   {11711004@mail.,11712532@mail.,11613015@mail.,tangb3@}sustech.edu.cn
                                                                                        {
ABSTRACT                                                                                        “<page title>”: “Kodak Eashyshare Z980 | eBay”,
The SIGMOD Programming Contest 2020 raises a real-world entity                                  “beautiful pictures more often automatically”: “Who says you
                                                                                                       can’t have it all? … ”,
resolution problem, which requires to identify product specifica-                               “brand”: “Kodak”,
tions from multiple e-commerce websites that represent the same                                 “bundled items”: “Case or Bag, Lens, Tripod”,
real-world cameras. Entity resolution has been extensively studied                              “megapixels”: “12.0 MP”,
and the general solution framework consists of two phases: blocking                             “model”: “Z980”,
                                                                                                …
and matching. Most existing works focus on the matching phase,
                                                                                        }
which trains (complex) models on large volumes of data and uses
the models to decide whether a pair of descriptions refers to the
same real-world object. However, training a high-quality model is                      Figure 1: An example of the camera specification JSON file
difficult for the SIGMOD contest because there is only a limited
amount of labeled data and the product specifications can be dirty                           Table 1: An illustration of the ground-truth dataset
and incomplete.
   In this paper, we propose CheetahER, an accurate and efficient                           Left specification ID          Right specification ID          Label
entity resolution system. Different from existing works, we focus
on improving the effectiveness of the blocking phase, which is                          www.ebay.com//24887             www.ebay.com//56369                    1
overlooked in both academia researches and industry systems, and                        www.ebay.com//42902            www.mypriceindia.com//57                0
propose a two-phase blocking framework to group the product
specifications according to brand and model. The pre-processing
and data cleaning procedures are also carefully designed to improve                   attribute page title (highlighted in Figure 1) but the other attributes
data quality. CheetahER ranks the 1st in accuracy among 53 teams                      (not their values), such as model and brand (marked in gray), could
and completes the task within 20 seconds. Even though some de-                        be different in different specifications. This makes entity resolution
signs of CheetahER are specialized for camera datasets, its novel                     difficult as the same camera can have different attributes in different
two-phase blocking framework and operators (i.e., merging and                         JSON files. A ground-truth dataset is a CSV file, in which each row
splitting) may generalize to other entity resolution tasks.                           is a record that contains three fields, i.e., left specification id, right
                                                                                      specification id and label. label=1 indicates that the two specifica-
KEYWORDS                                                                              tions refer to the same camera. The participants are required to list
                                                                                      all pairs of specifications that represent the same camera using a
Data Integration, Entity Resolution, Two-phase Blocking
                                                                                      format similar to the ground-truth dataset. The contest ranks the
                                                                                      participants by the accuracy of their results and measures accuracy
1    INTRODUCTION                                                                     using the F1 score, which is defined as
Entity resolution, which identifies different records (e.g., web-pages
                                                                                                                    2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
and user names) referring to the same real-world entity, is an impor-                                       𝐹1 =                           .
tant task in the database community. In the SIGMOD Programming                                                        𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Contest 2020 [3], organizers introduce a real-world entity resolu-                        A general solution to the entity resolution problem consists of
tion problem: finding which camera specifications from 24 different                   two steps, i.e., blocking and matching. The blocking step divides
e-commerce websites represent the same camera. Two types of                           the dataset into a number of groups to reduce the complexity for
datasets are provided to the participants: (i) camera specification                   pair-wise comparison. The matching step enumerates all possible
datasets and (ii) ground-truth datasets. For a camera specification                   pairs in each block and decides whether a pair of records (spec-
dataset, each camera specification is stored as a JSON file and an                    ifications in our case) refers to the same entity. Currently, both
example is provided in Figure 1. All JSON files have a common                         academia and industry focus on the matching step and developed
                                                                                      many learning-based solutions that achieve good accuracy for pair-
∗ Corresponding author: Bo Tang.
                                                                                      wise comparison [1, 2]. However, learning-based solutions is not
DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY     suitable for the contest due to two challenges. (i) Insufficient labeled
4.0).                                                                                 data, only 300k ground-truth pairs are given but there are 900M
DI2KG Workshop’20, August 2020, Tokyo, Japan                                                     Nan Deng† ,   Wendi Luan† ,   Haotian Liu† ,   Bo Tang†‡


                           Input
                                   Preprocessing
                                                      Two-Stage           4.1    Brand-based Blocking
                                                       Blocking
                                                                          We first divide the camera specifications into different blocks ac-
       Camera Specifications
                                                                          cording to their brands. As shown in Figure 3, brand-based blocking
                                                                          consists of two steps: grouping and merging.
                          Output
                                     Matching          Cleaning           Grouping by Brand. In this step, we first extract the brand names
          Camera Pairs                                                    from the brand attribute in the JSON files. Although the brand
                                                                          attribute only exists in a small portion of the specifications, it helps
         Figure 2: System architecture for CheetahER                      to obtain a set of brands that could cover most camera specifications.
                                                                          We then utilize the page title attribute in each JSON file to extract
possible specification pairs. The limited size of the ground-truth
                                                                          more brand information. A camera specification will be grouped
dataset makes it difficult to train high quality models. (ii) Poor data
                                                                          into a brand block if the brand appears in the specification’s page
quality, the specifications are unstructured data with different at-
                                                                          title attribute. For example, the JSON file in Figure 1 has attribute
tributes. We believe that the two challenges are also general for
                                                                          brand with value “Kodak”, then we are able to create a block with
many real-word entity resolution problems.
                                                                          brand name “Kodak”. All JSON files that have “Kodak” as a substring
   Our system, CheetahER, is designed to overcome the aforemen-
                                                                          in its page title attribute will be grouped into this block. We generate
tioned challenges. Different from existing works, we focus on the
                                                                          blocks for all encountered brands, e.g., “Canon”, “Cannon”, “Fuji”,
blocking step and introduce two block operations: merging and
                                                                          “Fujifilm”, as shown in Figure 3. These blocks will go through the
splitting. The merging operation merges two or more blocks into
                                                                          merging step to improve accuracy.
one while the splitting operation splits one block into multiple
                                                                          Merging. Block merging is used to merge different blocks that
blocks. A complete set of rules is also designed to control the execu-
                                                                          correspond to the same brand. As shown in Figure 3, we obtain
tion of the block operations. CheetahER achieves an F1 score of 98%
                                                                          blocks with name “Canon” and “Cannon” after grouping but “Can-
and runs within 20 seconds. We are also honored as finalist, ranking
                                                                          non” is apparently a typo of the correct spelling “Canon”. Different
top-5 on the world-wide leaderboard [4]. Details of CheetahER can
                                                                          blocks can also be created for the same brand due to alias, “Fuji”
be found in our open-sourced code1 .
                                                                          and “Fujifilm” in Figure 3 for example. To handle spelling error and
                                                                          alias, we introduce two criteria for merging the brand blocks.
2    SYSTEM ARCHITECTURE                                                      The first criteria utilizes the regular rules. For two brand blocks
As illustrated in Figure 2, CheetahER has four components: pre-           with brand name 𝐴 and 𝐵, if 𝐴 is the prefix of 𝐵 (e.g. “Fuji” and
possessing, two-stages blocking, cleaning and matching. The prepos-       “Fujifilm”), then block 𝐴 should be merged with block 𝐵. The other
sessing step loads data into memory, extracts useful information          criteria is based on the Levenshtein distance for strings[5].
and organizes the specifications in structured form. The blocking
step indexes the specifications and clusters similar specifications                       
                                                                                          
                                                                                           max(𝑖, 𝑗)                    if min(𝑖, 𝑗) = 0,
into a group, and the cleaning step adjusts the group assignment
                                                                                          
                                                                                          
                                                                                                 lev𝐴,𝐵 (𝑖 − 1, 𝑗) + 1
                                                                          lev𝐴,𝐵 (𝑖, 𝑗) =
                                                                                               
                                                                                               
                                                                                               
of incorrectly classified specifications. Finally, the matching step                      
                                                                                           min lev𝐴,𝐵 (𝑖, 𝑗 − 1) + 1         otherwise,
enumerates all possible specifications pairs in each block as the
                                                                                               lev𝐴,𝐵 (𝑖 − 1, 𝑗 − 1) + 1 𝐴 ≠𝐵
                                                                                               
                                                                                                                          ( 𝑖 𝑗)
                                                                                          
                                                                                              
result. That is, for a block of size 𝑛, 𝑛(𝑛 − 1)/2 matched pairs will
be produced.                                                              Here, 𝐴𝑖 and 𝐴 𝑗 are the sub-strings of 𝐴 and 𝐵 with first 𝑖 and 𝑗
                                                                          characters, respectively. Merging brand blocks with a small Lev-
3    PREPROCESSING                                                        enshtein distance helps to tackle spelling errors, e.g., “Canon” and
                                                                          “Cannon”.
Brand and model can identify a specific camera, and thus the prepro-
cessing step retrieves attributes from which brand and model could
be extracted. The page title attribute is present for JSON files and it
                                                                          4.2    Model Blocking within Brand Blocking
often contains information about brand and model. For example,            In this step, we further divide each brand block into multiple blocks
the page title attribute of spec “www.wexphotographic.com//626”           based on the camera model. Model blocking consists of three phases,
is “Samsung WB350F Digital Smart Camera ...”, which includes its          i.e., grouping, merging and splitting.
brand “Samsung” and model “WB350F”. Brand and model may also              Grouping. Model names cannot be easily extracted as the brand
appear independently in other attributes. However, the model at-          names do. On the one hand, there are many specifications whose
tribute can be ambiguous. e.g. “0002724284400” and “Camera” are           model name is missing from the model attribute; on the other hand,
model attributes in specifications “www.buzzillions.com//854.json”        the value of the model attribute can be ambiguous. For instance,
and “www.ebay.com//60127.json”, respectively. Thus, we only keep          in a specification with id “www.buzzi-llions.com/872”, the value of
the page title and brand attributes in this step.                         the model attribute is “15820728”, but its actual model is “F100fd”
                                                                          with brand “Fujifilm” by scrutinizing the data.
4    TWO-STAGE BLOCKING                                                       According to our observation, there are two patterns for the
                                                                          name of camera models: (i) model names usually consist of only
An illustration of our two-stage blocking method is provided in
                                                                          alphabet, number, space and crossbar; (ii) model names can be
Figure 3, which consists of two phases, brand blocking and model
                                                                          constructed by a combination of prefix and postfix, e.g., “EOS 1Ds”.
blocking. We elaborate the two phases as follows.
                                                                          Based on these observations, we define a suite of regular rules to
1 https://github.com/LUUUAN/EntityMatching_SIGMOD_2020_Contest            extract a set of model names from the “page title” attribute within
CheetahER: An Accurate and Efficient Entity Resolution System
for Heterogeneous Camera Data                                                                                                   DI2KG Workshop’20, August 2020, Tokyo, Japan


                                                                        Camera Specs


                   Grouping             Canon         Cannon     Kodak        …          Nikon         Fuji          Fujifilm
                                                                                                                                         Brand-Based
                                                                                                                                         Blocking
                   Merging                      Canon               …                      …                  Fujifilm


                                        EOS-          EOS                  IXUS          ELPH                    …
                  Grouping                                      …                                    X10                 S1
                                         5D            5D                   155           150


                   Merging                      EOS                               IXUS               X10                 S1              Model Blocking
                                                 5D                                155                                                   within Brand

                                      EOS 5D       EOS 5D       EOS 5D            IXUS
                  Splitting                                                                          X10                 S1
                                      Mark I       Mark II      Mark III           155


                                                        Figure 3: An example of two-stage blocking


each brand block. Similar to brand-based blocking, we can further                        names. The same model name can be expressed in various forms,
group specifications in each brand block into several model blocks,                      for instance, the model “Canon EOS 5D” can have name “EOS-5D”,
using the model name set extracted previously. For instance, the                         “EOS5D”, or even a simple name “5D”. In this case, we ignore the
JSON file shown in Figure 1 will be classified into block “Kodak”                        noncontributory prefixes and postfixes and merge them together.
in the brand-based blocking step. In the model blocking step, its                           The pseudo-code for model block merging is shown in Algo-
model name “Z980” can be extracted using regular rules described                         rithm 1 and the merging threshold 𝑒 is 3 by default. We will show
above. As a result, it will be grouped into the block which contains                     that the model merging strategies are effective and can significantly
all camera manifestations that has “Kodak” and “Z980” in their page                      improve recall in the experiments in Section 5.
title attribute. Similarly, as shown in Figure 3, the brand block
“Canon” will be further divided into model blocks with name “EOS
5D”, “IXUS 155”, etc.                                                                      Algorithm 1: Model Block Merging
Merging. Model block merging is similar to brand block merging                              Input: Model blocks 𝑀𝑡1 , 𝑀𝑡2 , ..., 𝑀𝑡𝑛 in brand block 𝐵𝑡
but the merge conditions are different. We can’t directly apply the                         Output: Model blocks 𝑀𝑡1 , 𝑀𝑡2 , ..., 𝑀𝑡𝑘
brand block merging rules due to two reasons: (i) the similarity                          1 for i=1:n do
between brand blocks is defined using the Levenshtein distance and                        2     for j=i+1:n do
does not work for models, e.g., the Levenshtein distances between                         3         if |𝑀𝑡𝑖 ∩ 𝑀𝑡 𝑗 | > 𝑒 then
“EOS 50D” and “EOS 5D” is only one but they are different models;                         4             Merge 𝑀𝑡𝑖 and 𝑀𝑡 𝑗
(ii) the same camera model can have different model names in                              5     end
different regions. For example, “IXY 140”, “ELPH 150” and “IXUS                           6 end
155” are names of the same model when they are sold in Japan,
America and elsewhere, respectively.
    We propose two criteria for model merging to tackle the afore-
                                                                                         Splitting. After the brand blocking and model blocking steps, some
mentioned problems. Firstly, some e-commerce websites tend to
                                                                                         specifications referring to different entities could be put into the
write all possible names of a model in its page title, i.e., models
                                                                                         same block. We found that this is because some models need to be
names like “IXUS 155” and “ELPH 150” may appear in a single page
                                                                                         further distinguished by their generations. For example, “Canon
title attribute values. In our example, this camera specification will
                                                                                         EOS 5D Mark II” will be blocked to “EOS 5D” by the previous steps,
be classified into both “IXUS 155” and “ELPH 150” model blocks.
                                                                                         losing its generation information “Mark II”. As a result, “Canon
As a heuristic method, we merge two model blocks if their com-
                                                                                         EOS 5D Mark II” will be paired with “Canon EOS 5D Mark I” and
mon specifications is greater than a user-given threshold. This
                                                                                         “Canon EOS 5D Mark III” when generating the solution. This will
merging condition can be formally specified as follows: Given two
                                                                                         severely degrade the precision of the final result. So we design a
brand blocks 𝐴 and 𝐵 with 𝐴 ≠ 𝐵, if |𝐴 ∩ 𝐵| > 𝑒, in which 𝑒 is a
                                                                                         set of regular rules like “.*Mark [(I)|(II)|(III)|(IV)].*” to identify “Mark
threshold parameter, then the two brand blocks can be merged. The
                                                                                         II” from “Canon EOS 5D Mark II”, then extract them and generate
other model merging criteria is based on the format of the model
                                                                                         new blocks for such records.
DI2KG Workshop’20, August 2020, Tokyo, Japan                                                        Nan Deng† ,    Wendi Luan† ,   Haotian Liu† ,   Bo Tang†‡


5    BLOCK CLEANING                                                               1.00

The block cleaning step aims to remove accessories (e.g., lens and                0.96
bags) that should not contribute to the final result. Accessories
may contain several brands and models, and thus may be assigned                   0.92
to multiple blocks and paired with specifications referring to real
camera instances, which hampers the precision of the result. For                  0.88
instance, “New Wide Angle Macro Lens for Canon EOS Digital
Rebel Camera XTi T3i T4i 18 55mm | eBay” describes a camera lens                  0.84
but is assigned to model “XTi”, “T3i” and “T4i”.
                                                                                  0.80
    To solve this problem, a revert list is built to record the model
                                                                                                Recall            Precision         F1-measure
blocks a specification has been assigned to. We detect and delete
                                                                                                CheetahER
accessories according to the length of the reverted list. That is, we
                                                                                                CheetahER without Block Merging
regard a specification as accessory if it is assigned to a large num-
                                                                                                CheetahER without Block Splitting
ber of model blocks. Setting a good length threshold is crucial for
the effectiveness of cleaning. If the length threshold is too small,
some pairs that ought to be matched will be discarded. For example,                 Figure 4: Accuracy of the CheetahER variants
in “Canon EOS 400D Digital Rebel XTi 10 1MP Digital SLR Cam-
era Silver | eBay”, “XTi” is an alias of “EOS 400D”. On the other         focuses on blocking. We design a comprehensive blocking pipeline
hand, if the length threshold is too large, some accessories may          that involves brand blocking, model blocking, block splitting, block
not be identified. We found the optimal length threshold to be 3 by       merging and block cleaning by considering properties of the prob-
traversing all possible values. An exhaustive search is acceptable        lem and dataset. Our experiment results show that CheetahER is
for 2 reasons: (i) the maximal length of the reverted list is 6 for all   accurate an efficient, achieving an F1 score of 0.98 and running
specifications in the dataset, and thus the search space is not large;    within 20 seconds on a standard CPU machine. We think the success
(ii) the cleaning process is very efficient and running block cleaning    of CheetahER shows that it is crucial to consider the characteristics
under a threshold is very fast.                                           of the problem and data in practical data mining problems such as
                                                                          entity resolution. Despite that machine learning based solutions are
6    MATCHING                                                             highly successful for many problems, CheetahER is an example that
The two-stage blocking with split and merge operators has classified      simple rule-based solutions are still valuable if they are properly
the specifications into blocks quite accurately. Thus, the matching       guided by insights from data.
procedure can be made very simple–enumerating all possible pair-
wise combination within each block as the result. Therefore, for a        9    ACKNOWLEDGMENTS
block with size 𝑛, a total 𝑛(𝑛 − 1)/2 matched pairs will be generated     This work was supported by the Science and Technology Innovation
in the final result.                                                      Committee Foundation of Shenzhen (Grant No. JCYJ20180302174301
                                                                          157), the Education Department of Guangdong (Grant No. 2020KZD
7    EXPERIMENTAL RESULTS                                                 ZX1184), the National Science Foundation of China (NSFC No.
We implemented CheetahER using C++ and conducted the experi-              61802163), and PCL Future Regional Network Facilities for Large-
ments on a machine with 4x Intel(R) Core(TM) i7-7700HQ CPU @              scale Experiments and Applications (PCL2018KP001).
2.80GHz and 16 GB memory. To test the gain of block merging and
splitting, we disable them and create two variants of CheetahER.          REFERENCES
The accuracy results are reported in Figure 4, and the precision and      [1] Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan,
recall scores are acquired from contest committee.                            Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton,
                                                                              Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Mag-
   The results show that block merging significantly improves re-             ellan: Toward Building Entity Matching Management Systems. PVLDB 9, 12 (2016),
call, i.e., from 0.88 to 0.97. This is because more specification pairs       1197–1208. https://doi.org/10.14778/2994509.2994535
                                                                          [2] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
referring to the same camera can be identified when block merging             Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018.
puts them into the same block. On the other hand, block splitting             Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD,
improves precision, from 0.93 to 0.99. This is because block splitting        Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 19–34.
                                                                              https://doi.org/10.1145/3183713.3196926
avoids generating false positive specification pairs that do not repre-   [3] Database Research Group of the Roma Tre University. [n.d.]. ACM SIG-
sent the same entity. Combining block merging and block splitting,            MOD 2020 Programming Contest. [EB/OL]. http://www.inf.uniroma3.it/db/
the CheetahER achieves an F1 score of 0.98. In the meantime, Cheeta-          sigmod2020contest/task.html.
                                                                          [4] Database Research Group of the Roma Tre University. 2017. ACM SIGMOD 2020
hER is also efficient, running the entire processing piepiline within         Programming Contest Leaderboard. Retrieved July 13, 2020 from http://www.inf.
20 seconds.                                                                   uniroma3.it/db/sigmod2020contest/leaders.html
                                                                          [5] Wikipedia. 2020. Levenshtein distance — Wikipedia, The Free Encyclopedia.
                                                                              https://en.wikipedia.org/wiki/Levenshtein_distance [Online; accessed 13-July-
8    CONCLUSION                                                               2020].

In this work, we develop an entity resolution engine, CheetahER,
for the SIGMOD Programming Contest 2020. Different from pop-
ular methods that heavily rely on machine learning, CheetahER

</pre>