=Paper=
{{Paper
|id=Vol-2726/paper5
|storemode=property
|title=CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data
|pdfUrl=https://ceur-ws.org/Vol-2726/paper5.pdf
|volume=Vol-2726
|authors=Nan Deng,Wendi Luan,Haotian Liu,Bo Tang
|dblpUrl=https://dblp.org/rec/conf/vldb/DengLLT20
}}
==CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data==
CheetahER: An Accurate and Efficient Entity Resolution System for Heterogeneous Camera Data SIGMOD 2020 Programming Contest Finalist Paper Nan Deng† , Wendi Luan† , Haotian Liu† , Bo Tang†‡∗ † Department of Computer Science and Engineering, Southern University of Science and Technology ‡ PCL Research Center of Networks and Communications, Peng Cheng Laboratory {11711004@mail.,11712532@mail.,11613015@mail.,tangb3@}sustech.edu.cn { ABSTRACT “”: “Kodak Eashyshare Z980 | eBay”, The SIGMOD Programming Contest 2020 raises a real-world entity “beautiful pictures more often automatically”: “Who says you can’t have it all? … ”, resolution problem, which requires to identify product specifica- “brand”: “Kodak”, tions from multiple e-commerce websites that represent the same “bundled items”: “Case or Bag, Lens, Tripod”, real-world cameras. Entity resolution has been extensively studied “megapixels”: “12.0 MP”, and the general solution framework consists of two phases: blocking “model”: “Z980”, … and matching. Most existing works focus on the matching phase, } which trains (complex) models on large volumes of data and uses the models to decide whether a pair of descriptions refers to the same real-world object. However, training a high-quality model is Figure 1: An example of the camera specification JSON file difficult for the SIGMOD contest because there is only a limited amount of labeled data and the product specifications can be dirty Table 1: An illustration of the ground-truth dataset and incomplete. In this paper, we propose CheetahER, an accurate and efficient Left specification ID Right specification ID Label entity resolution system. Different from existing works, we focus on improving the effectiveness of the blocking phase, which is www.ebay.com//24887 www.ebay.com//56369 1 overlooked in both academia researches and industry systems, and www.ebay.com//42902 www.mypriceindia.com//57 0 propose a two-phase blocking framework to group the product specifications according to brand and model. The pre-processing and data cleaning procedures are also carefully designed to improve attribute page title (highlighted in Figure 1) but the other attributes data quality. CheetahER ranks the 1st in accuracy among 53 teams (not their values), such as model and brand (marked in gray), could and completes the task within 20 seconds. Even though some de- be different in different specifications. This makes entity resolution signs of CheetahER are specialized for camera datasets, its novel difficult as the same camera can have different attributes in different two-phase blocking framework and operators (i.e., merging and JSON files. A ground-truth dataset is a CSV file, in which each row splitting) may generalize to other entity resolution tasks. is a record that contains three fields, i.e., left specification id, right specification id and label. label=1 indicates that the two specifica- KEYWORDS tions refer to the same camera. The participants are required to list all pairs of specifications that represent the same camera using a Data Integration, Entity Resolution, Two-phase Blocking format similar to the ground-truth dataset. The contest ranks the participants by the accuracy of their results and measures accuracy 1 INTRODUCTION using the F1 score, which is defined as Entity resolution, which identifies different records (e.g., web-pages 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 and user names) referring to the same real-world entity, is an impor- 𝐹1 = . tant task in the database community. In the SIGMOD Programming 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 Contest 2020 [3], organizers introduce a real-world entity resolu- A general solution to the entity resolution problem consists of tion problem: finding which camera specifications from 24 different two steps, i.e., blocking and matching. The blocking step divides e-commerce websites represent the same camera. Two types of the dataset into a number of groups to reduce the complexity for datasets are provided to the participants: (i) camera specification pair-wise comparison. The matching step enumerates all possible datasets and (ii) ground-truth datasets. For a camera specification pairs in each block and decides whether a pair of records (spec- dataset, each camera specification is stored as a JSON file and an ifications in our case) refers to the same entity. Currently, both example is provided in Figure 1. All JSON files have a common academia and industry focus on the matching step and developed many learning-based solutions that achieve good accuracy for pair- ∗ Corresponding author: Bo Tang. wise comparison [1, 2]. However, learning-based solutions is not DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY suitable for the contest due to two challenges. (i) Insufficient labeled 4.0). data, only 300k ground-truth pairs are given but there are 900M DI2KG Workshop’20, August 2020, Tokyo, Japan Nan Deng† , Wendi Luan† , Haotian Liu† , Bo Tang†‡ Input Preprocessing Two-Stage 4.1 Brand-based Blocking Blocking We first divide the camera specifications into different blocks ac- Camera Specifications cording to their brands. As shown in Figure 3, brand-based blocking consists of two steps: grouping and merging. Output Matching Cleaning Grouping by Brand. In this step, we first extract the brand names Camera Pairs from the brand attribute in the JSON files. Although the brand attribute only exists in a small portion of the specifications, it helps Figure 2: System architecture for CheetahER to obtain a set of brands that could cover most camera specifications. We then utilize the page title attribute in each JSON file to extract possible specification pairs. The limited size of the ground-truth more brand information. A camera specification will be grouped dataset makes it difficult to train high quality models. (ii) Poor data into a brand block if the brand appears in the specification’s page quality, the specifications are unstructured data with different at- title attribute. For example, the JSON file in Figure 1 has attribute tributes. We believe that the two challenges are also general for brand with value “Kodak”, then we are able to create a block with many real-word entity resolution problems. brand name “Kodak”. All JSON files that have “Kodak” as a substring Our system, CheetahER, is designed to overcome the aforemen- in its page title attribute will be grouped into this block. We generate tioned challenges. Different from existing works, we focus on the blocks for all encountered brands, e.g., “Canon”, “Cannon”, “Fuji”, blocking step and introduce two block operations: merging and “Fujifilm”, as shown in Figure 3. These blocks will go through the splitting. The merging operation merges two or more blocks into merging step to improve accuracy. one while the splitting operation splits one block into multiple Merging. Block merging is used to merge different blocks that blocks. A complete set of rules is also designed to control the execu- correspond to the same brand. As shown in Figure 3, we obtain tion of the block operations. CheetahER achieves an F1 score of 98% blocks with name “Canon” and “Cannon” after grouping but “Can- and runs within 20 seconds. We are also honored as finalist, ranking non” is apparently a typo of the correct spelling “Canon”. Different top-5 on the world-wide leaderboard [4]. Details of CheetahER can blocks can also be created for the same brand due to alias, “Fuji” be found in our open-sourced code1 . and “Fujifilm” in Figure 3 for example. To handle spelling error and alias, we introduce two criteria for merging the brand blocks. 2 SYSTEM ARCHITECTURE The first criteria utilizes the regular rules. For two brand blocks As illustrated in Figure 2, CheetahER has four components: pre- with brand name 𝐴 and 𝐵, if 𝐴 is the prefix of 𝐵 (e.g. “Fuji” and possessing, two-stages blocking, cleaning and matching. The prepos- “Fujifilm”), then block 𝐴 should be merged with block 𝐵. The other sessing step loads data into memory, extracts useful information criteria is based on the Levenshtein distance for strings[5]. and organizes the specifications in structured form. The blocking step indexes the specifications and clusters similar specifications max(𝑖, 𝑗) if min(𝑖, 𝑗) = 0, into a group, and the cleaning step adjusts the group assignment lev𝐴,𝐵 (𝑖 − 1, 𝑗) + 1 lev𝐴,𝐵 (𝑖, 𝑗) = of incorrectly classified specifications. Finally, the matching step min lev𝐴,𝐵 (𝑖, 𝑗 − 1) + 1 otherwise, enumerates all possible specifications pairs in each block as the lev𝐴,𝐵 (𝑖 − 1, 𝑗 − 1) + 1 𝐴 ≠𝐵 ( 𝑖 𝑗) result. That is, for a block of size 𝑛, 𝑛(𝑛 − 1)/2 matched pairs will be produced. Here, 𝐴𝑖 and 𝐴 𝑗 are the sub-strings of 𝐴 and 𝐵 with first 𝑖 and 𝑗 characters, respectively. Merging brand blocks with a small Lev- 3 PREPROCESSING enshtein distance helps to tackle spelling errors, e.g., “Canon” and “Cannon”. Brand and model can identify a specific camera, and thus the prepro- cessing step retrieves attributes from which brand and model could be extracted. The page title attribute is present for JSON files and it 4.2 Model Blocking within Brand Blocking often contains information about brand and model. For example, In this step, we further divide each brand block into multiple blocks the page title attribute of spec “www.wexphotographic.com//626” based on the camera model. Model blocking consists of three phases, is “Samsung WB350F Digital Smart Camera ...”, which includes its i.e., grouping, merging and splitting. brand “Samsung” and model “WB350F”. Brand and model may also Grouping. Model names cannot be easily extracted as the brand appear independently in other attributes. However, the model at- names do. On the one hand, there are many specifications whose tribute can be ambiguous. e.g. “0002724284400” and “Camera” are model name is missing from the model attribute; on the other hand, model attributes in specifications “www.buzzillions.com//854.json” the value of the model attribute can be ambiguous. For instance, and “www.ebay.com//60127.json”, respectively. Thus, we only keep in a specification with id “www.buzzi-llions.com/872”, the value of the page title and brand attributes in this step. the model attribute is “15820728”, but its actual model is “F100fd” with brand “Fujifilm” by scrutinizing the data. 4 TWO-STAGE BLOCKING According to our observation, there are two patterns for the name of camera models: (i) model names usually consist of only An illustration of our two-stage blocking method is provided in alphabet, number, space and crossbar; (ii) model names can be Figure 3, which consists of two phases, brand blocking and model constructed by a combination of prefix and postfix, e.g., “EOS 1Ds”. blocking. We elaborate the two phases as follows. Based on these observations, we define a suite of regular rules to 1 https://github.com/LUUUAN/EntityMatching_SIGMOD_2020_Contest extract a set of model names from the “page title” attribute within CheetahER: An Accurate and Efficient Entity Resolution System for Heterogeneous Camera Data DI2KG Workshop’20, August 2020, Tokyo, Japan Camera Specs Grouping Canon Cannon Kodak … Nikon Fuji Fujifilm Brand-Based Blocking Merging Canon … … Fujifilm EOS- EOS IXUS ELPH … Grouping … X10 S1 5D 5D 155 150 Merging EOS IXUS X10 S1 Model Blocking 5D 155 within Brand EOS 5D EOS 5D EOS 5D IXUS Splitting X10 S1 Mark I Mark II Mark III 155 Figure 3: An example of two-stage blocking each brand block. Similar to brand-based blocking, we can further names. The same model name can be expressed in various forms, group specifications in each brand block into several model blocks, for instance, the model “Canon EOS 5D” can have name “EOS-5D”, using the model name set extracted previously. For instance, the “EOS5D”, or even a simple name “5D”. In this case, we ignore the JSON file shown in Figure 1 will be classified into block “Kodak” noncontributory prefixes and postfixes and merge them together. in the brand-based blocking step. In the model blocking step, its The pseudo-code for model block merging is shown in Algo- model name “Z980” can be extracted using regular rules described rithm 1 and the merging threshold 𝑒 is 3 by default. We will show above. As a result, it will be grouped into the block which contains that the model merging strategies are effective and can significantly all camera manifestations that has “Kodak” and “Z980” in their page improve recall in the experiments in Section 5. title attribute. Similarly, as shown in Figure 3, the brand block “Canon” will be further divided into model blocks with name “EOS 5D”, “IXUS 155”, etc. Algorithm 1: Model Block Merging Merging. Model block merging is similar to brand block merging Input: Model blocks 𝑀𝑡1 , 𝑀𝑡2 , ..., 𝑀𝑡𝑛 in brand block 𝐵𝑡 but the merge conditions are different. We can’t directly apply the Output: Model blocks 𝑀𝑡1 , 𝑀𝑡2 , ..., 𝑀𝑡𝑘 brand block merging rules due to two reasons: (i) the similarity 1 for i=1:n do between brand blocks is defined using the Levenshtein distance and 2 for j=i+1:n do does not work for models, e.g., the Levenshtein distances between 3 if |𝑀𝑡𝑖 ∩ 𝑀𝑡 𝑗 | > 𝑒 then “EOS 50D” and “EOS 5D” is only one but they are different models; 4 Merge 𝑀𝑡𝑖 and 𝑀𝑡 𝑗 (ii) the same camera model can have different model names in 5 end different regions. For example, “IXY 140”, “ELPH 150” and “IXUS 6 end 155” are names of the same model when they are sold in Japan, America and elsewhere, respectively. We propose two criteria for model merging to tackle the afore- Splitting. After the brand blocking and model blocking steps, some mentioned problems. Firstly, some e-commerce websites tend to specifications referring to different entities could be put into the write all possible names of a model in its page title, i.e., models same block. We found that this is because some models need to be names like “IXUS 155” and “ELPH 150” may appear in a single page further distinguished by their generations. For example, “Canon title attribute values. In our example, this camera specification will EOS 5D Mark II” will be blocked to “EOS 5D” by the previous steps, be classified into both “IXUS 155” and “ELPH 150” model blocks. losing its generation information “Mark II”. As a result, “Canon As a heuristic method, we merge two model blocks if their com- EOS 5D Mark II” will be paired with “Canon EOS 5D Mark I” and mon specifications is greater than a user-given threshold. This “Canon EOS 5D Mark III” when generating the solution. This will merging condition can be formally specified as follows: Given two severely degrade the precision of the final result. So we design a brand blocks 𝐴 and 𝐵 with 𝐴 ≠ 𝐵, if |𝐴 ∩ 𝐵| > 𝑒, in which 𝑒 is a set of regular rules like “.*Mark [(I)|(II)|(III)|(IV)].*” to identify “Mark threshold parameter, then the two brand blocks can be merged. The II” from “Canon EOS 5D Mark II”, then extract them and generate other model merging criteria is based on the format of the model new blocks for such records. DI2KG Workshop’20, August 2020, Tokyo, Japan Nan Deng† , Wendi Luan† , Haotian Liu† , Bo Tang†‡ 5 BLOCK CLEANING 1.00 The block cleaning step aims to remove accessories (e.g., lens and 0.96 bags) that should not contribute to the final result. Accessories may contain several brands and models, and thus may be assigned 0.92 to multiple blocks and paired with specifications referring to real camera instances, which hampers the precision of the result. For 0.88 instance, “New Wide Angle Macro Lens for Canon EOS Digital Rebel Camera XTi T3i T4i 18 55mm | eBay” describes a camera lens 0.84 but is assigned to model “XTi”, “T3i” and “T4i”. 0.80 To solve this problem, a revert list is built to record the model Recall Precision F1-measure blocks a specification has been assigned to. We detect and delete CheetahER accessories according to the length of the reverted list. That is, we CheetahER without Block Merging regard a specification as accessory if it is assigned to a large num- CheetahER without Block Splitting ber of model blocks. Setting a good length threshold is crucial for the effectiveness of cleaning. If the length threshold is too small, some pairs that ought to be matched will be discarded. For example, Figure 4: Accuracy of the CheetahER variants in “Canon EOS 400D Digital Rebel XTi 10 1MP Digital SLR Cam- era Silver | eBay”, “XTi” is an alias of “EOS 400D”. On the other focuses on blocking. We design a comprehensive blocking pipeline hand, if the length threshold is too large, some accessories may that involves brand blocking, model blocking, block splitting, block not be identified. We found the optimal length threshold to be 3 by merging and block cleaning by considering properties of the prob- traversing all possible values. An exhaustive search is acceptable lem and dataset. Our experiment results show that CheetahER is for 2 reasons: (i) the maximal length of the reverted list is 6 for all accurate an efficient, achieving an F1 score of 0.98 and running specifications in the dataset, and thus the search space is not large; within 20 seconds on a standard CPU machine. We think the success (ii) the cleaning process is very efficient and running block cleaning of CheetahER shows that it is crucial to consider the characteristics under a threshold is very fast. of the problem and data in practical data mining problems such as entity resolution. Despite that machine learning based solutions are 6 MATCHING highly successful for many problems, CheetahER is an example that The two-stage blocking with split and merge operators has classified simple rule-based solutions are still valuable if they are properly the specifications into blocks quite accurately. Thus, the matching guided by insights from data. procedure can be made very simple–enumerating all possible pair- wise combination within each block as the result. Therefore, for a 9 ACKNOWLEDGMENTS block with size 𝑛, a total 𝑛(𝑛 − 1)/2 matched pairs will be generated This work was supported by the Science and Technology Innovation in the final result. Committee Foundation of Shenzhen (Grant No. JCYJ20180302174301 157), the Education Department of Guangdong (Grant No. 2020KZD 7 EXPERIMENTAL RESULTS ZX1184), the National Science Foundation of China (NSFC No. We implemented CheetahER using C++ and conducted the experi- 61802163), and PCL Future Regional Network Facilities for Large- ments on a machine with 4x Intel(R) Core(TM) i7-7700HQ CPU @ scale Experiments and Applications (PCL2018KP001). 2.80GHz and 16 GB memory. To test the gain of block merging and splitting, we disable them and create two variants of CheetahER. REFERENCES The accuracy results are reported in Figure 4, and the precision and [1] Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, recall scores are acquired from contest committee. Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Mag- The results show that block merging significantly improves re- ellan: Toward Building Entity Matching Management Systems. PVLDB 9, 12 (2016), call, i.e., from 0.88 to 0.97. This is because more specification pairs 1197–1208. https://doi.org/10.14778/2994509.2994535 [2] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, referring to the same camera can be identified when block merging Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. puts them into the same block. On the other hand, block splitting Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD, improves precision, from 0.93 to 0.99. This is because block splitting Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 19–34. https://doi.org/10.1145/3183713.3196926 avoids generating false positive specification pairs that do not repre- [3] Database Research Group of the Roma Tre University. [n.d.]. ACM SIG- sent the same entity. Combining block merging and block splitting, MOD 2020 Programming Contest. [EB/OL]. http://www.inf.uniroma3.it/db/ the CheetahER achieves an F1 score of 0.98. In the meantime, Cheeta- sigmod2020contest/task.html. [4] Database Research Group of the Roma Tre University. 2017. ACM SIGMOD 2020 hER is also efficient, running the entire processing piepiline within Programming Contest Leaderboard. Retrieved July 13, 2020 from http://www.inf. 20 seconds. uniroma3.it/db/sigmod2020contest/leaders.html [5] Wikipedia. 2020. Levenshtein distance — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Levenshtein_distance [Online; accessed 13-July- 8 CONCLUSION 2020]. In this work, we develop an entity resolution engine, CheetahER, for the SIGMOD Programming Contest 2020. Different from pop- ular methods that heavily rely on machine learning, CheetahER