=Paper=
{{Paper
|id=Vol-2726/paper6
|storemode=property
|title=An Extensible Block Scheme-Based Method for Entity Matching
|pdfUrl=https://ceur-ws.org/Vol-2726/paper6.pdf
|volume=Vol-2726
|authors=Jiawei Wang,Haizhou Ye,Jianhui Huang
|dblpUrl=https://dblp.org/rec/conf/vldb/WangYH20
}}
==An Extensible Block Scheme-Based Method for Entity Matching==
An Extensible Block Scheme-Based Method for Entity Matching DI2KG 2020 Challenge Winner Paper Jiawei Wang∗ Haizhou Ye∗ Jianhui Huang∗ College of Cybersecurity College of Cybersecurity College of Cybersecurity Jinan University, Guangzhou, China Jinan University, Guangzhou, China Jinan University, Guangzhou, China jiaweiwang.aa@qq.com 13632259569@163.com 924285628@qq.com ABSTRACT • How to compute entity similarity within a block, and how Entity resolution (ER) is of great importance in many real-world to determine transitivity of similarity between entities applications, including information retrieval, web search, natural Inspired by the previous work of rule-based ER method [2], we language processing, web information integration, multi-source develop a scalable ER system framework. Specifically, the core of data analysis, etc. In this paper, we present an extensible block our framework consists of a block scheme for the data set and simi- scheme-based method for entity resolution. Specifically, at first we larity computation of the product pairs within a block. The block preprocess data records from multiple datasets, so as to remove scheme includes various preprocessing steps, different keyword ex- ambiguity and maintain data attributes that are helpful for entity traction methods, as well as multiple block scheme aggregators. We resolution. Then, the processed records are assigned to different solve the ER tasks of DI2KG 2020 Challenges that involve product blocks, after which similarity between records within a same block entities of Monitor and Notebook, by using our block scheme-based is computed. Experiment results on challenge datasets from DI2KG framework. show that our proposed method is promising. The paper is organized as follows. In Section 1 we give an in- troduction of entity resolution and the challenge task in DI2KG. KEYWORDS We present our framework for entity resolution in Section 2, in- Entity Matching, Block-scheme, Block Aggregation, Entity Similar- cluding the three main components of our framework. In Section ity 3, we conduct experiments on the challenge datasets to evaluate the effectiveness of our method. Finally, in Section 4 we summarize this paper and give our future work on entity resolution. 1 INTRODUCTION Entity Resolution (ER) is to match data records from two or more 2 THE PROPOSED METHOD data sources, by analyzing their contents that describe identical Our method is based on block scheme and scheme aggregation. As entities in real-world [5], and it remains as a challenging problem shown in Figure 1, our extensible entity resolution system frame- in data management [9], information retrieval [8], data mining [7], work mainly includes three submodules, i.e., Preprocess, Block natural language processing, etc. Entity resolution has a wide range Scheme and Scheme Aggregation (BS-SA), Similarity and Cluster- of applications, including data cleansing, electronic commerce [6], ing. In this section, we described three sub-modules in detail. web search, and public/government data analysis [11]. In this paper, we focus on the problem of matching products recorded in electronic commerce website. Performing the entity matching task on an e-commerce database is challenging, because of 1) product data is highly heterogeneous; 2) structure of the data is loosely defined; 3) there are multiple data sources (for example, in the DI2KG challenge we need to process different data sources from different stores). One of the critical missions of ER on large data sets is how to reduce the computational complexity to im- prove efficiency. Currently, there are three prominent solutions, i.e., Blocking [10], Windowing [4], and Hybriding [3]. In this paper we use the first one. By assigning each product to one or more blocks and computing the similarity for pairs of products in these blocks, blocking schemes can reduce the number of pair-wise similarities that need to be computed. Due to challenges of ER on multiple source data, we need to solve several problems, as described below: • How to perform data preprocess and cleansing • How to design efficient block scheme and block. aggregation method ∗ All authors contributed equally to this research. DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY Figure 1: Our extensible entity resolution system framework 4.0). 2.1 Data Preprocessing ’’, because there is part of information which need to In data preprocessing step, we remove some less important terms be removed, such as resolution and screen size in the ’ ’. from page titles, and less informative properties from the product On the other hand, we consider the situation that the model name description text. Specifically, from table 1 we can see that there are is separated by spaces, so we will remove the space between non- differences between product profiles. For example, the first product numeric characters and numeric characters to create a model word possesses some property feature while the second product contains (i.e. the model name is something like ’P 230’, we will change it to neither this property nor the corresponding value. We also observe ’p230’). that for the same products, their model names are the same. So, we try to find model names for entity matching. Before finding the 2.2 Block Scheme and Scheme Aggregation model name, there are two problems to deal with, as given below: After data preprocessing step, the processed data will be passed to • Which property contains the model name submodule Block Scheme and Scheme Aggregation, through which • How to deal with the situation that the extracted model each product is assigned to a block. Referring to Figure 1, this words cannot be extracted from the predetermined key value submodule consists of two parts: first, the products are classified or the extracted model names based on the brand list that obtained from the Internet [2] and supplemented and modified manually. Just as [12] said, the brand Table 1: Example Raw Data list provides the algorithm with specific task information. Then, for the products of each brand, we classify products to different blocks, property value by extracting model word and using the blocks aggregator [13]. In this challenge, the model word is defined as a word that con- New ELO 1928L 19inch TFT 1280x1024 tains at least one number and one non-numeric character, and the LCD Non Touch Dual Monitor E939583 entire word is completely extracted as one model word. We adopt 7411493021428 | eBay two block schemes, i.e., extracting a single model word for title and aspect ratio 5:4 description, and then aggregating the two schemes by the union brand Elo aggregator. Eventually, all documents of each brand will be mapped contrast ratio 1300:1 to multiple product entities and each document will only be mapped model 1928L to one product entity. screen size 19 Elo TouchSystems 1928L – MBE contrast ratio 1300:1 2.3 Entity Similarity within Blocks typical The method we proposed in this paper for computing entity sim- display resolu- 1280 x 1024 pixels ilarity is based on model words and set judgment. The definition tion of model word is as described in Section 2.2, and we compare the horizontal scan 31.5 - 80 kHz model words set of product 𝑎 and the model words set of product range 𝑏, which measures the similarity of two products. Our proposed model name 1928L similarity function consists of two parts. The first part evaluates whether the models of product 𝑎 and 𝑏 are the same, by using function 𝑑𝑖 𝑓 𝑓 𝑀𝑜𝑑𝑒𝑙 (𝑎, 𝑏, 𝑀). Here, 𝑀 is a list we constructed from product descriptions, which consists Table 2: Example Data after Preprocessing of attributes where the model may appear. We count the model words extracted from the attributes in the list 𝑀, and select the property value model word that appears the most frequently as the product model. new elo 1928l tft lcd non touch dual monitor Moreover, if there is no any model word, then we take the first e939583 7411493021428 | ebay model word from ’ ’ as the model. The rationale is that brand elo the product model in the title usually appears first. model 1928l The second part determines whether synonymous attributes elo touchsystems 1928l – mbe of product 𝑎 and product 𝑏 are similar, by calling the function model name 1928l 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) multiple times. Here, parameter 𝐿 = [𝐿1, , 𝐿𝑛 ], where 𝐿𝑖 ∈ 𝐿 contains synonymous attributes, and 𝑂 = [𝑂 1, , 𝑂𝑛 ], where operator 𝑂𝑖 ∈ 𝑂 is either ′𝑎𝑛𝑑 ′ or ′𝑜𝑟 ′ , which is applied to To solve the above problems, at first some preprocessing steps are 𝐿𝑖 . Each time 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) is invoked, for the model word necessary, such as lowercasing strings and removing some special set extracted from all the attributes of the product in list 𝐿𝑖 , we de- symbols. Generally, the model names have at least one number termine the relationship between two sets, according to the content and one non-numeric character. Hence, we use regular expression of operator 𝑂𝑖 . Specifically, if 𝑂𝑖 is ′𝑎𝑛𝑑 ′ , then the two sets must to match the model name. We find that the model name will be be the same, otherwise one set must be the subset of the other set. obtained precisely in the properties, like model, model name and so For example, a parameter configuration is shown below: on. And we retain these properties as new descriptions of product. When one is unable to find these properties, we consider property 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, [ ′ < 𝑝𝑎𝑔𝑒𝑡𝑖𝑡𝑙𝑒 > ′, ′ 𝑚𝑜𝑑𝑒𝑙 ′, ′ 𝑚𝑜𝑑𝑒𝑙𝑛𝑎𝑚𝑒 ′ ], ′ 𝑜𝑟 ′ ) Algorithm 1 shows a high-level overview of how to compute Algorithm 2 Model Comparison Method product similarity. Specifically, if the model names of the two prod- Input: Product profile 𝑎 and 𝑏; Attributes list 𝑀; ucts are the same, the algorithm will compare whether certain -𝑚𝑤 (𝑎, 𝑀) extracts the model words from the key-value attributes of the product are similar according to the parameter 𝐿 pair of product a if the key appears in list 𝑀. and 𝑄. -𝑀𝑎𝑥𝑀𝑜𝑑𝑒𝑙 (𝐿) find the model word appear most fre- quently in the model word list 𝐿. Algorithm 1 Product Similarity Method -𝑖𝑠𝐸𝑚𝑝𝑡𝑦 (𝐿) true if 𝐿 is empty. Input: Product profile 𝑎 and 𝑏; List 𝐿 = [𝐿1, , 𝐿𝑛 ], where 𝐿𝑖 ∈ 𝐿 Output: True if the models of product 𝑎 and product 𝑏 are different; contains some synonyms attributes; List 𝑂 = [𝑂 1, , 𝑂𝑛 ], where otherwise, False. operator 𝑂𝑖 ∈ 𝑂 is either ′𝑎𝑛𝑑 ′ or ′𝑜𝑟 ′ , which corresponds to 1: function diffModel(𝑎, 𝑏, 𝑀) 𝐿𝑖 ; Attributes list 𝑀; Furthermore, the following functions are 2: 𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐴𝑚𝑤 = 𝑚𝑤 (𝑎, 𝑀) used: 3: 𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐵𝑚𝑤 = 𝑚𝑤 (𝑏, 𝑀) -𝑑𝑖 𝑓 𝑓 𝑀𝑜𝑑𝑒𝑙 (𝑎, 𝑏) is true if models of product 𝑎 and 𝑏 are 4: 𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 = 𝑚𝑤 (𝑎, [ ′ ′ ]) different 5: 𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 = 𝑚𝑤 (𝑏, [ ′ ′ ]) -𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) the model words similarity between 6: if 𝑖𝑠𝐸𝑚𝑝𝑡𝑦 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐴𝑀𝑊 ) then the products 𝑖 and 𝑗 using the key list 𝐿𝑖 and operator 𝑂𝑖 . 7: if 𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 [0] in 𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 then return False Output: True if product 𝑎 and product 𝑏 are the same; otherwise, 8: end if False. 9: 𝑚𝑜𝑑𝑒𝑙𝐴 = 𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 [0] 1: function ProdSim(𝑎, 𝑏, 𝑀, 𝐿, 𝑂) 10: else 2: if 𝑑𝑖 𝑓 𝑓 𝑀𝑜𝑑𝑒𝑙 (𝑎, 𝑏, 𝑀) then return False 11: 𝑚𝑜𝑑𝑒𝑙𝐴 = 𝑀𝑎𝑥𝑀𝑜𝑑𝑒𝑙 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐴𝑚𝑤) 3: end if 12: end if 4: for all 𝐿𝑖 , 𝑂𝑖 in 𝐿, 𝑂 do 13: if 𝑖𝑠𝐸𝑚𝑝𝑡𝑦 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐵𝑚𝑤) then 5: if not 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) then return False 14: if 𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 [0] in 𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 then return False 6: end if 15: end if 7: end for 16: 𝑚𝑜𝑑𝑒𝑙𝐵 = 𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 [0] 8: return True 17: else 9: end function 18: 𝑚𝑜𝑑𝑒𝑙𝐵 = 𝑀𝑎𝑥𝑀𝑜𝑑𝑒𝑙 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐵𝑚𝑤) 19: end if 20: return not (𝑚𝑜𝑑𝑒𝑙𝐴 == 𝑚𝑜𝑑𝑒𝑙𝐵) Algorithm 2 judges whether the two products have the same 21: end function model. It regards the most frequently occurring model word or the first model word in the title as the model name, and compares the model names of the two products. Algorithm 3 Model Word Similarity Method Function 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) in Algorithm 3 is used to compute Input: Product profile 𝑎 and 𝑏; Synonyms attributes list 𝐿𝑖 ; Opera- the similarity between product description of 𝑎 and 𝑏, according to tor 𝑂𝑖 is either ′𝑎𝑛𝑑 ′ or ′𝑜𝑟 ′ , which corresponds to 𝐿𝑖 ; 𝑚𝑤 (𝑤) some special product attributes. The number of times this function extracts the model words for a given set of words 𝑤. is called depends on the length of the parameter 𝐿𝑖 and 𝑂𝑖 . In this Output: True if the model words of product 𝑎 and 𝑏 extracted from algorithm, the model words from some attributes of product 𝑎 and 𝐿𝑖 are the same under operator 𝑂𝑖 ; otherwise, False. 𝑏 are compared. If 𝑂𝑖 is ‘𝑎𝑛𝑑 ′ , the model words must be the same, 1: function mwSim(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) otherwise, one model word set must be the subset of the other set. 2: 𝑠𝑒𝑡𝐴, 𝑠𝑒𝑡𝐵 = 𝑚𝑤 (𝑎, 𝐿𝑖 ), 𝑚𝑤 (𝑏, 𝐿𝑖 ) Due to transitivity property between products, if A is similar to 3: if 𝑂𝑖 == ′ 𝑎𝑛𝑑 ′ then B, and B is similar to C, then we regard it that A is similar to C. 4: if 𝑠𝑒𝑡𝐴.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐵) and 𝑠𝑒𝑡𝐵.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐴) then We perform clustering while calculating the similarity, instead of 5: return True clustering after calculating the similarity between all products. This 6: end if may sacrifice accuracy a little bit, but can significantly improve 7: end if efficiency. 8: if 𝑂𝑖 == ′ 𝑜𝑟 ′ then 9: if 𝑠𝑒𝑡𝐴.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐵) or 𝑠𝑒𝑡𝐵.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐴) then 3 EXPERIMENT RESULTS 10: return True To evaluate performance of our proposed method, we conduct 11: end if experiments on the challenge datasets provided by DI2KG. We sum- 12: end if marize the best experiment results of our proposed method in Table 13: return False 3, i.e., on training Dataset Y the Precision, Recall and F-Measure are 14: end function 1.000, 0.964 and 0.982, respectively, whereas on Dataset X Precision, Recall and F-Measure are 0.915, 0.967 and 0.940, respectively. In addition, we tested our algorithm on the camera dataset and on both data sets, that is, it has good extensivity. In order to be the Precision, Recall and F-Measure are 0.98, 0.97 and 0.98 [1]. In suitable for different datasets, we can set the parameters such as general, we can see that our algorithm achieved good performance brand list, product descriptions list and synchronous attributes list, Table 3: Performance of Our Method on Challenge Datasets [6] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourc- ing for entity matching. In Proceedings of the 2014 ACM SIGMOD international Precision Recall F-Measure conference on Management of data. 601–612. [7] Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Dataset Y 1.000 0.964 0.982 Low-resource deep entity resolution with transfer and active learning. arXiv Dataset X 0.915 0.967 0.940 preprint arXiv:1906.08042 (2019). [8] Selasi Kwashie, Lin Liu, Jixue Liu, Markus Stumptner, Jiuyong Li, and Lujing Yang. 2019. Certus: an effective entity resolution approach with graph differential dependencies (GDDs). Proceedings of the VLDB Endowment 12, 6 (2019), 653–666. [9] Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J Franklin. 2016. Crowd- which can be obtained based on rules or machine learning. It is sourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296–2319. worth mentioning that our algorithm is mainly based on model [10] George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and words, thus, it is better for data sets that mainly rely on product Wolfgang Nejdl. 2012. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data models for matching, such as cameras, displays and other electronic Engineering 25, 12 (2012), 2665–2682. products. [11] Tomer Sagi, Avigdor Gal, Omer Barkol, Ruth Bergman, and Alexander Avram. Our observation is that the key to data preprocessing is to remove 2017. Multi-source uncertain entity resolution: Transforming holocaust victim reports into people. Information Systems 65 (2017), 124–136. some redundant information. Since in Algorithm 2, we try to use [12] Ronald Van Bezu, Sjoerd Borst, Rick Rijkse, Jim Verhagen, Damir Vandic, and the first model word in ’ ’ for comparison, it is highly Flavius Frasincar. 2015. Multi-component similarity method for web product recommended to make the first model word in the ’ ’ to duplicate detection. In Proceedings of the 30th annual ACM symposium on applied computing. 761–768. be the model number of the product. [13] Damir Vandic, Flavius Frasincar, Uzay Kaymak, and Mark Riezebos. 2020. Scalable On the other hand, in the block scheme the acquisition of brand entity resolution for Web product descriptions. Information Fusion 53 (2020), 103–111. information requires some manual intervention. Although it can be obtained through machine learning or rules-based methods, we found that there are some special circumstances that need to deal with. For instance, Alienware was acquired by Dell, so Alienware and Dell belong to the same brand. Since brand classification is the first step in the BS-SA, manual intervention for such special situations can improve the recall rate at lower cost. 4 CONCLUSION In this paper, we proposed an extensible block-scheme based method for entity resolution. Specifically, our method consists of three components, i.e., data preprocessing, block-scheme and scheme aggregation, and similarity and clustering. Experiments on DI2KG challenge datasets show that our method can achieve 0.982 and 0.94 F-measure values on Dataset X and Dataset Y, respectively. In our future work, we intend to combine machine learning techniques and consider weighting key properties, so as to find as many as possible the best pairs of products. Because by using those strategies, we can make full use of training datasets, by mapping product similarity to the range [0, 1]. ACKNOWLEDGMENTS The authors would like to thank the reviewers for their comments. This paper is partially supported by NSFC under grant Number 61972177. REFERENCES [1] [n.d.]. ACM SIGMOD 2020 Programming Contest Leaders. http://www.inf. uniroma3.it/db/sigmod2020contest/leaders.html. [2] [n.d.]. Wikipedia: The free encyclopedia. https://en.wikipedia.org/wiki/Category: Lists_of_consumer_electronics_manufacturers. [3] Youssef Aassem, Imad Hafidi, and Noureddine Aboutabit. 2020. Enhanced Dupli- cate Count Strategy: Towards New Algorithms to Improve Duplicate Detection. In Proceedings of the 3rd International Conference on Networking, Information Systems & Security. 1–7. [4] P Christen. [n.d.]. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. 2012. [5] Gianni Costa, Giuseppe Manco, and Riccardo Ortale. 2010. An incremental clustering scheme for data de-duplication. Data Mining and Knowledge Discovery 20, 1 (2010), 152.