=Paper= {{Paper |id=Vol-2726/paper6 |storemode=property |title=An Extensible Block Scheme-Based Method for Entity Matching |pdfUrl=https://ceur-ws.org/Vol-2726/paper6.pdf |volume=Vol-2726 |authors=Jiawei Wang,Haizhou Ye,Jianhui Huang |dblpUrl=https://dblp.org/rec/conf/vldb/WangYH20 }} ==An Extensible Block Scheme-Based Method for Entity Matching== https://ceur-ws.org/Vol-2726/paper6.pdf
An Extensible Block Scheme-Based Method for Entity Matching
                                                         DI2KG 2020 Challenge Winner Paper

                    Jiawei Wang∗                                            Haizhou Ye∗                               Jianhui Huang∗
           College of Cybersecurity                                College of Cybersecurity                      College of Cybersecurity
     Jinan University, Guangzhou, China                      Jinan University, Guangzhou, China            Jinan University, Guangzhou, China
           jiaweiwang.aa@qq.com                                     13632259569@163.com                            924285628@qq.com

ABSTRACT                                                                                   • How to compute entity similarity within a block, and how
Entity resolution (ER) is of great importance in many real-world                              to determine transitivity of similarity between entities
applications, including information retrieval, web search, natural                       Inspired by the previous work of rule-based ER method [2], we
language processing, web information integration, multi-source                        develop a scalable ER system framework. Specifically, the core of
data analysis, etc. In this paper, we present an extensible block                     our framework consists of a block scheme for the data set and simi-
scheme-based method for entity resolution. Specifically, at first we                  larity computation of the product pairs within a block. The block
preprocess data records from multiple datasets, so as to remove                       scheme includes various preprocessing steps, different keyword ex-
ambiguity and maintain data attributes that are helpful for entity                    traction methods, as well as multiple block scheme aggregators. We
resolution. Then, the processed records are assigned to different                     solve the ER tasks of DI2KG 2020 Challenges that involve product
blocks, after which similarity between records within a same block                    entities of Monitor and Notebook, by using our block scheme-based
is computed. Experiment results on challenge datasets from DI2KG                      framework.
show that our proposed method is promising.                                              The paper is organized as follows. In Section 1 we give an in-
                                                                                      troduction of entity resolution and the challenge task in DI2KG.
KEYWORDS                                                                              We present our framework for entity resolution in Section 2, in-
Entity Matching, Block-scheme, Block Aggregation, Entity Similar-                     cluding the three main components of our framework. In Section
ity                                                                                   3, we conduct experiments on the challenge datasets to evaluate
                                                                                      the effectiveness of our method. Finally, in Section 4 we summarize
                                                                                      this paper and give our future work on entity resolution.
1     INTRODUCTION
Entity Resolution (ER) is to match data records from two or more                      2   THE PROPOSED METHOD
data sources, by analyzing their contents that describe identical                     Our method is based on block scheme and scheme aggregation. As
entities in real-world [5], and it remains as a challenging problem                   shown in Figure 1, our extensible entity resolution system frame-
in data management [9], information retrieval [8], data mining [7],                   work mainly includes three submodules, i.e., Preprocess, Block
natural language processing, etc. Entity resolution has a wide range                  Scheme and Scheme Aggregation (BS-SA), Similarity and Cluster-
of applications, including data cleansing, electronic commerce [6],                   ing. In this section, we described three sub-modules in detail.
web search, and public/government data analysis [11].
    In this paper, we focus on the problem of matching products
recorded in electronic commerce website. Performing the entity
matching task on an e-commerce database is challenging, because
of 1) product data is highly heterogeneous; 2) structure of the data
is loosely defined; 3) there are multiple data sources (for example,
in the DI2KG challenge we need to process different data sources
from different stores). One of the critical missions of ER on large
data sets is how to reduce the computational complexity to im-
prove efficiency. Currently, there are three prominent solutions, i.e.,
Blocking [10], Windowing [4], and Hybriding [3]. In this paper we
use the first one. By assigning each product to one or more blocks
and computing the similarity for pairs of products in these blocks,
blocking schemes can reduce the number of pair-wise similarities
that need to be computed. Due to challenges of ER on multiple
source data, we need to solve several problems, as described below:
     • How to perform data preprocess and cleansing
     • How to design efficient block scheme and block. aggregation
       method
∗ All authors contributed equally to this research.

DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY
                                                                                      Figure 1: Our extensible entity resolution system framework
4.0).
2.1    Data Preprocessing                                                ’’, because there is part of information which need to
In data preprocessing step, we remove some less important terms          be removed, such as resolution and screen size in the ’’.
from page titles, and less informative properties from the product       On the other hand, we consider the situation that the model name
description text. Specifically, from table 1 we can see that there are   is separated by spaces, so we will remove the space between non-
differences between product profiles. For example, the first product     numeric characters and numeric characters to create a model word
possesses some property feature while the second product contains        (i.e. the model name is something like ’P 230’, we will change it to
neither this property nor the corresponding value. We also observe       ’p230’).
that for the same products, their model names are the same. So, we
try to find model names for entity matching. Before finding the          2.2    Block Scheme and Scheme Aggregation
model name, there are two problems to deal with, as given below:         After data preprocessing step, the processed data will be passed to
     • Which property contains the model name                            submodule Block Scheme and Scheme Aggregation, through which
     • How to deal with the situation that the extracted model           each product is assigned to a block. Referring to Figure 1, this
       words cannot be extracted from the predetermined key value        submodule consists of two parts: first, the products are classified
       or the extracted model names                                      based on the brand list that obtained from the Internet [2] and
                                                                         supplemented and modified manually. Just as [12] said, the brand
                  Table 1: Example Raw Data                              list provides the algorithm with specific task information. Then, for
                                                                         the products of each brand, we classify products to different blocks,
  property           value                                               by extracting model word and using the blocks aggregator [13].
                                                                             In this challenge, the model word is defined as a word that con-
         New ELO 1928L 19inch TFT 1280x1024                  tains at least one number and one non-numeric character, and the
                     LCD Non Touch Dual Monitor E939583                  entire word is completely extracted as one model word. We adopt
                     7411493021428 | eBay                                two block schemes, i.e., extracting a single model word for title and
  aspect ratio       5:4                                                 description, and then aggregating the two schemes by the union
  brand              Elo                                                 aggregator. Eventually, all documents of each brand will be mapped
  contrast ratio     1300:1                                              to multiple product entities and each document will only be mapped
  model              1928L                                               to one product entity.
  screen size        19
         Elo TouchSystems 1928L – MBE
  contrast ratio     1300:1
                                                                         2.3    Entity Similarity within Blocks
  typical                                                                The method we proposed in this paper for computing entity sim-
  display resolu- 1280 x 1024 pixels                                     ilarity is based on model words and set judgment. The definition
  tion                                                                   of model word is as described in Section 2.2, and we compare the
  horizontal scan 31.5 - 80 kHz                                          model words set of product 𝑎 and the model words set of product
  range                                                                  𝑏, which measures the similarity of two products. Our proposed
  model name      1928L                                                  similarity function consists of two parts.
                                                                             The first part evaluates whether the models of product 𝑎 and
                                                                         𝑏 are the same, by using function 𝑑𝑖 𝑓 𝑓 𝑀𝑜𝑑𝑒𝑙 (𝑎, 𝑏, 𝑀). Here, 𝑀 is
                                                                         a list we constructed from product descriptions, which consists
          Table 2: Example Data after Preprocessing
                                                                         of attributes where the model may appear. We count the model
                                                                         words extracted from the attributes in the list 𝑀, and select the
  property           value                                               model word that appears the most frequently as the product model.
         new elo 1928l tft lcd non touch dual monitor        Moreover, if there is no any model word, then we take the first
                     e939583 7411493021428 | ebay                        model word from ’’ as the model. The rationale is that
  brand              elo                                                 the product model in the title usually appears first.
  model              1928l                                                   The second part determines whether synonymous attributes
         elo touchsystems 1928l – mbe                        of product 𝑎 and product 𝑏 are similar, by calling the function
  model name         1928l                                               𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) multiple times. Here, parameter 𝐿 = [𝐿1, , 𝐿𝑛 ],
                                                                         where 𝐿𝑖 ∈ 𝐿 contains synonymous attributes, and 𝑂 = [𝑂 1, , 𝑂𝑛 ],
                                                                         where operator 𝑂𝑖 ∈ 𝑂 is either ′𝑎𝑛𝑑 ′ or ′𝑜𝑟 ′ , which is applied to
   To solve the above problems, at first some preprocessing steps are    𝐿𝑖 . Each time 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) is invoked, for the model word
necessary, such as lowercasing strings and removing some special         set extracted from all the attributes of the product in list 𝐿𝑖 , we de-
symbols. Generally, the model names have at least one number             termine the relationship between two sets, according to the content
and one non-numeric character. Hence, we use regular expression          of operator 𝑂𝑖 . Specifically, if 𝑂𝑖 is ′𝑎𝑛𝑑 ′ , then the two sets must
to match the model name. We find that the model name will be             be the same, otherwise one set must be the subset of the other set.
obtained precisely in the properties, like model, model name and so      For example, a parameter configuration is shown below:
on. And we retain these properties as new descriptions of product.
When one is unable to find these properties, we consider property           𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, [ ′ < 𝑝𝑎𝑔𝑒𝑡𝑖𝑡𝑙𝑒 > ′, ′ 𝑚𝑜𝑑𝑒𝑙 ′, ′ 𝑚𝑜𝑑𝑒𝑙𝑛𝑎𝑚𝑒 ′ ], ′ 𝑜𝑟 ′ )
   Algorithm 1 shows a high-level overview of how to compute              Algorithm 2 Model Comparison Method
product similarity. Specifically, if the model names of the two prod-     Input: Product profile 𝑎 and 𝑏; Attributes list 𝑀;
ucts are the same, the algorithm will compare whether certain                        -𝑚𝑤 (𝑎, 𝑀) extracts the model words from the key-value
attributes of the product are similar according to the parameter 𝐿             pair of product a if the key appears in list 𝑀.
and 𝑄.                                                                               -𝑀𝑎𝑥𝑀𝑜𝑑𝑒𝑙 (𝐿) find the model word appear most fre-
                                                                               quently in the model word list 𝐿.
Algorithm 1 Product Similarity Method                                                -𝑖𝑠𝐸𝑚𝑝𝑡𝑦 (𝐿) true if 𝐿 is empty.
Input: Product profile 𝑎 and 𝑏; List 𝐿 = [𝐿1, , 𝐿𝑛 ], where 𝐿𝑖 ∈ 𝐿        Output: True if the models of product 𝑎 and product 𝑏 are different;
    contains some synonyms attributes; List 𝑂 = [𝑂 1, , 𝑂𝑛 ], where            otherwise, False.
    operator 𝑂𝑖 ∈ 𝑂 is either ′𝑎𝑛𝑑 ′ or ′𝑜𝑟 ′ , which corresponds to        1: function diffModel(𝑎, 𝑏, 𝑀)
    𝐿𝑖 ; Attributes list 𝑀; Furthermore, the following functions are        2:     𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐴𝑚𝑤 = 𝑚𝑤 (𝑎, 𝑀)
    used:                                                                   3:     𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐵𝑚𝑤 = 𝑚𝑤 (𝑏, 𝑀)
          -𝑑𝑖 𝑓 𝑓 𝑀𝑜𝑑𝑒𝑙 (𝑎, 𝑏) is true if models of product 𝑎 and 𝑏 are     4:     𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 = 𝑚𝑤 (𝑎, [ ′  ′ ])
    different                                                               5:     𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 = 𝑚𝑤 (𝑏, [ ′  ′ ])
          -𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) the model words similarity between         6:     if 𝑖𝑠𝐸𝑚𝑝𝑡𝑦 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐴𝑀𝑊 ) then
    the products 𝑖 and 𝑗 using the key list 𝐿𝑖 and operator 𝑂𝑖 .            7:          if 𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 [0] in 𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 then return False
Output: True if product 𝑎 and product 𝑏 are the same; otherwise,            8:          end if
    False.                                                                  9:          𝑚𝑜𝑑𝑒𝑙𝐴 = 𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 [0]
 1: function ProdSim(𝑎, 𝑏, 𝑀, 𝐿, 𝑂)                                        10:     else
 2:      if 𝑑𝑖 𝑓 𝑓 𝑀𝑜𝑑𝑒𝑙 (𝑎, 𝑏, 𝑀) then return False                       11:           𝑚𝑜𝑑𝑒𝑙𝐴 = 𝑀𝑎𝑥𝑀𝑜𝑑𝑒𝑙 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐴𝑚𝑤)
 3:      end if                                                            12:     end if
 4:      for all 𝐿𝑖 , 𝑂𝑖 in 𝐿, 𝑂 do                                        13:     if 𝑖𝑠𝐸𝑚𝑝𝑡𝑦 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐵𝑚𝑤) then
 5:          if not 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) then return False                14:          if 𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 [0] in 𝑡𝑖𝑡𝑙𝑒𝐴𝑚𝑤 then return False
 6:          end if                                                        15:          end if
 7:      end for                                                           16:          𝑚𝑜𝑑𝑒𝑙𝐵 = 𝑡𝑖𝑡𝑙𝑒𝐵𝑚𝑤 [0]
 8:      return True                                                       17:     else
 9: end function                                                           18:           𝑚𝑜𝑑𝑒𝑙𝐵 = 𝑀𝑎𝑥𝑀𝑜𝑑𝑒𝑙 (𝑑𝑖𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝐵𝑚𝑤)
                                                                           19:     end if
                                                                           20:     return not (𝑚𝑜𝑑𝑒𝑙𝐴 == 𝑚𝑜𝑑𝑒𝑙𝐵)
   Algorithm 2 judges whether the two products have the same
                                                                           21: end function
model. It regards the most frequently occurring model word or the
first model word in the title as the model name, and compares the
model names of the two products.                                          Algorithm 3 Model Word Similarity Method
   Function 𝑚𝑤𝑆𝑖𝑚(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 ) in Algorithm 3 is used to compute       Input: Product profile 𝑎 and 𝑏; Synonyms attributes list 𝐿𝑖 ; Opera-
the similarity between product description of 𝑎 and 𝑏, according to            tor 𝑂𝑖 is either ′𝑎𝑛𝑑 ′ or ′𝑜𝑟 ′ , which corresponds to 𝐿𝑖 ; 𝑚𝑤 (𝑤)
some special product attributes. The number of times this function             extracts the model words for a given set of words 𝑤.
is called depends on the length of the parameter 𝐿𝑖 and 𝑂𝑖 . In this      Output: True if the model words of product 𝑎 and 𝑏 extracted from
algorithm, the model words from some attributes of product 𝑎 and               𝐿𝑖 are the same under operator 𝑂𝑖 ; otherwise, False.
𝑏 are compared. If 𝑂𝑖 is ‘𝑎𝑛𝑑 ′ , the model words must be the same,         1: function mwSim(𝑎, 𝑏, 𝐿𝑖 , 𝑂𝑖 )
otherwise, one model word set must be the subset of the other set.          2:     𝑠𝑒𝑡𝐴, 𝑠𝑒𝑡𝐵 = 𝑚𝑤 (𝑎, 𝐿𝑖 ), 𝑚𝑤 (𝑏, 𝐿𝑖 )
   Due to transitivity property between products, if A is similar to        3:     if 𝑂𝑖 == ′ 𝑎𝑛𝑑 ′ then
B, and B is similar to C, then we regard it that A is similar to C.         4:         if 𝑠𝑒𝑡𝐴.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐵) and 𝑠𝑒𝑡𝐵.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐴) then
We perform clustering while calculating the similarity, instead of          5:              return True
clustering after calculating the similarity between all products. This      6:         end if
may sacrifice accuracy a little bit, but can significantly improve          7:     end if
efficiency.                                                                 8:     if 𝑂𝑖 == ′ 𝑜𝑟 ′ then
                                                                            9:         if 𝑠𝑒𝑡𝐴.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐵) or 𝑠𝑒𝑡𝐵.𝑖𝑠𝑠𝑢𝑏𝑠𝑒𝑡 (𝑠𝑒𝑡𝐴) then
3   EXPERIMENT RESULTS                                                     10:              return True
To evaluate performance of our proposed method, we conduct                 11:         end if
experiments on the challenge datasets provided by DI2KG. We sum-           12:     end if
marize the best experiment results of our proposed method in Table         13:     return False
3, i.e., on training Dataset Y the Precision, Recall and F-Measure are     14: end function
1.000, 0.964 and 0.982, respectively, whereas on Dataset X Precision,
Recall and F-Measure are 0.915, 0.967 and 0.940, respectively.
    In addition, we tested our algorithm on the camera dataset and        on both data sets, that is, it has good extensivity. In order to be
the Precision, Recall and F-Measure are 0.98, 0.97 and 0.98 [1]. In       suitable for different datasets, we can set the parameters such as
general, we can see that our algorithm achieved good performance          brand list, product descriptions list and synchronous attributes list,
Table 3: Performance of Our Method on Challenge Datasets                                  [6] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan
                                                                                              Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourc-
                                                                                              ing for entity matching. In Proceedings of the 2014 ACM SIGMOD international
                              Precision       Recall     F-Measure                            conference on Management of data. 601–612.
                                                                                          [7] Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019.
              Dataset Y          1.000        0.964          0.982                            Low-resource deep entity resolution with transfer and active learning. arXiv
              Dataset X          0.915        0.967          0.940                            preprint arXiv:1906.08042 (2019).
                                                                                          [8] Selasi Kwashie, Lin Liu, Jixue Liu, Markus Stumptner, Jiuyong Li, and Lujing
                                                                                              Yang. 2019. Certus: an effective entity resolution approach with graph differential
                                                                                              dependencies (GDDs). Proceedings of the VLDB Endowment 12, 6 (2019), 653–666.
                                                                                          [9] Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J Franklin. 2016. Crowd-
which can be obtained based on rules or machine learning. It is                               sourced data management: A survey. IEEE Transactions on Knowledge and Data
                                                                                              Engineering 28, 9 (2016), 2296–2319.
worth mentioning that our algorithm is mainly based on model                             [10] George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and
words, thus, it is better for data sets that mainly rely on product                           Wolfgang Nejdl. 2012. A blocking framework for entity resolution in highly
                                                                                              heterogeneous information spaces. IEEE Transactions on Knowledge and Data
models for matching, such as cameras, displays and other electronic                           Engineering 25, 12 (2012), 2665–2682.
products.                                                                                [11] Tomer Sagi, Avigdor Gal, Omer Barkol, Ruth Bergman, and Alexander Avram.
   Our observation is that the key to data preprocessing is to remove                         2017. Multi-source uncertain entity resolution: Transforming holocaust victim
                                                                                              reports into people. Information Systems 65 (2017), 124–136.
some redundant information. Since in Algorithm 2, we try to use                          [12] Ronald Van Bezu, Sjoerd Borst, Rick Rijkse, Jim Verhagen, Damir Vandic, and
the first model word in ’’ for comparison, it is highly                           Flavius Frasincar. 2015. Multi-component similarity method for web product
recommended to make the first model word in the ’’ to                             duplicate detection. In Proceedings of the 30th annual ACM symposium on applied
                                                                                              computing. 761–768.
be the model number of the product.                                                      [13] Damir Vandic, Flavius Frasincar, Uzay Kaymak, and Mark Riezebos. 2020. Scalable
   On the other hand, in the block scheme the acquisition of brand                            entity resolution for Web product descriptions. Information Fusion 53 (2020),
                                                                                              103–111.
information requires some manual intervention. Although it can
be obtained through machine learning or rules-based methods, we
found that there are some special circumstances that need to deal
with. For instance, Alienware was acquired by Dell, so Alienware
and Dell belong to the same brand. Since brand classification is
the first step in the BS-SA, manual intervention for such special
situations can improve the recall rate at lower cost.

4    CONCLUSION
In this paper, we proposed an extensible block-scheme based method
for entity resolution. Specifically, our method consists of three
components, i.e., data preprocessing, block-scheme and scheme
aggregation, and similarity and clustering. Experiments on DI2KG
challenge datasets show that our method can achieve 0.982 and 0.94
F-measure values on Dataset X and Dataset Y, respectively.
   In our future work, we intend to combine machine learning
techniques and consider weighting key properties, so as to find as
many as possible the best  pairs of products. Because
by using those strategies, we can make full use of training datasets,
by mapping product similarity to the range [0, 1].

ACKNOWLEDGMENTS
The authors would like to thank the reviewers for their comments.
This paper is partially supported by NSFC under grant Number
61972177.

REFERENCES
 [1] [n.d.]. ACM SIGMOD 2020 Programming Contest Leaders. http://www.inf.
     uniroma3.it/db/sigmod2020contest/leaders.html.
 [2] [n.d.]. Wikipedia: The free encyclopedia. https://en.wikipedia.org/wiki/Category:
     Lists_of_consumer_electronics_manufacturers.
 [3] Youssef Aassem, Imad Hafidi, and Noureddine Aboutabit. 2020. Enhanced Dupli-
     cate Count Strategy: Towards New Algorithms to Improve Duplicate Detection.
     In Proceedings of the 3rd International Conference on Networking, Information
     Systems & Security. 1–7.
 [4] P Christen. [n.d.]. Data matching: concepts and techniques for record linkage,
     entity resolution, and duplicate detection. 2012.
 [5] Gianni Costa, Giuseppe Manco, and Riccardo Ortale. 2010. An incremental
     clustering scheme for data de-duplication. Data Mining and Knowledge Discovery
     20, 1 (2010), 152.