=Paper= {{Paper |id=Vol-2720/paper4 |storemode=property |title=ISCAS_ICIP at MWPD-2020 Task 1: Product Matching Based on Deep Entity Matching Frameworks |pdfUrl=https://ceur-ws.org/Vol-2720/paper4.pdf |volume=Vol-2720 |authors=Cheng Fu,Tianshu Wang,Hao Nie,Xianpei Han,Le Sun |dblpUrl=https://dblp.org/rec/conf/semweb/FuWNH020 }} ==ISCAS_ICIP at MWPD-2020 Task 1: Product Matching Based on Deep Entity Matching Frameworks== https://ceur-ws.org/Vol-2720/paper4.pdf
 ISCAS_ICIP at MWPD-2020 Task 1: Product Matching
     Based on Deep Entity Matching Frameworks

                        1,3                 1,3           1,3
             Cheng Fu , Tianshu Wang , Hao Nie , Xianpei Han1,2 and Le Sun1,2
 1
        Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
    2
        State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences
                                3
                                  University of Chinese Academy of Sciences
 {fucheng, tianshu2020, niehao2016, xianpei, sunle}@iscas.ac.cn



             Abstract. This paper describes our product matching system developed for Se-
             mantic Web Challenge on Mining HTML-embedded Product Data 2020 (Task
             1). Product matching is the task of identifying product offers deriving from dif-
             ferent websites that refer to the same real-world product, which is a typical sce-
             nario of entity matching (EM). In our system, we implement four state-of-the-
             art deep learning-based entity matching models and integrate their results to get
             the final product matching predictions. Competition results show that, our sys-
             tem obtains promising performance on this task.

             Keywords: Product Matching, Entity Matching, Semantic Web.


1            Introduction1

Product matching is the task of identifying product offers deriving from different
websites that refer to the same real-world product, which is a typical scenario of entity
matching (EM). In this task, product matching is handled as a binary classification
problem: given two product offers decide if they describe the same product (match-
ing) or not (non-matching). It is critical for many downstream applications such as
product knowledge graph construction, product search, product recommendation etc.
Entity matching has been extensively studied since the 1950s [7], thus a variety of
methods for solving the EM problem have been proposed [8, 9]. The existing EM
approaches can be roughly divided into two categories: rule-based, and machine
learning-based. Rule-based approaches resolve entity record pairs using matching
rules given by domain experts [10] or automatically learned from labeled examples
[11, 12, 13]. Machine learning (ML)-based approaches usually treat entity resolution
as a classification problem [14]. Traditional ML approaches include SVM-based
models [15], Markov logic-based methods [16s], active learning-based solutions [17],
etc. Recently, some deep learning-base methods were also proposed for EM. One


Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
main advantage of such approaches is that they can better capture semantic similarity
between textual attributes, and can efficiently reduce human cost in EM pipeline [1, 2,
5, 6, 18, 19].
   In our system, we implement four state-of-the-art deep learning-based entity
matching models (MPM, Seq2SeqMatcher, HierMatcher and DITTO), and integrate
results output by them to get the final product matching predictions.


2      System Overview

As shown in Fig. 1, our system mainly consists of three pipeline modules, which are
respectively pre-processing module, entity matching module and post-processing
module. Pre-processing module is to normalize all attribute values and complete
product entity information by extracting some new information. Entity matching
module is to predicate whether the two offers refer to the same product using end-to-
end entity matching frameworks. Post-processing module is to refine the matching
results produced by the entity matching module via some heuristic rules.




                        Fig. 1. Overview of our proposed system.


2.1    Pre-processing

Value Normalization. For inputs of different subsequent modules, we use different
pre- normalization strategies. In most cases, given a raw textual value, we remove
non-alphanumeric characters and stopwords using NLTK, and then lowercase all
tokens it contains. But when preparing data for model extraction, we don't do the
lowercase operation and keep some special non-alphanumeric characters (such as “-”
and “/”) which can be considered as important model extraction features.


Attribute Extraction. To complete product entity information, we attempt to extract
two key types of attribute values for each product offer: brand and model. In raw
datasets, brand is an existing attribute, but its value coverage rate is low (eg. 57.4% in
the official dataset). Model is a new attribute that can usually help in entity Matching.
For the brand attribute, we use a vocabulary-based extraction approach. Specifically,
we first built a brand vocabulary based on the WDC Product Data Corpus [20] (Com-
puters & Accessories domain in its English version). Then use an exact matching
strategy to get brand value for each product offer via the vocabulary. When construct-
ing the brand vocabulary, we use the following existing attributes in the corpus:
"brand", "brand name", "merk", "manufacturer" and "marca". For the model attrib-
ute, we use two strategies. The first one is vocabulary-based, which is similar to the
extraction of brand. For model vocabulary construction, we use "Part Number" and
"SKU" attributes in the WDC Product Data Corpus [20] (Computers & Accessories
domain in its English version). The second one is pattern-based, in which we use
about twenty Python regular expressions to filter model candidates from the existing
title and description attributes. After get a candidate set of model values for each of-
fer, we use TF-IDF to choose the final one.

2.2    Entity Matching

In this module, we use four recently proposed EM models (MPM, Seq2SeqMatcher,
HierMatcher and DITTO) to predict whether two offers refer the same product. Then
use a voting mechanism to integrate their prediction results for each product offer
pair. Specifically, in Round 1 of this competition, we integrate results from MPM,
Seq2SeqMatcher and HierMatcher. In Round 2 of this competition, we integrate re-
sults from MPM, HierMatcher and DITTO. Detailed introduction of the four models
used in our system are as follows.




                Fig. 2. Framework of the MPM model used in our system.



MPM [1]. MPM is an end-to-end multi-perspective matching model for entity resolu-
tion, which can adaptively select the optimal similarity measures for heterogenous
attributes, and jointly learn and select similarity measures in an end-to-end way. As
shown in Fig. 2, it uses a “compare-select-aggregate” neural framework, which first
compares aligned attribute values in multiple perspectives using different similarity
measures, then adaptively selects the optimal similarity measure for each attribute by
designing a gate mechanism, finally aggregates the comparison results of the selected
similarity measures from all attributes to make EM decision.
    In our system, we use 4 attributes of each product offer as input: brand, model, ti-
tle, price. For each attribute, we use eight similarity measures of three types (the same
as in [1]) to get multi-pespective comparison results, then adaptively select the opti-
mal one. We use the pretrained FastText 300-dimensional word embedding [3] for the
two DL based similarity measures: rnn_sim and hybrid_sim. Hidden size of each




            Fig. 3. Framework of the Seq2SeqMatcher model used in our system.

GRU layer is set 256.


Seq2SeqMatcher [5]. Seq2SeqMatcher is a deep learning-based entity matching
model aiming to effectively solve the heterogeneous and dirty cases by modeling EM
as a token-level sequence-to-sequence matching task. Fig. 3 shows its architecture, in
which each record is linearized as a token sequence, and each token is a pair of the
form . From the figure we can see that: 1) it compares records in
token-level instead of attribute level, where no attribute alignment knowledge is
needed, therefore can naturally solve the heterogeneous schemas; and 2) tokens can
be compared across attributes and contribution of each token to the final EM decision
is automatically learned, therefore the dirty cases can be effectively solved.
    For this model, we use 4 attributes of each product offer as input: brand, model, ti-
tle, price. And we use the same word embedding and parameter settings as the origi-
nally paper [5] in our system.




              Fig. 4. Framework of the HierMatcher model used in our system.
HierMatcher [6]. HierMatcher is a hierarchical matching network also designed to
resolve heterogeneous and dirty entity matching problems. As shown in Fig. 4, it can
jointly model entity matching at three levels (token, attribute, and entity) in a unified
neural framework. At the token level, it constructs a cross-attribute token alignment
module. By selecting comparison objects for all tokens across all attributes, it can
effectively address the schema heterogeneity and the misplaced-type dirty data prob-
lems. At the attribute level, it uses an attribute-aware attention mechanism, which can
learn to identify important information for different attributes, therefore can effective-
ly resolve the redundant-type and noisy-type dirty data problems. Furthermore, by
obtaining matching evidence level by level, i.e., aggregating comparison results from
token level to attribute level, and then to entity level, it can fully take advantage of
hierarchical structure information of entities.
    For this model, we use 4 attributes of each product offer as input: brand, model,
title, price. And we use the same word embedding and parameter settings as the origi-
nally paper [6] in our system.




                Fig. 5. Framework of the DITTO model used in our system.



DITTO [2]. DITTO is a novel entity matching system based on pretrained Trans-
former-based language models. It fine-tunes and casts EM as a sequence-pair classifi-
cation problem to leverage such models with a simple architecture. As shown in Fig.
5, given two entities, DITTO serializes them as one sequence and feeds it to the mod-
el as input. The model consists of (1) token embeddings and Transformer layers from
a pre-trained language model (e.g., BERT) and (2) task-specific layers (linear fol-
lowed by softmax). Conceptually, the [CLS] token “summarizes” all the contextual
                                                                           ′
information needed for matching as a contextualized embedding vector 𝐸[𝐶𝐿𝑆]     which
the task-specific layers take as input for classification.
   In our system, we fine-tune our EM model on an uncased 12-layer DistilBERT [4]
pre-trained model. We fix the learning rate to be 1e-5 and the max sequence length to
be 512. For each product offer, we use the following 4 attributes for matchings:
brand, model, title, price.

2.3    Post-processing

Post-processing module is used to correct some obviously error of prediction results
output by the entity matching module using heuristic rules. For example, those pairs
with exactly the same title, but be predicted to be non-matched, and those pairs with
different brands but be predicted to be matched.


3      Data

In this competition, compared with test set, official released training and validation
sets are much easier. Furthermore, they do not cover the product offers contained in
the test set well. Therefore, for the competition, we construct an extended training
dataset and an extended validation dataset, which are harder and have better coverage
of the product offers in the test set. The size of our extended training set is 138,461,
which contains 68,461 pairs from official training set and 70,000 new hard pairs. The
size of our extended validation set is 6,100, which consists of 1,100 pairs from official
validation set and 5,000 new hard pairs. Additional hard instances (product offer
pairs) mentioned before are obtained by the following two steps.


Step 1: Initial candidate set construction. We first sample 10,000 clusters from subset
of the WDC Product Data Corpus (English version), each product of which belongs to
the Computers & Accessories category. Then we construct a large dataset containing
914,878 product offer pairs using the same strategy as the one used to construct offi-
cial training dataset.


Step 2: Hard instance selection. Given each offer pair from the initial candidate set,
we then use Jaccard similarity of offer titles to select hard samples. Specifically, for
each positive sample, if title similarity of its offers is less than 0.4, we consider it as a
hard one. For each negative sample, if title similarity of its offers is more than 0.6, we
consider it as a hard one. Finally, we randomly sample corresponding numbers of
samples for the extended training set and validation set described before. Posi-
tives/negatives ratio in the sampled hard pairs is 3:7.


4      Evaluation

Table 1 reports the results of our systems in the two rounds of this competition.
ISCAS-ICIP is our system in Round 1 integrating results of MPM, Seq2SeqMatcher
and HierMatcher. ISCAS-ICIP (R2) is our system in Round 2 integrating results of
MPM, HierMatcher and DITTO. From this table we can see that, our systems signifi-
cantly outperform the baseline system built on DeepMatcher. Specifically, ISCAS-
                      Table 1. Results of our systems in the competition.

                               Systems            Precision       Recall     F1
                        MPM                         80.43         82.66     81.53
                        Seq2SeqMatcher              82.26          81.17    81.71
        Base models
                        HierMatcher                 82.61          82.07    82.34
                        DITTO                       84.48          83.17    83.82
                        ISCAS-ICIP                  83.89          81.33    82.59
        Our systems
                        ISCAS-ICIP (R2)             85.77          84.95    85.36
          Baseline      DeepMatcher                 70.89          74.67    72.73

ICIP and ISCAS-ICIP (R2) respectively achieve 9.8 and 12.6 F1 score improvement
on the test set. It demonstrates that, our systems can achieve promising performances
for the product matching tasks. Besides, both of our systems in the two rounds out-
perform all base models integrated by them, which demonstrates the effectiveness of
the integration strategy used in our systems.


5      Conclusion

This paper describes our product matching system developed for Semantic Web Chal-
lenge on Mining HTML-embedded Product Data 2020 (Task 1). In our system, we
implement four state-of-the-art deep learning-based entity matching models (MPM,
Seq2SeqMatcher, HierMatcher and DITTO), and integrate results from multiple mod-
els from them to get the final product matching predictions. Competition results show
that, our system obtains promising performance on this task.

Acknolwedgements This work is supported by the National Natural Science Founda-
tion of China under Grants no. U1936207 and 61772505, the National Key Research
and Development Program of China under Grant No.2017YFB1002104, and Beijing
Academy of Artificial Intelligence (BAAI2019QN0502).


References
 1. Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. End-to-
    end multiperspective matching for entity resolution. In Proceedings of the IJCAI, pages
    4961–4967. AAAI Press, 2019.
 2. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. Deep Enti-
    ty Matching with Pre-Trained Language Models[J]. arXiv, 2020: arXiv: 2004.00584. Con-
    ference Name: ACM Woodstock conference.
 3. Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov. Enriching Word Vec-
    tors with Subword Information. CoRR abs/1607.04606 (2016).
 4. V. Sanh, L. Debut, J. Chaumond, and T. Wolf. DistilBERT, a distilled version of BERT:
    smaller, faster, cheaper and lighter. In Proc. EMC2 ’19, 2019.s
 5. Hao Nie, Xianpei Han, Ben He, Le Sun,Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong.
    Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In Pro-
    ceedings of CIKM, pages 629–638, 2019
 6. Cheng Fu, Xianpei Han, Jiaming He, Le Sun. Hierarchical Matching Network for Hetero-
    geneous Entity Resolution. In Proceedings of the IJCAI,2020.
 7. Howard B Newcombe, James M Kennedy, SJ Axford, and Allison P James. Automatic
    linkage of vital records. Science, 130(3381):954–959, 1959.
 8. AnHai Doan and Alon Y Halevy. Semantic integration research in the database communi-
    ty: A brief survey. AI magazine, 26(1):83–83, 2005.Conference Location:El Paso, Texas
    USA
 9. Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. Record linkage: similarity
    measures and algorithms. In Proceedings of the ACM SIGMOD, pages 802–803. ACM,
    2006.ISBN:978-1-4503-0000-0/18/06
10. Mauricio A Hern´andez and Salvatore J Stolfo. The merge/purge problem for large data-
    bases. In ACM Sigmod Record, volume 24, pages 127–138. ACM, 1995.
11. Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, and Raghav Kaushik. Exam-
    pledriven design of efficient record matching queries. In PVLDB, pages 327–338. VLDB
    Endowment, 2007.
12. Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. Entity matching: How simi-
    lar is similar. PVLDB, 4(10):622–633, 2011.
13. Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo
    Papotti, Jorge-Arnulfo Quian´e-Ruiz, Armando Solar-Lezama, and Nan Tang. Synthesiz-
    ing entity matching rules by examples. PVLDB, 11(2):189–202, 2017.
14. Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of the American
    Statistical Association, 64(328):1183–1210, 1969.
15. Mikhail Bilenko and Raymond J Mooney. Adaptive duplicate detection using learnable
    string similarity measures. In Proceedings of the ACM SIGKDD, pages 39–48. ACM,
    2003.
16. Parag Singla and Pedro Domingos. Entity resolution with markov logic. In ProcSunita Sa-
    rawagi and
17. Anuradha Bhamidipaty. Interactive deduplication using active learning. In Proceedings of
    the ACM SIGKDD, pages 269–278. ACM, 2002.eedings of the ICDM, pages 572–582.
    IEEE, 2006.
18. Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and
    Nan Tang. Distributed representations of tuples for entity resolution. PVLDB,
    11(11):1454–1467, 2018.
19. Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh
    Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity
    matching: A design space exploration. In Proceedings of the ICDM, pages 19–34. ACM,
    2018.
20. Primpeli, A., Peeters, R., & Bizer, C.: The WDC training dataset and gold standard for
    large-scale product matching. In: Companion Proceedings of the 2019 World Wide Web
    Conference. pp. 381-386 ACM (2019).