PMap: Ensemble Pre-training Models for
               Product Matching

                Natthawut Kertkeidkachorn1 and Ryutaro Ichise2,1
        1
            National Institute of Advanced Industrial Science and Technology,
                                 Tokyo 135-0064, Japan
              2
                National Institute of Informatics, Tokyo 101-8430, Japan
                 n.kertkeidkachorn@aist.go.jp, ichise@nii.ac.jp


      Abstract. Mining the Web of HTML-embedded Product Data (MWPD)
      Challenge aims to benchmark methods dealing with two e-commerce data
      integration tasks: 1 ) Product Matching and 2) Product Classification.
      In this paper, we present the design of our system, namely PMap, for
      the MWPD Challenge on the Product Matching task. PMap aggregates
      the results of the various state of the art pretraining models to resolve
      the identical products. Results on MWHPD show that PMap outper-
      forms the baseline and obtains the promising performance for the prod-
      uct matching task. The code and the system’s outputs are available.3


1   Introduction
Due to the growth of online shops in the e-commerce domain, semantic annota-
tion plays a key role in enhancing the accessibility and visibility of products. An-
notating the products with the semantic markup language helps a search engine
to retrieve the product as a user’s expectation. However, annotated products suf-
fer from inconsistent and heterogeneous problems from cross-sector e-commerce
vendors. As a result, it even leads to a situation where the product’s informa-
tion is conflicted. Furthermore, without a clear benchmark, it is hard to judge
the progress of the methods in this field. To address these challenges, Mining
the Web of HTML-embedded Product Data (MWPD) challenge4 is introduced.
The goal of the MWPD challenge is to provide the benchmark for the methods
dealing with two fundamental tasks in e-commerce data integration: 1) Product
Matching and 2) Product Classification.
    In this study, we focus on the Product Matching task. Product Matching is to
match the same products from different websites that refer to the same real-world
product. To deal with the Product Matching task, we introduce the ensemble
pre-train models, namely PMap. PMap takes the advantages of contextualized
embedding pre-train models together with the aggregating strategy in order to
uncover the identical products.
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
3
  http://github.com/knatthawut/mwpd
4
  https://ir-ischool-uos.github.io/mwpd/
2       N. Kertkeidkachorn et al.


Fig. 1. The samples of the product offers from the MWPD challenge on the product
matching tasks [1].


    The rest of the paper is organized as follows. We describe the problem setting
of the product matching on the MWPD challenge in Section 2. Section 3 reports
the design of our approach. In Section 4, the experimental setup details and the
experimental results are presented. We then survey the related work in Section
5. In Section 6, we conclude our work.

2    Problem Setting
A product offer is a collection of textual attributes that describes the real-world
product. Generally, product offers are published as the product descriptions with
specification tables, i.e. HTML tables that describe specifications about the offer
such as price or brand of the product. The samples of the product offers are
presented in Figure 1.
     Product Matching in the MWPD challenge is the task to classify whether the
given two product offers are identical, i.e. two product offers refer to the same
real-world object. We can formulate the Product Matching problem as follows:
     Let D and D0 be two collections of product offers from different resources.
We assume that product offers in D and D0 have the same schema, i.e. a product
offer is described by the same set of attributes A. Given D = {PD1 , PD2 , PD3 ,
..., PDn } and D0 = {PD10 , PD20 , PD30 , ..., PDn0 }, where PDi is the i − th product
offer of D and PDi0 is the i − th product offer of D0 , the objective of the product
matching is to model the function f : (PDi , PDi0 ) → {0, 1}. If two products refer
to the same object, the function f (·) returns 1, otherwise 0.
     For example, in Figure 1, the product offer a and the product offer c are
from D and the product offer b and the product offer d are from D0 . The pairs
                PMap: Ensemble Pre-training Models for Product Matching         3


                       Fig. 2. The design pipeline of PMap


of product offers (a, b) and (c, d) are given. The pair of product offers (a, b)
is the match pair (f : (a, b) → 1), while the pair of product offers (c, d) is the
non-match pair (f : (c, d) → 0 ).


3     Approach

We design our system (PMap) as the 3-steps pipeline. As shown in Figure 2, our
pipeline consists of 1) Pre-processing, 2) Fine-tuning Pre-train Models, and 3)
Ensemble Models. The details of each step are as follows.


3.1   Pre-processing

In the MWPD challenge, WDC Product Data Corpus5 is used as the dataset. It
is derived from the Web Data Commons6 extracted by using schema.org anno-
tations from the Common Crawl7 . Although some cleaning pre-processing steps
are taken into account on the dataset [6], we found that it is still necessary
to further pre-process the dataset due to the character encoding and symbol
in the data. To pre-process the dataset, we remove symbols and non-alphabet
characters by using a simple regular expression.


3.2   Fine-tuning Pre-train Models

Fine-tuning Pre-train Models is the core step of PMap. In this section, we explain
the pre-train models and how to fine-tune them.
5
  http://webdatacommons.org/largescaleproductcorpus/v2/index.html
6
  http://webdatacommons.org/structureddata/
7
  https://commoncrawl.org
4       N. Kertkeidkachorn et al.


Fig. 3. Illustration of the fine-tuning pre-train models for the product matching task.


Pre-train Models, also known pre-trained language representation models,
widely gain attention in the NLP community due to their transfer learning abil-
ity. Such pre-train models can easily achieve state-of-the-art performances for
various NLP standard tasks [8] by simply fine-tuning the models over specific
tasks. One of the state-of-the-art pre-train contextual language representation
models is BERT [2]. It builds upon a multi-layer bidirectional Transformer en-
coder, which is based on the self-attention mechanism. During the pre-training
representation learning, BERT is trained on large-scale unlabeled general do-
main corpus from BooksCorpus and English Wikipedia in order to perform the
masked language task and the next sentence prediction task. Based on the suc-
cess of the BERT, various pre-train models have also been introduced such as
DistilBert[7] and Roberta[5]. We can build various models for product matching
by fine-tuning pre-train models.

Fine-tuning is to optimize the model for the specific task. The architecture for
fine-tuning pre-train models for the product matching task is shown in Figure
3. Given the input pair (PDi , PDi0 ), the first token of every sequence of input
pairs is always a special classification token [CLS]. Following [CLS], the product
offer PDii is represented as the sequence of tokens containing the title of the
                        PD     PD      PD        PD
product offer PDi = T1 i , T2 i , T3 i , ... , Tn i , where n is the length for PDi0
of titles after tokenized. Then, [SEP] is put after the sequence representation of
PDi . After [SEP], the product offer PDi0 is represented by the similar way of the
product offer PDi as the sequence of tokens containing the title of the product
               PD0     PD 0   PD0         PD 0
offer PDi0 = T1 i , T2 i , T3 i , ... , Tm i , where m is the length of titles for
                  PMap: Ensemble Pre-training Models for Product Matching                5

PDi0 . Note that, at first, we aim to treat the product offer as the documents and
use the whole details of the product offer as the sequence of tokens. However,
the pre-train model allows the sequence of the tokens with the maximum length
at 512. To fit the pre-train model within this limitation, we decide to use the
only title as a representation of the product offer. As a result, it is still room to
investigate the other attributes of product offers as features.
    After feeding the input sequences to the pre-train model, the final vector rep-
resentation C corresponding to [CLS] is used as the representation of the input
sequence to pass to the shallow neural network for building the classifier. We
compute a cross-entropy loss with the following equations to train the classifier

                                         ŷ = σ(CW T )                                 (1)
                              X
                     L=                  y · log(ŷ0 ) + (1 − y) · log(ŷ1 )           (2)
                          (PDii ,PD0 )
                                   i

    , where σ(·) is the sigmoid function, W is the classification layer weight of
the shallow neural network for fine-tuning ( W ∈ IR2×|C| ), ŷ is a 2-dimensional
real vector with ŷ0 , ŷ1 ∈ [0, 1], ŷ0 + ŷ1 = 1 and y is the label for the pair of input
(y ∈ 0, 1).

3.3   Ensemble Models
Based on the preliminary results on the validation dataset, we found most of
the pre-train models achieved very remarkable performance. However, when we
observed and analyzed the result on each sample in the training process, it turned
out that each pre-train model could capture different aspects of the data. For
example, we found that RoBERTa could capture the typo error, whereas others
could not. Due to this signal, PMap combines the results from various pre-train
modes to capture various types of aspects of the dataset and make the final
prediction with these results.


4     Experiments and Results
In this section, we report the experiments of PMap on the product matching
task of the MWPD challenge.

4.1   Experimental Setup
The experimental setup is as follows:
   Datasets. The Product Matching dataset is derived from the WDC Prod-
uct Data Corpus and Gold Standard for Large-Scale Product Matching. The
product data corpus contains 16M product offers. In the product matching task,
there are 68,461, 1,100, and 1,500 offer pairs for training, validating, and testing
respectively.
6        N. Kertkeidkachorn et al.

                  Table 1. The Result on the Product Matching Task


         System                      Precision   Recall F1 (positive pairs only)
         Baseline [6]                 0.7089     0.7467         0.7273
         distilbert-base-uncased∗     0.7810     0.7495         0.7649
         bert-base-uncased∗           0.7848     0.8340         0.8086
         bert-large-uncased∗          0.7943     0.8493         0.8209
         roberta-base∗                0.8210     0.8725         0.8459
         roberta-large∗               0.8476     0.8691         0.8582
         PMap                         0.8204     0.9048         0.8605


    Settings. We select various pre-train models including distilbert-base-uncased,
bert-base-uncased, bert-large-uncased, roberta-base, and roberta-large. The pre-
train models are available at the huggingface repository8 . To implement the
model as in Figure 3, we employ the implementation of AutoModelForSequence-
Classification9 . We set the hyper-parameters in the fine-tuning process as fol-
lows: batch: 8, 16 or 32 (depending on the largest batch that can be loaded to
the memory), learning rate: 2e − 5, epochs: 2-4, dropout rate: 0.1. The maxi-
mum length of tokens is set at 150 due to the length of the titles in the dataset.
During the testing, we select bert-large-uncased, roberta-large, and roberta-base
for the ensembling of the results in the pipeline. This selection is based on the
observation of the validation dataset.
    Baseline. In the product matching task, deepmatcher [6], a state-of-the-
art matching method is used as the baseline. Also, we additionally conduct the
experiment on each pre-train model for the ablation study of our approach.
    Evaluation Metrics. Precision, Recall and F1 score on the positive class
(class 1) is calculated.


4.2     Results

Table 1 reports the results of PMap for the product matching task. The best
precision is obtained from the Roberta-large model, while PMap gives the best re-
call. Overall, PMap outperforms the baseline in F1 score and obtains the promis-
ing performance for the product matching task.


5     Related Work

Product Matching is a special case of the entity linking, which considers the dis-
ambiguation of a real-world entity in the e-commerce domain. There are many
8
    https://huggingface.co/models
9
    https://huggingface.co/transformers/model doc/auto.html
    ∗
      We additionally evaluate these results after releasing of the ground truth for the
    test dataset.
                 PMap: Ensemble Pre-training Models for Product Matching               7

research works related to entity linking [3, 4, 6]. Early works focused on mod-
eling the approaches with rule-based and statistics-based methods [3]. Later,
the machine learning-based approach has become a popular approach due to
its strong performance [4]. In recent years, the deep learning-based approach is
extremely successful in many application domains. Deepmacther[6], one of the
deep learning approaches, models the deep neural network and achieves the state
of the art for the product matching task. However, we notice that the pre-train
models have not been gained much attention in the product matching task yet.
The pre-train models (e.g. BERT [2]) achieve remarkable results on many NLP
tasks. Therefore, it is worthwhile to explore the pre-train models for the product
matching task.


6    Conclusion
In this paper, we report the product matching system, namely PMap. PMap
takes the advantages of the pre-train models to build the classifiers and then
ensemble the result to make the final prediction. By fine-tuning the pre-train
model on the language representation model. we could achieve a better result
than the baseline. In the future, we plan to investigate the other details of the
product such as description, price, etc. that are left unprocessed and not used in
the current system. Also, we plan to validate the results on the various pre-train
models because a new model comes out continuously.


References
1. Mining the Web of HTML-embedded Product Data. https://ir-ischool-
   uos.github.io/mwpd/, accessed: 2020-08-30
2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidi-
   rectional transformers for language understanding. In: Proceedings of the 2019 Con-
   ference of NAACL-HLT. pp. 4171–4186 (2019)
3. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American
   Statistical Association 64(328), 1183–1210 (1969)
4. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on
   real-world match problems. Proceedings of the VLDB 3(1-2), 484–493 (2010)
5. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle-
   moyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach.
   arXiv preprint arXiv:1907.11692 (2019)
6. Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R.,
   Arcaute, E., Raghavendra, V.: Deep learning for entity matching: A design space
   exploration. In: Proceedings of the 2018 SIGMOD. pp. 19–34 (2018)
7. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert:
   smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
8. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: A multi-task
   benchmark and analysis platform for natural language understanding. In: Proceed-
   ings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
   Networks for NLP. pp. 353–355 (2018)