<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PMap: Ensemble Pre-training Models for Product Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Natthawut Kertkeidkachorn</string-name>
          <email>n.kertkeidkachorn@aist.go.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryutaro Ichise</string-name>
          <email>ichise@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Advanced Industrial Science and Technology</institution>
          ,
          <addr-line>Tokyo 135-0064</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Informatics</institution>
          ,
          <addr-line>Tokyo 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Mining the Web of HTML-embedded Product Data (MWPD) Challenge aims to benchmark methods dealing with two e-commerce data integration tasks: 1 ) Product Matching and 2) Product Classi cation. In this paper, we present the design of our system, namely PMap, for the MWPD Challenge on the Product Matching task. PMap aggregates the results of the various state of the art pretraining models to resolve the identical products. Results on MWHPD show that PMap outperforms the baseline and obtains the promising performance for the product matching task. The code and the system's outputs are available.3 Due to the growth of online shops in the e-commerce domain, semantic annotation plays a key role in enhancing the accessibility and visibility of products. Annotating the products with the semantic markup language helps a search engine to retrieve the product as a user's expectation. However, annotated products suffer from inconsistent and heterogeneous problems from cross-sector e-commerce vendors. As a result, it even leads to a situation where the product's information is con icted. Furthermore, without a clear benchmark, it is hard to judge the progress of the methods in this eld. To address these challenges, Mining the Web of HTML-embedded Product Data (MWPD) challenge4 is introduced. The goal of the MWPD challenge is to provide the benchmark for the methods dealing with two fundamental tasks in e-commerce data integration: 1) Product Matching and 2) Product Classi cation. In this study, we focus on the Product Matching task. Product Matching is to match the same products from di erent websites that refer to the same real-world product. To deal with the Product Matching task, we introduce the ensemble pre-train models, namely PMap. PMap takes the advantages of contextualized embedding pre-train models together with the aggregating strategy in order to uncover the identical products.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The rest of the paper is organized as follows. We describe the problem setting
of the product matching on the MWPD challenge in Section 2. Section 3 reports
the design of our approach. In Section 4, the experimental setup details and the
experimental results are presented. We then survey the related work in Section
5. In Section 6, we conclude our work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem Setting</title>
      <p>A product o er is a collection of textual attributes that describes the real-world
product. Generally, product o ers are published as the product descriptions with
speci cation tables, i.e. HTML tables that describe speci cations about the o er
such as price or brand of the product. The samples of the product o ers are
presented in Figure 1.</p>
      <p>Product Matching in the MWPD challenge is the task to classify whether the
given two product o ers are identical, i.e. two product o ers refer to the same
real-world object. We can formulate the Product Matching problem as follows:</p>
      <p>Let D and D0 be two collections of product o ers from di erent resources.
We assume that product o ers in D and D0 have the same schema, i.e. a product
o er is described by the same set of attributes A. Given D = fPD1 , PD2 , PD3 ,
..., PDn g and D0 = fPD10 , PD20 , PD30 , ..., PDn0 g, where PDi is the i th product
o er of D and PDi0 is the i th product o er of D0, the objective of the product
matching is to model the function f : (PDi ; PD0 ) ! f0; 1g. If two products refer
i
to the same object, the function f ( ) returns 1, otherwise 0.</p>
      <p>For example, in Figure 1, the product o er a and the product o er c are
from D and the product o er b and the product o er d are from D0. The pairs
of product o ers (a, b) and (c, d) are given. The pair of product o ers (a, b)
is the match pair (f : (a; b) ! 1), while the pair of product o ers (c, d) is the
non-match pair (f : (c; d) ! 0 ).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>We design our system (PMap) as the 3-steps pipeline. As shown in Figure 2, our
pipeline consists of 1) Pre-processing, 2) Fine-tuning Pre-train Models, and 3)
Ensemble Models. The details of each step are as follows.
3.1</p>
      <p>
        Pre-processing
In the MWPD challenge, WDC Product Data Corpus5 is used as the dataset. It
is derived from the Web Data Commons6 extracted by using schema.org
annotations from the Common Crawl7. Although some cleaning pre-processing steps
are taken into account on the dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we found that it is still necessary
to further pre-process the dataset due to the character encoding and symbol
in the data. To pre-process the dataset, we remove symbols and non-alphabet
characters by using a simple regular expression.
3.2
      </p>
      <p>
        Fine-tuning Pre-train Models
Fine-tuning Pre-train Models is the core step of PMap. In this section, we explain
the pre-train models and how to ne-tune them.
5 http://webdatacommons.org/largescaleproductcorpus/v2/index.html
6 http://webdatacommons.org/structureddata/
7 https://commoncrawl.org
Pre-train Models, also known pre-trained language representation models,
widely gain attention in the NLP community due to their transfer learning
ability. Such pre-train models can easily achieve state-of-the-art performances for
various NLP standard tasks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by simply ne-tuning the models over speci c
tasks. One of the state-of-the-art pre-train contextual language representation
models is BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It builds upon a multi-layer bidirectional Transformer
encoder, which is based on the self-attention mechanism. During the pre-training
representation learning, BERT is trained on large-scale unlabeled general
domain corpus from BooksCorpus and English Wikipedia in order to perform the
masked language task and the next sentence prediction task. Based on the
success of the BERT, various pre-train models have also been introduced such as
DistilBert[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Roberta[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We can build various models for product matching
by ne-tuning pre-train models.
      </p>
      <p>Fine-tuning is to optimize the model for the speci c task. The architecture for
ne-tuning pre-train models for the product matching task is shown in Figure
3. Given the input pair (PDi ; PDi0 ), the rst token of every sequence of input
pairs is always a special classi cation token [CLS]. Following [CLS], the product
o er PDii is represented as the sequence of tokens containing the title of the
product o er PDi = T1PDi , T2PDi , T3PDi , ... , TnPDi , where n is the length for PD0
i
of titles after tokenized. Then, [SEP] is put after the sequence representation of
PDi . After [SEP], the product o er PDi0 is represented by the similar way of the
product o er PDi as the sequence of tokens containing the title of the product</p>
      <p>PD0 PD0 PD0 PD0
o er PDi0 = T1 i , T2 i , T3 i , ... , Tm i , where m is the length of titles for
PDi0 . Note that, at rst, we aim to treat the product o er as the documents and
use the whole details of the product o er as the sequence of tokens. However,
the pre-train model allows the sequence of the tokens with the maximum length
at 512. To t the pre-train model within this limitation, we decide to use the
only title as a representation of the product o er. As a result, it is still room to
investigate the other attributes of product o ers as features.</p>
      <p>After feeding the input sequences to the pre-train model, the nal vector
representation C corresponding to [CLS] is used as the representation of the input
sequence to pass to the shallow neural network for building the classi er. We
compute a cross-entropy loss with the following equations to train the classi er
L =</p>
      <p>X
(PDii ;PD0 )
i
y^ = (CW T )
y log(y^0) + (1
y) log(y^1)
(1)
(2)
, where ( ) is the sigmoid function, W is the classi cation layer weight of
the shallow neural network for ne-tuning ( W 2 IR2 jCj), y^ is a 2-dimensional
real vector with y^0; y^1 2 [0; 1], y^0 + y^1 = 1 and y is the label for the pair of input
(y 2 0; 1).
3.3</p>
      <p>Ensemble Models
Based on the preliminary results on the validation dataset, we found most of
the pre-train models achieved very remarkable performance. However, when we
observed and analyzed the result on each sample in the training process, it turned
out that each pre-train model could capture di erent aspects of the data. For
example, we found that RoBERTa could capture the typo error, whereas others
could not. Due to this signal, PMap combines the results from various pre-train
modes to capture various types of aspects of the dataset and make the nal
prediction with these results.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <p>In this section, we report the experiments of PMap on the product matching
task of the MWPD challenge.
4.1</p>
      <p>Experimental Setup
The experimental setup is as follows:</p>
      <p>Datasets. The Product Matching dataset is derived from the WDC
Product Data Corpus and Gold Standard for Large-Scale Product Matching. The
product data corpus contains 16M product o ers. In the product matching task,
there are 68,461, 1,100, and 1,500 o er pairs for training, validating, and testing
respectively.</p>
      <p>Settings. We select various pre-train models including distilbert-base-uncased,
bert-base-uncased, bert-large-uncased, roberta-base, and roberta-large. The
pretrain models are available at the huggingface repository8. To implement the
model as in Figure 3, we employ the implementation of
AutoModelForSequenceClassi cation9. We set the hyper-parameters in the ne-tuning process as
follows: batch: 8, 16 or 32 (depending on the largest batch that can be loaded to
the memory), learning rate: 2e 5, epochs: 2-4, dropout rate: 0.1. The
maximum length of tokens is set at 150 due to the length of the titles in the dataset.
During the testing, we select bert-large-uncased, roberta-large, and roberta-base
for the ensembling of the results in the pipeline. This selection is based on the
observation of the validation dataset.</p>
      <p>
        Baseline. In the product matching task, deepmatcher [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a
state-of-theart matching method is used as the baseline. Also, we additionally conduct the
experiment on each pre-train model for the ablation study of our approach.
      </p>
      <p>Evaluation Metrics. Precision, Recall and F1 score on the positive class
(class 1) is calculated.
8 https://huggingface.co/models
9 https://huggingface.co/transformers/model doc/auto.html</p>
      <p>
        We additionally evaluate these results after releasing of the ground truth for the
test dataset.
research works related to entity linking [
        <xref ref-type="bibr" rid="ref3 ref4 ref6">3, 4, 6</xref>
        ]. Early works focused on
modeling the approaches with rule-based and statistics-based methods [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Later,
the machine learning-based approach has become a popular approach due to
its strong performance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In recent years, the deep learning-based approach is
extremely successful in many application domains. Deepmacther[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], one of the
deep learning approaches, models the deep neural network and achieves the state
of the art for the product matching task. However, we notice that the pre-train
models have not been gained much attention in the product matching task yet.
The pre-train models (e.g. BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) achieve remarkable results on many NLP
tasks. Therefore, it is worthwhile to explore the pre-train models for the product
matching task.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we report the product matching system, namely PMap. PMap
takes the advantages of the pre-train models to build the classi ers and then
ensemble the result to make the nal prediction. By ne-tuning the pre-train
model on the language representation model. we could achieve a better result
than the baseline. In the future, we plan to investigate the other details of the
product such as description, price, etc. that are left unprocessed and not used in
the current system. Also, we plan to validate the results on the various pre-train
models because a new model comes out continuously.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. Mining the Web of HTML-embedded Product Data</article-title>
          . https://ir-ischooluos.github.io/mwpd/, accessed:
          <fpage>2020</fpage>
          -08-30
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of NAACL-HLT</source>
          . pp.
          <volume>4171</volume>
          {
          <issue>4186</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fellegi</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sunter</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          :
          <article-title>A theory for record linkage</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          <volume>64</volume>
          (
          <issue>328</issue>
          ),
          <volume>1183</volume>
          {
          <fpage>1210</fpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Kopcke, H.,
          <string-name>
            <surname>Thor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Evaluation of entity resolution approaches on real-world match problems</article-title>
          .
          <source>Proceedings of the VLDB</source>
          <volume>3</volume>
          (
          <issue>1-2</issue>
          ),
          <volume>484</volume>
          {
          <fpage>493</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mudgal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rekatsinas</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deep</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arcaute</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavendra</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Deep learning for entity matching: A design space exploration</article-title>
          .
          <source>In: Proceedings of the 2018 SIGMOD</source>
          . pp.
          <volume>19</volume>
          {
          <issue>34</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sanh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debut</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaumond</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>01108</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowman</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>GLUE: A multi-task benchmark and analysis platform for natural language understanding</article-title>
          .
          <source>In: Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          . pp.
          <volume>353</volume>
          {
          <issue>355</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>