=Paper=
{{Paper
|id=Vol-2720/paper3
|storemode=property
|title=A Pre-trained Matching Model Based on Self- and Inter-ensemble For Product Matching Task
|pdfUrl=https://ceur-ws.org/Vol-2720/paper3.pdf
|volume=Vol-2720
|authors=Shiyao Xu,Shijia E,Li Yang,Yang Xiang
|dblpUrl=https://dblp.org/rec/conf/semweb/XuEYX20
}}
==A Pre-trained Matching Model Based on Self- and Inter-ensemble For Product Matching Task==
<pdf width="1500px">https://ceur-ws.org/Vol-2720/paper3.pdf</pdf>
<pre>
  A Pre-trained Matching Model Based on Self-
 and Inter-ensemble For Product Matching Task

             Shiyao Xu1,2 , Shijia E2 , Li Yang1,2 , and Yang Xiang1
                       1
                       Tongji University, Shanghai, China
                         2
                           Tencent, Shanghai, China
        {xushiyao, li.yang}@tongji.edu.cn, tjdxxiangyang@gmail.com
                            e.shijia@gmail.com


      Abstract. The product matching task aims to identify that if a pair
      of product deriving from different websites refer to the same product or
      not. While the accumulated semantic annotations of products make it
      possible to study deep neural network-based matching methods, prod-
      uct matching is still a challenging task due to suffering from the class
      imbalance and heterogeneity of textual descriptions. In this paper, we
      directly regard product matching as a semantic text matching prob-
      lem and propose a pre-trained matching model based on both self- and
      inter-ensemble. BERT is the main module in our approach for binary
      classification of product pairs. We perform two types of ensemble meth-
      ods: self-ensemble using stochastic weight averaging (SWA) for the same
      model, and inter-ensemble combing the prediction of different models.
      Additionally, the focal loss is adopted to alleviate the imbalance prob-
      lem of positive and negative samples. Experimental results show that
      our model outperforms existing deep learning matching approaches. The
      proposed model achieves an F1-score of 85.94% on the test data which
      ranks second in the SWC2020 on Mining the Web of HTML-embedded
      Product Data Task One. Our implementation has been released 3 .

      Keywords: Product Matching · BERT · Focal Loss · SWA.


1   Introduction
In recent years, online shops (e-shops) are increasingly adopting semantic markup
languages to describe their products to improve their visibility. Those semantic
annotations are conducive to the further analysis of product offers. However,
different annotation systems used by e-shops often lead to data inconsistencies
or conflicts. Product matching is the task of identifying the similarity of product
offer pairs, which is the fundamental technology to construct a unified system
such as product knowledge graphs. Moreover, product matching can improve the
efficiency and experience of online purchasing. As for customers, matching the
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
3
  https://github.com/englishbook/product-matching
2       Shiyao Xu et al.

same product on different websites is convenient for them to compare and find
the best choice quickly. As to the e-commerce platform, the matching results can
be used for product recommendation. Therefore, product matching is a crucial
problem in the e-commerce domain.
    Product offers are described by the textual information (e.g. title, description,
and brand). We can think of product matching as a semantic text matching
problem. Although many previous works have been done on text matching and
have shown great success with deep neural networks [15, 12, 6], the task remains a
great challenge in the e-commerce environment. On the one hand, the semantics
of the natural language are diverse and complex, especially on the Internet. E-
shops like to use many exaggerated words and new words to attract customers.
On the other hand, there is a problem of class imbalance. Despite the huge
quantity of product offers, most of the pairs of products are not matched which
makes the number of negative samples much larger than the positive ones.
    To address those limitations, we propose a pre-trained matching model based
on both self- and inter-ensemble for product matching in this paper. 1) For se-
mantic complexity, we apply pre-trained BERT to model text pairs. BERT [3]
is pre-trained on a large-scale unlabeled dataset and then fine-tuned for down-
stream product matching task. Compared to traditional DNN models, BERT
has learned richer semantic information. 2) For class imbalance, we adopt the
focal loss [8] to better optimize parameters. It makes the training process can
focus on a few uncertain samples. 3) For generalization, we combine both self-
and inter-ensemble methods. Self-ensemble integrates model weights of the same
model at different training epochs. Inter-ensemble averages the matching score
resulting from different models. Our ensemble model achieves an F1-score of
85.94% in the final evaluation of the product matching task on the SWC2020
challenge.


2   Related Work

With the development of deep learning, large amounts of matching models based
on deep neural networks have emerged and shown their effectiveness. Previous
works mainly focus on the siamese architecture [1]. Those approaches generally
take word embeddings of text pairs pre-trained by word2vec [10] as input, con-
vert word embeddings to text representations, and then compute the similarity
between two text representations. Convolutional neural network (CNN) [4] and
recurrent neural network (RNN) [11] are the two mainstream methods used for
text modeling. ESIM [2] and EACNNs [13] incorporate attention mechanisms
into models to pay more attention to the relevant parts of text pairs. However,
they all depend on the quality of the training dataset and face the difficulty of
polysemy.
    Recently, pre-trained language models are proposed and have achieved sig-
nificant improvement in various NLP tasks, with state-of-the-art models such as
BERT [3], RoBERTa [9], and XLNET [14]. The main idea of them is the pre-
training language model on large-scale unlabeled corpus before fine-tuning on
    A Pre-trained Product Matching Model Based on Self- and Inter-ensemble       3

downstream tasks. Therefore, pre-trained models contain rich semantic informa-
tion and generate contextualized embeddings instead of fixed ones. In this paper,
we adopt pre-trained BERT as the base model to solve the product matching
task where the semantics of textual descriptions is relatively complex.


3     Model Description

3.1    Data

The WDC product data corpus, the largest publicly available product data cor-
pus, is released by the Web Data Commons project in 2018. The corpus consists
of 26 million products originating from 79 thousand websites. Products in the
corpus are described by these properties: id, cluster id, category, title, descrip-
tion, brand, price, and specification table. Products with the same cluster id
indicate the same product. In the SWC2020 challenge product matching task
4
  , organizers provide a dataset containing matching and non-matching pairs of
products only from Computers & Accessories category. It is sampled from the
product data corpus according to the clusters (68K product pairs for training,
1.1K for validation, and 1500 for testing). We also utilize an extended training
dataset 5 with all the four product categories derived by the same sample strat-
egy (214K for training, and 4.4K for validation). Matching models can learn
more information from extra categories and thus make a more accurate predic-
tion on samples from Computers. The statistics of these training datasets are
listed in Table 1.

              Table 1. The statistics of product matching training sets.

        Category      #clusters      #positive     #negative      #samples
        Computers        745           9690          58771          68461
           All          2481          30198         184463         214661


   Besides, some data preprocessing operations are performed before inputting
product information into models. We remove stopwords (using NLTK) and low-
ercase all textual descriptions.


3.2    Matching Model

Pre-trained language models have learned from the large-scale corpus, so they
possess certain advantages to transfer to the e-commerce domain. In this section,
we build a product matching model based on pre-trained BERT. The overall
framework is shown in Figure 1.
4
    https://ir-ischool-uos.github.io/mwpd/index.html
5
    http://webdatacommons.org/largescaleproductcorpus/v2/index.html
4       Shiyao Xu et al.

                                     Matching Score
              Prediction
                Layer

            Fully Connected                  …
                 Layer

                                 C         T(     …    T)     T[&*+]    T(,     …    T-,     T[&*+]
                                class
             Encode Layer
                              embedding
                                                            BERT
                               E[$%&]      E(     …    E)     E[&*+]    E(,     …    E-,     E[&*+]


              Input Layer      [CLS]      Tok 1   …   Tok M   [SEP]    Tok 1’   …   Tok N’   [SEP]


                                              Product A                     Product B
                                             Information                   Information


              Fig. 1. The overall framework of BERT matching model.


    Although products are described by many attributes, most of the fields con-
tain NULL values. The title attribute of all products is filled, and the filling rate
of the description attribute is relatively high. Therefore, we mainly focus on these
two attributes. We concatenate the textual information of product pair by [SEP]
token at first, and then add [CLS] and [SEP] tokens at the beginning and end re-
spectively as the input of BERT, x = [[CLS] tA dA [SEP ] tB dB [SEP ]]. BERT
can model the input tokens through the multi-layer bidirectional Transformer
encoder and generate high-level representations. The output state of BERT that
corresponds to [CLS] token is used as the pair representation. We feed the rep-
resentation into a fully connected layer with the sigmoid activation function and
obtain the final matching score p between two product offers.


3.3   Focal Loss

In general, we often adopt binary cross-entropy loss directly for binary classifi-
cation tasks. However, as shown in Table 1, the number of positive and negative
samples in the dataset is imbalance making easily classified negatives comprise
the majority of the loss and dominate the gradient. To solve the problem, Lin et
al. [8] scale the cross-entropy loss and propose the focal loss which is formulated
as follows:
                                                γ
                             F L = −αt (1 − pt ) log(pt )                       (1)
where pt = p when the ground truth is 1, otherwise pt = 1 − p. The weight
factor α ∈ [0, 1] is set according to class frequency to balance the importance
of positive and negative samples. For convenience, we define αt similar to pt .
Moreover, the focusing parameter γ ∈ [0, 5] is introduced to differentiate easy
and hard examples. In that way, the samples that have been accurately classified
contribute less to the loss so that model can focus training on few hard cases.
  A Pre-trained Product Matching Model Based on Self- and Inter-ensemble              5

3.4   Ensemble


                                                    Final Result
                                                                   Inter-ensemble


                             Model 1                     Model 2   …        Model m
             Self-ensemble


               BERT          BERT      …     BERT

              Epoch 𝑛   Epoch 𝑛 + 1        Epoch 𝑛 + 𝑡


           Fig. 2. The combination of self- and inter-ensemble strategies.


Self-ensemble: for the same model at different training. As illustrated in Fig-
ure 2, we use the stochastic weight averaging strategy [5] in the training phase.
When the model is about to converge, the model weights trained at different
epochs are averaged as the final weights of the model. Compared with the tra-
ditional method that only retains the weights at the final epoch, SWA can help
avoid the local optimal solution and improve the generalization ability without
increasing training cost.


Inter-ensemble: for different models. Different models have their advantages.
For example, models trained by the cross-entropy loss should predict more ac-
curately on substantial non-matched samples, while models with the focal loss
should perform better on few hard examples. In this paper, multiple models are
obtained by training on different datasets, inputs, and loss functions. As illus-
trated in Figure 2, we average the prediction probability of these models as the
final results to combine the strengths of different models.


3.5   Post-processing

Many attributes are not fed into the model for training, but they are undoubtedly
useful for product matching. In the SWC2020 challenge, we attempt to take
full advantage of them to correct the prediction results. The values of category
attributes are assigned to four unified categories. Two products belonging to
different categories must be non-matched. For test pairs with prediction results of
1 but different categories, we correct their results to 0. The following experiments
demonstrate the effectiveness of the post-processing operation. Similarly, the
brand and price can also be used for correction in the future.
6       Shiyao Xu et al.

4     Experiments
4.1    Experimental Setups
Product textual information is padded or truncated to a fixed length. The max
length is set to 64 and 200 for the input only with product title and the concate-
nation of title and description, respectively. For pre-trained BERT, we initialize
the model by the weights of BERTbase . For focal loss, α is set to 0.75, and γ
is set to 2 to focus on hard positive samples. The optimizer we adopt is Adam
[7] with constant learning rate of 2 × 10−5 . We start to use the SWA and early
stopping strategy after the fifth training epoch.

4.2    Results on the Validation Set
The F1 score on the positive class is used as the evaluation metric. We have
trained multiple models with different architectures and training strategies. Mod-
els trained on the All dataset are also evaluated on the validation set with four
categories. Table 2 shows the experimental results of these models on the val-
idation set in detail. We can find that the performance of pre-trained BERT
for the product matching task is significantly better than other classic matching
models (e.g. CNN, ESIM). Incorporating focal loss and SWA strategy further
improves the BERT models. Moreover, post-processing can indeed correct some
prediction errors effectively.

Table 2. The results on validation set. CE, FL represents cross-entropy and focal loss,
respectively. Post F1 means the F1 score after post-processing.

          Model     Dataset    Input   Loss            SWA        F1       Post F1
     1 Siamese CNN    All        title  CE              ×       0.8445      0.8479
     2    ESIM        All        title  CE              ×       0.9167      0.9174
     3    BERT        All        title  CE              ×       0.9410      0.9426
     4    BERT        All        title  CE              X       0.9427      0.9431
     5    BERT        All       title   FL              X       0.9481     0.9496
     6    BERT        All     tit+desc  CE              X       0.9369      0.9381
     7    BERT        All     tit+desc  FL              X       0.9384      0.9411
     8    BERT     Computers     title  CE              ×       0.9630      0.9646
     9    BERT     Computers     title  CE              X       0.9646      0.9662
    10    BERT     Computers     title  FL              X       0.9585      0.9633
    11    BERT     Computers tit+desc CE                X       0.9700     0.9700
    12    BERT     Computers tit+desc   FL              X       0.9672      0.9688


4.3    Final Results
After obtaining various models, we select several models that perform better on
the validation set and try to integrate them by inter-ensemble strategy. Averag-
ing the matching score predicted by multiple models can combine their strengths.
    A Pre-trained Product Matching Model Based on Self- and Inter-ensemble      7

The results of our ensemble models on the validation set in the Computers
dataset are presented in Table 3. Ensemble models are generally better than
single models. In the final evaluation of the test data, we submitted the pre-
diction result of our best ensemble model. As shown in Table 4, we achieve an
F1-score of 85.94% on the test set which ranks second. The experimental results
demonstrate the effectiveness and generalization ability of our proposed model.


Table 3. The results of our ensemble models on validation set. The model number is
given in Table 2.

                      Ensemble Model                   F1
                           5+7+9                     0.9718
                          5+7+10                     0.9689
                          5+7+11                     0.9754
                          5+7+12                     0.9737
                        5+7+11+12                    0.9735


                       Table 4. The final results on test set.

                 Team            Precision       Recall            F1
                Baseline          0.7089         0.7467          0.7273
                Megagon           0.8268         0.6552          0.7311
              ISCAS-ICIP          0.8389         0.8133          0.8259
              ASVinSpace          0.8620         0.8210          0.8410
                 PMap             0.8204         0.9048          0.8605
             Ours (5+7+11)        0.8286         0.8838          0.8553
            Ours (5+7+12)         0.8063         0.9200          0.8594


5    Conclusion and Future Work

In this paper, we propose a pre-trained matching model based on both self-
and inter-ensemble for product matching. Pre-trained BERT is adopted as the
base matching model. We incorporate the SWA strategy in the training phase to
improve the generalization ability of models and combine the output of different
models to make full use of their advantages. Experimental results show that
our model achieves great improvement compared with existing state-of-the-art
matching models.
   An interesting direction of future work is to pre-train BERT on product data
corpus so that it can learn more product description ways. Also, post-processing
operations are worthy of further study, especially in industry practice.
8       Shiyao Xu et al.

Acknowledgments This work was supported by the National Key Research
and Development Program of China (Grant No. 2019YFB1704402) and 2019
Tencent Marketing Solution Rhino-Bird Focused Research Program.


References
 1. Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Sckinger,
    E., Shah, R.: Signature Verification using a “Siamese” Time Delay Neural Network.
    In: Proceedings of the International Journal of Pattern Recognition and Artificial
    Intelligence. vol. 7, pp. 669–688 (1993)
 2. Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H., Inkpent, D.: Enhanced LSTM for
    Natural Language Inference. In: Proceedings of the 55th Annual Meeting of the
    Association for Computational Linguistics. pp. 1657–1668. Association for Com-
    putational Linguistics (July 2017)
 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep
    Bidirectional Transformers for Language Understanding. In: Proceedings of the
    2019 Conference of the North American Chapter of the Association for Computa-
    tional Linguistics: Human Language Technologies. pp. 4171–4186. Association for
    Computational Linguistics (June 2019)
 4. Hu, B., Lu, Z., Li, H., Chen, Q.: Convolutional Neural Network Architectures for
    Matching Natural Language Sentences. In: Proceedings of the 27th International
    Conference on Neural Information Processing Systems. vol. 2, pp. 2042–2050. Cur-
    ran Associates, Inc. (December 2014)
 5. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging
    Weights Leads to Wider Optima and Better Generalization. Computing Research
    Repository arXiv:1803.05407 (2018)
 6. Kim, S., Kang, I., Kwak, N.: Semantic Sentence Matching with Densely-connected
    Recurrent and Co-attentive Information. In: Proceedings of the Thirty-Third AAAI
    Conference on Artificial Intelligence. pp. 6586–6593. AAAI press (January 2019)
 7. Kingma, D.P., Ba, J.L.: Adam: A Method for Stochastic Optimization. In: Pro-
    ceedings of the 3rd International Conference for Learning Representations (May
    2015)
 8. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
    detection. In: Proceedings of the IEEE international conference on computer vision.
    pp. 2980–2988 (December 2017)
 9. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretrain-
    ing Approach. Computing Research Repository arXiv:1907.11692 (2019)
10. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
    tations of words and phrases and their compositionality. In: Proceedings of the
    Advances in Neural Information Processing Systems. pp. 3111–3119 (2013)
11. Mueller, J., Thyagarajan, A.: Siamese Recurrent Architectures for Learning Sen-
    tence Similarity. In: Proceedings of the Thirtieth AAAI Conference on Artificial
    Intelligence. pp. 2786–2792. AAAI Press (Feburary 2016)
12. Wang, Z., Hamza, W., Florian, R.: Bilateral Multi-Perspective Matching for Nat-
    ural Language Sentences. In: Proceedings of the Twenty-Sixth International Joint
    Conference on Artificial Intelligence. pp. 4144–4150 (August 2017)
13. Xu, S., E, S., Xiang, Y.: Enhanced attentive convolutional neural networks for
    sentence pair modeling. Expert Systems with Applications 151, 113384 (2020)
  A Pre-trained Product Matching Model Based on Self- and Inter-ensemble           9

14. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet:
    Generalized Autoregressive Pretraining for Language Understanding. In: Proceed-
    ings of the 32nd International Conference on Neural Information Processing Sys-
    tems. pp. 5754–5764 (December 2019)
15. Yin, W., Schütze, H., Xiang, B., Zhou, B.: ABCNN: Attention-Based Convolutional
    Neural Network for Modeling Sentence Pairs. Transactions of the Association for
    Computational Linguistics 4, 259–272 (2016)

</pre>