Introduction

PMap: Ensemble Pre-training Models for Product Matching

Natthawut Kertkeidkachorn

n.kertkeidkachorn@aist.go.jp 0

Ryutaro Ichise

ichise@nii.ac.jp 0 1 0 National Institute of Advanced Industrial Science and Technology , Tokyo 135-0064 , Japan 1 National Institute of Informatics , Tokyo 101-8430 , Japan

Mining the Web of HTML-embedded Product Data (MWPD) Challenge aims to benchmark methods dealing with two e-commerce data integration tasks: 1 ) Product Matching and 2) Product Classi cation. In this paper, we present the design of our system, namely PMap, for the MWPD Challenge on the Product Matching task. PMap aggregates the results of the various state of the art pretraining models to resolve the identical products. Results on MWHPD show that PMap outperforms the baseline and obtains the promising performance for the product matching task. The code and the system's outputs are available.3 Due to the growth of online shops in the e-commerce domain, semantic annotation plays a key role in enhancing the accessibility and visibility of products. Annotating the products with the semantic markup language helps a search engine to retrieve the product as a user's expectation. However, annotated products suffer from inconsistent and heterogeneous problems from cross-sector e-commerce vendors. As a result, it even leads to a situation where the product's information is con icted. Furthermore, without a clear benchmark, it is hard to judge the progress of the methods in this eld. To address these challenges, Mining the Web of HTML-embedded Product Data (MWPD) challenge4 is introduced. The goal of the MWPD challenge is to provide the benchmark for the methods dealing with two fundamental tasks in e-commerce data integration: 1) Product Matching and 2) Product Classi cation. In this study, we focus on the Product Matching task. Product Matching is to match the same products from di erent websites that refer to the same real-world product. To deal with the Product Matching task, we introduce the ensemble pre-train models, namely PMap. PMap takes the advantages of contextualized embedding pre-train models together with the aggregating strategy in order to uncover the identical products.

Introduction

The rest of the paper is organized as follows. We describe the problem setting of the product matching on the MWPD challenge in Section 2. Section 3 reports the design of our approach. In Section 4, the experimental setup details and the experimental results are presented. We then survey the related work in Section 5. In Section 6, we conclude our work. 2

Problem Setting

A product o er is a collection of textual attributes that describes the real-world product. Generally, product o ers are published as the product descriptions with speci cation tables, i.e. HTML tables that describe speci cations about the o er such as price or brand of the product. The samples of the product o ers are presented in Figure 1.

Product Matching in the MWPD challenge is the task to classify whether the given two product o ers are identical, i.e. two product o ers refer to the same real-world object. We can formulate the Product Matching problem as follows:

Let D and D0 be two collections of product o ers from di erent resources. We assume that product o ers in D and D0 have the same schema, i.e. a product o er is described by the same set of attributes A. Given D = fPD1 , PD2 , PD3 , ..., PDn g and D0 = fPD10 , PD20 , PD30 , ..., PDn0 g, where PDi is the i th product o er of D and PDi0 is the i th product o er of D0, the objective of the product matching is to model the function f : (PDi ; PD0 ) ! f0; 1g. If two products refer i to the same object, the function f ( ) returns 1, otherwise 0.

For example, in Figure 1, the product o er a and the product o er c are from D and the product o er b and the product o er d are from D0. The pairs of product o ers (a, b) and (c, d) are given. The pair of product o ers (a, b) is the match pair (f : (a; b) ! 1), while the pair of product o ers (c, d) is the non-match pair (f : (c; d) ! 0 ). 3

Approach

We design our system (PMap) as the 3-steps pipeline. As shown in Figure 2, our pipeline consists of 1) Pre-processing, 2) Fine-tuning Pre-train Models, and 3) Ensemble Models. The details of each step are as follows. 3.1

Pre-processing In the MWPD challenge, WDC Product Data Corpus5 is used as the dataset. It is derived from the Web Data Commons6 extracted by using schema.org annotations from the Common Crawl7. Although some cleaning pre-processing steps are taken into account on the dataset [ 6 ], we found that it is still necessary to further pre-process the dataset due to the character encoding and symbol in the data. To pre-process the dataset, we remove symbols and non-alphabet characters by using a simple regular expression. 3.2

Fine-tuning Pre-train Models Fine-tuning Pre-train Models is the core step of PMap. In this section, we explain the pre-train models and how to ne-tune them. 5 http://webdatacommons.org/largescaleproductcorpus/v2/index.html 6 http://webdatacommons.org/structureddata/ 7 https://commoncrawl.org Pre-train Models, also known pre-trained language representation models, widely gain attention in the NLP community due to their transfer learning ability. Such pre-train models can easily achieve state-of-the-art performances for various NLP standard tasks [ 8 ] by simply ne-tuning the models over speci c tasks. One of the state-of-the-art pre-train contextual language representation models is BERT [ 2 ]. It builds upon a multi-layer bidirectional Transformer encoder, which is based on the self-attention mechanism. During the pre-training representation learning, BERT is trained on large-scale unlabeled general domain corpus from BooksCorpus and English Wikipedia in order to perform the masked language task and the next sentence prediction task. Based on the success of the BERT, various pre-train models have also been introduced such as DistilBert[ 7 ] and Roberta[ 5 ]. We can build various models for product matching by ne-tuning pre-train models.

Fine-tuning is to optimize the model for the speci c task. The architecture for ne-tuning pre-train models for the product matching task is shown in Figure 3. Given the input pair (PDi ; PDi0 ), the rst token of every sequence of input pairs is always a special classi cation token [CLS]. Following [CLS], the product o er PDii is represented as the sequence of tokens containing the title of the product o er PDi = T1PDi , T2PDi , T3PDi , ... , TnPDi , where n is the length for PD0 i of titles after tokenized. Then, [SEP] is put after the sequence representation of PDi . After [SEP], the product o er PDi0 is represented by the similar way of the product o er PDi as the sequence of tokens containing the title of the product

PD0 PD0 PD0 PD0 o er PDi0 = T1 i , T2 i , T3 i , ... , Tm i , where m is the length of titles for PDi0 . Note that, at rst, we aim to treat the product o er as the documents and use the whole details of the product o er as the sequence of tokens. However, the pre-train model allows the sequence of the tokens with the maximum length at 512. To t the pre-train model within this limitation, we decide to use the only title as a representation of the product o er. As a result, it is still room to investigate the other attributes of product o ers as features.

After feeding the input sequences to the pre-train model, the nal vector representation C corresponding to [CLS] is used as the representation of the input sequence to pass to the shallow neural network for building the classi er. We compute a cross-entropy loss with the following equations to train the classi er L =

X (PDii ;PD0 ) i y^ = (CW T ) y log(y^0) + (1 y) log(y^1) (1) (2) , where ( ) is the sigmoid function, W is the classi cation layer weight of the shallow neural network for ne-tuning ( W 2 IR2 jCj), y^ is a 2-dimensional real vector with y^0; y^1 2 [0; 1], y^0 + y^1 = 1 and y is the label for the pair of input (y 2 0; 1). 3.3

Ensemble Models Based on the preliminary results on the validation dataset, we found most of the pre-train models achieved very remarkable performance. However, when we observed and analyzed the result on each sample in the training process, it turned out that each pre-train model could capture di erent aspects of the data. For example, we found that RoBERTa could capture the typo error, whereas others could not. Due to this signal, PMap combines the results from various pre-train modes to capture various types of aspects of the dataset and make the nal prediction with these results. 4

Experiments and Results

In this section, we report the experiments of PMap on the product matching task of the MWPD challenge. 4.1

Experimental Setup The experimental setup is as follows:

Datasets. The Product Matching dataset is derived from the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching. The product data corpus contains 16M product o ers. In the product matching task, there are 68,461, 1,100, and 1,500 o er pairs for training, validating, and testing respectively.

Settings. We select various pre-train models including distilbert-base-uncased, bert-base-uncased, bert-large-uncased, roberta-base, and roberta-large. The pretrain models are available at the huggingface repository8. To implement the model as in Figure 3, we employ the implementation of AutoModelForSequenceClassi cation9. We set the hyper-parameters in the ne-tuning process as follows: batch: 8, 16 or 32 (depending on the largest batch that can be loaded to the memory), learning rate: 2e 5, epochs: 2-4, dropout rate: 0.1. The maximum length of tokens is set at 150 due to the length of the titles in the dataset. During the testing, we select bert-large-uncased, roberta-large, and roberta-base for the ensembling of the results in the pipeline. This selection is based on the observation of the validation dataset.

Baseline. In the product matching task, deepmatcher [ 6 ], a state-of-theart matching method is used as the baseline. Also, we additionally conduct the experiment on each pre-train model for the ablation study of our approach.

Evaluation Metrics. Precision, Recall and F1 score on the positive class (class 1) is calculated. 8 https://huggingface.co/models 9 https://huggingface.co/transformers/model doc/auto.html

We additionally evaluate these results after releasing of the ground truth for the test dataset. research works related to entity linking [ 3, 4, 6 ]. Early works focused on modeling the approaches with rule-based and statistics-based methods [ 3 ]. Later, the machine learning-based approach has become a popular approach due to its strong performance [ 4 ]. In recent years, the deep learning-based approach is extremely successful in many application domains. Deepmacther[ 6 ], one of the deep learning approaches, models the deep neural network and achieves the state of the art for the product matching task. However, we notice that the pre-train models have not been gained much attention in the product matching task yet. The pre-train models (e.g. BERT [ 2 ]) achieve remarkable results on many NLP tasks. Therefore, it is worthwhile to explore the pre-train models for the product matching task.

Conclusion

In this paper, we report the product matching system, namely PMap. PMap takes the advantages of the pre-train models to build the classi ers and then ensemble the result to make the nal prediction. By ne-tuning the pre-train model on the language representation model. we could achieve a better result than the baseline. In the future, we plan to investigate the other details of the product such as description, price, etc. that are left unprocessed and not used in the current system. Also, we plan to validate the results on the various pre-train models because a new model comes out continuously.

1. Mining the Web of HTML-embedded Product Data . https://ir-ischooluos.github.io/mwpd/, accessed: 2020 -08-30

2. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proceedings of the 2019 Conference of NAACL-HLT . pp. 4171 { 4186 ( 2019 )

3. Fellegi , I.P. , Sunter , A.B. : A theory for record linkage . Journal of the American Statistical Association 64 ( 328 ), 1183 { 1210 ( 1969 )

4. Kopcke, H., Thor , A. , Rahm , E. : Evaluation of entity resolution approaches on real-world match problems . Proceedings of the VLDB 3 ( 1-2 ), 484 { 493 ( 2010 )

5. Liu , Y. , Ott , M. , Goyal , N. , Du , J. , Joshi , M. , Chen , D. , Levy , O. , Lewis , M. , Zettlemoyer , L. , Stoyanov , V. : Roberta: A robustly optimized bert pretraining approach . arXiv preprint arXiv: 1907 . 11692 ( 2019 )

6. Mudgal , S. , Li , H. , Rekatsinas , T. , Doan , A. , Park , Y. , Krishnan , G. , Deep , R. , Arcaute , E. , Raghavendra , V. : Deep learning for entity matching: A design space exploration . In: Proceedings of the 2018 SIGMOD . pp. 19 { 34 ( 2018 )

7. Sanh , V. , Debut , L. , Chaumond , J. , Wolf , T. : Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter . arXiv preprint arXiv: 1910 . 01108 ( 2019 )

8. Wang , A. , Singh , A. , Michael , J. , Hill , F. , Levy , O. , Bowman , S.: GLUE: A multi-task benchmark and analysis platform for natural language understanding . In: Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . pp. 353 { 355 ( 2018 )