ProBERT: Product Data Classification with Fine-tuning
                  BERT Model

                       Hamada M. Zahera and Mohamed A. Sherif

                    Data Science Group, Paderborn University, Germany
                {hamada.zahera, mohamed.sherif}@uni-paderborn.de


       Abstract. In this paper, we describe our submission to the semantic web chal-
       lenge on mining the product data in websites (MWPD2020). The dataset pro-
       vided 19K instances of product data collected from various websites. The task
       is to predict the category, defined as hierarchical taxonomy as provided in the
       training set, of the product titles in the test set. In our approach, we present a
       simple BERT-based model (dubbed ProBERT) for classifying product data into
       one or more categories. We trained our system on products titles and descriptions
       to learn semantic representation. The participated systems are evaluated using
       weighted-average precision, recall and F1-score.


1   Introduction

Recently, many e-commerce websites are embedding structured product data into their
content; according to the statistics from web data common1 , there are 37% of web pages
or 30% of websites contain structured data. Consequently, these structured data can be
used for product data integration and optimize product search service [8]. In addition,
product categorization becomes essential in providing personalized recommendations
and targeting advertisements. However, classifying product data is a challenging task
due to the intrinsic noisy nature of the product labels, the size of modern e-commerce
catalogues. In addition, each website has its a different structure of their product data,
we refer to it as site-specific annotation [5, 1]. For example, one product like a T-shirt
can have different annotation labels in different websites (College>T-Shirts,Clothing
>Tops>Shirts,Clothing accessories>Clothing>Tops). To train robust models
in these cases, we need large amount of training data with balanced classes. Therefore,
automated product classification is need to further organize these data semantically into
a universal categorization system regardless of their site-specific annotation.
    In this paper, we explain our method to solve this problem through the semantic web
challenge on mining HTML-embedded product data (MWPD20202 ). The challenge
aims to mining product data embedded into websites content. Previous studies [3, 6]
focused on categorizing product data on a single e-commerce website and sensitive to


      Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
    1
      http://webdatacommons.org/structureddata/2018-12/stats/stats.html
    2
      https://ir-ischool-uos.github.io/mwpd/
it’s site-specific content. In this challenge, the goal is to predict each product’s cate-
gories based on datasets from different websites. We address this task as a multi-label
classification problem, where each product can be assigned more than one class (i.e.,
label or category) simultaneously.
     The latest development in language models (e.g., BERT) have shown impressive
gains in a wide variety of natural language tasks ranging from sentence classification to
sequence labeling. In our approach, we propose a BERT-based neural model to catego-
rize a product based on it’s meta-data such as product name, description or site-specific
annotation. In particular, we employ a fine-tune BERT model to represent product data
as low-dimensional contextualized vector. We feed our model with product name and
description to capture semantic representation for product information. We summarize
our main contributions in this paper as follows:

    – We presented ProBERT, a BERT-based model for multi-label product classification
      based on product meta data (e.g name, description and site annotations).
    – We conducted different experiments to benchmark the impact of different embed-
      dings approaches. The result indicates that our method can be a good baseline with
      contextualized embedding (BERT) for product classification.

    The rest of this paper is organized as follows: We first explore the dataset used in
the challenge in section 2. Then, we present our proposed approach and the official
results in sections 3 and 4 respectively. In section 5, we conclude the paper with some
discussion about future work.


2     Dataset

The dataset is provided in the JSON format and divided into three subsets: (1) train-
ing contains approximately 10k product instances, (2) validation contains 3k instances
and (3) 3k instances used for evaluated and testing the submitted systems. The product
attributes in the dataset as follows:

    – ID: refers to the product identification number.
    – Name: is the product name (can be an empty string if unavailable).
    – Description: is the description of product (truncated to a maximum of 5k characters.
      can be an empty string if unavailable).
    – CategoryText: is the website-specific category for a product, or breadcrum (an
      empty string if unavailable).
    – URL: refers to the original web page URL of the product.

Each product may be assigned one or more from the following classification levels,
corresponding to the three GS1 GPC classification levels:

    – lvl1: the level 1 GS1 GPC classification.
    – lvl2: the level 2 GS1 GPC classification.
    – lvl3: the level 3 GS1 GPC classification.
3   Approach
In this section, we present ProBERT, our simple BERT-based model for multi-label
product classification. BERT is a pre-trained transformer network [2], which set for
various NLP tasks new state-of-the-art results including text classification [7] and natu-
ral language understanding [4]. When we adopt BERT to NLP tasks in a target domain,
a proper fine-tuning strategy, where a task-specific layer is added on top of BERT ar-
chitecture. In this work, we leverage the BERT-Base pre-trained model with these de-
tails: Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters. Then, we add a fully-
connected layer (i.e Dense). For multilabel classification purpose, we use binary-cross-
entropy as in Eq.1 loss function and sigmoid activation function to replace the original
softmax. All hyper-parameters remain as default values, except we set max seq length
as 30 words per input sequence.
                              n
                         1 X
                   L=−         yi log(H(xi ))) + (1 − yi ) log(1 − H(xi ))
                                                                          
                                                                                      (1)
                         n i=1

where yi and H(xi ) denote ground-truth and predicted categories for each product. xi
refers to the feature vector obtained from the BERT model.


                                                Product Classes


                                        Sigmoid


                                  Fully-connected Layer


                                                  Feature Vector


               C         T1                T2       T[SEP]     Tx'            TM'


                                          BERT
             E[CLS]      E1                E2       E[SEP]        Ex'         EM'


              CLS        W1               CLS        SEP          Wx          Wm


Fig. 1: ProBERT: A Fine-tuned BERT Model for Multi-label Product Categorization.
    The general architecture of BERT is shown in Figure 1. We use a combined text of
product title and description as an input features. Then, we do standard preprocessing
which lower-casing and lemmatization of text. Then, a special preprocessing is per-
formed for BERT processing; first inserting two special tokens. (CLS) is appended to
the beginning of the text, another special token (SEP) is inserted after each sentence as
an indicator of sentence boundary. The modified text is then represented as a sequence
of tokens X = [w1 , w2 , . . . , wn ]. Each token wi is assigned three kinds of embeddings:
token embedding, segmentation embedding and position embedding. These three em-
beddings are summed to a single input vector (C), which captures the overall meaning
of the input.


4     Experiments

4.1   Evaluation

The evaluation metrics used in this challenge are precision, recall and F1. F1-score in
Eq. 2 is the harmonic mean of precision and recall scores. The organizers used macro-
averaged F1 score as the main metric to compare and rank the participating systems.

                                         precision × recall
                              F1 = 2 ×                                                 (2)
                                         precision + recall

4.2   Results

The organizers provided an overview of the performance of baselines with different
embedding approaches (FastText, CBOW and Skipgram) on the validation dataset. As
shown in Table 1, the baselines are evaluated based on both weighted-average and
macro-average F1-scores. The experimental results are promising and shows that the
systems based on embedding methods can achieve good F1-scores. Hence, we pro-
posed our approach to employ the state-of-art contextualized embedding such as BERT
to benchmark the system performance.


                             Table 1: Experimental Results
Model                             Weighted Avg. P, R, F1        Macro Avg. P, R, F1

Baseline                          85.553 84.167 84.255          66.164 60.709 61.542
Baseline+embeddings(CBOW)         86.498 86.000 85.734          70.639 63.925 65.551
Baseline+embeddings(Skipgram)     85.453 84.911 84.575          70.574 62.740 64.693


    The results are reported in terms of three evaluation metrics: (precision, recall and
F1-score). F1-score is the score ultimately used to compare and rank the participating
systems. Table 2 shows the results of five participating teams and the baseline (Fast-
Text). Our team (DICE UPB) submitted one system based on fine-tuning BERT model.
The performance is close to the baseline system in terms of F1 score (81.84% com-
pared to baseline 84.26%). However, we found that feature engineering needs a special
preprocessing rather than the standard preprocessing, due to the nature of product data
such as: highly imbalanced in labels as shown in Figures 2a and 2b; noisiness in the de-
scriptions. We suggest to perform the same preprocessing as [8] and change our strategy
of fine-tuning BERT model to address these challenges properly.


Table 2: System Evaluation Results. R2 refers to the systems which participated in the
                        second round. Best Results in Bold
         System                        Precision       Recall         F1-score

         Rhinobird                     89.01           89.04          88.62
         Rhinobird (R2)                88.97           88.72          88.43
         Team ISI                      87.16           86.85          86.54
         ASVinSpace                    86.96           86.30          86.10
         Megagon                       84.98           84.98          84.98
         Baseline FastText             85.55           84.17          84.26
         DICE UPB                      85.30           81.49          81.84


5   Conclusion and Future Work
In this paper, we described our approach (ProBert) to classify product data based on
micro annotations. Our approach leverage a simple BERT model that represents a single
feature vector from product’s title and description, then predicts it’s categories. Our ex-
periments suggest that ProBERT is a good baseline to benchmark the task of automatic
products classification. In the future, we plan to re-evaluate our approach with different
preprocessing and fine-tuning strategies. Also, we will investigate more deep models
with different architectures (e.g., graph-based neural model).

Acknowledgments This work has been supported by the EU H2020 project Know-
Graphs (GA no. 860801) as well as the BMVI projects LIMBO (GA no. 19F2029C)
and OPAL (GA no. 19F2028A).
                                                                                                                                                              100
                                                                                                                                                                    101
                                                                                                                                                                          102
                                                                                                                                                                                103
                                                                                                                                                                                                                                                                 100
                                                                                                                                                                                                                                                                       101
                                                                                                                                                                                                                                                                             102
                                                                                                  50100000_Fruits/Vegetables/Nuts/Seeds Prepared/Processed                                                                                                                         103
                                                                                                                         75010000_Household/Office Furniture                                               62000000_Stationery/Office Machinery/Occasion Supplies
                                                                                                                                  68040000_Audio Visual Media                                                                                  63000000_Footwear
                                                                                                                            68010000_Audio Visual Equipment
                                                                                                                                 93060000_Plants Variety Packs                                                                              77000000_Automotive
                                                                                                                             50180000_Bread/Bakery Products
                                                                                                                 50170000_Seasonings/Preservatives/Extracts                                                                               10000000_Pet Care/Food
                                                                                                                              54100000_Baby Feeding/Hygiene
                                                                                                                                   78030000_Electrical Lighting
                                                                                                                                        53140000_Hair Products                                                    75000000_Household/Office Furniture/Furnishings
                                                                                                                                        54110000_Baby Welfare
                                                                                                                                            63010000_Footwear                                                                         78000000_Electrical Supplies
                                                                                                                                      50150000_Oils/Fats Edible
                                                                                                                            54120000_Baby Care Variety Packs                                                                                86000000_Toys/Games
                                                                                                                   61010000_Musical Instruments/Accessories
                                                                                                                       62060000_Stationery/Office Machinery                                                                           71000000_Sports Equipment
                                                                                                                             51100000_Health Treatments/Aids
                                                                                                                                          86010000_Toys/Games                                                               84000000_Tool Storage/Workshop Aids
                                                                                                                                        93070000_Seeds/Spores
                                                                                                                              81010000_Lawn/Garden Supplies                                                                                  51000000_Healthcare
                                                                                                                               87020000_Fuel Storage/Transfer
                                                                                                                               77030000_Cars and Motorcycles                                                                                87000000_Fuels/Gases
                                                                                                                                           50200000_Beverages
                                                                                                                               92040000_Storage/Haulage Aids                                                                                    67000000_Clothing
                                                                                         62070000_Stationery/Office Machinery/Occasion Supplies Variety Packs
                                                                                                                                       53200000_Body Products                                                                          72000000_Home Appliances
                                                                                                                            75030000_Ornamental Furnishings
                                                                                                                       93030000_Live Plants (Genus A thru G)                                                                                   74000000_Camping
                                                                                                                  78020000_Electrical Connection/Distribution
                                                                                                                                 68020000_Photography/Optics                                                                           83000000_Building Products
                                                                                                                    70010000_Arts/Crafts/Needlework Supplies
                                                                                                        62050000_Greeting Cards/Gift Wrap/Occasion Supplies                                                                   68000000_Audio Visual/Photography
                                                                                                                                         50240000_Meat/Poultry
                                                                                                       92020000_Storage/Haulage Boxes/Crates/Trays (Empty)                                                                 91000000_Safety/Security/Surveillance
                                                                                                50130000_Milk/Butter/Cream/Yogurts/Cheese/Eggs/Substitutes
                                                                                                                        72010000_Major Domestic Appliances                                                                 92000000_Storage/Haulage Containers
                                                                                                                        72020000_Small Domestic Appliances
                                                                                                                                   51130000_Home Diagnostics                                                                         93000000_Horticulture Plants
                                                                                                                      50210000_Tobacco/Smoking Accessories
                                                                                                                                   53230000_Personal Intimacy                                                                      64000000_Personal Accessories
                                                                                                                                            74010000_Camping
                                                                                                                                     10110000_Pet Food/Drinks                                                                 82000000_Tools/Equipment Power
                                                                                                                                    68030000_In-car Electronics
                                                                                                        91030000_Home/Business Safety/Security/Surveillance


                                                                                                                                                                                      (a) Label (Level1)


                                                                    (b) Label (Level2)
                                                                                                          50160000_Confectionery/Sugar Sweetening Products                                                                         73000000_Kitchen Merchandise
                                                                                                                              51160000_Pharmaceutical Drugs                                                                       81000000_Lawn/Garden Supplies
                                                                                                     79010000_Plumbing/Heating/Ventilation/Air Conditioning
                                                                                                                                   47100000_Cleaning Products
                                                                                                                  88010000_Lubricants/Protective Compounds                                                                                        61000000_Music
                                                                                                                                        53130000_Skin Products
                                                                                                                                         89020000_Live Animals                                                                               65000000_Computing
                                                                                                          77010000_Automotive Accessories and Maintenance
                                                                                                                        84010000_Tool Storage/Workshop Aids                                                                     50000000_Food/Beverage/Tobacco
                                                                                                                           65010000_Computers/Video Games
                                                                                                                           78040000_Electrical Cabling/Wiring                                                            53000000_Beauty/Personal Care/Hygiene
                                                                                                            78050000_Electronic Communication Components
                                                                                                                                             67010000_Clothing                                              79000000_Plumbing/Heating/Ventilation/Air Conditioning
                                                                                                                                    83010000_Building Products
                                                                                                                         53180000_Personal Hygiene Products                                                          60000000_Textual/Printed/Reference Materials
                                                                                                                                    66010000_Communications
                                                                                                                      47210000_Waste Management Products                                                                                     88000000_Lubricants
                                                                                                                                73040000_Kitchen Merchandise
                                                                                                                                64010000_Personal Accessories                                                                                 54000000_Baby Care
                                                                                                                        47120000_Insect/Pest/Allergen Control


Fig. 2: Label distributions (log scaled) in the training dataset.
                                                                                                                          50190000_Prepared/Preserved Foods                                                                    80000000_Tools/Equipment Hand
                                                                                                                           80010000_Tools/Equipment Hand
                                                                                                                            85010000_Safety/Protection DIY                                                                       70000000_Arts/Crafts/Needlework
                                                                                                                60010000_Textual/Printed/Reference Materials
                                                                                                                                             10100000_Pet Care                                                                  85000000_Safety/Protection DIY
                                                                                           50260000_Vegetables (Non Leaf) Unprepared/Unprocessed (Fresh)
                                                                                                                          82010000_Tools/Equipment Power                                                                                   89000000_Live Animals
                                                                                                                               51120000_Health Enhancement
                                                                                                                           75020000_Fabric/Textile Furnishings                                                                         66000000_Communications
                                                                                                                              53160000_Cosmetics/Fragrances
                                                                                                                                   71010000_Sports Equipment                                                                 47000000_Cleaning/Hygiene Products
                                                                                                                                     51150000_Medical Devices
                                  Bibliography


[1] Cevahir, A., Murakami, K.: Large-scale multi-class and hierarchical product cat-
    egorization for an e-commerce giant. In: Proceedings of COLING 2016, the 26th
    International Conference on Computational Linguistics: Technical Papers. pp. 525–
    535 (2016)
[2] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
[3] Gupta, V., Karnick, H., Bansal, A., Jhala, P.: Product classification in e-commerce
    using distributional semantics. arXiv preprint arXiv:1606.06083 (2016)
[4] Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert:
    Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351
    (2019)
[5] Kozareva, Z.: Everyone likes shopping! multi-class product categorization for e-
    commerce. In: Proceedings of the 2015 Conference of the North American Chapter
    of the Association for Computational Linguistics: Human Language Technologies.
    pp. 1329–1333 (2015)
[6] Xia, Y., Levine, A., Das, P., Di Fabbrizio, G., Shinzato, K., Datta, A.: Large-scale
    categorization of japanese product titles using neural attention models. In: Pro-
    ceedings of the 15th Conference of the European Chapter of the Association for
    Computational Linguistics: Volume 2, Short Papers. pp. 663–668 (2017)
[7] Zahera, H.M., Elgendy, I.A., Jalota, R., Sherif, M.A.: Fine-tuned bert model for
    multi-label tweets classification. In: TREC (2019)
[8] Zhang, Z., Paramita, M.: Product classification using microdata annotations. In:
    International Semantic Web Conference. pp. 716–732. Springer (2019)