ProBERT: Product Data Classification with Fine-tuning BERT Model Hamada M. Zahera and Mohamed A. Sherif Data Science Group, Paderborn University, Germany {hamada.zahera, mohamed.sherif}@uni-paderborn.de Abstract. In this paper, we describe our submission to the semantic web chal- lenge on mining the product data in websites (MWPD2020). The dataset pro- vided 19K instances of product data collected from various websites. The task is to predict the category, defined as hierarchical taxonomy as provided in the training set, of the product titles in the test set. In our approach, we present a simple BERT-based model (dubbed ProBERT) for classifying product data into one or more categories. We trained our system on products titles and descriptions to learn semantic representation. The participated systems are evaluated using weighted-average precision, recall and F1-score. 1 Introduction Recently, many e-commerce websites are embedding structured product data into their content; according to the statistics from web data common1 , there are 37% of web pages or 30% of websites contain structured data. Consequently, these structured data can be used for product data integration and optimize product search service [8]. In addition, product categorization becomes essential in providing personalized recommendations and targeting advertisements. However, classifying product data is a challenging task due to the intrinsic noisy nature of the product labels, the size of modern e-commerce catalogues. In addition, each website has its a different structure of their product data, we refer to it as site-specific annotation [5, 1]. For example, one product like a T-shirt can have different annotation labels in different websites (College>T-Shirts,Clothing >Tops>Shirts,Clothing accessories>Clothing>Tops). To train robust models in these cases, we need large amount of training data with balanced classes. Therefore, automated product classification is need to further organize these data semantically into a universal categorization system regardless of their site-specific annotation. In this paper, we explain our method to solve this problem through the semantic web challenge on mining HTML-embedded product data (MWPD20202 ). The challenge aims to mining product data embedded into websites content. Previous studies [3, 6] focused on categorizing product data on a single e-commerce website and sensitive to Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 http://webdatacommons.org/structureddata/2018-12/stats/stats.html 2 https://ir-ischool-uos.github.io/mwpd/ it’s site-specific content. In this challenge, the goal is to predict each product’s cate- gories based on datasets from different websites. We address this task as a multi-label classification problem, where each product can be assigned more than one class (i.e., label or category) simultaneously. The latest development in language models (e.g., BERT) have shown impressive gains in a wide variety of natural language tasks ranging from sentence classification to sequence labeling. In our approach, we propose a BERT-based neural model to catego- rize a product based on it’s meta-data such as product name, description or site-specific annotation. In particular, we employ a fine-tune BERT model to represent product data as low-dimensional contextualized vector. We feed our model with product name and description to capture semantic representation for product information. We summarize our main contributions in this paper as follows: – We presented ProBERT, a BERT-based model for multi-label product classification based on product meta data (e.g name, description and site annotations). – We conducted different experiments to benchmark the impact of different embed- dings approaches. The result indicates that our method can be a good baseline with contextualized embedding (BERT) for product classification. The rest of this paper is organized as follows: We first explore the dataset used in the challenge in section 2. Then, we present our proposed approach and the official results in sections 3 and 4 respectively. In section 5, we conclude the paper with some discussion about future work. 2 Dataset The dataset is provided in the JSON format and divided into three subsets: (1) train- ing contains approximately 10k product instances, (2) validation contains 3k instances and (3) 3k instances used for evaluated and testing the submitted systems. The product attributes in the dataset as follows: – ID: refers to the product identification number. – Name: is the product name (can be an empty string if unavailable). – Description: is the description of product (truncated to a maximum of 5k characters. can be an empty string if unavailable). – CategoryText: is the website-specific category for a product, or breadcrum (an empty string if unavailable). – URL: refers to the original web page URL of the product. Each product may be assigned one or more from the following classification levels, corresponding to the three GS1 GPC classification levels: – lvl1: the level 1 GS1 GPC classification. – lvl2: the level 2 GS1 GPC classification. – lvl3: the level 3 GS1 GPC classification. 3 Approach In this section, we present ProBERT, our simple BERT-based model for multi-label product classification. BERT is a pre-trained transformer network [2], which set for various NLP tasks new state-of-the-art results including text classification [7] and natu- ral language understanding [4]. When we adopt BERT to NLP tasks in a target domain, a proper fine-tuning strategy, where a task-specific layer is added on top of BERT ar- chitecture. In this work, we leverage the BERT-Base pre-trained model with these de- tails: Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters. Then, we add a fully- connected layer (i.e Dense). For multilabel classification purpose, we use binary-cross- entropy as in Eq.1 loss function and sigmoid activation function to replace the original softmax. All hyper-parameters remain as default values, except we set max seq length as 30 words per input sequence. n 1 X L=− yi log(H(xi ))) + (1 − yi ) log(1 − H(xi ))  (1) n i=1 where yi and H(xi ) denote ground-truth and predicted categories for each product. xi refers to the feature vector obtained from the BERT model. Product Classes Sigmoid Fully-connected Layer Feature Vector C T1 T2 T[SEP] Tx' TM' BERT E[CLS] E1 E2 E[SEP] Ex' EM' CLS W1 CLS SEP Wx Wm Fig. 1: ProBERT: A Fine-tuned BERT Model for Multi-label Product Categorization. The general architecture of BERT is shown in Figure 1. We use a combined text of product title and description as an input features. Then, we do standard preprocessing which lower-casing and lemmatization of text. Then, a special preprocessing is per- formed for BERT processing; first inserting two special tokens. (CLS) is appended to the beginning of the text, another special token (SEP) is inserted after each sentence as an indicator of sentence boundary. The modified text is then represented as a sequence of tokens X = [w1 , w2 , . . . , wn ]. Each token wi is assigned three kinds of embeddings: token embedding, segmentation embedding and position embedding. These three em- beddings are summed to a single input vector (C), which captures the overall meaning of the input. 4 Experiments 4.1 Evaluation The evaluation metrics used in this challenge are precision, recall and F1. F1-score in Eq. 2 is the harmonic mean of precision and recall scores. The organizers used macro- averaged F1 score as the main metric to compare and rank the participating systems. precision × recall F1 = 2 × (2) precision + recall 4.2 Results The organizers provided an overview of the performance of baselines with different embedding approaches (FastText, CBOW and Skipgram) on the validation dataset. As shown in Table 1, the baselines are evaluated based on both weighted-average and macro-average F1-scores. The experimental results are promising and shows that the systems based on embedding methods can achieve good F1-scores. Hence, we pro- posed our approach to employ the state-of-art contextualized embedding such as BERT to benchmark the system performance. Table 1: Experimental Results Model Weighted Avg. P, R, F1 Macro Avg. P, R, F1 Baseline 85.553 84.167 84.255 66.164 60.709 61.542 Baseline+embeddings(CBOW) 86.498 86.000 85.734 70.639 63.925 65.551 Baseline+embeddings(Skipgram) 85.453 84.911 84.575 70.574 62.740 64.693 The results are reported in terms of three evaluation metrics: (precision, recall and F1-score). F1-score is the score ultimately used to compare and rank the participating systems. Table 2 shows the results of five participating teams and the baseline (Fast- Text). Our team (DICE UPB) submitted one system based on fine-tuning BERT model. The performance is close to the baseline system in terms of F1 score (81.84% com- pared to baseline 84.26%). However, we found that feature engineering needs a special preprocessing rather than the standard preprocessing, due to the nature of product data such as: highly imbalanced in labels as shown in Figures 2a and 2b; noisiness in the de- scriptions. We suggest to perform the same preprocessing as [8] and change our strategy of fine-tuning BERT model to address these challenges properly. Table 2: System Evaluation Results. R2 refers to the systems which participated in the second round. Best Results in Bold System Precision Recall F1-score Rhinobird 89.01 89.04 88.62 Rhinobird (R2) 88.97 88.72 88.43 Team ISI 87.16 86.85 86.54 ASVinSpace 86.96 86.30 86.10 Megagon 84.98 84.98 84.98 Baseline FastText 85.55 84.17 84.26 DICE UPB 85.30 81.49 81.84 5 Conclusion and Future Work In this paper, we described our approach (ProBert) to classify product data based on micro annotations. Our approach leverage a simple BERT model that represents a single feature vector from product’s title and description, then predicts it’s categories. Our ex- periments suggest that ProBERT is a good baseline to benchmark the task of automatic products classification. In the future, we plan to re-evaluate our approach with different preprocessing and fine-tuning strategies. Also, we will investigate more deep models with different architectures (e.g., graph-based neural model). Acknowledgments This work has been supported by the EU H2020 project Know- Graphs (GA no. 860801) as well as the BMVI projects LIMBO (GA no. 19F2029C) and OPAL (GA no. 19F2028A). 100 101 102 103 100 101 102 50100000_Fruits/Vegetables/Nuts/Seeds Prepared/Processed 103 75010000_Household/Office Furniture 62000000_Stationery/Office Machinery/Occasion Supplies 68040000_Audio Visual Media 63000000_Footwear 68010000_Audio Visual Equipment 93060000_Plants Variety Packs 77000000_Automotive 50180000_Bread/Bakery Products 50170000_Seasonings/Preservatives/Extracts 10000000_Pet Care/Food 54100000_Baby Feeding/Hygiene 78030000_Electrical Lighting 53140000_Hair Products 75000000_Household/Office Furniture/Furnishings 54110000_Baby Welfare 63010000_Footwear 78000000_Electrical Supplies 50150000_Oils/Fats Edible 54120000_Baby Care Variety Packs 86000000_Toys/Games 61010000_Musical Instruments/Accessories 62060000_Stationery/Office Machinery 71000000_Sports Equipment 51100000_Health Treatments/Aids 86010000_Toys/Games 84000000_Tool Storage/Workshop Aids 93070000_Seeds/Spores 81010000_Lawn/Garden Supplies 51000000_Healthcare 87020000_Fuel Storage/Transfer 77030000_Cars and Motorcycles 87000000_Fuels/Gases 50200000_Beverages 92040000_Storage/Haulage Aids 67000000_Clothing 62070000_Stationery/Office Machinery/Occasion Supplies Variety Packs 53200000_Body Products 72000000_Home Appliances 75030000_Ornamental Furnishings 93030000_Live Plants (Genus A thru G) 74000000_Camping 78020000_Electrical Connection/Distribution 68020000_Photography/Optics 83000000_Building Products 70010000_Arts/Crafts/Needlework Supplies 62050000_Greeting Cards/Gift Wrap/Occasion Supplies 68000000_Audio Visual/Photography 50240000_Meat/Poultry 92020000_Storage/Haulage Boxes/Crates/Trays (Empty) 91000000_Safety/Security/Surveillance 50130000_Milk/Butter/Cream/Yogurts/Cheese/Eggs/Substitutes 72010000_Major Domestic Appliances 92000000_Storage/Haulage Containers 72020000_Small Domestic Appliances 51130000_Home Diagnostics 93000000_Horticulture Plants 50210000_Tobacco/Smoking Accessories 53230000_Personal Intimacy 64000000_Personal Accessories 74010000_Camping 10110000_Pet Food/Drinks 82000000_Tools/Equipment Power 68030000_In-car Electronics 91030000_Home/Business Safety/Security/Surveillance (a) Label (Level1) (b) Label (Level2) 50160000_Confectionery/Sugar Sweetening Products 73000000_Kitchen Merchandise 51160000_Pharmaceutical Drugs 81000000_Lawn/Garden Supplies 79010000_Plumbing/Heating/Ventilation/Air Conditioning 47100000_Cleaning Products 88010000_Lubricants/Protective Compounds 61000000_Music 53130000_Skin Products 89020000_Live Animals 65000000_Computing 77010000_Automotive Accessories and Maintenance 84010000_Tool Storage/Workshop Aids 50000000_Food/Beverage/Tobacco 65010000_Computers/Video Games 78040000_Electrical Cabling/Wiring 53000000_Beauty/Personal Care/Hygiene 78050000_Electronic Communication Components 67010000_Clothing 79000000_Plumbing/Heating/Ventilation/Air Conditioning 83010000_Building Products 53180000_Personal Hygiene Products 60000000_Textual/Printed/Reference Materials 66010000_Communications 47210000_Waste Management Products 88000000_Lubricants 73040000_Kitchen Merchandise 64010000_Personal Accessories 54000000_Baby Care 47120000_Insect/Pest/Allergen Control Fig. 2: Label distributions (log scaled) in the training dataset. 50190000_Prepared/Preserved Foods 80000000_Tools/Equipment Hand 80010000_Tools/Equipment Hand 85010000_Safety/Protection DIY 70000000_Arts/Crafts/Needlework 60010000_Textual/Printed/Reference Materials 10100000_Pet Care 85000000_Safety/Protection DIY 50260000_Vegetables (Non Leaf) Unprepared/Unprocessed (Fresh) 82010000_Tools/Equipment Power 89000000_Live Animals 51120000_Health Enhancement 75020000_Fabric/Textile Furnishings 66000000_Communications 53160000_Cosmetics/Fragrances 71010000_Sports Equipment 47000000_Cleaning/Hygiene Products 51150000_Medical Devices Bibliography [1] Cevahir, A., Murakami, K.: Large-scale multi-class and hierarchical product cat- egorization for an e-commerce giant. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pp. 525– 535 (2016) [2] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) [3] Gupta, V., Karnick, H., Bansal, A., Jhala, P.: Product classification in e-commerce using distributional semantics. arXiv preprint arXiv:1606.06083 (2016) [4] Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019) [5] Kozareva, Z.: Everyone likes shopping! multi-class product categorization for e- commerce. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 1329–1333 (2015) [6] Xia, Y., Levine, A., Das, P., Di Fabbrizio, G., Shinzato, K., Datta, A.: Large-scale categorization of japanese product titles using neural attention models. In: Pro- ceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. pp. 663–668 (2017) [7] Zahera, H.M., Elgendy, I.A., Jalota, R., Sherif, M.A.: Fine-tuned bert model for multi-label tweets classification. In: TREC (2019) [8] Zhang, Z., Paramita, M.: Product classification using microdata annotations. In: International Semantic Web Conference. pp. 716–732. Springer (2019)