1. Introduction

Towards Classifying HTML-embedded Product Data Based On Machine Learning Approach

Oleksandr Matveiev

Anastasiia Zubenko

zubenkoanastasia94@gail.com 0

Dmitry Yevtushenko

yevtushenkods@gmail.com 0

Olga Cherednichenko

olha.cherednichenko@gmail.com 0 0 National Technical University “Kharkiv Polytechnic Institute” , Kirpicheva st. 2, Kharkiv, 61002 , Ukraine

In this paper we explored machine learning approaches using descriptions and titles to classify footwear by brand. The provided data were taken from many different online stores. In particular, we have created a pipeline that automatically classifies product brands based on the provided data. The dataset is provided in JSON format and contains more than 40,000 rows. The categorization component was implemented using K-Nearest Neighbour (K-NN) and Support Vector Machine (SVM) algorithms. The results of the pipeline construction were evaluated basing on the classification report, especially the Precision weighted average value was considered during the calculation, which reached 79.0% for SVM and 72.0% for K-NN.

1 Product classification SVM K-Nearest Neighbour TF-IDF machine learning vectorization item matching

1. Introduction

Today, there is an enormous number of e-shops that allow consumers to buy goods online. As a result, the number of products sold through e-shops grew rapidly. A recent study estimated that total ecommerce retail sales were $791.70 billion in 2020, up 32.4% from the previous year's $598.02 billion. This is the highest annual growth of digital technologies for any year for which data are available this information reported by the Ministry of Trade in 2019 [ 1 ]. One of the reasons for this growth was the result of COVID-19, which further increased e-commerce revenue in 2020 by 105.47 billion dollars [ 1 ]. For example, web giants such as Amazon reached $100.83 billion in the fourth quarter of 2020, up a whopping 47.5% from $ 68.34 billion a year earlier. This is 2.5 times higher than the level of income on the Internet by 19.5% during the fourth quarter of 2019.

This global trend of e-commerce is forcing all businesses to go online, resulting in an increasing number of e-commerce stores. Each e-commerce store has different streams to publish an added item on the platform. Some markets, such as Amazon, eBay, etc., allow users to become sellers and add products themselves. This functionality permits retailers to increase the number of products they sell. However, the process of adding new products and assigning a category can lead to consistency issues. An error in the classification of the product in the first place can lead to some problems with finding the exact product. Therefore, the correct categorization of products is critical for all e-commerce platforms, as it speeds up the search for the definite product and provides better interaction with users, highlighting the correct categories.

To solve these problems with the assignment of goods to the wrong category, an automatic tool that can classify any product by name in the product taxonomy is needed. At the same time, this process will facilitate human work and further improve the consistency of product categorization on ecommerce websites.

In this paper, we apply some approaches to product categorization for the provided data collection. The data provided were taken from many different online stores. The total amount of data provided in the JSON file is over 40,000 lines. This number of records will allow us to teach the model to predict the category of goods for future products.

2. Related work

This section provides an overview of existing research on product classification based on product specifications that have been studied with different approaches and methods in recent years.

Due to not all websites use a hierarchy of product classification and some of them use but it can be completely different, a unified product classification from different websites is needed in order to provide the user with useful features like browsing and searching.

Although there are several approaches to product data classification [ 2 ] introduced a modified Naive Bayesian model for classifying goods, using the usual Bayesian naive instead of a text classifier. Although the accuracy is somewhat high, the main disadvantage of this approach is how to choose the right weight, as it is based on data observation and manual assignment of scales based on selected functions. Failure to select the appropriate weight will significantly change the results. Lin and Shankar [ 3 ] investigated using effective pre-treatment methods and multi-class features to improve classification accuracy. The paper [ 4 ] discussed the classification process in terms of what a classification was, and they represented a model of SCM semantic classification. In [ 5 ] used fuzzy modelling of sets to identify categories, but this model lacked a comparison of classification accuracy for evaluation.

Recently, the categorization of goods using product descriptions by Chen and Warren has aroused great interest [ 6 ]. Despite these efforts, there are not many studies aimed at classifying goods by name and description.

3. The product classification pipeline

At an elevated level, the goal for our system is to build a multi-class classifier, which can accurately predict the product category of a new unlabeled product title. The high-level steps are presented in Figure 1. a. Removes topic-neutral words such as articles (a, an, the), prepositions (in, of, at), conjunctions (and, or, nor), etc. from the documents.

b. Word stemming models (K-NN).

Analysis of the results

The full process is described below. 3.1.

Classifiers Overview

Classification models: Multi-Class SVM, K nearest neighbours (K-NN) for the selected features. These two models were selected to compare the discriminative (SVM) and nonparametric

The classifier is built basing on the learning from the provided dataset and can be used to classify unknown products by brand in future. We choose two algorithms (K-NN and SVM) to implement. We provide a brief description of each algorithm in this Section.

3.1.1. SVM Based Categorization

optimization problem [ 16 ].

SVM is introduced as an algorithm for text classification by Joachims [ 14 ]. Let = {( ⃗⃗⃗⃗1, 1), … ( ⃗⃗⃗⃗ , )} be a set of instances for training, where ⃗⃗⃗⃗⃗ ∈ , and category ∈ 1 {−1, +1}. SVM learns linear decision rules ( ) = {⃗⃗ + } , described by a weight vector and threshold . If is linearly separable, SVM finds the hyperplane with maximum Euclidean distance to the closest training instances. If is non-separable, the amount of training errors is measured using slack variables . Computing the hyperplane is equivalent to solving the following : (⃗⃗ , , ) = 1 2

⃗⃗ ∗ ⃗⃗ + ∑ =1 : ∀ =1: [⃗⃗

∗ ⃗⃗⃗⃗ + ] ≥ 1 − ∀ =1: > 0

The factor in (1) is a parameter used for trading off training error vs. model complexity. The constraints (2) require that all training instances be classified correctly up to some slack .

3.1.2. K-NN Algorithm

The K-Nearest Neighbour (K-NN) is one of the popular algorithms [ 15, 16 ]. The algorithm is based on finding the most similar objects from sample groups about the mutual Euclidean distance [ 7, 8 ].

The algorithm assumes that it is possible to classify documents in the Euclidean space as points [ 17 ]. The distance between two points can be calculated as following:

( , ) = ( , ) = √( − )2 + ( − )2 3.2.

Exploratory Data Analysis 3.2.1. Convert a file from JSON format to CSV

First of all, it is necessary to convert the input format to CSV. This format is more common in Python and gives us more opportunities to work with data. (1) (2) (3) (4)

To do this, we installed an additional library of pandas. We did this with the following command: pip install pandas.

This library contains the read_json () method, which allows you to upload a file to the program and continue working with it. The read_json () method can take several parameters but we used only one: path_or_buf. This parameter is responsible for the path to our JSON file. This library contains the read_json() method, which allows you to upload a file to the program and continue working with it.

Once we download the file data to the program's memory, we can start working on it. The data downloaded to the program's memory can be written to a CSV file, using the following method to_csv(). In this method, we passed the path where we wanted to place our CSV file as a parameter.

The code needed to convert a file from JSON to CSV can be found in the convert.py script. Run the file with the following command: python convert.py.

3.2.2. Input analysis

After we have converted the input file, we can start its analysis. The input file contains 41,664 records and 17 columns:

Consider the source data contained in the tables. The data is presented in Figures 3 and 4.

We focused on each of the provided columns separately. This is important because a more detailed analysis allowed us to understand exactly how to configure the script for automatic data processing.

The amount of zero data in the tables was analysed, the result is presented in Figure 5.

Analysing Figure 5, we concluded that the data contains many zero values, but this function calculates the sum of zero values. Therefore, if there are no records in the column, the sum of the zero values will not be found correctly. The proof of this issue is presented in Figures 3 and 4 where we can see the empty columns. Thus, before deleting the null rows the additional manual examination for the columns is required. The result of our additional analysis is presented in Figure 6 3.3.

Feature Selection based on the Exploratory Data Analysis

Based on the data analysis stage, we identified columns that were used for further modelling. Thus, for the machine learning model, we used: title, description, and brand. The example of the columns and the data they contain is presented in Figure 7.

Based on the provided example in Figure 6, we concluded that the existing data cannot be used for appropriate product categorization, because of the:  a large number of empty values;  data duplication.

Therefore, before approaching the categorization of data, we decided to proceed with further cleaning. So, we developed the component which cleans input data automatically. 3.4.

Pre-Processing. Automatic Cleaning for the Input Data

Since our solution in future will work with the new data itself, we developed a script that cleans the data automatically.

First of all, we removed cells that contain empty values. Otherwise, the algorithm cannot process the data correctly. To do this, we used the dropna() method that comes with the pandas() package. This method automatically deleted empty cells.

Next, duplicates are removed with the drop_duplicates() method. For this method to process the current file (and not return a new one), set the inplace = TRUE parameter. Since the input data will be obtained from several resources, we need to process them further.

The HTML tags were removed, as there was a risk that they might be in our sample. It was done using the methods BeautifulSoup() and get_test() from the bs4 library.

Then the special characters which could be in these data were removed. The library re and the sub() method were imported. As the first parameter, we passed the following pattern: [^ a-zA-Z \ d].

The next step was to transform all the text data into lowercase and broke it down into words. To do this we used two lower() and split() methods.

After that, the "stop words" can be applied, for that the stopwords() function was used. This function takes one argument: the language we work with. As this argument, we transferred the value "English feature". This set parameter analyzes the language in each cell and removes all non-English rows.

To start automatic cleaning of input data, you should run python clear.py script that contains all the steps described above. After executing the submitted script, our document contains 3 columns and 10,200 unique cleaned lines. An example of the processed data is shown in Figure 8.

Thus, after processing 41,664 lines, 10,200 lines were left, which is 24,48% of the initial dataset. 3.5.

Data Transformation. Text Vectorization

Machine learning algorithms usual operate on a numeric feature space. To perform the algorithm on the text, we transformed our text data into vector representations. It is called feature extraction or vectorization [ 9 ].

In this paper, we evaluated performance of two methods HashingVectorizer, CountVectorize which are used for converting the collection of text data to a matrix of token counts and TfidfVectorizer method for converting a collection of raw data to a matrix of TF-IDF features.

HashingVectorizer and CountVectorizer are meant to do the same thing, which is to convert a collection of text documents to a matrix of token occurrences. [ 10 ]. Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method used to reflect the importance of a term to a document in the corpus [ 11 ]. TFIDF can be calculated as: = × 2 (

) (5) where is the weight of term in document , is the number of documents in the collection, is the term frequency of term in document and is the document frequency of term in the collection.

To obtain better results with documents of different length, we used a modified equation [ 14 ]: classify, as a result, we performed with 323 different classes. 3.6.

Modelling Classification Algorithms

Each row is converted into appropriate representation and applied to training, validation, and The vectorization also allows us to calculate the number of unique categories which we are going to Both algorithms will process with the features selected in Section 3.3.

For selected, cleaned features we applied the vectorization CountVectorizer function which was done during applying the StratifiedKFold method which splits the dataset into the 3.6 test groups. The selected vectorization function allows us to evaluate the performance of the built model and compare the results we got with applying K-NN. After we applied the vectorization function, the next step was determining the optimal value of the C parameter. The evaluation for selected C parameters is presented in Section 3.7.1.

As we selected features, we vectorized the data. For the K-NN model, we evaluated the performance by applying each vectorizing functions described in Section 3.5. The evaluation is done in Section 3.7.2.

In the next step, we determined the K value. To determine the vectors distance between the data for K-NN, we used both Cosine similarity and Euclidean space. 3.7.

Models Evaluation

The method of stratified cross-validation kfold (Stratified kfold cross validation) was used to assess the quality of the model at the initial stage. Choosing between regular cross-checking kfold and stratified cross-checking. The kfold check selected a stratified kfold cross check. Because we have unbalanced data, stratified kfold cross-checking is useful for our experiment. It was decided not to use regular kfold cross-checking because we do not have enough data and this method often preserves the ratio of classes, and this can lead to partitioning in such a way that some networks will contain examples of training from only one class.

Stratified cross-checking is suitable for assessing the quality of a classifier without the use of test data. Testing occurs from parts of the training sample that are not known to the classifier. This assessment approach helps determine if the system is capable of relearning. In our experiment, we used a stratified cross-check with k bends (k = 6) for 10,200 products and 323 categories. Therefore, the evaluation was done for the 1703 products.

The evaluation was performed in several test phases:    quality classification; speed text classification; classifications recall according to categories of product.

Classification results, reported in this section, were based on the evaluation which was done according to F1–measure, precision, recall, and accuracy metrics [ 19 ]. To evaluate the overall performance of the algorithms on given datasets we focused on the F1 macro average. F1 macro average calculates the score separated by class but not using weights for the aggregation. The F1 weighted average calculates the score for each class independently but when it adds them together uses a weight that depends on the number of true labels of each class. Therefore, F1 weighted average favoring the majority class which we do not want.

3.7.1. SVM Model Evaluation

We applied different values for the C parameter to ensure that the experimental results faithfully reflect the performance of the algorithms.

From the experimental result of the SVM, the C parameter equals 0,125 is optimal based on the execution time of 310.8 sec which is 5.5 min and the macro average for F1-score is 72%.

Also, for measuring the performance we calculated the number of goods with correct and incorrect classification based on that the percentage of correct and misclassified categories was found. So, the algorithm creates a separate file for initial and classified values, and automatically compares values. Then this function calculates the sum of correct and incorrect predicted values and percentage accordingly.

The output of this function is presented in Figure 8. The comparison is presented in Figure 10. As we selected features, we vectorized the data. For the K-NN model, we evaluated the performance by applying each vectorizing functions described in Section 3.5. The evaluation is done in Section 3.7.2.

In the next step, we determined the K value. K value of the K-NN algorithm is a factor that indicates a required amount of data from the collection which is closest to the selected row. To determine the vectors distance between the data for K-NN, we used both Cosine similarity and Euclidean space.

3.7.2. K-NN Model Evaluation

Various scaling methods were used to evaluate the efficiency of the model, such as the similarity of cosines and Euclidean space. The final analysis of the model efficiency is analyzed based on the chosen method. Figures 11-13 represent some results of our experimets.

Based on the K-NN models evaluation results, the best result for classification by a brand we got while using the vectorization method TfidfVectorizer and cosine similarity metric, where the macro average for F1 is 70%. The number of goods with correct and incorrect classification and the percentage of correct and misclassified categories were calculated as the same for SVM presented in Figure 10.

Also, we can see that the execution time which is 9,04 sec for the best result depends on the selected scale method, metrics and the number of features used for the elevation.

Therefore, we can conclude that if the number of input features is increased, the execution time could become critical, and another faster model can be used.

4. Conclusion

In this paper, we present an investigation of two widely used approaches for text categorization KNN and the SVM algorithms.

The main goal of the research was to evaluate the performance of two popular K-NN and SVM algorithms, compare execution time for both of them and to develop an MVP pipeline that can automatically classify the shoes category based on the brand.

The combination of the K-NN algorithm and different vectorization methods showed good results as well as SVM and CountVectorizer. However, despite the good performance results of the SVM algorithm, it has the highest execution time, which can be significant for big marketplaces.

Therefore, the gained results which are reported in this paper are satisfactory, however, they are not the best that can be achieved. Moreover, additional investigation is needed to improve the performance of applied algorithms.

To further study and improve the model, the following steps are suggested: - Get more data to test models; - Implement of the algorithm for automatic search of optimal arameters;

Prepare the developed module for integration with e-commerce stores.

5. References

[1] Quarterly retail e-commerce sales in the last quarter of 2020. US Digital Commerce Bureau News (

2020 ) https://www.digitalcommerce360.com/article/quarterly-online-sales/

[2] Kim , Young-Gon Modified naïve bayes classifier for e-catalog classification , Seoul 151-742.

[3]

Lin ,

Shankar Applying Machine Learning to Product Categorization Stanford University. CS229.

[4]

Kim

Dongkyu , Sang-Goo

Lee

, Jonghoon Chun, Juhnyoung LeeA semantic classification model for e-catalogs .: Proceedings - IEEE International Conference on E-Commerce

Technology

, CEC 2004 , p 85 - 92 , ( 2004 ).

[5] Wan , Hongxin; Peng, Yun A technique of e-commerce goods classification and evaluation based on fuzzy set . Proceedings, International Conference on Internet Technology and Applications , ITAP ( 2010 ).

[6]

Jianfu

Chen , David Warren. Cost-sensitive learning for large-scale hierarchical classification . In Proceedings of the 22Nd ACM International Conference on Conference on Information & Knowledge Management, CIKM , pages 1351 - 1360 ( 2013 ).

[7]

Tan , Neighbor-weighted K-nearest neighbour for unbalanced text corpus , Expert Systems with Applications 28 ( 2005 ) 667 - 671 .

[8]

Lan ,

C.L.

Tan ,

Su ,

Lu , Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , IEEE Transactions on Pattern Analysis and Machine Intelligence , VOL. 31 , NO. 4 , ( 2009 ).

[9]

Text

Vectorization and

Transformation

Pipelines . Chapter 4. https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html

[10] HashingVectorizer, CountVectorizer https://kavita-ganesan.com/hashingvectorizer-vscountvectorizer/

[11]

Uguz , A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , Knowledge-Based Systems 24 ( 2011 ) 1024 - 1032 .

[12] J. T .-Y. Kwok, Automatic Text Categorization Using Support Vector Machine, Proceedings of International Conference on Neural Information Processing , ( 1998 ) 347 - 351 .

[13] Joachims , T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , In Proceedings of 10th European Conference on Machine Learning , Chemnitz, Germany, pages 137 - 142 ( 1998 ).

[14] Joachims , T. A Statistical Learning Model of Text Classification for Support Vector Machines . In Proceedings of SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval , pages 128 - 136 ( 2001 ).

[15] Wang , X. Li , An improved KNN algorithm for text classification , ( 2010 ).

[16]

Guo ,

Wang ,

Bell ,

Bi ,

Greer , KNN Model-Based Approach in Classification, ( 2003 ) 986 - 996 .

[17] Ming-Yang

, Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification , Journal of Network and Computer Applications 34 ( 2011 ) 722 - 730 .

[18]

Mikawa ,

Ishidat ,

Goto , A Proposal of Extended Cosine Measure for Distance Metric Learning in Text Classification , 2011 .

[19] Sebastiani , F.

Machine Learning in

Automated Text Categorization. ACM Computing Surveys , Vol. 34 , No.1, March

2002

, pages 1 - 47 ( 2002 ).