<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Classifying HTML-embedded Product Data Based On Machine Learning Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleksandr Matveiev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasiia Zubenko</string-name>
          <email>zubenkoanastasia94@gail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry Yevtushenko</string-name>
          <email>yevtushenkods@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Cherednichenko</string-name>
          <email>olha.cherednichenko@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>Kirpicheva st. 2, Kharkiv, 61002</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we explored machine learning approaches using descriptions and titles to classify footwear by brand. The provided data were taken from many different online stores. In particular, we have created a pipeline that automatically classifies product brands based on the provided data. The dataset is provided in JSON format and contains more than 40,000 rows. The categorization component was implemented using K-Nearest Neighbour (K-NN) and Support Vector Machine (SVM) algorithms. The results of the pipeline construction were evaluated basing on the classification report, especially the Precision weighted average value was considered during the calculation, which reached 79.0% for SVM and 72.0% for K-NN.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Product classification</kwd>
        <kwd>SVM</kwd>
        <kwd>K-Nearest Neighbour</kwd>
        <kwd>TF-IDF</kwd>
        <kwd>machine learning</kwd>
        <kwd>vectorization</kwd>
        <kwd>item matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Today, there is an enormous number of e-shops that allow consumers to buy goods online. As a
result, the number of products sold through e-shops grew rapidly. A recent study estimated that total
ecommerce retail sales were $791.70 billion in 2020, up 32.4% from the previous year's $598.02 billion.
This is the highest annual growth of digital technologies for any year for which data are available this
information reported by the Ministry of Trade in 2019 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. One of the reasons for this growth was the
result of COVID-19, which further increased e-commerce revenue in 2020 by 105.47 billion dollars
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, web giants such as Amazon reached $100.83 billion in the fourth quarter of 2020, up
a whopping 47.5% from $ 68.34 billion a year earlier. This is 2.5 times higher than the level of income
on the Internet by 19.5% during the fourth quarter of 2019.
      </p>
      <p>This global trend of e-commerce is forcing all businesses to go online, resulting in an increasing
number of e-commerce stores. Each e-commerce store has different streams to publish an added item
on the platform. Some markets, such as Amazon, eBay, etc., allow users to become sellers and add
products themselves. This functionality permits retailers to increase the number of products they sell.
However, the process of adding new products and assigning a category can lead to consistency issues.
An error in the classification of the product in the first place can lead to some problems with finding
the exact product. Therefore, the correct categorization of products is critical for all e-commerce
platforms, as it speeds up the search for the definite product and provides better interaction with users,
highlighting the correct categories.</p>
      <p>To solve these problems with the assignment of goods to the wrong category, an automatic tool that
can classify any product by name in the product taxonomy is needed. At the same time, this process
will facilitate human work and further improve the consistency of product categorization on
ecommerce websites.</p>
      <p>In this paper, we apply some approaches to product categorization for the provided data collection.
The data provided were taken from many different online stores. The total amount of data provided in
the JSON file is over 40,000 lines. This number of records will allow us to teach the model to predict
the category of goods for future products.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>This section provides an overview of existing research on product classification based on product
specifications that have been studied with different approaches and methods in recent years.</p>
      <p>Due to not all websites use a hierarchy of product classification and some of them use but it can be
completely different, a unified product classification from different websites is needed in order to
provide the user with useful features like browsing and searching.</p>
      <p>
        Although there are several approaches to product data classification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduced a modified Naive
Bayesian model for classifying goods, using the usual Bayesian naive instead of a text classifier.
Although the accuracy is somewhat high, the main disadvantage of this approach is how to choose the
right weight, as it is based on data observation and manual assignment of scales based on selected
functions. Failure to select the appropriate weight will significantly change the results. Lin and Shankar
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] investigated using effective pre-treatment methods and multi-class features to improve classification
accuracy. The paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] discussed the classification process in terms of what a classification was, and
they represented a model of SCM semantic classification. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used fuzzy modelling of sets to identify
categories, but this model lacked a comparison of classification accuracy for evaluation.
      </p>
      <p>
        Recently, the categorization of goods using product descriptions by Chen and Warren has aroused
great interest [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Despite these efforts, there are not many studies aimed at classifying goods by name
and description.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The product classification pipeline</title>
      <p>At an elevated level, the goal for our system is to build a multi-class classifier, which can accurately
predict the product category of a new unlabeled product title. The high-level steps are presented in
Figure 1.
a. Removes topic-neutral words such as articles (a, an, the), prepositions (in, of, at), conjunctions
(and, or, nor), etc. from the documents.</p>
      <p>b. Word stemming
models (K-NN).</p>
      <p>Analysis of the results</p>
      <p>The full process is described below.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Classifiers Overview</title>
      <p>Classification models: Multi-Class SVM, K nearest neighbours (K-NN) for the selected
features. These two models were selected to compare the discriminative (SVM) and nonparametric</p>
      <p>The classifier is built basing on the learning from the provided dataset and can be used to classify
unknown products by brand in future. We choose two algorithms (K-NN and SVM) to implement. We
provide a brief description of each algorithm in this Section.</p>
    </sec>
    <sec id="sec-5">
      <title>3.1.1. SVM Based Categorization</title>
      <p>
        optimization problem [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        SVM is introduced as an algorithm for text classification by Joachims [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Let   =
{( ⃗⃗⃗⃗1,  1), … ( ⃗⃗⃗⃗ ,   )} be a set of 
instances for training, where ⃗⃗⃗⃗⃗ ∈   , and category   ∈
1
{−1, +1}. SVM learns linear decision rules  ( ) = 
{⃗⃗ 
+  } , described by a weight vector
 and  threshold  . If   is linearly separable, SVM finds the hyperplane with maximum Euclidean
distance to the closest training instances. If   is non-separable, the amount of training errors is
measured using slack variables   . Computing the hyperplane is equivalent to solving the following


:  (⃗⃗ ,  ,  ) =
1
2
      </p>
      <p>⃗⃗ ∗ ⃗⃗ +  ∑ =1  
  : ∀ =1:   [⃗⃗</p>
      <p>∗ ⃗⃗⃗⃗ +  ] ≥ 1 −  

∀ =1:   &gt; 0</p>
      <p>The factor  in (1) is a parameter used for trading off training error vs. model complexity. The
constraints (2) require that all training instances be classified correctly up to some slack   .</p>
    </sec>
    <sec id="sec-6">
      <title>3.1.2. K-NN Algorithm</title>
      <p>
        The K-Nearest Neighbour (K-NN) is one of the popular algorithms [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. The algorithm is based
on finding the most similar objects from sample groups about the mutual Euclidean distance [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
      </p>
      <p>
        The algorithm assumes that it is possible to classify documents in the Euclidean space as points [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
The distance between two points can be calculated as following:
      </p>
      <p>( ,  ) =  ( ,  ) = √( −  )2 + ( −  )2
3.2.</p>
    </sec>
    <sec id="sec-7">
      <title>Exploratory Data Analysis</title>
    </sec>
    <sec id="sec-8">
      <title>3.2.1. Convert a file from JSON format to CSV</title>
      <p>First of all, it is necessary to convert the input format to CSV. This format is more common in Python
and gives us more opportunities to work with data.
(1)
(2)
(3)
(4)</p>
      <p>To do this, we installed an additional library of pandas. We did this with the following command:
pip install pandas.</p>
      <p>This library contains the read_json () method, which allows you to upload a file to the program and
continue working with it. The read_json () method can take several parameters but we used only one:
path_or_buf. This parameter is responsible for the path to our JSON file. This library contains the
read_json() method, which allows you to upload a file to the program and continue working with it.</p>
      <p>Once we download the file data to the program's memory, we can start working on it. The data
downloaded to the program's memory can be written to a CSV file, using the following method
to_csv(). In this method, we passed the path where we wanted to place our CSV file as a parameter.</p>
      <p>The code needed to convert a file from JSON to CSV can be found in the convert.py script. Run the
file with the following command: python convert.py.</p>
    </sec>
    <sec id="sec-9">
      <title>3.2.2. Input analysis</title>
      <p>After we have converted the input file, we can start its analysis. The input file contains 41,664
records and 17 columns:</p>
      <p>Consider the source data contained in the tables. The data is presented in Figures 3 and 4.</p>
      <p>We focused on each of the provided columns separately. This is important because a more detailed
analysis allowed us to understand exactly how to configure the script for automatic data processing.</p>
      <p>The amount of zero data in the tables was analysed, the result is presented in Figure 5.</p>
      <p>Analysing Figure 5, we concluded that the data contains many zero values, but this function
calculates the sum of zero values. Therefore, if there are no records in the column, the sum of the zero
values will not be found correctly. The proof of this issue is presented in Figures 3 and 4 where we can
see the empty columns. Thus, before deleting the null rows the additional manual examination for the
columns is required. The result of our additional analysis is presented in Figure 6
3.3.</p>
    </sec>
    <sec id="sec-10">
      <title>Feature Selection based on the Exploratory Data Analysis</title>
      <p>Based on the data analysis stage, we identified columns that were used for further modelling. Thus,
for the machine learning model, we used: title, description, and brand. The example of the columns and
the data they contain is presented in Figure 7.</p>
      <p>Based on the provided example in Figure 6, we concluded that the existing data cannot be used for
appropriate product categorization, because of the:
 a large number of empty values;
 data duplication.</p>
      <p>Therefore, before approaching the categorization of data, we decided to proceed with further
cleaning. So, we developed the component which cleans input data automatically.
3.4.</p>
    </sec>
    <sec id="sec-11">
      <title>Pre-Processing. Automatic Cleaning for the Input Data</title>
      <p>Since our solution in future will work with the new data itself, we developed a script that cleans the
data automatically.</p>
      <p>First of all, we removed cells that contain empty values. Otherwise, the algorithm cannot process
the data correctly. To do this, we used the dropna() method that comes with the pandas() package. This
method automatically deleted empty cells.</p>
      <p>Next, duplicates are removed with the drop_duplicates() method. For this method to process the
current file (and not return a new one), set the inplace = TRUE parameter. Since the input data will be
obtained from several resources, we need to process them further.</p>
      <p>The HTML tags were removed, as there was a risk that they might be in our sample. It was done
using the methods BeautifulSoup() and get_test() from the bs4 library.</p>
      <p>Then the special characters which could be in these data were removed. The library re and the sub()
method were imported. As the first parameter, we passed the following pattern: [^ a-zA-Z \ d].</p>
      <p>The next step was to transform all the text data into lowercase and broke it down into words. To do
this we used two lower() and split() methods.</p>
      <p>After that, the "stop words" can be applied, for that the stopwords() function was used. This function
takes one argument: the language we work with. As this argument, we transferred the value "English
feature". This set parameter analyzes the language in each cell and removes all non-English rows.</p>
      <p>To start automatic cleaning of input data, you should run python clear.py script that contains all the
steps described above. After executing the submitted script, our document contains 3 columns and
10,200 unique cleaned lines. An example of the processed data is shown in Figure 8.</p>
      <p>Thus, after processing 41,664 lines, 10,200 lines were left, which is 24,48% of the initial dataset.
3.5.</p>
    </sec>
    <sec id="sec-12">
      <title>Data Transformation. Text Vectorization</title>
      <p>
        Machine learning algorithms usual operate on a numeric feature space. To perform the algorithm on
the text, we transformed our text data into vector representations. It is called feature extraction or
vectorization [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>In this paper, we evaluated performance of two methods HashingVectorizer, CountVectorize which
are used for converting the collection of text data to a matrix of token counts and TfidfVectorizer
method for converting a collection of raw data to a matrix of TF-IDF features.</p>
      <p>
        HashingVectorizer and CountVectorizer are meant to do the same thing, which is to convert a
collection of text documents to a matrix of token occurrences. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Term frequency-inverse document
frequency (TF-IDF) is a feature vectorization method used to reflect the importance of a term to a
document in the corpus [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. TFIDF can be calculated as:
 =    ×   2 (
      </p>
      <p>)

(5)
where   is the weight of term  in document  ,  is the number of documents in the collection,    is
the term frequency of term  in document  and    is the document frequency of term  in the collection.</p>
      <p>
        To obtain better results with documents of different length, we used a modified equation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]:
classify, as a result, we performed with 323 different classes.
3.6.
      </p>
    </sec>
    <sec id="sec-13">
      <title>Modelling Classification Algorithms</title>
      <p>Each row is converted into appropriate representation and applied to training, validation, and
The vectorization also allows us to calculate the number of unique categories which we are going to
Both algorithms will process with the features selected in Section 3.3.</p>
      <p>For selected, cleaned features we applied the vectorization CountVectorizer function which was
done during applying the StratifiedKFold method which splits the dataset into the 3.6 test groups. The
selected vectorization function allows us to evaluate the performance of the built model and compare
the results we got with applying K-NN. After we applied the vectorization function, the next step was
determining the optimal value of the C parameter. The evaluation for selected C parameters is presented
in Section 3.7.1.</p>
      <p>As we selected features, we vectorized the data. For the K-NN model, we evaluated the performance
by applying each vectorizing functions described in Section 3.5. The evaluation is done in Section 3.7.2.</p>
      <p>In the next step, we determined the K value. To determine the vectors distance between the data for
K-NN, we used both Cosine similarity and Euclidean space.
3.7.</p>
    </sec>
    <sec id="sec-14">
      <title>Models Evaluation</title>
      <p>The method of stratified cross-validation kfold (Stratified kfold cross validation) was used to assess
the quality of the model at the initial stage. Choosing between regular cross-checking kfold and
stratified cross-checking. The kfold check selected a stratified kfold cross check. Because we have
unbalanced data, stratified kfold cross-checking is useful for our experiment. It was decided not to use
regular kfold cross-checking because we do not have enough data and this method often preserves the
ratio of classes, and this can lead to partitioning in such a way that some networks will contain examples
of training from only one class.</p>
      <p>Stratified cross-checking is suitable for assessing the quality of a classifier without the use of test
data. Testing occurs from parts of the training sample that are not known to the classifier. This
assessment approach helps determine if the system is capable of relearning. In our experiment, we used
a stratified cross-check with k bends (k = 6) for 10,200 products and 323 categories. Therefore, the
evaluation was done for the 1703 products.</p>
      <p>The evaluation was performed in several test phases:



quality classification;
speed text classification;
classifications recall according to categories of product.</p>
      <p>
        Classification results, reported in this section, were based on the evaluation which was done
according to F1–measure, precision, recall, and accuracy metrics [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. To evaluate the overall
performance of the algorithms on given datasets we focused on the F1 macro average. F1 macro average
calculates the score separated by class but not using weights for the aggregation. The F1 weighted
average calculates the score for each class independently but when it adds them together uses a weight
that depends on the number of true labels of each class. Therefore, F1 weighted average favoring the
majority class which we do not want.
      </p>
    </sec>
    <sec id="sec-15">
      <title>3.7.1. SVM Model Evaluation</title>
      <p>We applied different values for the C parameter to ensure that the experimental results faithfully
reflect the performance of the algorithms.</p>
      <p>From the experimental result of the SVM, the C parameter equals 0,125 is optimal based on the
execution time of 310.8 sec which is 5.5 min and the macro average for F1-score is 72%.</p>
      <p>Also, for measuring the performance we calculated the number of goods with correct and incorrect
classification based on that the percentage of correct and misclassified categories was found. So, the
algorithm creates a separate file for initial and classified values, and automatically compares values.
Then this function calculates the sum of correct and incorrect predicted values and percentage
accordingly.</p>
      <p>The output of this function is presented in Figure 8. The comparison is presented in Figure 10. As
we selected features, we vectorized the data. For the K-NN model, we evaluated the performance by
applying each vectorizing functions described in Section 3.5. The evaluation is done in Section 3.7.2.</p>
      <p>In the next step, we determined the K value. K value of the K-NN algorithm is a factor that indicates
a required amount of data from the collection which is closest to the selected row. To determine the
vectors distance between the data for K-NN, we used both Cosine similarity and Euclidean space.</p>
    </sec>
    <sec id="sec-16">
      <title>3.7.2. K-NN Model Evaluation</title>
      <p>Various scaling methods were used to evaluate the efficiency of the model, such as the similarity of
cosines and Euclidean space. The final analysis of the model efficiency is analyzed based on the chosen
method. Figures 11-13 represent some results of our experimets.</p>
      <p>Based on the K-NN models evaluation results, the best result for classification by a brand we got
while using the vectorization method TfidfVectorizer and cosine similarity metric, where the macro
average for F1 is 70%. The number of goods with correct and incorrect classification and the percentage
of correct and misclassified categories were calculated as the same for SVM presented in Figure 10.</p>
      <p>Also, we can see that the execution time which is 9,04 sec for the best result depends on the selected
scale method, metrics and the number of features used for the elevation.</p>
      <p>Therefore, we can conclude that if the number of input features is increased, the execution time
could become critical, and another faster model can be used.</p>
    </sec>
    <sec id="sec-17">
      <title>4. Conclusion</title>
      <p>In this paper, we present an investigation of two widely used approaches for text categorization
KNN and the SVM algorithms.</p>
      <p>The main goal of the research was to evaluate the performance of two popular K-NN and SVM
algorithms, compare execution time for both of them and to develop an MVP pipeline that can
automatically classify the shoes category based on the brand.</p>
      <p>The combination of the K-NN algorithm and different vectorization methods showed good results
as well as SVM and CountVectorizer. However, despite the good performance results of the SVM
algorithm, it has the highest execution time, which can be significant for big marketplaces.</p>
      <p>Therefore, the gained results which are reported in this paper are satisfactory, however, they are not
the best that can be achieved. Moreover, additional investigation is needed to improve the performance
of applied algorithms.</p>
      <p>To further study and improve the model, the following steps are suggested:
- Get more data to test models;
- Implement of the algorithm for automatic search of optimal arameters;</p>
      <p>Prepare the developed module for integration with e-commerce stores.</p>
    </sec>
    <sec id="sec-18">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Quarterly retail e-commerce sales in the last quarter of 2020. US Digital Commerce Bureau News (</article-title>
          <year>2020</year>
          ) https://www.digitalcommerce360.com/article/quarterly-online-sales/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Young-Gon Modified naïve bayes classifier for e-catalog classification</article-title>
          , Seoul 151-742.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shankar Applying</surname>
          </string-name>
          Machine Learning to Product Categorization Stanford University. CS229.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kim</given-names>
            <surname>Dongkyu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sang-Goo</surname>
            <given-names>Lee</given-names>
          </string-name>
          , Jonghoon Chun,
          <article-title>Juhnyoung LeeA semantic classification model for e-catalogs</article-title>
          .:
          <string-name>
            <surname>Proceedings - IEEE International Conference on E-Commerce</surname>
            <given-names>Technology</given-names>
          </string-name>
          ,
          <string-name>
            <surname>CEC</surname>
          </string-name>
          <year>2004</year>
          , p
          <fpage>85</fpage>
          -
          <lpage>92</lpage>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Wan</surname>
          </string-name>
          , Hongxin; Peng,
          <article-title>Yun A technique of e-commerce goods classification and evaluation based on fuzzy set</article-title>
          .
          <source>Proceedings, International Conference on Internet Technology and Applications</source>
          , ITAP (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jianfu</given-names>
            <surname>Chen</surname>
          </string-name>
          , David Warren.
          <article-title>Cost-sensitive learning for large-scale hierarchical classification</article-title>
          .
          <source>In Proceedings of the 22Nd ACM International Conference on Conference on Information &amp; Knowledge Management, CIKM</source>
          , pages
          <fpage>1351</fpage>
          -
          <lpage>1360</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Neighbor-weighted K-nearest neighbour for unbalanced text corpus</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>28</volume>
          (
          <year>2005</year>
          )
          <fpage>667</fpage>
          -
          <lpage>671</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.L.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Supervised and Traditional Term Weighting Methods for Automatic Text Categorization</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , VOL.
          <volume>31</volume>
          , NO.
          <volume>4</volume>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Text</given-names>
            <surname>Vectorization</surname>
          </string-name>
          and
          <string-name>
            <given-names>Transformation</given-names>
            <surname>Pipelines</surname>
          </string-name>
          . Chapter 4. https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[10] HashingVectorizer, CountVectorizer https://kavita-ganesan.com/hashingvectorizer-vscountvectorizer/</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Uguz</surname>
          </string-name>
          ,
          <article-title>A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>24</volume>
          (
          <year>2011</year>
          )
          <fpage>1024</fpage>
          -
          <lpage>1032</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>J. T</surname>
          </string-name>
          .-Y. Kwok,
          <source>Automatic Text Categorization Using Support Vector Machine, Proceedings of International Conference on Neural Information Processing</source>
          , (
          <year>1998</year>
          )
          <fpage>347</fpage>
          -
          <lpage>351</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>Text Categorization with Support Vector Machines: Learning with Many Relevant Features</article-title>
          ,
          <source>In Proceedings of 10th European Conference on Machine Learning</source>
          , Chemnitz, Germany, pages
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>A Statistical Learning Model of Text Classification for Support Vector Machines</article-title>
          .
          <source>In Proceedings of SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>128</fpage>
          -
          <lpage>136</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>An improved KNN algorithm for text classification</article-title>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Greer</surname>
          </string-name>
          ,
          <article-title>KNN Model-Based Approach in Classification, (</article-title>
          <year>2003</year>
          )
          <fpage>986</fpage>
          -
          <lpage>996</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Ming-Yang</surname>
            <given-names>Su</given-names>
          </string-name>
          ,
          <article-title>Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification</article-title>
          ,
          <source>Journal of Network and Computer Applications</source>
          <volume>34</volume>
          (
          <year>2011</year>
          )
          <fpage>722</fpage>
          -
          <lpage>730</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ishidat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goto</surname>
          </string-name>
          ,
          <article-title>A Proposal of Extended Cosine Measure for Distance Metric Learning in Text Classification</article-title>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>Machine Learning in</article-title>
          <source>Automated Text Categorization. ACM Computing Surveys</source>
          , Vol.
          <volume>34</volume>
          , No.1,
          <string-name>
            <surname>March</surname>
            <given-names>2002</given-names>
          </string-name>
          , pages
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>