<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparative analysis of machine learning methods for news categorization in Russian*</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>49A, Kronverksky Pr., St. Petersburg, 197101, Russian Federation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nizhny Novgorod State Technical University n.a. R.E. Alekseev</institution>
          ,
          <addr-line>24, Minin st., Nizhny Novgorod, 603950, Russian Federation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vyatka State University</institution>
          ,
          <addr-line>36, Moskovskaya st., Kirov, 610000, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Text categorization is one of the important areas of research in the field of natural language processing and machine learning. The relevance of the topic is due to the demand for automatic categorization methods for the operational processing of the growing volume of news content published in online publications and social networks. The article investigates the influence of the feature selection procedure on the performance of machine learning methods for solving the problem of categorizing news articles: Logistic Regression, Light Gradient Boosted Machine, k-Nearest Neighbors, Random Forest, Naïve Bayes, Support Vector Machine and RuBERT. The research was carried out on the Russian corpus of documents containing texts from six topics: incidents, culture, economics, politics, society, sports. According to the results of experiments, for most of the considered methods, a positive effect of the feature selection procedure on the quality of categorization, the speed of analysis and the amount of memory consumed was noted. Of the considered classifiers, the RuBERT model made it possible to obtain the best average classification quality on a test corpus, reaching F1=0.882.</p>
      </abstract>
      <kwd-group>
        <kwd>Text categorization</kwd>
        <kwd>machine learning</kwd>
        <kwd>deep learning</kwd>
        <kwd>feature selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The Internet contains a huge amount of text data, which is growing rapidly. Every
day, a large number of text news is published on various web resources by the media
and users, which requires systematization, therefore, an important area of research in
the field of natural language processing is the development of effective systems for
automatic categorization of text documents. Text categorization is the comparison of
texts with predefined labels (classes). This paper provides a comparative analysis of
* Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
popular machine learning methods in relation to solving the problem of categorizing
news articles in Russian. The problem to be solved is the problem of multi-class
classification of text documents.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        There are many studies devoted to solving the problem of categorizing news articles
in different languages using machine learning methods. The paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] evaluates the
performance of real-time machine learning methods for classifying English news
from the BBC website into five topics: business, entertainment, politics, sports and
tech. Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM),
Decision Tree (DT) and Random Forest (RF) are used as classifiers. The authors
perform feature selection using TF-IDF. The highest value of the accuracy was obtained
using LR and is equal to A=95.5%.
      </p>
      <p>
        The article [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] also evaluates the performance of news texts from the BBC corpus.
In this case, the classifiers NB, SVM, Multilayer Perceptron Neural Network, RF and
DT are used. TF-IDF is used for feature selection. In this work, NB was the best in
quality, which made it possible to obtain an accuracy equal to A=96.8%.
      </p>
      <p>
        Sreedevi et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] investigate bag-of-words and bag-of-n-gram text representation
models, as well as four machine learning methods: SVM, NB, k-Nearest Neighbors
(kNN) and Convolutional Neural Network. Testing of methods is performed on 20
NewsGroup and AG's News corpora. According to the results of the experiments, the
highest value of the accuracy was obtained using the SVM with bag-of-words model
and is equal to A=90.8% for the 20 NewsGroup corpus and is equal to A=85.14% for
the AG's News corpus. Also in the article, the authors provide estimates of the
training time and prediction time for algorithms.
      </p>
      <p>
        Luo [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in his research applies the technique of selecting text features based on a
cross-validation procedure. SVM, NB and LR are used to classify news. Testing of
methods is performed on three text corpora: 1) Data1 is categorized into women,
sports, literature, campus; 2) Data2 is categorized into sport, constellation, game,
entertainment; 3) Data3 is categorized into science and technology, fashion, current
event. For the Data1 and Data2 corpora, the best results in terms of the classification
quality were obtained using SVM and are equal to F1=0.86 and F1=0.71,
respectively. For the Data3 corpus the best estimate was obtained using LR and is equal to
F1=0.63.
      </p>
      <p>
        There are a number of works in which the problem of classification of news
articles is solved for the Arabic language. The article [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses a corpus from the BBC
website, containing news from 7 topics, and a corpus from CNN website, containing
news from 6 topics. The authors investigate the influence of preprocessing on the
quality of classification. Three stemming techniques and twelve methods of weighting
terms are explored. C4.5, NB и Discriminative parameter learning for Bayesian
networks for text (DMNBtext) are used as classifiers. Experimental results showed that
the DMNBtext algorithm achieves higher performance compared to other machine
learning algorithms.
      </p>
      <p>
        Qadi et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] categorize news articles into four topics: business, sports,
technology and Middle East. Weights of terms during text vectorization are determined using
TF-IDF. The paper explores 10 popular classical machine learning methods.
According to the experimental results, the best result F1=97.9 belongs to SVM, and the worst
result F1=87.7 belongs to Ada-Boost.
      </p>
      <p>
        The work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] explores 9 neural network models using corpora AR-5, KH-7, AB-7,
RT-40, the numbers in the title of which correspond to the number of topics. On the
AR-5 corpus the best accuracy is A=97.41% (Bidirectional Gated Recurrent Unit), on
the KH-7 corpus the best accuracy is A=96.86% (Convolutional Gated Recurrent
Unit), on the AB-7 corpus the best accuracy is A=94.00% (Convolutional Gated
Recurrent Unit), on the RT-40 corpus the best accuracy is A=64.24% (Convolutional
Neural Network).
      </p>
      <p>This study has the following differences from the existing ones: 1) the problem of
topic classification is solved for Russian; 2) the influence of the number of the most
relevant features, selected on the basis of TF-IDF weights, on the quality of news
classification by topics is investigated; 3) the comparison of traditional machine
learning methods with the modern neural network model BERT, showing state-of-the-art
results in many natural language processing problems, is made; 4) the training time of
the models is estimated, as well as the amount of memory required to store the
models.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Materials and methods</title>
      <sec id="sec-3-1">
        <title>Method for solving the problem of topic classification</title>
        <p>The solution to the problem of topic classification consists of the following stages:
1. Pre-processing of text corpus documents.</p>
        <p>At the pre-processing stage, html tags and stop words are removed from the texts,
and the tokenization of the texts is performed.</p>
        <p>Separate word forms are used as features.
2. Feature selection.</p>
        <p>When the procedure for feature selection is performed, it is required to determine
their weights. As a method of weighting features, the statistical measure Term
Frequency – Inverse Document Frequency (tfidf) is often used, which for term t and
document d in collection D is calculated by the formula:
tfidf t, d   ft,d  log</p>
        <p>D
nt
where ft,d – term frequency t in document d; D– the total number of documents in
the collection D; nt – the number of documents in collection D, in which the term t
occurs.</p>
        <p>After calculating the tfidf, the features are ranked in descending order of weights.
The first n features with the highest weight are selected as the most relevant ones.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Text corpus</title>
        <p>To solve the problem of multiclass topic classification, a text corpus was formed from
news articles, each of which belongs to one of six large topics: accidents, culture,
economics, politics, society, sports. The articles were taken from the Internet portals
”Gazeta.ru”, ”Lenta.ru”, ”Komsomolskaya Pravda”, ”RBK” and news agencies
”Interfax”, ”ITAR-TASS”, ”RIA Novosti” for the period from 2010 to 2020. The number
of texts in each of the topics is presented in Table 1.</p>
        <p>The created text corpus is unbalanced. The largest topic ”Economics“ contains
38,423 texts. The topic ”Incidents“ has the smallest size, which contains 10,008 texts.</p>
        <p>The markup of news articles by topics was carried out on the basis of the topics
indicated for these articles on the information resource from which they were taken.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Design of the experiments</title>
        <p>
          The experiments were carried out on a computer with an Intel (R) Xeon (R) CPU
@ 2.30GHz and a Tesla K80 video card. The experiments were carried out using the
Python programming language. Seven machine learning methods were used to
categorize texts, as described in subsection 3.1. The software implementation of the LR,
RF, NB and SVM methods is taken from the scikit-learn library [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], the LGBM
method is taken from the lightgbm library [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the RuBERT model is taken from the
DeepPavlov library [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Word forms are features of the text. The number of features with the highest
TFIDF value taken into account in the text representation model was taken equal to
0.01N, 0.05N, 0.1N, 0.25N, 0.5N and N, where N is the total number of features in the
training corpus.</p>
        <p>The performance of the categorization was determined by the F1-score calculated
by the formula:</p>
        <p>F1  2  P  R ,</p>
        <p>P  R
(1)
where P – precision; R – recall. Macro-averaging was applied to obtain the average
value of the F1-score.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The total number of features (word forms) in the training corpus was N=559,108. The
average values of the F1-score, obtained using seven classifiers for a different number
of features with the highest weight, are presented in Table 2 and Figure 1.</p>
      <p>LR
LR
kNN
kNN</p>
      <p>RF</p>
      <p>RF</p>
      <p>NB
NB</p>
      <p>The values of performance measures for the classification of news articles by
topics using two leading models of the considered models – RuBERT with N features
and SVM with 0.1N features – are presented in Table 5.
From Table 2 it follows that feature selection can improve the performance of
classification of news articles by topics for most machine learning methods. For the LGBM
method the best classification quality was obtained at 0.05N features, for RF – at
0.01N and 0.05N features, for kNN, NB and SVM – at 0.1N features. The feature
selection for LR and RuBERT did not improve the quality of the classification. For
these methods the highest F1-score is achieved with the full set of features.</p>
      <p>Among the considered classifiers, the RuBERT model showed the best results,
reaching F1=0.882. The second result in the quality of classification belongs to the
SVM method and is equal to F1=0.877.</p>
      <p>Based on Tables 3 and 4, we can conclude that a decrease in the number of features
has a positive effect on the performance of classifiers. As the number of features
decreases, the training time for LR, LGBM, RF and SVM models decreases, and the
amount of memory required for all models, except for RuBERT, decreases. The SVM
classifier turned out to be the longest in terms of training time. It took about 7.5 hours
to train it with 0.1N features. The best quality RuBERT method learned 2.2 times
faster than SVM – in about 3.4 hours. The RF and RuBERT models turned out to be
the most demanding in terms of the amount of memory, while LR, LGBM and NB
required an order of magnitude less memory for storing models on average.</p>
      <p>Analysis of Table 5 shows that the topics ”Culture“, ”Economics“ and ”Sports“ are
recognized the best by classifiers (F1 varies from 0.952 to 0.990), the topic “Society”
is recognized the worst of all (F1=0.700 for SVM and F1=0.731 for RuBERT) due to
the fact that this topic may contain texts that also belong to five other topics. The
largest gap in the F1-score for the SVM and RuBERT models (3.1 percentage points
(p.p.)) is observed in the topic “Society” in favor of RuBERT due to the higher
precision of this model (6.6 p.p. higher than SVM). However, SVM has 5.1 p.p. higher
precision than RuBERT for ”Politics“. The SVM classifier provides higher recall in
the topic ”Incidents“ (4.4 p.p. higher than RuBERT), and RuBERT provides higher
recall in the topic ”Politics“ (4.3 p.p. higher than SVM).
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The problem of text categorization is of great practical importance and can be solved
using machine learning methods. The efficiency of solving the problem is
significantly influenced by data pre-processing, including the selection of the most relevant
features. This study investigates the influence of the number of features selected at the
feature selection stage on the performance of seven classifiers, among which there are
both the classic well-proven SVM and LGBM, and the relatively new and popular
BERT. It was found that the feature selection in most cases improves the quality of
the classification, but not for all classifiers it gives a positive effect.</p>
      <p>Among the considered machine learning methods, the best average classification
quality for six topics was obtained using BERT and was equal to F1=0.882. On
average for topics (Table 5), RuBERT slightly surpasses SVM in both precision and
recall. The topics ”Culture“, ”Economics“ and ”Sports“ were most easily recognized by
the classifiers, the topic ”Society” turned out to be the most difficult to recognize.</p>
      <p>In future works, it is planned to investigate the effectiveness of machine learning
methods for solving the problem of multi-label classification of news articles in
Russian.
10. Burtsev, M. et al.: DeepPavlov: Open-source library for dialogue systems. In: Proceedings
of the 56th Annual Meeting of the Association for Computational Linguistics-System
Demonstrations, 122–127 (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Patro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.:
          <article-title>Real Time News Classification Using Machine Learning</article-title>
          .
          <source>In: International Journal of Advanced Science and Technology</source>
          ,
          <volume>29</volume>
          (
          <issue>9</issue>
          ),
          <fpage>620</fpage>
          -
          <lpage>630</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Deb</surname>
          </string-name>
          , N. et al.:
          <article-title>A Comparative Analysis of News Categorization Using Machine Learning Approaches</article-title>
          .
          <source>International journal of scientific and technology research</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ),
          <fpage>2469</fpage>
          -
          <lpage>2472</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sreedevi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reddy</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Newspaper Article Classification using Machine Learning Techniques</article-title>
          .
          <source>International Journal of Innovative Technology and Exploring Engineering</source>
          ,
          <volume>9</volume>
          (
          <issue>5</issue>
          ),
          <fpage>872</fpage>
          -
          <lpage>877</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Efficient English text classification using selected Machine Learning Techniques</article-title>
          .
          <source>Alexandria Engineering Journal</source>
          ,
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>3401</fpage>
          -
          <lpage>3409</lpage>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Alshammari</surname>
          </string-name>
          , R.:
          <article-title>Arabic Text Categorization using Machine Learning Approaches</article-title>
          .
          <source>International Journal of Advanced Computer Science and Applications</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ),
          <fpage>226</fpage>
          -
          <lpage>230</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Qadi</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          et al.:
          <article-title>Arabic Text Classification of News Articles Using Classical Supervised Classifiers</article-title>
          .
          <source>In: Proceedings of the 2nd International Conference on new Trends in Computing Sciences</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Elnagar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Debsi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Einea</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Arabic text classification using deep learning models</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>57</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          et al.:
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Light</given-names>
            <surname>Gradient Boosting Machine Homepage</surname>
          </string-name>
          , https://github.com/microsoft/LightGBM, last accessed,
          <year>2021</year>
          /06/19.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>