-

Comparative analysis of machine learning methods for news categorization in Russian*

0 ITMO University , 49A, Kronverksky Pr., St. Petersburg, 197101, Russian Federation 1 Nizhny Novgorod State Technical University n.a. R.E. Alekseev , 24, Minin st., Nizhny Novgorod, 603950, Russian Federation 2 Vyatka State University , 36, Moskovskaya st., Kirov, 610000, Russian Federation

0000 0001

Text categorization is one of the important areas of research in the field of natural language processing and machine learning. The relevance of the topic is due to the demand for automatic categorization methods for the operational processing of the growing volume of news content published in online publications and social networks. The article investigates the influence of the feature selection procedure on the performance of machine learning methods for solving the problem of categorizing news articles: Logistic Regression, Light Gradient Boosted Machine, k-Nearest Neighbors, Random Forest, Naïve Bayes, Support Vector Machine and RuBERT. The research was carried out on the Russian corpus of documents containing texts from six topics: incidents, culture, economics, politics, society, sports. According to the results of experiments, for most of the considered methods, a positive effect of the feature selection procedure on the quality of categorization, the speed of analysis and the amount of memory consumed was noted. Of the considered classifiers, the RuBERT model made it possible to obtain the best average classification quality on a test corpus, reaching F1=0.882.

Text categorization machine learning deep learning feature selection

The Internet contains a huge amount of text data, which is growing rapidly. Every day, a large number of text news is published on various web resources by the media and users, which requires systematization, therefore, an important area of research in the field of natural language processing is the development of effective systems for automatic categorization of text documents. Text categorization is the comparison of texts with predefined labels (classes). This paper provides a comparative analysis of * Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). popular machine learning methods in relation to solving the problem of categorizing news articles in Russian. The problem to be solved is the problem of multi-class classification of text documents. 2

Related work

There are many studies devoted to solving the problem of categorizing news articles in different languages using machine learning methods. The paper [ 1 ] evaluates the performance of real-time machine learning methods for classifying English news from the BBC website into five topics: business, entertainment, politics, sports and tech. Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT) and Random Forest (RF) are used as classifiers. The authors perform feature selection using TF-IDF. The highest value of the accuracy was obtained using LR and is equal to A=95.5%.

The article [ 2 ] also evaluates the performance of news texts from the BBC corpus. In this case, the classifiers NB, SVM, Multilayer Perceptron Neural Network, RF and DT are used. TF-IDF is used for feature selection. In this work, NB was the best in quality, which made it possible to obtain an accuracy equal to A=96.8%.

Sreedevi et al. [ 3 ] investigate bag-of-words and bag-of-n-gram text representation models, as well as four machine learning methods: SVM, NB, k-Nearest Neighbors (kNN) and Convolutional Neural Network. Testing of methods is performed on 20 NewsGroup and AG's News corpora. According to the results of the experiments, the highest value of the accuracy was obtained using the SVM with bag-of-words model and is equal to A=90.8% for the 20 NewsGroup corpus and is equal to A=85.14% for the AG's News corpus. Also in the article, the authors provide estimates of the training time and prediction time for algorithms.

Luo [ 4 ] in his research applies the technique of selecting text features based on a cross-validation procedure. SVM, NB and LR are used to classify news. Testing of methods is performed on three text corpora: 1) Data1 is categorized into women, sports, literature, campus; 2) Data2 is categorized into sport, constellation, game, entertainment; 3) Data3 is categorized into science and technology, fashion, current event. For the Data1 and Data2 corpora, the best results in terms of the classification quality were obtained using SVM and are equal to F1=0.86 and F1=0.71, respectively. For the Data3 corpus the best estimate was obtained using LR and is equal to F1=0.63.

There are a number of works in which the problem of classification of news articles is solved for the Arabic language. The article [ 5 ] uses a corpus from the BBC website, containing news from 7 topics, and a corpus from CNN website, containing news from 6 topics. The authors investigate the influence of preprocessing on the quality of classification. Three stemming techniques and twelve methods of weighting terms are explored. C4.5, NB и Discriminative parameter learning for Bayesian networks for text (DMNBtext) are used as classifiers. Experimental results showed that the DMNBtext algorithm achieves higher performance compared to other machine learning algorithms.

Qadi et al. [ 6 ] categorize news articles into four topics: business, sports, technology and Middle East. Weights of terms during text vectorization are determined using TF-IDF. The paper explores 10 popular classical machine learning methods. According to the experimental results, the best result F1=97.9 belongs to SVM, and the worst result F1=87.7 belongs to Ada-Boost.

The work [ 7 ] explores 9 neural network models using corpora AR-5, KH-7, AB-7, RT-40, the numbers in the title of which correspond to the number of topics. On the AR-5 corpus the best accuracy is A=97.41% (Bidirectional Gated Recurrent Unit), on the KH-7 corpus the best accuracy is A=96.86% (Convolutional Gated Recurrent Unit), on the AB-7 corpus the best accuracy is A=94.00% (Convolutional Gated Recurrent Unit), on the RT-40 corpus the best accuracy is A=64.24% (Convolutional Neural Network).

This study has the following differences from the existing ones: 1) the problem of topic classification is solved for Russian; 2) the influence of the number of the most relevant features, selected on the basis of TF-IDF weights, on the quality of news classification by topics is investigated; 3) the comparison of traditional machine learning methods with the modern neural network model BERT, showing state-of-the-art results in many natural language processing problems, is made; 4) the training time of the models is estimated, as well as the amount of memory required to store the models. 3 3.1

Materials and methods Method for solving the problem of topic classification

The solution to the problem of topic classification consists of the following stages: 1. Pre-processing of text corpus documents.

At the pre-processing stage, html tags and stop words are removed from the texts, and the tokenization of the texts is performed.

Separate word forms are used as features. 2. Feature selection.

When the procedure for feature selection is performed, it is required to determine their weights. As a method of weighting features, the statistical measure Term Frequency – Inverse Document Frequency (tfidf) is often used, which for term t and document d in collection D is calculated by the formula: tfidf t, d   ft,d  log

D nt where ft,d – term frequency t in document d; D– the total number of documents in the collection D; nt – the number of documents in collection D, in which the term t occurs.

After calculating the tfidf, the features are ranked in descending order of weights. The first n features with the highest weight are selected as the most relevant ones. 3.2

Text corpus

To solve the problem of multiclass topic classification, a text corpus was formed from news articles, each of which belongs to one of six large topics: accidents, culture, economics, politics, society, sports. The articles were taken from the Internet portals ”Gazeta.ru”, ”Lenta.ru”, ”Komsomolskaya Pravda”, ”RBK” and news agencies ”Interfax”, ”ITAR-TASS”, ”RIA Novosti” for the period from 2010 to 2020. The number of texts in each of the topics is presented in Table 1.

The created text corpus is unbalanced. The largest topic ”Economics“ contains 38,423 texts. The topic ”Incidents“ has the smallest size, which contains 10,008 texts.

The markup of news articles by topics was carried out on the basis of the topics indicated for these articles on the information resource from which they were taken. 3.3

Design of the experiments

The experiments were carried out on a computer with an Intel (R) Xeon (R) CPU @ 2.30GHz and a Tesla K80 video card. The experiments were carried out using the Python programming language. Seven machine learning methods were used to categorize texts, as described in subsection 3.1. The software implementation of the LR, RF, NB and SVM methods is taken from the scikit-learn library [ 8 ], the LGBM method is taken from the lightgbm library [ 9 ], the RuBERT model is taken from the DeepPavlov library [ 1 ].

Word forms are features of the text. The number of features with the highest TFIDF value taken into account in the text representation model was taken equal to 0.01N, 0.05N, 0.1N, 0.25N, 0.5N and N, where N is the total number of features in the training corpus.

The performance of the categorization was determined by the F1-score calculated by the formula:

F1  2  P  R ,

P  R (1) where P – precision; R – recall. Macro-averaging was applied to obtain the average value of the F1-score. 4

Results

The total number of features (word forms) in the training corpus was N=559,108. The average values of the F1-score, obtained using seven classifiers for a different number of features with the highest weight, are presented in Table 2 and Figure 1.

LR LR kNN kNN

NB NB

The values of performance measures for the classification of news articles by topics using two leading models of the considered models – RuBERT with N features and SVM with 0.1N features – are presented in Table 5. From Table 2 it follows that feature selection can improve the performance of classification of news articles by topics for most machine learning methods. For the LGBM method the best classification quality was obtained at 0.05N features, for RF – at 0.01N and 0.05N features, for kNN, NB and SVM – at 0.1N features. The feature selection for LR and RuBERT did not improve the quality of the classification. For these methods the highest F1-score is achieved with the full set of features.

Among the considered classifiers, the RuBERT model showed the best results, reaching F1=0.882. The second result in the quality of classification belongs to the SVM method and is equal to F1=0.877.

Based on Tables 3 and 4, we can conclude that a decrease in the number of features has a positive effect on the performance of classifiers. As the number of features decreases, the training time for LR, LGBM, RF and SVM models decreases, and the amount of memory required for all models, except for RuBERT, decreases. The SVM classifier turned out to be the longest in terms of training time. It took about 7.5 hours to train it with 0.1N features. The best quality RuBERT method learned 2.2 times faster than SVM – in about 3.4 hours. The RF and RuBERT models turned out to be the most demanding in terms of the amount of memory, while LR, LGBM and NB required an order of magnitude less memory for storing models on average.

Analysis of Table 5 shows that the topics ”Culture“, ”Economics“ and ”Sports“ are recognized the best by classifiers (F1 varies from 0.952 to 0.990), the topic “Society” is recognized the worst of all (F1=0.700 for SVM and F1=0.731 for RuBERT) due to the fact that this topic may contain texts that also belong to five other topics. The largest gap in the F1-score for the SVM and RuBERT models (3.1 percentage points (p.p.)) is observed in the topic “Society” in favor of RuBERT due to the higher precision of this model (6.6 p.p. higher than SVM). However, SVM has 5.1 p.p. higher precision than RuBERT for ”Politics“. The SVM classifier provides higher recall in the topic ”Incidents“ (4.4 p.p. higher than RuBERT), and RuBERT provides higher recall in the topic ”Politics“ (4.3 p.p. higher than SVM). 6

Conclusion

The problem of text categorization is of great practical importance and can be solved using machine learning methods. The efficiency of solving the problem is significantly influenced by data pre-processing, including the selection of the most relevant features. This study investigates the influence of the number of features selected at the feature selection stage on the performance of seven classifiers, among which there are both the classic well-proven SVM and LGBM, and the relatively new and popular BERT. It was found that the feature selection in most cases improves the quality of the classification, but not for all classifiers it gives a positive effect.

Among the considered machine learning methods, the best average classification quality for six topics was obtained using BERT and was equal to F1=0.882. On average for topics (Table 5), RuBERT slightly surpasses SVM in both precision and recall. The topics ”Culture“, ”Economics“ and ”Sports“ were most easily recognized by the classifiers, the topic ”Society” turned out to be the most difficult to recognize.

In future works, it is planned to investigate the effectiveness of machine learning methods for solving the problem of multi-label classification of news articles in Russian. 10. Burtsev, M. et al.: DeepPavlov: Open-source library for dialogue systems. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, 122–127 (2018).

1. Patro , A. et al.: Real Time News Classification Using Machine Learning . In: International Journal of Advanced Science and Technology , 29 ( 9 ), 620 - 630 ( 2020 ).

2. Deb , N. et al.: A Comparative Analysis of News Categorization Using Machine Learning Approaches . International journal of scientific and technology research , 9 ( 1 ), 2469 - 2472 ( 2020 ).

3. Sreedevi , J. , Bai , M.R. , Reddy , M.C. : Newspaper Article Classification using Machine Learning Techniques . International Journal of Innovative Technology and Exploring Engineering , 9 ( 5 ), 872 - 877 ( 2020 ).

4. Luo , X. : Efficient English text classification using selected Machine Learning Techniques . Alexandria Engineering Journal , 60 ( 3 ), 3401 - 3409 ( 2021 ).

5. Alshammari , R.: Arabic Text Categorization using Machine Learning Approaches . International Journal of Advanced Computer Science and Applications , 9 ( 3 ), 226 - 230 ( 2018 ).

6. Qadi , L.A. et al.: Arabic Text Classification of News Articles Using Classical Supervised Classifiers . In: Proceedings of the 2nd International Conference on new Trends in Computing Sciences ( 2019 ).

7. Elnagar , A. , Al-Debsi , R. , Einea , O. : Arabic text classification using deep learning models . Information Processing and Management , 57 ( 1 ), 1 - 17 ( 2020 ).

8. Pedregosa , F. et al.: Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research , 12 , 2825 - 2830 ( 2011 ).

Light

Gradient Boosting Machine Homepage , https://github.com/microsoft/LightGBM, last accessed, 2021 /06/19.