=Paper=
{{Paper
|id=Vol-3396/paper20
|storemode=property
|title=A Conceptual Text Classification Model Based on Two-Factor Selection of Significant Words
|pdfUrl=https://ceur-ws.org/Vol-3396/paper20.pdf
|volume=Vol-3396
|authors=Olesia Barkovska,Vladyslav Kholiev,Anton Havrashenko,Dmytro Mohylevskyi,Andriy Kovalenko
|dblpUrl=https://dblp.org/rec/conf/colins/BarkovskaKHMK23
}}
==A Conceptual Text Classification Model Based on Two-Factor Selection of Significant Words==
A Conceptual Text Classification Model Based on Two-Factor Selection of Significant Words Olesia Barkovska, Vladyslav Kholiev, Anton Havrashenko, Dmytro Mohylevskyi and Andriy Kovalenko Kharkiv National University of Radio Electronics, Nauki ave., 14, Kharkiv, 61166, Ukraine Abstract The aim of the study is to develop a text classification conceptual model based on a combined method of two-factor selection of significant words in a frequency dictionary. The task is relevant due to the increase in the amount of textual information in electronic form, which requires organization and classification, for example, in the automatic processing of news flow, distribution of news texts in catalogs or analysis of different publications in the scientific field. Efficient processing of text arrays and the quality of searching for materials require an accurate correlation of the publication with other types of publications related to particular scientific field. It confirms the relevance of research in the field of automatic text documents’ classification. Achieving this goal was possible due to the analysis of the dependence of the classification accuracy of the Reuters-21578, NSF and MiniNg20 datasets on the choice of significant words of the frequency dictionary on the basis of the TF-IDF. The first study of the selection of topic-related words for classification based on such factor, as frequency of topic-related words showed that for the analyzed data set the most informative words are those that occur at least 10 to 15 times in the data set. The second study of the selection of topic-related words based on such factor, as the reduction of the frequency vector by determining the threshold of the frequency dictionary showed that using the range of significant words from 2000 to 4000 for all datasets gives more successful results than using all words in the feature vector. The proposed combined method of two- factor selection of topic-related words (on the base of frequency of topic-related words together with the threshold of the frequency dictionary) outperforms previous methods for all three datasets and increases the accuracy of text document classification from 2 to 4 percent. Keywords Text classification, text representation, TF-IDF, frequency dictionary, acceleration, accuracy 1. Introduction With the increase in the amount of textual information in electronic form the task of automatic text classification continues to increase in relevancy. This task arises during the automatic processing of news flow and distribution of news texts in catalogs (Figure 1). For the convenience of users, directories are organized in a hierarchical structure: a directory consists of several subdirectories, etc. The task of classification is especially important in the scientific field, where tens of thousands of monographs, articles, preprints, and other types of publications are being added annually in each discipline. Effective processing of such arrays and the quality of searching for materials relevant to a particular research area require an accurate correlation of each publication with its thematic category [1] for different languages, including Ukrainian [2, 3]. _______________________________ COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine EMAIL: olesia.barkovska@nure.ua (O. Barkosvka); vladyslav.kholiev@nure.ua (V. Kholiev); anton.havrashenko@nure.ua (A. Havrashenko); dmytro.mohylevskyi@nure.ua (D. Mohylevskyi); andriy.kovalenko@nure.ua (A. Kovalenko) ORCID: 0000-0001-7496-4353 (O. Barkosvka); 0000-0002-9148-1561 (V. Kholiev); 0000-0002-8802-0529 (A. Havrashenko); 0009-0003- 2889-6208 (D. Mohylevskyi); 0000-0002-2817-9036 (A. Kovalenko); 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 11% News 23% Science 15% Social Networks 18% Mood Analysis 33% Misc. Figure 1: Practical applications of text array classifiers After converting documents into vector form, this task is suitable to be solved by machine learning methods. Currently, TextMining is actively developing: research is being conducted, projects and competitions are being launched to identify the best algorithms in terms of accuracy. Many different methods are used to solve the problem of classifying text documents [4, 5]. The k- nearest neighbors method and its modifications are widely used, where the classified object is assigned to the class that the k other closest objects in the training set “around” it belong to. Another algorithm is Bayesian classification, which works to calculate the posterior error probabilities of classes. A representative of linear classifiers is the support vector method, which involves constructing a hyperplane that separates the sample objects in the most optimal way. Recently, neural networks have been increasingly used to solve the classification problem [6, 7]. On average, the accuracy of various text classification algorithms varies from 70% to 90% and depends not only on the classification algorithms but also on the quality of the source data. 2. Related Works Many existing methods of text classification are based on terminological proximity. The text is represented as a vector in Euclidean space, where the coordinate axes are terms, n-grams or lexemes that are extracted from the text, and the coordinate along the axis is statistical information about them. Thus, the text can be represented as frequency vectors of word occurrences based on TF, TF*IDF, C- TF*IDF, and other schemes [8, 9]. Another important parameter in text classification is the proximity measure calculated between vectors. Its choice has an impact on the quality of classification. The well-known metrics are: Euclidean distance, Minkowski distance, Otiai coefficient, Jaccard coefficient, projection distance, etc. Consider in more detail the main methods used in text classification. These methods are related to supervised machine learning methods (table 1). Metric classification methods include the k-nearest neighbors’ method, where the classified object is assigned to the class to which the objects in the training set closest to it belong. The classic k- nearest neighbors algorithm has many modifications. This is due to the high computational complexity of the algorithm and the low classification speed. One study compares the results of classifying Fudan University texts using five methods: the classical k-nearest neighbors’ method, k weighted nearest neighbors, fuzzy k-nearest neighbors, k-nearest neighbors based on Dempster-Shafer theory, and k-nearest neighbors based on fuzzy integral [10]. It is shown that the best accuracy of 86% is shown by the algorithm based on the fuzzy integral, while the accuracy of the classical k- nearest neighbors’ algorithm is only 78%. Another group of classifiers is probabilistic. A widely used algorithm belonging to this class is naive Bayesian classification. It represents the simplest variation of Bayesian classifiers - a naive Bayesian classifier based on the assumption of feature independence. Since the classical approach to naive Bayesian classification often does not include the weights of the learned features in the conditional probability estimation, Liangxiao Jiang and co-authors in their study propose a naive Bayesian classification with deep feature weighting, which calculates weighted features by frequencies based on the training data, and then these weights are taken into account when calculating the probability [11]. In that paper, naive Bayesian classification is used to determine the authorship of texts. Depending on the representation of the text, for example, in the form of n-grams, the accuracy of the method in applying to this task showed results from 40% (with trigrams and tetragrams) to 96.67% (with terms). The study revealed a problem in the process of parameter estimation that can affect the accuracy of naive Bayesian text classification. To eliminate this problem, the authors propose to normalize the text for each document and use the feature weighting method. To improve the performance of the naive Bayesian classification, the method of auxiliary functions is also used, the Kullback-Leibler distance is calculated between words, naive Bayesian trees are built, polynomial naive Bayesian classification, Bernoulli naive Bayesian classification, Gaussian naive Bayesian classification, etc. The study shows that polynomial naive Bayesian classification gives a better result when classifying texts (although its accuracy is only 73.4%) than Bernoulli's naive Bayesian classification (its accuracy is 69.15%). When comparing the three methods based on naive Bayesian classification, it is shown that Bernoulli's naive Bayesian classification is comparable in terms of results to the classical one, while the Gaussian naive Bayesian classifier gives the best classification accuracy. One of the examples of linear classifiers is the support vector machine, which consists in constructing a hyperplane that separates the sample objects in the most optimal way. There is also a classification based on graph theory methods. It includes, for example, the random forest method. It consists in building an ensemble independent decision trees learning in parallel [12]. A number of studies have suggested ways to improve the performance of the random forest method. Thus, to solve multi-class problems for calculating the weights of objects, it is proposed to use the method of XI-squares [13]. By using a new feature weighting method for subspace sampling and a tree selection method, the subspace size is effectively reduced and classification performance is improved. Depending on the dataset, the method can demonstrate classification accuracy from 72% to 92%. The semantics-aware random forest algorithm on trees of different sizes shows an accuracy of 73-78%, while the accuracy of the classical algorithm is 57-60% [14]. Recently, neural networks have seen increased usage to solve the classification problem. In their work, Siwei Lai and co-authors propose to use recurrent convolutional neural networks to solve the text classification problem [15]. The authors conclude that the use of neural networks in the classification of text documents will help to avoid the problem of sparse data, as well as collect more contextual information about entities compared to traditional methods. Convolutional neural networks have shown high accuracy (83.98%) in the classification of patent documents. Table 1 Classification methods Methods Accuracy Scope Computational Classification complexity speed k-nearest 78% – 86% 86% – 91% High Low neighbors Support vector 63% – 90% 83% – 87% Low Low machine Naive Bayesian 40% – 83% 80% – 90% Low Low classification «Random 57% – 78% 75% – 82% High High forest» Convolutional 83,98% 70% – 85% High High neural networks There are many studies aimed at comparing the accuracy of text document classification using different methods. Thus, when comparing three methods: k-nearest neighbors based on fuzzy integral, support vector machine and Bayesian classification, the support vector machine showed the best accuracy of 90%. When classifying tweets in Turkish, the methods showed different classification results depending on the size of the training sample. The best results, from 63% to 83%, in all three cases were demonstrated by Bayesian classification. When classifying books, the Bayesian classifier also showed the best accuracy, 81%. However, when classifying Indian and English tweets, despite the fact that Bayesian classification was the most effective, its accuracy did not exceed 63%. The study uses five classifiers to classify data from news websites: k-nearest neighbors, random forest, polynomial naive Bayesian classifier, logistic regression, and support vector machine. The most effective algorithm was the support vector machine, which demonstrated not only a high accuracy of 91%, but also the fastest running time: at least one and a half times lower than the other algorithms studied [16, 17]. Combinations of different classification algorithms are also used to improve classification accuracy [18]. For example, the combination of k-nearest neighbors and support vector machine algorithms makes the classification accuracy higher by 1 to 2% than when these classifiers are used separately. The combination of k-nearest neighbors, the Rocchio algorithm, and the least squares method reduced the number of classification errors by 15%. Thus, on average, the accuracy of various text classification algorithms varies from 70% to 90%. At the same time, the classification accuracy depends not only on the chosen classification algorithm, but also on the source data and preprocessing methods [19, 20]. That is, the analysis and development of a method that would rationally process the source data and classify it with a lower error rate is a popular and relevant task. 3. Aims and Tasks of the Work The aim of the work is to create a conceptual model of two-factor text classification on the example of standardized training and test text data sets. To achieve this goal, the following tasks have to be solved: to make the overview of existing methods of text data classification; to analyze methods of pre-processing and preparation of input text data; to develop a text classification model based on two-factor selection of topic-related words; to research the frequency vector reduction impact on the text classification accuracy; to analyze the results obtained. 4. Results and Discussion When choosing a specific text classification algorithm, one should take into account the features of each of them. As before, the issue of determining the set of classifying features, their number, and how to calculate weights remains unresolved. In deep learning algorithms, the classification accuracy depends on the availability of a training set of appropriate size. Preparing such a set is a very time- consuming process. The problem of selecting the parameters of some algorithms at the training stage is still open. Figure 2 shows a general scheme of the classification process, taking into account the main stages and options for their implementation. After analyzing the existing methods of automatic text classification, a new two- factor classification model was developed. It is shown in Figure 3 in the form of IDEF0 notation. In the proposed model, feature selection is performed using a two- factor approach based on TF- IDF and C-TF-IDF methods. C-TF-IDF is a class-based TF-IDF procedure that can be used to create objects from text documents based on the class they are in. The goal of a class-based TF-IDF is to provide all documents within a class with the same vector class. To do this, we must start thinking about TF-IDF in terms of classes rather than individual documents. Figure 2: Accepted stages of the automatic text classification process C-TF-IDF can be best explained as a TF-IDF formula adopted for multiple classes by combining all documents for each class. This way, each class is transformed into a single document rather than a set of documents. Figure 3: A conceptual model of text classification based on two-factor selection of topic-related words In this paper, we mainly implement four main methods: all-words (AW), all-words with corpus- based abbreviation (AWP), all-words with class-based keyword selection (AWK), and two-stage feature selection with both abbreviation and keyword selection (AWPK). The AW method is a basic method that uses the standard bag of words with all the words in the feature vector. The bag of words is a useful tool that is used for various purposes, such as classifying texts as spam/not spam, determining the similarity of texts, and as a simplified way to represent texts for various machine learning tasks in a pre-processing stage. The bag of words shows words founded in the text, but it does not take into account their order and semantics. It can be regarded as a shortcoming of the method. Text arrays' classification taking into account the semantics of the text is available to the majority of modern intellectual models, but, their use requires a powerful local computing resource or a certain cost of renting a computing server. AWP takes into account all the words in the document collection, but filters them using a pruning process. This method filters out terms that occurs less than a certain threshold value in the entire training set. We call this threshold value the pruning level (PL). PL = n (n≥1) indicates that terms that appear at least n times in the training set are used in the decision vector, while the rest are ignored. Note that PL=1 corresponds to the AW method (i.e., no pruning). We perform parameter tuning by analyzing different values for each dataset to achieve optimal PL values for the AWP method. We conduct experiments with different levels of cropping from 2 to 30: 2, 3, 5, 8, 13, 20, і 30. In the AWK method, separate keywords are selected for each class. This method gives equal weight to each class during the keyword selection phase. We experiment with five different numbers of keywords (250, 500, 1000, 2000, and 4000) and compare the results with AW, which includes all words as objects in the decision vector. The AWPK method is designed to be an optimal combination of AWP and AWK by varying the level of pruning and the number of keyword parameters. The values of the parameters that give the best results in the basic methods are used for AWPK experiments. 4.1. Performing the experiment Based on the methods discussed in the previous section, in this one we determine the optimal parameter values (pruning level and number of keywords) for the methods in all datasets. The experiments were evaluated and the methods were compared with respect to the micro-average F- measure (MicroF), which is the average success rate of documents, and the macro-average F-measure (MacroF), which is the average success rate of categories [21]. For the three datasets, we analyzed the relationships between: keyword frequency vector and classification accuracy; size of text collections and quality of classification; classification methods and text data sets. As a result of the experiments, the impact of reducing the vector of keyword frequencies in the text on the accuracy of text classification should be assessed, and the impact of choosing a range of keywords according to the TF-IDF metric on the quality of classification should be analyzed. The study of the influence of the choice of keyword rank according to the TF-IDF metric on the quality of classification is the second experiment to be conducted on three text collections. In this paper, we use three well-known datasets from the UCI Machine Learning Repository: Reuters-21578 (Reuters), National Science Foundation Research Award (NSF) abstracts, and Mini 20 newsgroups (MiniNg20). These datasets have different characteristics that can be crucial for classification performance. Skewness is one of the key properties of a dataset, which is defined as the distribution of the number of documents across classes. A dataset that has a low skewness coefficient indicates that it is a balanced dataset with approximately the same number of document samples for each class. The validity of multiple classes for documents (indicating that a document can belong to more than one topic), document length (e.g., short abstracts or long news articles), split proportions (training and test sets), level of formality (e.g., formal journal documents or informal Internet forum posts) are other properties of datasets. In our experiments, we use standard partitions of the Reuters dataset (the dataset contains structured information about news feed articles that can be categorized into several classes, which creates a multiple label problem. The collection consists of 21,578 documents) and MiniNg20 (informal, with many grammatical errors, allows only one topic per text, and is a balanced dataset containing the same number of messages for each topic. The MiniNg20 dataset consists of 2000 messages). For NSF (the NSF dataset consists of 129,000 abstracts describing NSF awards for basic research from 1990 to 2003. The level of formality of the dataset is high. The NSF is not a perfectly balanced dataset, but its skewness coefficient is also not as high as Reuters. The length of the document is short due to its abstract content) data related to the year 2001 were randomly selected, and five sections were chosen from this year (four sections for training and one section for testing). We create five different splits, repeat all tests with them, and take the average as the final result. 4.2. A method for selecting topic-related words based on word frequency In this experiment, the AWP method was implemented with several PL values (PL=1 corresponds to AW) for the three datasets. Table 2 shows the feature number and the "micro" and "macro" success rates for each reduction level. The first column of the table shows the method and the value of the PL parameter, separated by a comma. As it can be seen, the reduction process improves the success rate of the classifier, and the best results (high accuracy with low feature numbers) are obtained at approximately PL=13 consistently across all three datasets with two different performance parameters. Table 2 AWP success rates (optimal results are highlighted in bold) Method, Reuters NSF MiniNg20 Parameter Featu- MicroF MacroF Featu- MicroF MacroF Featu- MicroF MacroF re# re# re# AW 20292 85.58 43.83 13424 64.46 46.11 30970 46.42 43.44 AWP,2 12959 85.55 43.84 8492 64.41 46.21 13102 49.73 47.13 AWP,3 9971 85.52 43.93 6328 64.62 46.42 9092 49.64 47.19 AWP,5 7168 85.51 44.56 4528 64.86 46.49 6000 51.26 48.52 AWP,8 5268 85.73 44.91 3376 64.66 46.38 4169 52.48 49.90 AWP,13 3976 85.84 44.85 2478 64.58 46.49 2863 53.62 51.02 AWP,20 3046 86.02 44.55 1875 64.23 46.67 2025 53.78 51.02 AWP,30 2237 81.29 43.59 1419 63.84 46.21 1384 52.89 50.46 Following the generalization that words that occur less than 10 to 15 times in a dataset are likely not a good indicator for text classification, we found PL=13 in the reduction-based experiments. This result indicates that the common belief in the literature that a reduction level of 2 to 3 times is sufficient to eliminate uninformative terms is not true. 4.3. A method for selecting topic-related words based on determining the threshold of a frequency dictionary In this experiment, the performance of the AWK method was analyzed using different parameters of the keyword (function) number. The results are shown in Table 3. The success rates for AW are also included in the table for comparison. Table 3 AWK success rates (optimal results are highlighted in bold) Method, Reuters NSF MiniNg20 Parameter MicroF MacroF MicroF MicroF MacroF MicroF AWK,250 83.69 51.15 62.04 49.51 56.65 55.72 AWK,500 84.71 50.92 62.92 49.31 56.16 55.01 AWK,1000 85.16 51.72 64.69 49.33 53.68 52.17 AWK,2000 85.58 52.03 65.19 49.31 54.04 52.10 AWK,4000 85.84 52.10 65.71 49.35 55.25 53.73 AW 85.58 43.83 64.46 46.11 46.42 43.44 In general, the AWK method with the number of keywords from 2000 to 4000 increases the success rate in all datasets compared to the AW method. Therefore, it can be concluded that using a specific set of keywords for each class gives more successful results than using all the words in the feature vector. When we analyze the results of AWP and AWK together, we see that the improvement of AWP over AW is clear in the balanced dataset (MiniNg20), while the improvement in the distorted datasets (Reuters and NSF) is smaller. On the other hand, the improvement of AWK over AW is more significant than the improvement of AWP in all datasets. This performance gain is more pronounced in the MacroF measure. In corpus-based approaches, documents of rare classes tend to be misclassified because the words of the predominant classes dominate the feature vector. Figure 4 and Figure 5 show the results of microF and macroF, respectively, for the class-based and corpus-based approaches with TF-IDF representation of the document using all words and keywords in the range from 10 to 2000. MicroF results 90,00 85,00 Frequency 80,00 measure of 75,00 topic-related MicroF % 70,00 words factor 65,00 60,00 55,00 Topic-related 50,00 words 45,00 threshold 40,00 factor 10 30 100 300 1000 2000 all Feature Numbers Figure 4: MicroF for two factors – frequency measure of topic-related words and topic-related words threshold in the frequency dictionary Regarding the microF results (Figure 4), we can conclude that the class-based feature selection achieves a higher microF than the corpus-based approach for a small number of keywords. In text classification, most of the learning takes place with a small but important portion of keywords for a class. Class-based feature selection, by definition, focuses on this small portion; on the other hand, the corpus-based approach finds common keywords that apply to all classes. So, with a small number of keywords, the class-based approach is much more successful in finding more important class keywords. The corpus-based approach is not successful with such a small portion, but has a steeper learning curve that reaches a peak value of 86% in the experiments with 2000 corpus-based keywords. For the macroF results (Figure 5), we analyzed that the class-based feature selection provides consistently higher macroF performance than the corpus-based approach. High asymmetry in the distribution of classes in the dataset negatively affects the macroF value, since macroF gives equal weight to each class rather than to each document, and documents of rare classes are more likely to be misclassified. Accordingly, the average value of correct class classifications drops sharply for datasets with many rare classes. Class-based feature selection is very useful for this asymmetry. As mentioned above, even with a small fraction of the words (e.g., 100), the class-based TF-IDF method achieves a 50% success rate, which is much better than the 43.9% success rate of TF-IDF with all words. Rare classes are successfully characterized by class-based feature selection because each class has its own keywords for the categorization problem. The corpus approach performs worse because most of the keywords are selected from the predominant classes, which does not allow rare classes to be fairly represented by their keywords. MacroF results 60 50 Frequency measure of topic-related 40 words factor MacroF% 30 20 Topic-related words 10 threshold factor 0 10 30 100 300 1000 2000 all Feature Numbers Figure 5: MacroF for two factors – frequency measure of topic-related words and topic-related words threshold in the frequency dictionary MacroF gives equal weight to each class when determining the success of a classifier. Thus, especially for highly distorted datasets, where rare classes are poorly represented by the selected features, the average value of correct classifications for rare classes drops significantly. This is true for both AW and AWP in skewed datasets that use a common feature set for all classes. However, with class-based keyword selection, since each class has its own keywords during classification, sparse classes are characterized more successfully. Thus, we observe a significant increase in the success rate (MacroF) with AWK in the skewed datasets. 4.4. Combined method of two-factor selection of topic-related words The AWPK method combines the optimal patterns of using the AWP and AWK approaches. Therefore, the parameters of the method are the reduction level and the number of keywords. In this experiment, we use the optimal values of these parameters determined in the previous analyzes for each dataset: a reduction level of 13 and the number of keywords 2000 and 4000. The results are shown in Table 4. The table also shows the best performing AW, AWP, and AWK for comparison. As can be seen from the table, the two-factor feature selection approach outperforms the previous approaches. Selecting the best 2000-4000 keywords for each class with an initial reduction step significantly improves the best AWP (with PL=13) and AWK (with 2000-4000 keywords) performance in all three datasets. Table 4 AWPK success rates (optimal results are highlighted in bold) Method, Reuters NSF MiniNg20 Parameter MicroF MacroF MicroF MacroF MicroF MacroF AWPK,13,2000 86.40 53.95 66.06 50.11 57.43 55.66 AWPK,13,4000 86.70 53.98 66.10 50.12 57.43 55.66 AW 85.58 43.83 64.46 46.11 46.42 43.44 AWP,13 85.84 44.85 64.58 46.49 53.62 51.02 AWK,2000 85.58 52.03 65.19 49.31 54.04 52.10 AWK,4000 85.84 52.10 65.71 49.35 55.25 53.73 The diagram (Figure 6) shows a comparison of the three methods by micro-averaged F-measure, with the AWPK method performing better than the others, especially in the Reuters set. When comparing the three methods by macro-average F-measure in Figure 7, it can be seen that the AWPK method is better than the others, but the best performance is already with the MiniNg20 dataset. Also, due to the heterogeneity of the data, we can see different micro-average and macro- average F-measure values for different datasets. Hence, we can conclude that the additional effect of corpus-based shortening is extended when it is combined with the class-based TF-IDF keyword selection metric. As a consequence, the method proposed in this paper, AWPK, gives the best performance. The significance of the results for the three methods was measured using a statistical feature test. We noticed that in general, each method outperforms its predecessor. In this sense, AWP and AWK are significantly better than the standard AW method, and AWPK is significantly better than both AWP and AWK. Thus, the most advanced method in this study (AWPK) is the optimal method with two-factor feature selection analysis. Figure 6: MicroF comparison chart of the three methods Figure 7: MacroF comparison chart of the three methods The scientific novelty of the research is in creating a flexible method for extracting topic-related words from a frequency dictionary to increase further SVM-classification accuracy and to reduce the redundancy of the sample. To do this, a study of the classification accuracy was carried out using different values of the occurrence of terms in the dictionary (the value of the occurrence of words in the sample, equal to 13, was determined empirically), as well as a different threshold of words in the dictionary (the threshold, equal to 4000 words, was determined empirically). By combining the performed studies, a modified method for determining topic-related words was proposed, increasing the classification accuracy by 4%. 5. Conclusion The paper proposes a conceptual model of text classification based on accepted stages of the automatic text classification process with modification in a feature selection module. The method of topic-related words selection was improved by combining two factors – frequency measure of topic- related words and topic-related words threshold in the frequency dictionary. Achieving this goal was possible due to the analysis of the dependence of the classification accuracy of the Reuters-21578, NSF and MiniNg20 datasets on the choice of topic-related words of the frequency dictionary built on the basis of the TF-IDF method. The first study of the selection of topic-based words for classification based on the frequency of words showed that for the analyzed data set the most informative words are those that occur at least 10 to 15 times in the data set. The second study of the selection of topic-related words based on the reduction of the frequency vector by determining the threshold of the frequency dictionary showed that using the range of topic-related words from 2000 to 4000 for all datasets gives more successful results than using all words in the feature vector. Then, after determining the optimal parameter values for each method (with the highest micro-F and macro-F measures), a new two-factor method was proposed, which is a combination of these two approaches. The proposed combined method of two-factor selection of topic-related words outperforms the previous approaches for all three datasets and increases the accuracy of text document classification from 2 to 4 percent. Possible future work is to apply the two-factor feature selection approach to more semantically oriented text classification methods, such as methods that use language models, linguistic features, or lexical dependencies. Integrating the concepts of keyword reduction and keywords number selection into these methods as two serial steps can lead to higher classification performance. 6. References [1] O. Barkovska, Information Object Storage Model with Accelerated Text Processing Methods, in: D.Pyvovarova, V. Kholiev, H. Ivashchenko, D. Rosinskyi (Eds.), Proceedings of the : 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021), volume 1 : Main Conference, Lviv Ukraine, 2021, pp. 286-299. [2] D. Panchenko, Ukrainian News Corpus as Text Classification Benchmark. In: D. Maksymenko, O. Turuta, M. Luzan, S. Tytarenko, O. Turuta (Eds.), ICTERI 2021 Workshops. ICTERI 2021. Communications in Computer and Information Science, vol 1635. Springer, Cham., Kherson Ukraine, 2021, pp. 550-559. https://doi.org/10.1007/978-3-031-14841-5_37. [3] E. Erdem. Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning. Journal of Artificial Intelligence Research 73 (2022) 1131-1207. doi: 10.1613/jair.1.12918 [4] M. Mirończuk, J. Protasiewicz, A Recent Overview of the State-of-the-Art Elements of Text Classification. Expert Systems with Applications 106 (2018) 36-54. doi:10.1016/j.eswa.2018.03.058. [5] W. Cunha, V. Mangaravite, C. Gomes, S. Canuto, E. Resende, C. Nascimento, Cecilia & F. Viegas, C. França, W. Martins, T. Couto, L. Rocha, M. Gonçalves, On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study, in: Information Processing & Management, 58, 2021. doi:10.1016/j.ipm.2020.102481 [6] M. Malekzadeh, P. Hajibabaee, M. Heidari, S. Zad, O. Uzuner and J. H. Jones, Review of Graph Neural Network in Text Classification, in: IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 2021, pp. 0084-0091, doi:10.1109/UEMCON53757.2021.9666633. [7] M. Shervin, Nal Kalchbrenner, E. Cambria, Narjes Nikzad, Meysam Asgari Chenaghlu, Jianfeng Gao, Deep Learning based Text Classification, ACM Computing Surveys (CSUR), 54, 2020: doi:10.1145/3439726. [8] C. Liu, Y. Sheng, Z. Wei and Y. Yang., Research of Text Classification Based on Improved TF- IDF Algorithm, in: 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), IEEE, Lanzhou, China, 2018, pp. 218-222, doi:10.1109/IRCE.2018.8492945. [9] A. I. Kadhim, Term Weighting for Feature Extraction on Twitter: A Comparison Between BM25 and TF-IDF, 2019 International Conference on Advanced Science and Engineering (ICOASE), Zakho - Duhok, Iraq, 2019, pp. 124-128, doi:10.1109/ICOASE.2019.8723825. [10] T.V. Batura, Automatic text classification methods, Software & Systems. 1(30) 2017: 85–99. doi:10.15827/0236-235X.117.085-099 [11] Jiang, Mingyang, Yanchun Liang, Xiaoyue Feng, Xiaojing Fan, Zhili Pei, Yu Xue and Renchu Guan. “Text classification based on deep belief network and softmax regression.” Neural Computing and Applications 29 (2016): 61-70. doi:10.1007/s00521-016-2401-x [12] Chen W., Xie X., Wang J., Pradhan B., Hong H., Bui D.T., Duan Z., Ma J. “A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility.” CATENA 151 (2017) 147–160. doi:10.1016/j.catena.2016.11.032 [13] L.M. Manevitz, M. Yousef. “One-class SVMs for document classification.” Journal of Machine Learning Research 2 (2001): 139–154. [14] Gary Marchionini. “Exploratory search: from finding to understanding.” Communication of the ACM 49, 4 (2006): 41–46. doi:10.1145/1121949.1121979. [15] B. Choudhary, Text clustering using semantics, In: P. Bhattacharyya (Ed.), Proceedings of the 11th International World Wide Web Conference, 2002, pp. 1-4. [16] M. R Utomo, AText classification of british english and american english using support vector machine, In: Y. Sibaroni, 7th International Conference on Information and Communication Technology (ICoICT), IEEE, 2019. pp. 1-6. doi: 10.1109/ICoICT.2019.8835256. [17] J. Cervantes, F Garcia-Lamont, L. Rodríguez-Mazahua, A. Lopez. “A comprehensive survey on support vector machine classification: Applications, challenges and trends.” Neurocomputing, 408 (2020): 189-215. doi:10.1016/j.neucom.2019.10.118. [18] K.Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, Text classification algorithms: A survey.” Information 10.4 (2019. doi:10.3390/info10040150. [19] A. P. Pimpalkar, R. J. R Raj. “Influence of pre-processing strategies on the performance of ML classifiers exploiting TF-IDF and BOW features.” ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal 9.2 (2020): 49-68. doi: 10.14201/ADCAIJ2020924968. [20] Y.Cohen-Kerner, D. Miller, Y.Yigal. “The influence of preprocessing on text classification using a bag-of-words representation.” PloS one, 15.5 (2020). doi:10.1371/journal.pone.0232525. [21] J.C. Lamirel, P. Cuxac, A.S. Chivukula et al. “Optimizing text classification through efficient feature selection based on quality metric.” Journal of Intelligent Information Systems 45 (2015): 379–396. doi:10.1007/s10844-014-0317-4.