<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N.
Melnyk);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Evaluation of the Keyword Selection Methods Effectiveness for the Fake News Classification⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khrystyna Lipianina-Honcharenko</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Lendiuk</string-name>
          <email>dmytrolenduk@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazar Melnyk</string-name>
          <email>88nazar88@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Myroslav</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Komar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taras Lendiuk</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>West Ukrainian National University</institution>
          ,
          <addr-line>Lvivska str., 11, Ternopil, 46000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This study is devoted to the comparison of different methods of choosing keywords for the classification of fake and true news based on Ukrainian and Russian datasets. TF-IDF, RAKE, Yake!, LSA, LDA and TextRank methods were used, which were evaluated by such metrics as accuracy, recall, precision and F1-measure. The results showed that the TF-IDF and RAKE methods are the most effective for news classification, showing an overall accuracy of 88% and 87.94%, respectively. The TF-IDF method achieved precision for fake news at the level of 0.90 and recall for true news at the level of 0.94. Other methods such as Yake! and LDA, showed significantly lower accuracy 78% and 67%, respectively. The obtained results indicate the prospects of using the TF-IDF and RAKE methods for the development of a system for fake news automatic detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;news classification</kwd>
        <kwd>fake news</kwd>
        <kwd>TF-IDF</kwd>
        <kwd>RAKE</kwd>
        <kwd>machine learning</kwd>
        <kwd>accuracy</kwd>
        <kwd>disinformation detection 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The problem of disinformation, in particular the spread of fake news, is gaining more and more
importance in the modern information environment. This problem is especially acute in the
conditions of hybrid conflicts and information warfare, where false information is used as a tool to
influence public opinion and social stability. Therefore, the development of effective methods for the
fake news automatic detection is an extremely important task for ensuring objective information to
society. In this context, the use of machine learning and text-based methods for processing large
volumes of data opens up new opportunities for automating the process of false information
identifying.
learning.</p>
      <p>This study focuses on the comparison of different keyword selection methods for news
classification, such as TF-IDF, RAKE, Yake!, LSA, LDA, and TextRank. The main goal is to identify
the most effective approaches that can be used to create a fake news detection tool. With the help of
experiments using sets of news data, the effectiveness of each method was evaluated according to
such indicators as accuracy, recall, precision and F1-measure. The results of this study will contribute
to the further development of automatic disinformation detection technologies based on machine</p>
      <p>This work focuses on the comparison of keyword selection methods for the classification of fake
and true news and consists of several chapters. Chapter 2 analyzes existing research in the field of
disinformation detection and text processing
methods. Chapter 3 describes the research
methodology, including data collection processes, pre-processing and application of algorithms for
keyword extraction and construction of classification models. Chapter 4 provides an analysis of the
obtained results with an assessment of the accuracy of the methods using the precision, recall and
F1-measure metrics. The final section 5 highlights the effectiveness of the TF-IDF and RAKE methods
and provides recommendations for further development of the system for automatic detection of
fake news.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Several studies have investigated methods for detecting fake news by comparing machine learning
(ML) and deep learning (DL) approaches. For example, the study of Khalil Ur et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] compared SVM
(ML) and LSTM (DL), where LSTM achieved an accuracy of 99.54%, while SVM showed 99.29% in
authenticating news articles. Apparently, Chauhan and Palivela [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] noted the high performance of
neural networks, particularly LSTM, which achieved 99.88% accuracy.
      </p>
      <p>
        These studies often use approaches such as Random Forest, Decision Tree, and feature extraction
methods such as TF-IDF. For example, in a study by Johnson et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Random Forest achieved 100%
accuracy, outperforming the Decision Tree method which achieved 93.64%.
      </p>
      <p>
        The study [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] examines three feature extraction techniques for detecting fake news: TF-IDF,
Count Vectorizer (CV), and Hash Vectorizer (HV). These methods use linguistic features to improve
the disinformation identification.
      </p>
      <p>
        The study [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes a new approach to detecting fake news based on the use of Positive and
Unlabeled Learning (PUL) with the addition of an attention mechanism to identify the most relevant
keywords in a news network. The GNEE algorithm based on Graph Attention Networks (GAT) is
used, which allows automatic selection of important terms for news classification. Results showed a
2 10% improvement in F1-measure in scenarios with limited labeled data (only 10% of fake news was
labeled).
      </p>
      <p>Most comparisons show the benefits of deep learning, especially using LSTM networks, due to
their ability to better handle complex text patterns, while traditional ML models like SVM continue
to provide robust results. The main difference of this study is a thorough comparative analysis of
keyword extraction methods, such as TF-IDF, RAKE, Yake!, KeyBERT, LSA, LDA and TextRank,
combined with classifiers, aimed at a specific geopolitical context using Ukrainian and Russian
datasets, which increases its significance for detecting fake news during the Russian-Ukrainian war.</p>
      <p>Considering this, the purpose of this study is to compare and determine the most effective
methods of choosing keywords for the classification of fake and true news, which will allow creating
an accurate tool for disinformation automatic detection. This study also aims to analyze different
text processing approaches, such as TF-IDF, RAKE, Yake!, LSA, LDA, and TextRank, in order to
evaluate their impact on classification accuracy and identify the most effective ones for further
implementation in the fake news detection system.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Research Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Research architecture</title>
        <p>The next stage involves using the received keywords to build classification models. The purpose
of this step is to evaluate how effectively each keyword extraction method can help detect fake news.
At the final stage, the model is evaluated, where accuracy, completeness, F1-measure and other
metrics are measured for each method. This approach allows for a comprehensive analysis of various
text processing techniques and their impact on the accuracy of fake news classification.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Datasets description</title>
        <p>
          Ukrainian News Fake vs True Dataset and Fake News UA Dataset contain news from Ukrainian and
Russian sources, collected for the purpose of classifying fake and authentic news. These datasets are
important for information space analysis and machine learning research related to detecting fake
news during the Russian-Ukrainian war [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          The Ukrainian News Fake vs True Dataset contains approximately 10,700 news headlines
collected from Ukrainian and Russian Telegram channels between February 24 and December 11,
2022, during a full-scale Russian-Ukrainian invasion [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This open data set includes both authentic
news and fake news published by Russian sources. In particular, 4,522 titles are classified as fake,
and 56,237 as authentic. Two types of labels are used to classify news: "True" for verified news and
"False" for fake news. Data sources include Telegram channels such as Suspilne Novyny, Perepichka
NEWS, and NR, as well as disinformation channels, including Vox Ukraine and War on Fakes. The
dataset was developed for a university project to classify news and is useful for machine learning
research.
        </p>
        <p>Fake News UA Dataset contains news samples collected from Ukrainian and Russian sources for
the purpose of classifying fake news. The dataset contains various articles and news headlines from
Ukrainian Telegram channels, as well as links to primary sources such as Vox Ukraine, Ukrainian
Pravda, Espreso, and Radio Svoboda. Each record contains the text of the news, its original source,
language (Ukrainian or Russian), as well as a label indicating whether the news is fake. In total, there
are about 12,749 news in the set, which have the labels "Fake" (3,375 news) or "True". The data set is
intended for text classification tasks and also contains time stamps that allow to investigate the
dynamics of the spread of fake news.</p>
        <p>Analysis of the graph (Figure 2) of the distribution of "Fake" and "True" classes after removing
records with omissions shows an uneven distribution of news between fake and true in the combined
data set. In general, the number of true news (True) is 4,668 (about 58% of the total), which exceeds
the number of fake news (Fake), which is only 3,375 (about 42%).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Description of the used keywords selection 3.3.1. TF-IDF</title>
        <p>
          TF-IDF (Term Frequency-Inverse Document Frequency) is a method for evaluating the importance
of a term in a document relative to an entire document collection or corpus. It consists of two main
parts: TF (Term Frequency) and IDF (Inverse Document Frequency), which together help reveal how
important is a particular term in a particular document compared to the entire set of documents [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>Term frequency shows how often a certain term occurs in a particular document. The formula
for calculating TF looks like this:
 ( ,  ) =
 ( ,  )</p>
        <p>,
∑  ( ,  )
where:  ( ,  ) the number of occurrences of the term ttt in the document d, ∑  ( ,  ) the
total number of all terms in the document d.</p>
        <p>Thus, TF shows the share of a certain term from the total number of terms in the document.</p>
        <p>The inverse frequency of a document indicates the importance of a term in the entire corpus of
documents. If a term occurs in many documents, its importance decreases. The formula for IDF is as
follows:</p>
        <p>( ,  ) = (| lo∈g(:| ∈|) |),
where, ∣  ∣ the total number of documents in the corpus, ∣ { ∈  :  ∈  } ∣ the number of
documents that contain the term t.</p>
        <p>The less frequently a term appears in documents, the higher its IDF value will be.
The final TF-IDF formula combines both of these metrics:</p>
        <p>−  ( ,  ,  ) =  ( ,  ) ⋅  ( ,  ).</p>
        <p>
          This means that the importance of a term depends both on its frequency in the document and on
its overall prevalence in the corpus of documents. A term that occurs frequently in one document
but rarely in others will have a high TF-IDF value. In contrast, terms that occur in many documents
will have a low TF-IDF value, even if they occur frequently in a particular document.
3.3.2. RAKE
RAKE (Rapid Automatic Keyword Extraction) is a method for automatically extracting keywords
from text based on the frequency of terms and their relationship with other terms in the document
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It does not require prior training on data or linguistic resources (such as dictionaries or lexical
bases), making it efficient and fast for identifying relevant keywords.
        </p>
        <p>The text is divided into phrases consisting of one or more words. For this, punctuation marks
(.,!?), stop words (general words that do not carry a specific semantic load) and other symbols that
are not part of phrases that can be keywords are used. These phrases can consist of one or more
words standing next to each other.</p>
        <p>For each term in the candidate phrases, two main indicators are calculated:
•
•</p>
        <p>Term frequency f(t): number of occurrences of the term in the entire text.</p>
        <p>Term degree d(t): the number of all terms with which this term is in the same phrase. This
can be thought of as the sum of the number of words in all phrases containing that term.
The importance of a term is calculated based on the ratio of its degree to frequency:
 ( ) =  (( )).</p>
        <p>This shows the relationship between the number of terms that appear together with a given term
and the number of times it appears in the text. The higher the degree of the term relative to its
frequency, the more important this word is for the formation of the key phrase.</p>
        <p>For each candidate phrase, its total score is calculated as the sum of scores of all terms included
in the phrase:</p>
        <p>∈ ℎ</p>
        <p>
          Thus, phrases containing important terms receive high scores and are considered as potential key
phrases for the text.
3.3.3. Yake!
YAKE! (Yet Another Keyword Extractor) is a method for automatically extracting keywords from
text, which is based on the statistical characteristics of terms in the text without the need for training
on external corpora or using linguistic resources [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The main idea of YAKE! is that it uses several
metrics to evaluate the importance of each term in the text, taking into account the local context,
position, and prevalence of the term.
        </p>
        <p>First, the text is broken down into candidate terms and phrases using punctuation marks, stop
words, and other symbols that do not belong to key phrases.</p>
        <p>The frequency of the term in the text is calculated, that is, the number of times the term appears
in the text. The more often a term appears in a document, the higher its frequency score:
 ( ) =  ( ,  ),
where  ( ,  ) the number of occurrences of term t in document d.</p>
        <p>YAKE! takes into account the position at which the term appears in the text. A term that appears
closer to the beginning of a text may have a different weight than one that appears closer to the end.</p>
        <p>YAKE! takes into account whether the term appears in different contexts (phrases). If a term
occurs in only a limited number of contexts, its importance diminishes.</p>
        <p>The relationship of the term with other terms in the candidate phrases is evaluated. If a term
appears frequently with the same other terms, its uniqueness is reduced.</p>
        <p>The final formula for evaluating a key phrase includes a weighted consideration of term
frequency, position, prevalence, and uniqueness:
 ( ) =  1 ⋅  ( ,  ) +  2 ⋅  ( ,  ) +  3 ⋅  ( ),
using cosine similarity:
document.</p>
        <p>where:  ( ,  )
frequency of term ttt, 
( ,  )
the position of the term ttt in the document,
the number of contexts in which the term appears,  1,  2,  3
weighting factors
to adjust the influence of each component.
3.3.4. KeyBERT</p>
        <sec id="sec-3-3-1">
          <title>KeyBERT</title>
          <p>
            is a method for extracting keywords from text based on the use of transformer models
[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], in particular the BERT (Bidirectional Encoder Representations from Transformers) model.
Unlike statistical methods such as TF-IDF or RAKE, KeyBERT uses semantic understanding of text
to identify key phrases and words, allowing it to take into account the context of terms in the text.
          </p>
          <p>First, the text is fed as input to the BERT model, which generates multidimensional vector
representations (embeddings) for each term in the text. These vectors reflect the semantic properties
of words and phrases, which allows us to evaluate their similarity.</p>
          <p>KeyBERT uses a vector representation of the entire document or text to extract keywords. The
vectors for each word or phrase in the text are then compared to the document vector to determine
the most relevant terms. If the phrase or word vector is closest to the document vector, that phrase
or word is considered a potential keyword.</p>
          <p>Mathematically, the similarity between a word vector w and a document vector d is calculated
only to evaluate terms by their frequency or statistical indicators, but also to take into account the
semantic context of the text.</p>
          <p>KeyBERT supports the extraction of multi-word phrases (n-grams), which allows it to detect not
only individual words, but also important phrases that may carry more meaning.
3.3.5. LSA</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>LSA (Latent Semantic Analysis)</title>
          <p>
            is a text processing method used to identify hidden semantic
relationships between terms in text documents [
            <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
            ]. The main goal of LSA is to reduce the
dimensionality of the space of documents and words to reveal relationships between terms that are
not obvious at the surface level.
          </p>
          <p>The text corpus is first transformed into a matrix of term-documents. Each row of the matrix
corresponds to a specific term, and each column corresponds to a document. The values in the cells
can be the number of occurrences of the terms in the documents or weighted values such as TF-IDF.
Thus, the initial representation of the text is high-dimensional and sparse.</p>
          <p>Formally, let A be an m×n term-document matrix, where mmm is the number of terms and n is
the number of documents.</p>
          <p>LSA applies singular value decomposition (SVD) to reduce the dimensionality of the original
matrix. SVD decomposes the AAA matrix into three matrices:
 = 
  ,
where: U
orthogonal matrix of terms,</p>
          <p>diagonal matrix of singular values containing ranked
singular values (from largest to smallest),</p>
          <p>orthogonal matrix of documents.</p>
          <p>Smaller singular values can be discarded, reducing the dimensionality of the space, leaving only the
most significant components.</p>
          <p>After performing SVD, only the largest singular values and their corresponding components in
the U and   matrices are selected. This allows reducing the dimensionality of the matrix, leaving
only the main latent semantic structures of the text.</p>
          <p>Thus, the reduced dimensionality matrix makes it possible to analyze documents and terms in a
new latent space, where terms with similar contexts will be close to each other.</p>
          <p>In the reduced latent variable space, terms that often occur in similar contexts but may not be
obvious on the surface become closer together. This helps discover hidden semantic relationships
between terms and documents, which allows LSA to work effectively in the tasks of topic modeling,
information retrieval, and text classification.
3.3.6. LDA
LDA (Latent Dirichlet Allocation) is a statistical topic modeling method used to automatically
extract hidden topics from a large collection of text documents [14]. The basic idea of LDA is that
each document is treated as a mixture of several topics, and each topic is treated as a distribution of
terms. The goal of the algorithm is to identify these topics and distribute terms among them.</p>
          <p>Topics in LDA are defined as distributions of terms. These are sets of words that have a high
probability of occurrence within a certain topic. For example, if the topic is about politics, then the
most likely words in it might be "elections", "president", "parliament", etc.</p>
          <p>Each document in the corpus is treated as a mixture of several topics. For example, an economic
policy paper can be a mixture of two topics: economics and politics.</p>
          <p>LDA assumes that the distribution of topics in each document has a predefined distribution, such
as the Dirichlet distribution. This is a parameterized distribution that allows you to control how
ines this distribution of topics for
documents.</p>
          <p>∼  ℎ ( ),
where:   topic distribution vector for document ddd, a hyperparameter that controls the
density of topics in documents.</p>
          <p>Each topic also has its own term distribution, which is also modeled using the Dirichlet
.</p>
          <p>∼ 
term distribution vector for topic k,
ℎ ( ),
hyperparameter controlling the density of
where:  
terms in topics.</p>
          <p>For each term in a document, a topic is first selected based on the distribution of topics for that
document. A term is then selected based on the distribution of terms for the selected topic.</p>
          <p>, ∼  (  ),
  , ∼  (   , ),
where:   , topic for the nnnth term in the document d,   , the term itself, chosen according
to the topic  , ,   distribution of topics for the document d,    , distribution of terms for the
selected topic   , .</p>
          <p>The main task of LDA is to find the distribution of topics for each document and the distribution
of terms for each topic. This is done using parameter estimation methods such as
ExpectationMaximization (EM) or Variational Inference.</p>
          <p>The generative LDA model is described by the following probability function:</p>
          <p>(  ,  ,  ,  ∣  ,  ) =  (  ∣  ) ∏ =1  (   ∣  ) ∏ =1  (   ∣  ) (   ∣   ,  ),
where: w terms in the documents, z topics for each term, distribution of topics for each
document, distribution of terms for each topic, hyperparameters of Dirichlet distributions.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.3.7. TextRank</title>
        <sec id="sec-3-4-1">
          <title>TextRank</title>
          <p>is an algorithm for extracting keywords and automatically constructing text
annotations, based on graph ranking methods [15]. TextRank is an adaptation of the PageRank
algorithm used by search engines to rank web pages. In the case of TextRank, instead of web pages,
the algorithm works on words or sentences of text.</p>
          <p>The text is represented as a graph, where the vertices are words or sentences, and the edges are
set between those vertices that are "close" in context. For keyword extraction tasks, the vertices are
individual words, for summarization tasks, whole sentences.</p>
          <p>To extract keywords:</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>A term is considered a graph vertex.</title>
          <p>window has a fixed size, such as 2 or 3).</p>
          <p>An edge connects two terms if they appear within the same window of words (usually the
For each vertex   , the weight of connections with other vertices can be described through a
similarity matrix, which takes into account the frequency of meeting words together in the window.</p>
          <p>For summarization problems:</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>Vertices are whole sentences. Edges connect sentences based on semantic similarity, which can be calculated, for example, using cosine similarity between vector representations of sentences. keywords.</title>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.4. Classifier</title>
        <p>The cosine similarity between sentences   and   is defined as follows:
,
where:  (  )
top rank  , d
attenuation coefficient (usually d=0.85),  (  )
set of vertices
that have connections with  , 
(  )</p>
        <p>the number of outgoing edges from the vertex  , the rank
of each vertex is calculated iteratively until convergence is reached.</p>
        <p>Keywords: After the vertices (words) are ranked, the words with the highest rank are chosen as
Text summarization: Sentences with the highest ranks are selected to construct text annotation.
Based on the authors' previous publication [16], a comparative analysis of various machine learning
methods for the classification of fake and true news was conducted. The best results were achieved
using the Random Forest classifier.</p>
        <p>Random Forest is an ensemble method that uses multiple decision trees to improve classification
performance. The model consists of a set of decision trees { 1,  2, … ,   }, each of which is
independently built on random subsets of data and features. The final classification is carried out
using the voting method (choosing the class that received the most votes).</p>
        <p>The main steps of Random Forest work:</p>
        <p>Creating decision trees:
bootstrapping).
reached.</p>
        <sec id="sec-3-5-1">
          <title>Voting:</title>
          <p>For each tree, a subset of training data is randomly selected (with replacement,
A random subset of features is selected to construct each node of the tree.</p>
          <p>Tree construction continues until full classification or until the maximum depth limit is
For each new example x, each tree   in the ensemble predicts the class   .</p>
          <p>The final prediction y is made by majority vote:

where N is the number of trees,   ( ) is the prediction of tree   for the feature vector x, and 1{ 
( ) =  } is an indicator that tree   predicts class y.</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>3.5. Evaluation metrics</title>
        <p>Metrics such as precision, recall, F1-score (F-measure) and accuracy (accuracy of classification) are
used to evaluate the effectiveness of classification models. Each of these metrics has its own
mathematical definition and is used to evaluate various aspects of model performance [17].</p>
        <p>Precision shows the proportion of correct positive predictions among all positive class
predictions. This is a metric responsible for the accuracy of predictions regarding positive cases (in
our case, for example, fake news). Mathematically, it is represented as
,
,
where TP
true positives, and FP</p>
        <p>false positives.</p>
        <p>The high accuracy indicates that the model is rarely wrong in classifying news as fake when it is
true.</p>
        <p>Recall measures the proportion of correctly predicted positive cases among all actual positive
cases. This metric is important for evaluating how well the model detects fake news among all
available fake news. Formula for calculation:
where FN</p>
        <p>false negatives.
in classifying true news as fake.</p>
        <p>High completeness means that the model finds the majority of all fake news, but may be wrong
The F1-score is a harmonic mean between precision and recall and is used to balance these two
metrics, especially when it is important not only to predict the correct positive cases, but also to
reduce the number of false positives and negatives. Mathematical definition of F1-score:
 1 = 2 ⋅   +⋅  .
 =
 +
+
+ +
,</p>
        <p>This metric is particularly useful in unbalanced data situations where precision and recall may
have different values.</p>
        <p>Accuracy is an overall measure of model accuracy, showing the proportion of correct predictions
among all predicted cases. It takes into account both correctly predicted positive and negative cases.
Mathematically, classification accuracy is defined as follows:
where TN</p>
        <p>true negatives.</p>
        <p>Accuracy is a useful metric for balanced data, but can be misleading in unbalanced class settings
because it can show high accuracy even in cases where one class predominates.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section presents the results of a comparative analysis of the effectiveness of different methods
of classifying news according to "Fake" or "True". The studied models use different keyword selection
approaches, such as TF-IDF, RAKE, Yake!, LSA, LDA, and TextRank. The aim of the analysis was to
evaluate the accuracy of the models using precision, recall and F1-score metrics to determine the
best algorithms for detecting fake news in textual data.</p>
      <p>Analysis of the classification report (Figure 3.) for the model based on TF-IDF shows an overall
accuracy of 0.88 or 88%. For the "Fake" class, the model achieved a precision of 0.90, but a lower recall
of 0.81, which indicates that the model misses some fake news. In contrast, for the "True" class, the
accuracy and completeness indicators are much closer: the accuracy is 0.87 and the completeness is
0.94, which means that the model is good at recognizing true news. The final F1-score values for the
"Fake" and "True" classes are 0.86 and 0.90, respectively. The macro-average and weighted average
F1-score is 0.88, indicating a balanced performance of the model between classes, but with a slight
advantage in detecting true news.</p>
      <p>The classification model based on RAKE (Figure 4) showed an overall accuracy of 0.88 or 88%. For
the "Fake" class, the accuracy is 0.90, but the completeness is slightly lower at 0.80, resulting in an
F1-score of 0.85. The "True" class demonstrated a stable accuracy of 0.87 and a high completeness of
0.94, resulting in an F1-score of 0.90. The macro-average and weighted average value of F1-score is
0.87 and 0.88, respectively, which indicates stable performance of the model for both classes.
Although the model shows balanced results, the effectiveness in detecting fake news is slightly
inferior to the results for true news.</p>
      <p>Model Yake! (Figure 5) demonstrated an accuracy of 0.78 or 78%, which is lower compared to
other models. For the "Fake" class, accuracy was 0.73, and completeness was 0.76, giving an F1-score
of 0.74. For the "True" class, accuracy was slightly higher at 0.82 and completeness at 0.79, resulting
in an F1-score of 0.81. The overall results show that the Yake! performs significantly worse, especially
for the "Fake" class. The macro-average and weighted average value of F1-score is 0.78, which
indicates a less balanced performance compared to other methods.</p>
      <p>The LSA model (Figure 6) demonstrated accuracy at the level of 0.85 or 85%. For the "Fake" class,
the precision was 0.90, but the completeness was 0.72, resulting in an F1-score of 0.80. For the "True"
class, the indicators are significantly higher: accuracy 0.82, completeness 0.94, and F1-score
0.88. Although the overall accuracy is high, the relatively low completeness for fake news suggests
that the model may be missing some fake news. The macro-average F1-score is 0.84, and the weighted
average is 0.85, indicating a good balance between classes.</p>
      <p>The classification model based on LDA (Latent Dirichlet Allocation) showed (Figure 7) accuracy
at the level of 0.67 or 67%, which is significantly lower compared to other models. For the "Fake"
class, the precision is 0.61 and the completeness is 0.56, giving an F1-score of 0.59. For the "True"
class, the accuracy is higher 0.70, and the completeness 0.74, which provided an F1-score at the
level of 0.72. Macro-average and weighted average value of F1-score are equal to 0.65 and 0.66,
respectively, which indicates an imbalance in the work of the model, in particular with a worse result
for the "Fake" class. The overall results show that the LDA model has difficulty in effectively
classifying fake news.</p>
      <p>The classification model based on TextRank (Figure 8) showed an overall accuracy of 0.77 or 77%.
For the "Fake" class, the precision is 0.72 and the completeness is 0.74, giving an F1-score of 0.73. The
"True" class performed better: accuracy 0.81, completeness 0.79, and F1-score 0.80.
Macroaverage and weighted average value of F1-score are 0.77, which indicates a fairly balanced
performance of the model for both classes, although the accuracy for "Fake" news is lower than for
"True".</p>
      <p>Comparing all classification methods by different metrics, it can be seen that TF-IDF and RAKE
were the most effective for news classification, with accuracies of 0.8843 and 0.8794, respectively.
Both methods demonstrated balanced performance for the "Fake" and "True" classes. TF-IDF has a
high precision (0.90) for "Fake" news and a very high recall (0.94) for "True" news, resulting in an
overall F1-score of 0.88. RAKE, although slightly inferior to TF-IDF in completeness for "Fake" news
(recall 0.80), still provides a stable performance with an f1-score of 0.85 for "Fake" news and 0.90 for
"True". This makes these two methods the best for detecting fake news in textual data.</p>
      <p>The results of the study showed that the TF-IDF and RAKE methods demonstrated the highest
efficiency in news classification, having an accuracy of 0.88. These methods showed a balanced
performance for both classes ("Fake" and "True"), especially in terms of precision and f1-score. Other
methods, such as LSA, Yake!, TextRank, and LDA, performed worse, particularly in the classification
of fake news, which may be due to their lower ability to accurately identify "Fake" signs. The general
analysis shows that the TF-IDF and RAKE approaches are the most suitable for more accurate
disinformation detection.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>Within the framework of this study, a comparative analysis of keyword selection methods for
classifying news into fake and true was conducted. TF-IDF, RAKE, Yake!, LSA, LDA and TextRank
methods were used, which were applied to two sets of news data containing headlines from
Ukrainian and Russian Telegram channels. The main objective of the study was to compare the
performance of these methods using standard classification evaluation metrics such as precision,
recall, F1-score and accuracy. The process included pre-processing the text, extracting keywords,
building machine learning models and evaluating them based on the results obtained.</p>
      <p>Quantitative results showed that the TF-IDF and RAKE methods became the leaders among all
the tested approaches. TF-IDF demonstrated the highest precision of 0.8843, with a high precision
for fake news (0.90) and a very high recall for true news (0.94), providing an F1-score of 0.88. The
RAKE method performed slightly lower, with an overall accuracy of 0.8794 and an F1-score of 0.85
for fake news and 0.90 for true news. Other methods such as LSA (0.8495), Yake! (0.7805), TextRank
(0.7711) and LDA (0.6673), showed worse results. Particularly low results were obtained in the
classification of fake news, which indicates the difficulty of these approaches in identifying false
information. This approach can be used to detect fake information, in particular, during an
information war.</p>
      <p>In the future, it is planned to develop a tool for detecting fake news, which will be based on the
results of this study. The tool will integrate the most effective keyword selection methods, including
TF-IDF and RAKE, to analyze textual data from news sources. The main goal will be to create a
system capable of quickly and accurately identifying fake information based on the latest machine
learning methods.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[14] D. M. Blei, A. Y. Ng, M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning</p>
        <p>Research 3 (2003) 993-1022.
[15] M. Zhang, X. Li, S. Yue, L. Yang. An empirical study of TextRank for keyword extraction. IEEE</p>
        <p>Access 8 (2020) 178849-178858. https://doi.org/10.1109/ACCESS.2020.3027567
[16] K. Lipianina-Honcharenko, M. Soia, K. Yurkiv, A. Evaluation of the effectiveness of
machine learning methods for detecting disinformation in Ukrainian text data. In: Proceedings
of the Seventh International Workshop on Computer Modeling and Intelligent Systems
(CMIS2024), Zaporizhzhia, Ukraine, May 3, 2024. https://ceur-ws.org/Vol-3702/paper9.pdf
[17] K. Lipianina-Honcharenko, C. Wolff, A. Sachenko, I. Kit, D. Zahorodnia. Intelligent method for
classifying the level of anthropogenic disasters. Big Data and Cognitive Computing 7 (2023) 157.
https://doi.org/10.3390/bdcc7030157</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Khaleel</given-names>
            <surname>Ur</surname>
          </string-name>
          et al.,
          <article-title>Comparative study of fake news detection between machine learning and deep learning approaches</article-title>
          .
          <source>In: Proceedings of the 1st National Conference on Applications of Soft Computing Techniques in Engineering NCASCTE-2022</source>
          , Hyderabad, India,
          <year>2023</year>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>56</lpage>
          . https://www.doi.org/10.56726/IRJMETS-NCASCTE202230
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Palivela</surname>
          </string-name>
          ,
          <article-title>Optimization and improvement of fake news detection using deep learning approaches for societal benefit</article-title>
          .
          <source>International Journal of Information Management Data Insights</source>
          <volume>1</volume>
          (
          <year>2021</year>
          )
          <article-title>100051</article-title>
          . https://doi.org/10.1016/j.jjimei.
          <year>2021</year>
          .100051
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Johnson</surname>
          </string-name>
          et al.,
          <article-title>An experimental comparison of classification tools for fake news detection</article-title>
          .
          <source>International Journal of Advanced Research in Computer and Communication Engineering</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>135</fpage>
          -
          <lpage>141</lpage>
          . https://doi.org/10.17148/IJARCCE.
          <year>2021</year>
          .
          <volume>10820</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          .
          <article-title>Linguistic features based framework for automatic fake news detection</article-title>
          .
          <source>Computers &amp; Industrial Engineering</source>
          <volume>172</volume>
          (
          <year>2022</year>
          )
          <article-title>108432</article-title>
          . https://doi.org/10.1016/j.cie.
          <year>2022</year>
          .108432
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>M. C. de Souza</surname>
            ,
            <given-names>M. P. S.</given-names>
          </string-name>
          <string-name>
            <surname>Gôlo</surname>
            ,
            <given-names>A. M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Jorge</surname>
          </string-name>
          , E.
          <string-name>
            <surname>C. F. de Amorim</surname>
            ,
            <given-names>R. N. T.</given-names>
          </string-name>
          <string-name>
            <surname>Campos</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          <string-name>
            <surname>Marcacini</surname>
            ,
            <given-names>S. O.</given-names>
          </string-name>
          <string-name>
            <surname>Rezende</surname>
          </string-name>
          .
          <article-title>Keywords attention for fake news detection using few positive labels</article-title>
          .
          <source>Information Sciences</source>
          <volume>663</volume>
          (
          <year>2024</year>
          )
          <article-title>120300</article-title>
          . https://doi.org/10.1016/j.ins.
          <year>2024</year>
          .120300
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] Ukrainian news</article-title>
          . URL: https://www.kaggle.com/datasets/zepopo/ukrainian
          <article-title>-fake-and-true-news</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sophia</given-names>
            <surname>Matskovych</surname>
          </string-name>
          .
          <article-title>Fake News UA</article-title>
          . URL: https://www.kaggle.com/datasets/sophiamatskovych/fake-news-ua
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          .
          <article-title>An information-theoretic perspective of TF IDF measures</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>39</volume>
          (
          <year>2003</year>
          )
          <fpage>45</fpage>
          -
          <lpage>65</lpage>
          . https://doi.org/10.1016/S0306-
          <volume>4573</volume>
          (
          <issue>02</issue>
          )
          <fpage>00021</fpage>
          -
          <lpage>3</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Engel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cramer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cowley</surname>
          </string-name>
          .
          <article-title>Automatic keyword extraction from individual documents. Text mining: applications and theory, Editor(s): Michael W</article-title>
          . Berry, Jacob
          <string-name>
            <surname>Kogan</surname>
          </string-name>
          (
          <year>2010</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . https://doi.org/10.1002/9780470689646.ch1
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mangaravite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pasquali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jorge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nunes</surname>
          </string-name>
          , &amp; A. Jatowt. YAKE!
          <article-title>Keyword extraction from single documents using multiple local features</article-title>
          .
          <source>Information Sciences</source>
          <volume>509</volume>
          (
          <year>2020</year>
          )
          <fpage>257</fpage>
          -
          <lpage>289</lpage>
          . https://doi.org/10.1016/j.ins.
          <year>2019</year>
          .
          <volume>09</volume>
          .013
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>P. A. A. B. M. Ogrodniczuk</surname>
          </string-name>
          .
          <article-title>Keyword extraction from short texts with a text-to-text transfer transformer</article-title>
          .
          <source>In Asian Conference on Intelligent Information and Database Systems</source>
          . Singapore: Springer Nature Singapore.
          <year>2022</year>
          , pp.
          <fpage>530</fpage>
          -
          <lpage>542</lpage>
          . https://doi.org/10.1007/
          <fpage>978</fpage>
          -981-19-8234-7_
          <fpage>41</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>T. K. Landauer</surname>
            ,
            <given-names>P. W.</given-names>
          </string-name>
          <string-name>
            <surname>Foltz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Laham</surname>
          </string-name>
          .
          <article-title>An introduction to latent semantic analysis</article-title>
          .
          <source>Discourse processes 25</source>
          (
          <year>1998</year>
          )
          <fpage>259</fpage>
          -
          <lpage>284</lpage>
          . https://doi.org/10.1080/01638539809545028
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lipianina-Honcharenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lendiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sachenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Osolinskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zahorodnia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Komar</surname>
          </string-name>
          .
          <article-title>An intelligent method for forming the advertising content of higher education institutions based on semantic analysis</article-title>
          .
          <source>In: International Conference on Information and Communication</source>
          Technologies in Education, Research, and
          <string-name>
            <given-names>Industrial</given-names>
            <surname>Applications</surname>
          </string-name>
          . Cham: Springer International Publishing.
          <source>September</source>
          <year>2021</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>182</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -14841-5_
          <fpage>11</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>