<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hybrid method for building a balanced Ukrainian- language news corpus for fake news detection⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuliia Sobchuk</string-name>
          <email>yuliia.sob4uk@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sviatoslav Krushelnytskyi</string-name>
          <email>sviatoslav.kru@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khrystyna Lipianina-Honcharenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrii</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivasechko</string-name>
          <email>andrewivasechko@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetiana Drakokhrust</string-name>
          <email>t.drakokhrust@wunu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Information Technologies, West Ukrainian National University</institution>
          ,
          <addr-line>46000 Ternopil</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This paper presents a method for constructing a balanced Ukrainian-language news corpus for fake-news detection that combines LLM-based controlled generation with editorial verification. The truthful subset is collected from authoritative media using topic- and year-stratified sampling (2022-2025), while fake examples are produced via a parameterized LLM prompt controlling tone, style, manipulation types, and topics. The pipeline comprises multi-stage normalization, stop-word removal, lemmatization (Stanza), language identification, near-duplicate filtering (hybrid cosine/Jaccard-trigram similarity), and human moderation of borderline cases. The resulting corpus contains ~40k texts (~20k “Trusted” and ~20k “Fake”) with an average length of ~250 tokens and a bimodal length distribution. Reproducibility is ensured by publishing data schemas and fixed 80/20 splits. A BiLSTM baseline with FastText (300d) achieves 99.25% accuracy, 0.9925 macro-F1, and 0.985 MCC, with false-positive/false-negative rates ≤0.9%. These results indicate strong class separability and validate the corpus as a benchmark for future studies, including transformer-based models, ablation of synthetic components, robustness assessment, and probability calibration.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;fake news detection</kwd>
        <kwd>Ukrainian corpus</kwd>
        <kwd>large language models</kwd>
        <kwd>LLM-based generation</kwd>
        <kwd>editorial verification</kwd>
        <kwd>text preprocessing</kwd>
        <kwd>BiLSTM</kwd>
        <kwd>FastText</kwd>
        <kwd>reproducibility</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>AI.</p>
      <p>The pipeline involves multi-stage normalization and linguistic processing of texts, language
verification, removal of duplicates and stylistic anomalies, as well as manual moderation of
borderline cases. To ensure reproducibility, stochastic parameters are fixed, data schemas are
published, and train–validation split lists are provided. As a result, a balanced corpus of
approximately 40,000 news texts was obtained (around 20,000 in each of the “Trusted” and “Fake”
classes), with representative thematic and stylistic coverage.</p>
      <p>This paper presents a method for constructing a balanced corpus of Ukrainian-language news
for fake news detection that combines controlled LLM generation with editorial verification.
Section 2 summarizes related work; Section 3 describes the methodology and data collection
pipeline, including stratified selection of reliable materials, parameterized generation of synthetic
examples, preprocessing, and quality control; Section 4 provides the corpus composition and
statistics, as well as baseline benchmarks (including BiLSTM+FastText); Section 5 presents
experimental results with a confusion matrix and analysis of training dynamics; Section 6
formulates conclusions, highlights scientific significance, and outlines directions for future work,
including testing transformer models, conducting ablation studies, assessing robustness, and
addressing ethical considerations in the use of generative AI.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Early attempts at fake news detection were primarily based on classical machine learning
algorithms with manual feature engineering. In particular, the study presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] highlights the
use of methods such as Support Vector Machines (SVM), Naive Bayes classifiers (NB), and Random
Forests (RF). The effectiveness of these algorithms largely depended on the quality of the selected
linguistic and meta-feature characteristics.
      </p>
      <p>
        The research conducted in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focuses on classifying fake news in social media based on textual
content. This work applied four traditional text feature extraction methods (TF-IDF, Count Vector,
Character-level Vector, N-Gram Level Vector) and ten different machine learning and deep
learning classifiers. The obtained results demonstrated that textual fake news can be effectively
classified, with classification accuracy ranging from 81% to 100% depending on the classifier used,
with convolutional neural networks (CNN) showing particularly high effectiveness.
      </p>
      <p>With the rise of deep learning, fake news detection methods have gained new momentum.
Recurrent neural networks, particularly bidirectional LSTMs (Bi-LSTMs), enable models to learn
context in both forward and backward directions.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a combination of BERT and LSTM was used for fake news classification based on
headlines. The model was trained on the FakeNewsNet dataset (PolitiFact, GossipCop), achieving
accuracy improvements of 2.5% and 1.1%, respectively, compared to the baseline BERT model,
confirming the effectiveness of combining transformers and recurrent networks.
      </p>
      <p>
        The study in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a fake news detection framework that combines analysis of news
content and social context. The model is built on a Transformer architecture with an encoder for
feature extraction and a decoder for predicting the subsequent behavior of the news. To address the
lack of labeled data, the authors applied a custom automatic labeling technique. Experiments with
real-world data showed that the model provides higher accuracy in early detection (within minutes
of dissemination) compared to baseline methods.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors explored the potential of fine-tuning the modern language model GPT-3 for
the task of fake news detection. The model was adapted on the ISOT dataset and demonstrated
high effectiveness, achieving an accuracy of 99.90%, precision of 99.81%, recall of 99.99%, and an
F1score of 99.90%, significantly outperforming existing solutions. These results confirm the promise
of using GPT-3 to combat disinformation in social media and news outlets.
      </p>
      <p>
        The study in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presents the Generative Bidirectional Encoder Representations from
Transformers (GBERT) framework, which combines BERT’s deep contextual understanding with
GPT’s generative capabilities for fake news classification. Both models were fine-tuned on two
real-world benchmark datasets, achieving an accuracy of 95.30%, precision of 95.13%, recall of
97.35%, and an F1-score of 96.23%. The obtained results demonstrate the high efficiency of GBERT
and the potential of this approach in countering the spread of disinformation in the digital
environment.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a review of machine learning algorithms and datasets used for fake news detection was
conducted. Among the most effective models identified were the Stacking Method with 99.9%
accuracy, BiRNN, and CNN — both at 99.8%. Most studies relied on data from controlled
environments (e.g., Kaggle) or from sources without real-time updates, which limits their practical
applicability in social media, where disinformation spreads most actively. The most frequently used
datasets included Kaggle, Weibo, FNC-1, COVID-19 Fake News, and Twitter. The authors
emphasize the need to expand topics beyond political news and to apply hybrid methods in future
research.
      </p>
      <p>
        The study in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] presents the OLTW-TEC method (Online Learning with Sliding Windows for
Text Classifier Ensembles), developed for detecting disinformation in the Ukrainian-language
information space. The approach combines an ensemble of classifiers with a “sliding window”
mechanism for dynamically updating the model to incorporate new data, thereby increasing its
adaptability to changing fake news dissemination tactics. The method was tested on a specially
constructed dataset of authentic and fake news, achieving an accuracy of 93%. The results confirm
the effectiveness of OLTW-TEC and its suitability for operating under information warfare
conditions, as well as its potential for adaptation to other languages and regions.
      </p>
      <p>
        A comparative study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] showed that RNN, LSTM, and Bi-LSTM models achieve similar results
with around 91% accuracy, although LSTM outperformed in recall and RNN in precision. This
highlights the importance of selecting an architecture that aligns with specific objectives.
      </p>
      <p>
        Further research has focused on transformer architectures. For instance, the authors of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
evaluated the performance of BERT, CNN, Bi-LSTM, and their ensemble combination. The latter
achieved the highest accuracy — 98.24% — demonstrating the effectiveness of hybrid solutions.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], transformers (BERT, RoBERTa, GPT-2) were compared with graph neural networks
(GNN). Transformers showed significantly better results: RoBERTa reached 99.99% on ISOT, and
GPT-2 achieved 99.72% on WELFake, highlighting their ability to work with contextually rich data.
      </p>
      <p>
        The study in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] analyzed the effectiveness of various machine learning methods for detecting
disinformation in Ukrainian-language news collected during the military conflict. Evaluated models
included logistic regression, SVM, random forest, gradient boosting, KNN, decision trees, XGBoost,
and AdaBoost. The random forest demonstrated the best results. The authors emphasize the
importance of adapting models to the specifics of the task and the need for further research in this
area.
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] argue that combining transformers with text summarization further
increases accuracy. RoBERTa fine-tuned on summarized content achieved 98.39%, which is among
the highest metrics among modern models.
      </p>
      <p>
        Special attention should be paid to hybrid models that integrate Word2Vec vectors with CNN
and LSTM. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the authors focused on fake news classification using a combination of machine
learning (ML) and natural language processing (NLP) methods based on textual content. They
compared several modern ML models and neural networks. Experiments showed that all traditional
ML models achieved over 85% accuracy, while neural networks outperformed them, reaching over
90% accuracy.
      </p>
      <p>
        Research (Table 1) on fake news detection has evolved from classical machine learning
approaches with manual feature engineering to deep and transformer-based architectures (see [
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1–
14</xref>
        ]). In the early stages, SVM, NB, and RF were applied with textual representations such as TF-IDF
or Bag-of-Words, where accuracy largely depended on feature selection and data domain.
Subsequent studies focused on neural models: RNN/LSTM/Bi-LSTM achieved results around ≈91%
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], while hybrids combining CNN with vector representations (e.g., FastText) improved
performance to 0.99 in Accuracy and 0.97–0.99 in F1-score [
        <xref ref-type="bibr" rid="ref13 ref14">13–14</xref>
        ]. Transformers, particularly
BERT/RoBERTa and their ensembles, reached 98.24% and higher [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], with some well-known
datasets (ISOT, WELFake) reporting values up to 99.99% [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, these metrics vary
significantly across datasets and experimental setups, complicating the generalization of
conclusions and accurate comparison of approaches.
      </p>
      <p>Existing studies indicate progress in methods but reveal several research gaps, particularly for
the Ukrainian-language segment: (i) a lack of large public annotated corpora with transparent
preparation pipelines, fixed splits, and detailed documentation; (ii) limited evaluation under
temporal and domain shift scenarios, with insufficient attention to robustness against
paraphrasing/adversarial attacks and probability calibration; (iii) inadequate analysis of bias across
topics/genres, as well as the impact of synthetic examples on model generalization and stability;
(iv) incomplete reproducibility due to the absence of publicly available code, fixed seeds, and
detailed preprocessing protocols. The scientific significance of this research lies in addressing these
gaps by creating a reproducible Ukrainian-language corpus with balanced classes, a clearly
specified construction and quality control methodology, and establishing benchmark standards that
allow accurate comparison of modern architectures and investigation of their robustness under
realistic conditions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>The following outlines the sequential stages of constructing a balanced corpus of
Ukrainianlanguage news for fake news detection, combining verified texts from reputable media with
controlled LLM-generated examples (stages 1–6, Fig. 1).</p>
      <sec id="sec-3-1">
        <title>Stage 1. Corpus formalization.</title>
        <p>Let C =T ∪ F be the final corpus of Ukrainian-language news for the two-class fake news
classification task, whereF is the set of trusted texts and is the set of fake texts. Each document is
a pair ( xi , yi ), where xi∈ Σ∗¿ is the text, and yi∈ {0 , 1} is the class label (
y =1−« Fake » , y =0−« Trusted »). The corpus is balances: ∣ T ∣ =∣ F∣ =20 000, so the prior
class probabilities are π 0=π1= 1 .</p>
        <p>2</p>
        <p>Let the empirical distribution of text lengths in tokens be denoted as ^p L (l ). Following
preprocessing, the average length is μ^L= E ^pL [ L ] ≈ 250 tokens.</p>
        <p>Stage 2. Selection of trusted texts.</p>
        <p>Let S={TCH . ua , Bi h us . info , BBC News Україна , …} be the set of sources. Over the time
interval t∈[2022,2025], the initial sample is formed as</p>
        <p>U T ={ x : x was publis h ed on s∈ S , t ∈ [ 2022,2025 ] } .</p>
        <p>To avoid dominance of individual sites or topics, stratified sampling is applied by topic
τ ∈ { politics , economy , society , defense , h ealt h , …} and publication year. Within each
stratum ,( τ , year ) random sampling without replacement is performed with an upper limit ms of
documents per source s (anti-dominance cap). Each candidate undergoes manual verification of
editorial standards and fact-checking; acceptance is denoted by the predicate RT ( x )∈ {0 , 1}. The
final set is:</p>
        <p>T ={ x∈ U T : RT ( x )=1}</p>
      </sec>
      <sec id="sec-3-2">
        <title>Stage 3. Fake text generation.</title>
        <p>Fake texts are generated by a large language model G via LangChain using a parameterized
prompt template τ ( θ ). The control vector is</p>
        <p>θ=( tone , style , type , topic )
where
−tone∈ {neutral , alarming , reassuring };
−style∈ {analytical , populist , ironic , factual };
−type∈ {disinformation , manipu l ation , emotional influence , propaganda };
−topic∈ { politics , economy , education , defense , infrastructure }.</p>
        <p>To ensure diversity, θ is covered almost uniformly (Latin square / combinatorial sweep), and G
is instructed to use real facts/persons in a fictional context while avoiding fantastical events or
clichés. Generation occurs as</p>
        <p>x=G ( τ ( θ ) , z ) ,
where z is the stochastic seed/model temperature. Each synthetic text undergoes both automatic
and manual quality filtering.</p>
        <p>Stage 4. Preprocessing.</p>
        <p>Let</p>
        <p>Φ= Λ∘ Ψ ∘ Norm
be the preprocessing pipeline,
where</p>
        <p>Norm ( x ): lowercase conversion, removal of URLs, hashtags, special characters, and numeric
markers;
Ψ ( x ): stop-word removal;
Λ ( x ): lemmatization (Stanza for Ukrainian).</p>
        <p>The resulting text x'=Φ ( x ) is fed into the validation and statistical analysis modules.
Stage 5. Quality control of texts.</p>
        <p>Three independent acceptance predicates are applied:</p>
        <sec id="sec-3-2-1">
          <title>Language identification.</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Anti-duplicate check. Let</title>
          <p>L ( x )= P ( x ), requiring L ( x ) ≥ τ lang.</p>
          <p>¿ ( xi , x j )=α ⋅cos cos (tfidf ( xi ) , tfidf ( x j ))+(1−α )⋅J 3 ( xi , x j ) ,
where J 3 is the Jaccard similarity of trigrams. A text x is discarded if max j&lt;i ⁡∼( x , x j )≥ τ dup
(practicaly implemented via MinHash/LSH).
3. Style/logic check. Automatic heuristics (length, anomalous n-gram repetition) plus manual
review R ( x ).</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>The final filter is</title>
          <p>Q ( x )=1 { L ( x ) ≥ τ lang ,∼( x , x j )&lt; τ dup }∧ R ( x ) ,</p>
          <p>Where τ lang and τ dup denote threshold parameters for language and duplication filtering,
respectively.</p>
          <p>The final setsT , F consist only of texts that satisfy Q ( x )=1.</p>
          <p>Stage 6. Corpus splitting.</p>
          <p>The corpus is split into training and validation subsets while preserving class balance:</p>
          <p>Ctrain∪ C val=C ,∣ Ctrain∣ ≈ 0.8∣ C ∣ ,∣ C val∣ ≈ 0.2∣ C ∣ .</p>
          <p>Lists of document IDs for each split are stored separately to ensure reproducibility, and all
stochastic procedures are fixed using a common seed s.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result</title>
      <p>
        To construct the trusted news class, we used materials from reputable Ukrainian information
sources, including TCH.ua, Bihus.info, BBC News Україна, and others. The full list of trusted
sources is available online [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The news covers the period 2022–2025 and topics related to
politics, economy, society, military events, healthcare, etc. In total, approximately 20,000 texts were
selected, each undergoing manual verification for accuracy and compliance with contemporary
journalistic style. Manual verification was conducted by two authors, with the dataset evenly split
between them for independent assessment.
      </p>
      <p>Fake news was generated using the large language model Gemini 2.0. Generation was
performed via the LangChain interface using a pre-designed prompt template that specified the
parameters of the resulting text.</p>
      <p>Each fake news item was created taking into account the following characteristics::



</p>
      <sec id="sec-4-1">
        <title>Tone: neutral, alarming, reassuring; Writing style: analytical, populist, ironic, factual; Type of fake: disinformation, manipulation, emotional influence, propaganda; Topic: politics, economy, education, defense, infrastructure.</title>
        <p>The prompt template also instructed the model to incorporate real facts, institutions, and
persons within a fictional context, making the texts as close as possible to authentic media content.
Generation was accompanied by guidelines to avoid fantastical events or obvious clichés.</p>
        <p>Despite automated generation, all fake news items underwent manual verification. Texts with
low plausibility, artificial language, logical inconsistencies, or violations of style guidelines were
filtered out. As a result, a balanced corpus was created, consisting of 20,000 fake and 20,000 trusted
news articles.</p>
        <p>
          The news corpus [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] was cleaned of noisy elements: hyperlinks, special characters, numeric
markers, and hashtags were removed. Texts were converted to lowercase, stripped of stop-words,
and lemmatized using the Stanza library, which supports Ukrainian morphology.
        </p>
        <p>The combined corpus contains approximately 40,000 Ukrainian-language news articles with
nearly equal class representation (Fig. 2): Trusted ≈ 20,000, Fake ≈ 20–21,000; the deviation from a
50/50 split does not exceed ≈10%. Such balance reduces the risk of metric bias toward the larger
class.</p>
        <p>The length distribution exhibits a pronounced bimodality (Fig. 3): the first local peak
corresponds to short notes (~20–60 words), while the second corresponds to full-length articles of
≈250–300 words. The mean length is approximately 250 tokens/words, which matches a typical
news item and provides sufficient context for linguistic features.</p>
        <p>The top-20 lemmas by class (Fig. 4) show the expected dominance of function words
(conjunctions, prepositions), indicating a homogeneous underlying syntactic structure across both
classes. At the same time, differences in content lemmas are noticeable: in the Fake class, terms like
“український” (Ukrainian), “ситуація” (situation), “про” (about), “але” (but) appear more
frequently, whereas in Trusted, “Україна” (Ukraine), “рік” (year), “вони” (they), “для” (for) are
more common. This reflects stylistic distinctions: fake texts tend to use generalizing and evaluative
formulations, while trusted texts feature nominative references to institutions/country and
temporal markers.</p>
        <p>The cosine similarity between the sets of key lemmas was 0.879, indicating a high lexical
overlap. This suggests that fake and trusted news often share the same topical vocabulary, which
makes the classification task realistic and shifts the discriminative power towards stylistic and
contextual features rather than mere word occurrence. The heatmap of cosine similarities (Fig. 5)
further illustrates this overlap, showing the strong lexical proximity between the two classes.</p>
        <p>The bimodality in text lengths reflects two dominant forms of news presentation (short “notes”
and full-length articles), which is useful for building robust models: the classifier is exposed to
different styles and text volumes. High model metrics on the balanced corpus confirm strong class
separability and the quality of data preparation (cleaning, language and duplicate control).
Differences in content lemmas illustrate stylistic signals that can be used as interpretable features
or for further bias analysis.</p>
        <p>For vector representation, FastText in skip-gram mode was applied. The vectorizer was trained
on the preprocessed corpus with the following hyperparameters: vector size – 300, number of
epochs – 15, context window width – 5, minimum word frequency – 10.</p>
        <p>News vectorization was performed by truncating or padding with zero vectors to a fixed length
of 100 tokens. The classifier architecture is based on a bidirectional LSTM network with additional
Dropout and Dense layers. Optimization was performed using Adam with an initial learning rate of
0.001.</p>
        <p>After training on 80% of the dataset and validating on the remaining 20%, the model achieved an
accuracy of 99.25% (Table 2). Precision and recall coefficients exceed 0.99 for both classes, which is
also confirmed by the confusion matrix (Fig. 5). The training dynamics are shown in Figure 6,
illustrating a gradual decrease in the loss function without signs of overfitting.</p>
        <p>The column “Support” indicates the number of instances belonging to each class in the test
dataset. Overall classification performance (see Fig. 5): Accuracy = 0.9925 (99.25%), macro-averaged
F1 = 0.9925, Matthews correlation coefficient (MCC) = 0.985 (see Table 2). From the confusion
matrix (Fig. 6):

</p>
        <p>Fake class: TP = 4208, FN = 26, FP = 36, TN = 3979; Precision = 0.9915, Recall = 0.9939, F1 =
0.9927; TPR = 0.9939, TNR = 0.9910, FNR = 0.0061, FPR = 0.0090;
Trusted class: TP = 3979, FN = 36, FP = 26, TN = 4208; Precision = 0.9935, Recall = 0.9910, F1
= 0.9923; TPR = 0.9910, TNR = 0.9939, FNR = 0.0090, FPR = 0.0061.</p>
        <p>The average balanced accuracy equals TP RFake+TP RTrusted =0.99245, corresponding to a BER
2
= 0.00755. The low false positive and false negative rates (≤ 0.9%) in each class confirm strong class
separability and the absence of bias toward any label.</p>
        <p>The training dynamics (Fig. 7) show a monotonic increase in accuracy on both the training and
validation sets, reaching ≈0.99 and plateauing after approximately the 7th epoch. The loss function
decreases steadily across both subsets without divergence. The absence of rising validation error
and the minimal generalization gap indicate no signs of overfitting under the chosen
hyperparameters (FastText 300d, window = 5, epochs = 15, min_count = 10; BiLSTM + Dropout +
Dense, Adam optimizer, η=10−3).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study introduces a balanced corpus of Ukrainian-language news for fake news detection,
comprising ~40,000 texts (≈20k “Trusted” and ≈20k “Fake”) from 2022–2025. Trusted data were
sourced from verified media, while synthetic fakes were generated with LLMs under controlled
prompts, followed by normalization, filtering, and lemmatization. The corpus shows clear stylistic
differences between classes and an average length of ≈250 tokens, making it suitable for machine
learning.</p>
      <p>Evaluation with a BiLSTM + FastText model achieved accuracy of 99.25% and macro-F1 of
0.9925, confirming both the quality of the dataset and the feasibility of automated fake news
detection. Misclassification rates remained below 1%, with stable learning dynamics and no
overfitting.</p>
      <p>The dataset and approach can be applied in practice for media monitoring and early detection of
disinformation in Ukraine. Future work will include benchmarking transformer models, robustness
testing, and releasing artifacts to support reproducible research and regular updates of the corpus.</p>
      <sec id="sec-5-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Alluri</surname>
            ,
            <given-names>C. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reddy</surname>
            ,
            <given-names>K. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarma</surname>
            ,
            <given-names>B. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Fake news detection: a systematic review and knowledge mapping</article-title>
          .
          <source>The Journal of Supercomputing</source>
          ,
          <volume>79</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1770</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdulrahman</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Baykara</surname>
          </string-name>
          ,
          <article-title>"Fake News Detection Using Machine Learning and Deep Learning Algorithms,"</article-title>
          <source>2020 International Conference on Advanced Science and Engineering (ICOASE)</source>
          , Duhok, Iraq,
          <year>2020</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>23</lpage>
          , doi: 10.1109/ICOASE51841.
          <year>2020</year>
          .
          <volume>9436605</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaushik</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raj</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Fake News Classification using transformer based enhanced LSTM and BERT</article-title>
          .
          <source>International Journal of Cognitive Computing in Engineering</source>
          ,
          <volume>3</volume>
          ,
          <fpage>98</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Raza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Fake news detection based on news content and social contexts: a transformer-based approach</article-title>
          .
          <source>International journal of data science and analytics</source>
          ,
          <volume>13</volume>
          (
          <issue>4</issue>
          ),
          <fpage>335</fpage>
          -
          <lpage>362</lpage>
          . https://doi.org/10.1007/s41060-021-00302-z.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hemina</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boumahdi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Remmide</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>A Cross-Validated Fine-Tuned GPT-3 as a Novel Approach to Fake News Detection</article-title>
          . In: Zantout,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ragab</given-names>
            <surname>Hassen</surname>
          </string-name>
          , H. (eds)
          <source>Proceedings of the International Conference on Applied Cybersecurity (ACS) 2023. ACS 2023. Lecture Notes in Networks and Systems</source>
          , vol
          <volume>760</volume>
          . Springer, Cham. .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dhiman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juneja</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nauman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Muhammad</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>GBERT: A hybrid deep learning model based on GPT-BERT for fake news detection</article-title>
          .
          <source>Heliyon</source>
          ,
          <volume>10</volume>
          (
          <issue>16</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Villela</surname>
            ,
            <given-names>H. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrêa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro</surname>
            ,
            <given-names>J. S. D. A. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabelo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Carvalho</surname>
            ,
            <given-names>D. B. F.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Fake news detection: a systematic literature review of machine learning algorithms and datasets</article-title>
          .
          <source>Journal on Interactive Systems</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ),
          <fpage>47</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lipianina-Honcharenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yurkiv</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ivasechkо</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2024</year>
          , May).
          <article-title>Evaluation of the effectiveness of machine learning methods for detecting disinformation in Ukrainian text data</article-title>
          .
          <source>In Proceedings of the Seventh International Workshop on Computer Modeling and Intelligent Systems (CMIS-2024)</source>
          , Zaporizhzhia, Ukraine.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Airlangga</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Advancing fake news detection: a comparative study of RNN, LSTM</article-title>
          , and
          <string-name>
            <surname>Bidirectional</surname>
            <given-names>LSTM</given-names>
          </string-name>
          <string-name>
            <surname>Architectures. Jurnal Teknik Informatika C.I.T Medicom</surname>
          </string-name>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kuntur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krzywda</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wróblewska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paprzycki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ganzha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Comparative Analysis of Graph Neural Networks and Transformers for Robust Fake News Detection: A Verification and Reimplementation Study</article-title>
          .
          <source>Electronics</source>
          ,
          <volume>13</volume>
          (
          <issue>23</issue>
          ),
          <fpage>4784</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Saadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belhadef</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guessas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hafirassou</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2025</year>
          .
          <article-title>Enhancing Fake News Detection with Transformer Models and Summarization</article-title>
          . Engineering, Technology &amp; Applied Science Research.
          <volume>15</volume>
          ,
          <issue>3</issue>
          (Jun.
          <year>2025</year>
          ),
          <fpage>23253</fpage>
          -
          <lpage>23259</lpage>
          . DOI:https://doi.org/10.48084/etasr.10678.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Lipianina-Honcharenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodyanskiy</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kustra</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ivasechkо</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>OLTW-TEC: online learning with sliding windows for text classifier ensembles</article-title>
          .
          <source>Frontiers in Artificial Intelligence</source>
          ,
          <volume>7</volume>
          ,
          <fpage>1401126</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hashmi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yayilgan</surname>
            ,
            <given-names>S. Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamin</surname>
            ,
            <given-names>M. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Abomhara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Advancing fake news detection: Hybrid deep learning with fasttext and explainable ai</article-title>
          .
          <source>IEEE Access.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>C.-M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , M.-H.,
          <string-name>
            <surname>Kristiani</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verma</surname>
            ,
            <given-names>V. K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-T.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <source>Fake News Classification Based on Content Level Features. Applied Sciences</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ), 1116. https://doi.org/10.3390/app12031116.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Dataset</surname>
          </string-name>
          . (
          <year>2025</year>
          ).
          <article-title>Trusted sources for fake news detection</article-title>
          .
          <source>Google Drive</source>
          . https://drive.google.com/drive/folders/13c_
          <article-title>QRvuMuXTByYZkzbJOq4VcXzURniVx? usp=drive_link.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sobchuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yulia</surname>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>fake_true_ukrainian_news_dataset</article-title>
          .csv. figshare. Dataset. https://doi.org/10.6084/m9.figshare.
          <volume>29257568</volume>
          .
          <year>v1</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>