<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GRU for fake news detection⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ivan Peleshchak</string-name>
          <email>ivan.r.peleshchak@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasyl Lytvyn</string-name>
          <email>vasyl.v.lytvyn@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victoria Vysotska</string-name>
          <email>victoria.a.vysotska@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksii Khobor</string-name>
          <email>oleksii.r.khobor@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mykhailo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luchkevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Lviv, S. Bandery str. 12, 79000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This study introduces a lightweight neural architecture built on TinyBERT, incorporating a CrossAttention mechanism, Low-Rank Adaptation (LoRA), and a bidirectional Bi-GRU layer for the task of automatic fake news detection. The explosive growth of the digital information space has intensified the spread of disinformation, while traditional fact-checking techniques struggle to keep pace. Current neural solutions based on large transformer models are often impractical, as they demand extensive computational resources and lack the ability to explicitly capture semantic mismatches between a news headline and its body. The proposed approach addresses these issues by utilizing a distilled TinyBERT backbone that reduces computational overhead, together with a Cross-Attention block designed to highlight headline-body incongruence. LoRA adapters further minimize the number of trainable weights, enabling efficient fine-tuning, while the Bi-GRU layer preserves sequential context and refines information aggregation. Experiments on the Fake News Classification dataset show that the model surpasses classical machine learning methods, reaching 99% accuracy, an F1-score of 0.985, and a ROC AUC of 0.998. With its efficiency, rapid adaptability to new domains, and strong interpretability through attention visualization, the model is well-suited for deployment in real-time fact-checking systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;transformer models</kwd>
        <kwd>fake news detection</kwd>
        <kwd>AI fact-checking</kwd>
        <kwd>TinyBERT</kwd>
        <kwd>Cross-Attention</kwd>
        <kwd>Bi-GRU</kwd>
        <kwd>LoRA1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid growth of digital content has made “AI-driven disinformation” one of the most serious
global threats. Disinformation can undermine elections, polarize societies, and cost the world
economy tens of billions of dollars annually. Traditional, manual fact-checking cannot keep up
with the speed of these “digital wildfires”: news flows spread at a rate of hundreds of thousands of
publications per hour, while the full verification of a single article may take journalists from half an
hour to several days.</p>
      <p>With the emergence of large language models (LLMs), automated verification has become
technically feasible; however, the vast majority of existing neural solutions still face significant
limitations.</p>
      <p>Modern “heavyweight” transformers (such as Longformer-Large and multimodal GNN stacks)
contain hundreds of millions of parameters, making them unsuitable for deployment on editorial
CMS servers, in browser extensions, or on mobile devices. Single-branch models with CLS pooling
perform significantly worse at capturing the main rhetorical trick of fake news – the incongruence
between a “sensational” headline and a relatively innocuous news body. Studies show that this
mismatch is one of the most informative indicators of media manipulation, yet conventional
encoders fail to capture it – requiring a dedicated mechanism for the interaction of two text
streams [1]. Moreover, fully fine-tuning large transformer architectures demands substantial
0000–0002–7481–8628 (I. Peleshchak); 0000–0002–9676–0180 (V. Lytvyn); 0000-0001-6417-3689 (V. Vysotska); 0000–
0002–3210–036X (O. Khobor); 0000-0002-2196-252X (M. Luchkevych)
computational resources and time, limiting their applicability in dynamic information
environments.</p>
      <p>Thus, there is a need for a compact, highly efficient, and interpretable neural network model
capable of detecting fake news in real time, particularly through precise analysis of the relationship
between headline and body content, while remaining suitable for rapid adaptation to new topics
and linguistic contexts without significant resource costs.</p>
      <p>To address these issues, this study proposes a new compact neural network architecture. Its
backbone is TinyBERT-4L-312D – a distilled transformer that preserves the quality and
functionality of BERT-Base while being 7.5 times smaller [2]. On top of its outputs, a bidirectional
cross-attention mechanism is constructed between the headline and the news body, making the
model sensitive to semantic conflicts. To maintain the number of parameters within the range of
several hundred thousand, we employ Low-Rank Adaptation (LoRA): instead of full fine-tuning,
only low-rank adapters are updated, reducing the number of trainable weights by a factor of 32
without loss of quality [3]. Additionally, sequential context is aggregated through a lightweight
bidirectional Bi-GRU (64 hidden units).</p>
      <p>Thus, the proposed network is:



</p>
      <p>Accessible – suitable for real-time deployment even on resource-constrained devices (e.g.,
mobile platforms);
Specialized – designed to detect headline–body incongruence, which is often overlooked by
more general LLMs;
Interpretable – cross-attention heatmaps and Grad-CAM visualizations over the GRU reveal
the reasoning behind classifications;
Flexible in fine-tuning – LoRA adapters are lightweight enough to adapt the model to a
local media environment within minutes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Analysis of recent research and publications</title>
      <p>The first substantial attempts at automatic fake-news detection relied on recurrent and
convolutional networks. In [4], a two-layer Bi-LSTM augmented with lexical and stylistic features
outperformed traditional SVMs and became a springboard for deep approaches. Subsequently, [5]
applied hierarchical attention by decomposing text into words and sentences; this architecture
added roughly three percentage points to the F1 score, demonstrating the value of multi-level text
modeling. The authors of [6] showed that a news item’s propagation path on social media is no less
informative than the content itself: their hierarchical propagation network markedly improved
detection accuracy by integrating temporal and authorial relations.</p>
      <p>The next phase of countering disinformation was marked by rapid advances in graph-based and
transformer approaches. In [7], a heterogeneous graph network with attention over node types –
authors, topics, and the text itself – brought models closer to the real structure of the media
ecosystem. Around the same time, [8] showed that deep CNNs combined with LSTM stacks deliver
steady gains, though their model approached the limits of available GPU memory.</p>
      <p>The advent of distilled transformers [9] enabled injecting external knowledge into hierarchical
attention. Meanwhile, [10] demonstrated that combining CNN filters with HAN and simple stylistic
cues can boost accuracy without substantially inflating the model.</p>
      <p>Researchers then turned to robustness and multimodality. The authors of [11] were the first to
leverage cross-attention between headline and body, showing that incongruence between these
components significantly improves classification accuracy. In [12], an adversarially trained
network was proposed to build resilience to lexical substitutions, but the model became heavier
and more resource-hungry. Work [13] fused text, images, and user graphs, surpassing
state-of-theart results, yet at the cost of hundreds of millions of parameters and reliance on social-media APIs.</p>
      <p>The survey in [14] concluded that fully fine-tuning large transformers is increasingly
impractical; instead, Low-Rank Adaptation (LoRA) allows one to update only a tiny fraction of
weights. This idea was further developed in [15], which showed that small LoRA layers can be
ensembled with LLMs to reach results competitive with flagship systems – albeit at a high
hardware cost.</p>
      <p>Despite these advances, significant gaps remain. Most models still impose excessive hardware
demands, making deployment on newsroom laptops or in browser plugins infeasible [16-20]. Many
approaches cannot explicitly assess the mismatch between headline and body, even though this
imbalance is a defining characteristic of propagandistic material.</p>
      <p>The primary objective of this study is to address the aforementioned limitations and to design
and experimentally validate a compact neural network architecture based on the distilled
transformer TinyBERT, enhanced with a Cross-Attention mechanism, Low-Rank Adaptation
(LoRA), and a bidirectional recurrent Bi-GRU layer for automatic fake news detection. The
proposed solution is intended to deliver high classification accuracy with significantly reduced
computational costs, while enabling explicit analysis of semantic consistency between headline and
body text, rapid adaptation to new topics, and seamless integration into real-time fact-checking
systems on devices with limited computational resources.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Data flow scheme in the model</title>
        <p>The data stream in the model begins at the level of raw CSV records, where each news item is
represented as a pair of rows – the headline and the body text. These two fragments pass through a
preprocessing block that removes HTML tags, normalizes Unicode encoding, and converts the text
to lowercase. The cleaned strings are then fed into the TinyBERT WordPiece tokenizer: the
headline is transformed into a sequence of up to 64 tokens, while the body is represented as a
sequence of up to 448 tokens; if necessary, excess words are truncated, and shorter texts are padded
with [PAD] tokens. Both sequences are supplemented with special markers [CLS] and [SEP], and a
binary attention_mask is generated in parallel to distinguish actual tokens from padding (Fig. 1).</p>
        <p>Next, both sections are independently fed into a shared TinyBERT encoder, which keeps its
weights frozen during the initial training phase and functions as a contextual transformer: each
token receives a 312-dimensional vector enriched with both local and global semantics. The
outputs of the headline and body are then passed into the cross-attention module: headline tokens
act as queries, while body tokens serve as keys and values, and vice versa. This layer aligns the two
text streams, highlighting those word and sentence pairs where semantic conflicts or confirmations
occur. All large projection matrices in this block remain frozen; only their low-rank LoRA adapters
are updated, keeping the size of the new trainable space minimal.</p>
        <p>The result of the cross-attention is a unified sequence containing the “stitched” information of
the headline and body. This sequence is passed into a compact bidirectional GRU with 64 hidden
units. The recurrent layer processes it forward and backward, accumulating sequential context, and
at each position produces a combined importance vector. To reduce the temporal dimension into a
fixed size, global max-pooling is applied: the maximum activations across all tokens are selected,
representing the strongest “arguments” for or against the article’s credibility. The resulting
128dimensional vector undergoes dropout and is then passed through a minimal linear layer that
outputs a single logit – the fake-news score. A sigmoid function converts this logit into a
probability, after which a threshold decision determines whether the news is labeled as fake.</p>
        <p>Thus, data flows from raw text to final classification through five logical stages – normalization,
tokenization, contextualization via the transformer, cross-attention fusion, and compact recurrent
aggregation – each adding informational value while staying within the computational limits of
even resource-constrained platforms.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Preprocessing</title>
        <p>The Fake News Classification dataset [16], used for training and testing this model, contains 37,106
Fake entries and 35,028 True entries. Each record consists of four fields: Serial Number (starting
from 0), Headline (the news headline), Text (the news body), and Label (1 = fake, 0 = real).</p>
        <p>Before feeding the news content into the tokenizer, each headline and body undergo a
preprocessing pipeline designed to remove inconsistencies that could distort frequency statistics
and corrupt the feature space. First, the text is cleaned of residual HTML markup: tags, service
attributes, and entity codes are converted into plain characters. Next, Unicode normalization is
applied so that all “composite” letters are reduced to a single code – this is particularly important
for words with diacritics, which would otherwise be treated as separate tokens.</p>
        <p>After encoding unification, the text is converted to lowercase, since TinyBERT is trained in
uncased mode and does not differentiate between uppercase and lowercase characters. The next
step involves harmonizing typography: quotation marks («…», „…“) and different types of dashes
are replaced with standard ASCII equivalents. This operation reduces the number of unique tokens
and ensures that the WordPiece tokenizer correctly merges word segments without introducing
extraneous tokens.</p>
        <p>Subsequently, multiple consecutive spaces are compressed into a single space, while tabs and
line breaks are replaced with spaces to prevent the tokenizer from creating empty or partial tokens.
Finally, non-printable control characters are removed; these often appear in copied text fragments
and can cause rendering errors or an explosion in the number of unknown tokens.</p>
        <p>As a result of these steps, a clean, stylistically uniform string is obtained, which can be safely
passed to the TinyBERT WordPiece tokenizer. This guarantees a stable vocabulary and overall
improves the quality of subsequent text encoding.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Tokenization and Sequence Construction</title>
        <p>
          After normalization, the text proceeds to the tokenization stage, where raw characters are
converted into numerical indices interpretable by the neural network. For this purpose, the
WordPiece tokenizer trained alongside TinyBERT is employed. Its vocabulary contains 30,522
units, ranging from whole words to characteristic subwords. The operation of the TinyBERT
WordPiece tokenizer (with vocabulary |V |=30522) is described by formula (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ). For a string s, it
returns a sequence of indices T ( s)=( x1 , ... , x L).
        </p>
        <p>
          T : str →{0 , ... ,|V |−1}
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>This segmentation makes it possible to preserve rare terms and proper names by splitting them
into shorter, already known lexemes, while keeping the vocabulary compact. The tokenizer reads
the string from left to right, selecting the longest substring present in the vocabulary; if a word is
not found in full, the algorithm recursively splits it until each part is recognized. Unknown
fragments are replaced with the special token [UNK], although after the earlier normalization steps
such cases are rare.</p>
        <p>During batch formation, the headline and article body are processed separately. The maximum
length is set to 64 tokens Lt=64 (including special markers) for the headline and 448 tokens
Lb=448 for the body. If a sequence is shorter, it is padded with [PAD] tokens until it reaches the
fixed length. If it is longer, the headline is truncated to the required size, while the body is
processed using a sliding window with a stride of 128 tokens, ensuring the model can gradually see
the entire text without catastrophic loss of the ending.</p>
        <p>Next, special markers are inserted into both sequences. At the beginning, [CLS] is placed – a
universal flag whose vector is often used for classification in classical BERT-like architectures,
though in our case it remains within the sequence as a regular token. At the end of each section,
[SEP] is added to signal a logical boundary to the neural network. As a result, two arrays of
input_ids are obtained: one for the headline tokens and one for the body tokens.</p>
        <p>In parallel, binary attention_masks are generated: elements corresponding to actual tokens are
marked with ones, while positions filled with [PAD] are marked with zeros. During further
processing in the transformer, these zeros are masked, preventing the network from spending
computation on empty positions and ensuring no gradient “leakage” into padding tokens.</p>
        <p>Finally, for each token, a positional index is assigned – its order number within the sequence.
This index is mapped to a vector from a dedicated positional embedding table and added to the
token embedding vector, together forming a complete 312-dimensional token representation. After
this step, the headline and body matrices share a unified format and are ready for input into the
shared TinyBERT block.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. TinyBERT-4L-312D Block</title>
        <p>The core of the entire architecture is the distilled transformer TinyBERT, consisting of only four
layers of the “classical” Attention + Feed-Forward block. At the input, it receives the preprocessed
embedding matrices of the headline and body, where each token is represented by a
312dimensional vector.</p>
        <p>The first operation is the addition of positional and token embeddings: the lexical vector
extracted from the embedding table is summed with the positional vector encoding the token’s
order in the sequence.</p>
        <p>For each input token xi∈{0 , ... , V −1}, the vector representation is formed as: ¯e= Exi+ Pi ,
where E ∈ RV ×d is the token embedding and P ∈ RLmax×d is the positional embedding.</p>
        <p>The resulting matrices are sequentially passed through four transformer layers. Each layer
consists of twelve self-attention heads that simultaneously construct query, key, and value
matrices. After the attention weights are normalized with the softmax function, tokens receive
contextual vectors enriched with information about lexical surroundings, syntax, and general
linguistic patterns. Each such block concludes with a two-layer feed-forward network employing
the GELU nonlinearity, along with two residual connections followed by LayerNorm. This internal
“rhythm” – attention → normalization → FFN → normalization – is repeated four times,
providing sufficient depth to capture complex political and journalistic vocabulary without
overloading memory.</p>
        <p>
          A key aspect is that the same TinyBERT copy processes both the headline and the body (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ). As a
result, weights are effectively shared between the two streams, reducing memory consumption by
half compared to approaches that maintain separate encoders for headline and body.
        </p>
        <p>
          H t=TinyBERT ( X title , mt ) , H b=TinyBERT ( X body , mb) ,
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
where H t ∈ R Lt×d and H b∈ R Lb×d denote the hidden states of the news title and body text,
respectively.
        </p>
        <p>In the first training stage, all transformer parameters are frozen: it functions as a ready-made
extractor of linguistic features distilled from the larger BERT-Base model. This allows for rapid
convergence, as the network optimizes only lightweight LoRA adapters within the cross-attention
module and the weights of the Bi-GRU. Only after the classifier stabilizes are the top two
transformer layers unfrozen and fine-tuned with a low learning rate – thus avoiding catastrophic
forgetting of general linguistic representations while enabling the model to adapt more precisely to
the rhetoric of the news domain.</p>
        <p>The distilled nature of TinyBERT means that it retains nearly the full linguistic competence of
its “parent” BERT-Base, yet contains 7.5 times fewer parameters and operates several times faster.</p>
        <p>In summary, the shared TinyBERT provides the essential “alphabetical” foundation for the
entire system, offering compact yet deep contextual representations of each token while
maintaining resource efficiency – an indispensable requirement for building a practical, widely
deployable tool for automated news verification.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. LoRA Block</title>
        <p>After both the headline and the body text pass through the shared TinyBERT, each token already
carries a full 312-dimensional contextual vector. However, up to this point the two fragments have
not “communicated” with each other: the transformer has processed them independently. To
enable the model to capture either a semantic gap or, conversely, a confirmation between a
sensational headline and the news body, a dedicated cross-attention layer is introduced.</p>
        <p>
          The idea is as follows: the hidden-state matrix of the headline is treated as queries, while the
hidden-state matrix of the news body is split into keys and values. Using the scaled dot-product
mechanism, attention weights are computed to indicate how strongly each word in the headline
“resonates” with each word in the body. The process is then mirrored: body tokens are placed in
the query role, and the headline serves as keys and values. The result is two streams of mutual
attention that literally “stitch” the two texts together. In practice, this bidirectional alignment is
sufficient for the model to detect a classic fake-news tactic: a sensational “click-bait” headline that
is not substantiated by the news content. The bidirectional attention mechanism can be described
as: Block for the direction title → body (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
        </p>
        <p>Qt= H t (W Qtb+Δ W Qtb) ,
K b= H b (W Ktb+Δ W Ktb) ,</p>
        <p>
          U b= H b (W Utb+Δ W Utb) ,
Attnt ⇒b=softmax ( Qt K b )U b∈ RLt×d ,
√ dh
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
d
where dh=
        </p>
        <p>=156 , h=2. In the specialized cross-attention block between the headline and the
h
body text, the number of self-attention heads is reduced to two in order to lower computational
costs. This reduction also decreases the risk of overfitting and gradient noise. In particular, with
dh=156, each self-attention head models semantics more effectively due to the increased
dimensionality of its representation.</p>
        <p>The body → title direction is implemented symmetrically, after which the results are
concatenated along the temporal axis F =[ Attnt ⇒b , Attnb⇒t ]∈ R(Lt+Lb)×d. The raw parameters of
the Q, K, U layers and the output projection remain frozen – they were inherited from TinyBERT
during distillation and already encode universal linguistic patterns. Adjusting the entire set of
hundreds of thousands of weights for a single localized task would be inefficient; therefore,
LowRank Adaptation (LoRA) is applied. The essence of LoRA is that each large matrix is augmented
with a small rank-4 decomposition matrix, which becomes the only “trainable” component.</p>
        <p>Each “large” matrix (for example W Qtb) remains fixed, while only its low-rank decomposition is
optimized: Δ W =α AB, where A ∈ Rd×r, and B ∈ Rr×dh. Thus, with rank r=4, the number of
trainable weights becomes d ×r +r ×dh=1872 instead of d ×dh=48672, yielding a parameter
reduction of 96.2%. A scaling factor α=32 (the LoRA default) is applied to preserve gradient norms.
In our case, only A, B, and the LayerNorm biases are trained, while all other weights remain frozen.</p>
        <p>After computing attention, the two resulting tensors are concatenated into a single long
sequence: now every word in the headline “knows” which sentences it aligns with, and vice versa.
An additional LayerNorm and dropout (0.1) stabilize the statistics before passing the sequence to
the Bi-GRU, which now processes not two separate fragments but a single, semantically coherent
“dialogue” between them.</p>
        <p>Thus, the cross-attention layer fulfills two critical functions: it makes the model sensitive to
semantic incongruence – the core of most fake publications – and it enables interpretability, since
the attention heatmap can be easily visualized to show an editor exactly which headline words
“resonated” with which passages of the text. All of this is achieved without parameter explosion,
thanks to LoRA-based lightweight fine-tuning.
3.6. Bi-GRU-64 Block
Once cross-attention has completed the “stitching” of the headline and body, we obtain a long
sequence of tokens, where each vector now encodes not only its local context but also its semantic
relation to the other part of the news item. However, classification requires a single fixed-length
document representation. Reducing hundreds of tokens to just a few numbers can be done in
several ways, but simple averaging or relying on the [CLS] vector often erases the causal and
discursive nuances. For this reason, the next step employs a compact bidirectional GRU with 64
hidden units in each direction hk=[ h⃗tb , h⃗bt ]∈ R128.</p>
        <p>The GRU functions as a learnable pooling mechanism: by traversing the sequence both
left-toright and right-to-left, it progressively accumulates salient information, and through its built-in
reset and update gates, it decides which contextual components to retain and which to discard.
Each update requires only two parameter sets instead of the three used by LSTMs, thus avoiding
unnecessary computation. The choice of 64 hidden units is a compromise: large enough to preserve
long-term dependencies between phrases, yet small enough to remain within the bounds of a
lightweight architecture.</p>
        <p>At the final stage, global max-pooling is applied across the
temporal axis
g j=k=m1a,..x., L(hk , j) , g ∈ R128. Instead of relying on the last hidden state, this technique “selects” the
strongest activations across all positions, ensuring that no critical sentence is lost, even if it appears
far from the text’s conclusion. An additional dropout layer (rate 0.2) is applied to the pooled vector,
filtering out random spikes and mitigating overfitting.</p>
        <p>In this way, the Bi-GRU fulfills several roles simultaneously. It introduces sequential context
absent from the “flat” self-attention layers of the transformer, concentrates on the most significant
phrases, and, most importantly, achieves this with minimal computational overhead. The result is a
128-dimensional vector that encodes both the linguistic and structural “profile” of the news item –
this is the representation passed to the classification layer.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.7. Classification Layer (Classification “Head”)</title>
        <p>After the bidirectional GRU compresses the entire news item into a compact 128-dimensional
vector g, the final stage is decision-making: whether the publication should be classified as fake or
considered credible. This role is fulfilled by a minimalist yet sufficient classification head,
consistent with the overall “lightweight” design of the model.</p>
        <p>First, the vector g is passed through a Dropout layer with a probability of 0.2. During training,
this randomly zeroes out one-fifth of the components, forcing the network not to over-rely on a
few “loud” features and preventing overfitting to isolated triggers – such as excessive exclamation
marks or cliched words.</p>
        <p>
          Next, a single linear layer produces the logit z, which reflects the cumulative “weight of
evidence” for fakeness. To convert this logit into a probability, the sigmoid function y^ = 1+1e−z is
applied, mapping any real value into the (
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ) range. During inference, this is the value presented to
the user: for example, 0.83 corresponds to an 83% confidence that the article is fake. By default, the
decision threshold is set at 0.5; however, after training, the model is calibrated on the validation set
to determine the threshold that maximizes the F1 score. This calibration allows tailoring the
precision/recall trade-off to the needs of a specific newsroom: some may prioritize avoiding false
accusations, while others may focus on catching every fake.
        </p>
        <p>
          The entire model is optimized using Focal Binary Cross-Entropy (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ).
        </p>
        <p>
          Φ=−((1− y^ )γ y log ( y^ )+( y^ )γ (1− y ) log (1− y^ )) , γ =2
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
        <p>In summary, the classification head completes the work of the network while upholding three
key principles: minimal resource footprint, the ability to provide detailed explanations for each
decision, and flexibility in adapting thresholds or calibrating probabilities for new domains.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.8. Model Training</title>
        <p>The training process begins with preparing the Fake News Classification corpus. The CSV file is
stratified into training, validation, and test subsets in a 70:15:15 ratio, preserving the fake/real class
balance in each part. To verify metric stability on the same split, five-fold cross-validation is
performed. Model training was conducted on a Tesla T4 GPU.</p>
        <p>Optimization is carried out using the AdamW algorithm with classical moments (0.9, 0.999) and
ε=1⋅10−8. All “heavy” Frozen-TinyBERT matrices remain excluded from updates – gradients do not
pass through them – so memory required for gradient storage is significantly reduced compared to
full fine-tuning. Training proceeds in two stages. In the first stage, lasting two epochs, the
transformer remains a feature extractor: only the LoRA adapters in the cross-attention block, the
projection before the GRU, the GRU itself, and the linear classifier are trainable. These layers use a
relatively high learning rate of 2 · 10⁻⁴, enabling the model to quickly adapt to the rhetoric of news
without altering TinyBERT’s deep linguistic base. Once the loss curve stabilizes, the second stage
begins: the top two transformer blocks are unfrozen but trained with a lower learning rate of 1 ·
10⁻⁵, while other layers continue updating at the initial step. This freeze-schedule preserves
TinyBERT’s large knowledge base while giving the model just enough freedom to fine-tune context
for the discourse domain.</p>
        <p>Learning rates in both groups follow a short linear warm-up – cosine decay schedule. The first
10% of iterations serve as warm-up, preventing unstable gradient spikes; afterward, the step
gradually decays to zero, ensuring smooth convergence. At each step, losses are computed using
focal binary cross-entropy: the scaling factor γ suppresses the influence of already correctly
classified examples and concentrates gradients on hard, borderline cases; label smoothing (ε=0.05)
filters out random spikes caused by data noise.</p>
        <p>Several measures counter overfitting. Each vector after the GRU passes through dropout with
probability 0.2; GRU gate weights undergo mix-out regularization (0.1), partially freezing past
values and preventing the model from overfitting to dataset-specific artifacts. Gradients are clipped
to a norm of 1.0 when necessary. Early stopping is applied if macro-F1 on validation does not
improve for two consecutive epochs.</p>
        <p>After the final epoch, the checkpoint with the best F1 score is retained. Stochastic Weight
Averaging Gaussian (SWAG) is then applied over the last ten optimizer states: weights are
smoothed, providing additional fractional improvements in stability. Finally, the decision threshold
is calibrated: validation data determine the value that maximizes F1, which is saved in the
configuration file.</p>
        <p>In this way, the full data → model cycle remains feasible even on consumer hardware, ensures
reproducible results, and allows LoRA adapters to be fine-tuned for new topics or local information
landscapes within minutes – all without risking the loss of general linguistic knowledge distilled
into TinyBERT.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>We present the results of the experimental comparison between the proposed TinyBERT with
Cross-Attention + LoRA + Bi-GRU model and classical “lightweight” machine learning methods,
including logistic regression, linear SVM, Bernoulli Naive Bayes, k-nearest neighbors (k = 7),
random forest, gradient boosting, and XGBoost. Comparing our model with full fine-tuning of large
transformers – where training times are often measured in hours – was deemed impractical.</p>
      <p>Experiments were conducted on the English-language Fake News Classification corpus [21],
split into training, validation, and test subsets. Evaluation metrics included accuracy, weighted
F1score, and the area under the ROC curve (ROC AUC). The purpose of the study was not only to
measure the absolute effectiveness of each approach in distinguishing fake from real news but also
to assess the trade-off between classification quality and computational cost during fine-tuning.
Interpretation of the results highlights the practicality of integrating a transformer backbone with
lightweight adaptation and sequential context modeling to improve both accuracy and stability in
text classification tasks.</p>
      <p>In the comparative experiments, our model achieved the highest performance across all
evaluated algorithms: Accuracy = 99%, weighted F1-score = 0.985, and ROC AUC = 0.998. The
results of other models are summarized in Table 2. The class-specific accuracy metrics confirm the
high reliability of the proposed architecture across both categories. For the real news class, the
model achieved precision = 0.98 and recall = 0.99, while for the fake news class it reached precision
= 0.99 and recall = 0.98. This balance ensures a minimal rate of both false positives and false
negatives.</p>
      <p>From the perspective of computational cost, training the LoRA adapter together with
finetuning the Bi-GRU in the proposed model required approximately 778 seconds. While this is
significantly longer than the training time for linear models (Logistic Regression and Linear SVM ≈
105 s; Bernoulli NB and K-NN ≈ 88 s), our model delivers superior classification quality, as
demonstrated in Table 2. In comparison to full fine-tuning of large transformers – where training
can take hours – the choice of TinyBERT reduced the computational load and overall training time
considerably (778 s).</p>
      <p>Regarding inference, TinyBERT with cross-attention and Bi-GRU achieves an average latency of
about 5 ms on a Tesla T4 GPU, making the model well-suited for integration into real-time services.</p>
      <p>This compromise between accuracy and resource consumption positions TinyBERT +
CrossAttention + LoRA + Bi-GRU as an attractive solution for real-time fact-checking tasks.</p>
      <p>In the context of confusion matrices, it is evident that classical algorithms handle the trade-off
between false positives and false negatives differently. For instance, in the case of logistic
regression (Fig. 2), the metrics appear symmetrical; however, its recognition accuracy is
approximately 3% lower than that of our proposed model. A drawback of logistic regression is that
it performs effectively only when the data distribution is close to linearly separable and fails to
capture complex n-gram interactions.</p>
      <p>For Linear SVM (Fig. 3), the model reliably avoids missing fake news but fails to identify
approximately 14% of real cases, as reflected in the false-negative classifications shown in the
upper-right corner of the confusion matrix.</p>
      <p>Bernoulli NB (Fig. 4) exhibits an almost uniform distribution of errors but overall suffers from
overly generalized and partially excessive predictions. In contrast, K-NN with k = 7 (Fig. 5)
frequently confuses feature-similar examples, resulting in a blurred diagonal in the confusion
matrix.</p>
      <p>In the case of Random Forest (Fig. 6), a much clearer diagonal can be observed; however, some
real posts are occasionally labeled as fake due to the model’s excessive sensitivity to feature
correlations. Overall, the model achieved an accuracy comparable to logistic regression, at around
96%. Nevertheless, it exhibited the longest training time among the classical approaches because of
the large number of hyperparameters involved. Gradient Boosting (Fig. 7) demonstrates balanced
behavior, although the model is not free from false-positive classifications.</p>
      <p>XGBoost (Fig. 8) produced the weakest result, classifying nearly all posts as fake. However, it
should be noted that the training was conducted using default hyperparameters, and the model was
not specifically tuned for this task.</p>
      <p>In contrast, the confusion matrix for TinyBERT + Cross-Attention + LoRA + Bi-GRU (Fig. 9) is
nearly perfect: both real and fake news are classified correctly in 99% of cases, with only a minimal
number of false positives and false negatives.</p>
      <p>Limitations and Future Directions. Despite the high classification performance, this study has
several limitations that must be considered when interpreting the results and planning system
deployment. First, the experimental evaluation relied exclusively on the English-language Fake
News Classification corpus; therefore, the transferability of the model to texts in other languages or
cultural contexts remains uncertain. Second, the classification task was limited to textual features
only: multimodal data such as images, videos, or graphical metadata – elements that could provide
critical cues for distinguishing fake from real – were not included. Although the TinyBERT
architecture with cross-attention has strong potential for multimodal integration, combining text
with images, videos, or graphical metadata could significantly improve fake news detection, albeit
at the cost of higher computational demands. It should also be noted that training was conducted
on a single Tesla T4 GPU, which poses challenges for scaling to larger models or datasets.</p>
      <p>The further development of the proposed architecture envisions extending its capabilities and
enhancing efficiency under real-world application scenarios. A promising direction involves
experimenting with multimodal data by fusing textual information with additional sources such as
images, videos, and graphical metadata. Integrating these modalities into the model (e.g., through
cross-modal attention mechanisms) would improve classification accuracy and enable deeper
analysis of content – particularly in cases where fake news is accompanied by manipulative visual
material.</p>
      <p>Another important direction is optimizing the model for portable devices and embedded
systems. Planned research includes exploring different quantization techniques to reduce
computational overhead and inference time without significant accuracy loss. Specifically,
posttraining INT8 quantization is considered for the entire model, while lower-precision formats such
as FP8 or INT4 may be applied to computationally intensive components. Implementing these
approaches with platform-specific tools – TensorFlow Lite with NNAPI for Android, Core ML for
iOS, and ONNX Runtime Mobile for ARM devices – would accelerate model execution and reduce
memory consumption.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The proposed compact neural network architecture – TinyBERT with Cross-Attention, LoRA, and
Bi-GRU – demonstrated high effectiveness in the task of automatic fake news detection. The
obtained results significantly outperformed classical machine learning models (logistic regression,
SVM, Bernoulli Naïve Bayes, k-NN, random forest, gradient boosting, and XGBoost), achieving 99%
accuracy, a weighted F1-score of 0.985, and a ROC AUC of 0.998.</p>
      <p>The main advantage of the developed model lies in its ability to explicitly detect inconsistencies
between the headline and the body of a news item through the use of the cross-attention
mechanism. This approach substantially improves classification accuracy, particularly in the
context of widespread clickbait headlines, which traditional neural models are not always able to
interpret correctly.</p>
      <p>The application of Low-Rank Adaptation (LoRA) significantly reduced the computational
resources required for fine-tuning the model while preserving performance – an essential
advantage for practical deployment on devices with limited processing power.</p>
      <p>The addition of a bidirectional recurrent Bi-GRU layer to the transformer backbone preserved
sequential context and enabled effective aggregation of important semantic features, thereby
improving both classification quality and interpretability of the results.</p>
      <p>Practical benefits of the proposed model include its suitability for integration into real-world
fact-checking services thanks to a low inference time (5 ms on a Tesla T4 GPU), the ability to be
quickly fine-tuned for new topics and linguistic contexts, and a high level of decision
interpretability via attention heatmaps and Grad-CAM.</p>
      <p>Comparison of confusion matrices showed that the proposed model effectively balanced
precision and recall, producing fewer misclassifications compared to other algorithms.</p>
      <p>In conclusion, the TinyBERT + Cross-Attention + LoRA + Bi-GRU model is a highly effective
and resource-optimized solution, well-suited for integration into real-time fake news detection
systems, with considerable potential for further development and adaptation to new application
scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The research was carried out with the grant support of the National Research Fund of Ukraine
"Development of an Information System for Automatic Detection of Disinformation Sources and
Inauthentic User Behavior in Chats", project registration number 33/0012 from 3/03/2025
(2023.04/0012).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>
        During the preparation of this work, the authors used Grammarly in order to: grammar and
spelling check. After using this service, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content.
[6] K. Shu, et al., Hierarchical propagation networks for fake news detection: Investigation and
exploitation, in: Proceedings of the International AAAI Conference on Web and Social Media,
vol. 14, 2020, pp. 626–637. doi:10.1609/icwsm.v14i1.7329.
[7] Q. Chang, X. Li, Z. Duan, Graph global attention network with memory: A deep learning
approach for fake news detection, Neural Netw. (2024) 106115.
doi:10.1016/j.neunet.2024.106115.
[8] E. Qawasmeh, M. Tawalbeh, M. Abdullah, Automatic identification of fake news using deep
learning, in: 2019 Sixth International Conference on Social Networks Analysis, Management
and Security (SNAMS), Granada, Spain, 22–25 October 2019. doi:10.1109/snams.2019.8931873.
[9] Y. Dun, et al., KAN: Knowledge-aware attention network for fake news detection, in:
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 2021, pp. 81–89.
doi:10.1609/aaai.v35i1.16080.
[10] J. Alghamdi, Y. Lin, S. Luo, Enhancing hierarchical attention networks with CNN and stylistic
features for fake news detection, Expert Syst. Appl. (2024) 125024.
doi:10.1016/j.eswa.2024.125024.
[11] W. Zhang, et al., Cross-attention multi-perspective fusion network based fake news
censorship, Neurocomputing (2024) 128695. doi:10.1016/j.neucom.2024.128695.
[12] S. Maham, et al., ANN: Adversarial news net for robust fake news classification, Sci. Rep. 14 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(2024). doi:10.1038/s41598-024-56567-4.
[13] D. Zhou, et al., GS2F: Multimodal fake news detection utilizing graph structure and guided
semantic fusion, ACM Trans. Asian Low-Resour. Lang. Inf. Process. (2024).
doi:10.1145/3708536.
[14] S. Kuntur, et al., Under the influence: A survey of large language models in fake news
detection, IEEE Trans. Artif. Intell. (2024) 1–21. doi:10.1109/tai.2024.3471735.
[15] M.M. Kabir, et al., CUET-NLP_MP@DravidianLangTech 2025: A transformer and LLM-based
ensemble approach for fake news detection in Dravidian, in: Proceedings of the Fifth
Workshop on Speech, Vision, and Language Technologies for Dravidian Languages,
Albuquerque, New Mexico, USA, 2025, pp. 420–426.
doi:10.18653/v1/2025.dravidianlangtech-1.75.
[16] V. Vysotska, O. Markiv, D. Svyshch, L. Chyrun, S. Chyrun, R. Romanchuk, Fake news
identification based on NLP, big data analysis and deep learning technology, in: Proc. 2024
IEEE 17th Int. Conf. on Advanced Trends in Radioelectronics, Telecommunications and
Computer Engineering (TCSET), IEEE, 2024, pp. 199–204.
[17] V. Vysotska, K. Przystupa, L. Chyrun, S. Vladov, Y. Ushenko, D. Uhryn, Z. Hu, Disinformation,
fakes and propaganda identifying methods in online messages based on NLP and machine
learning methods, Int. J. Comput. Netw. Inf. Secur. 16(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) (2024) 57–85.
[18] V. Vysotska, K. Przystupa, Y. Kulikov, S. Chyrun, Y. Ushenko, Z. Hu, D. Uhryn, Recognizing
fakes, propaganda and disinformation in Ukrainian content based on NLP and
machinelearning technology, Int. J. Comput. Netw. Inf. Secur. 17(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) (2025) 92–127.
[19] V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, V. Schuchmann, NLP tool for
extracting relevant information from criminal reports or fakes/propaganda content, in: Proc.
2022 IEEE 17th Int. Conf. on Computer Sciences and Information Technologies (CSIT), IEEE,
2022, pp. 93–98.
[20] V. Vysotska, L. Chyrun, S. Chyrun, I. Holets, Information technology for identifying
disinformation sources and inauthentic chat users’ behaviours based on machine learning,
CEUR Workshop Proc. 3723 (2024) 466–483.
[21] Fake news classification, Kaggle dataset, 8 October 2023. Available at:
https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoon</surname>
          </string-name>
          , et al.,
          <article-title>Learning to detect incongruence in news headline and body text via a graph neural network</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>36195</fpage>
          -
          <lpage>36206</lpage>
          . doi:
          <volume>10</volume>
          .1109/access.
          <year>2021</year>
          .
          <volume>3062029</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiao</surname>
          </string-name>
          , et al.,
          <article-title>TinyBERT: Distilling BERT for natural language understanding, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Stroudsburg</article-title>
          , PA, USA,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>372</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>LoRA-Adv</surname>
          </string-name>
          :
          <article-title>Boosting text classification in large language models through adversarial low-rank adaptations</article-title>
          , IEEE Access (
          <year>2025</year>
          )
          <article-title>1</article-title>
          . doi:
          <volume>10</volume>
          .1109/access.
          <year>2025</year>
          .
          <volume>3579539</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Taher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moussaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moussaoui</surname>
          </string-name>
          ,
          <article-title>Automatic fake news detection based on deep learning, FastText and news title</article-title>
          ,
          <source>Int. J. Adv. Comput. Sci. Appl</source>
          .
          <volume>13</volume>
          (
          <issue>1</issue>
          ) (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .14569/ijacsa.
          <year>2022</year>
          .
          <volume>0130118</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          , SADHAN:
          <article-title>Hierarchical attention networks to learn latent aspect embeddings for fake news detection</article-title>
          ,
          <source>in: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '19)</source>
          , New York, USA,
          <year>2019</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>204</lpage>
          . doi:
          <volume>10</volume>
          .1145/3341981.3344229.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>