<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Misrepresentation to Quantities: Labeling Misinformation Types in South Indian Language Summaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachana Nagaraju</string-name>
          <email>rachananagaraju20@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hosahalli Lakshmaiah Shashirekha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <addr-line>Mangalore, Karnataka</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The widespread adoption of Large Language Models (LLMs) has introduced new risks associated with the fluent generation of factually incorrect or misleading content. Addressing this challenge requires fine-grained tools capable of not only detecting misinformation but also distinguishing among types of factual errors. Prompt RecOvery for MisInformation Detection (PROMID)-2025 shared task, held as part of the Forum for Information Retrieval Evaluation (FIRE)-2025, directly targets this issue by encouraging systems that analyze and classify incorrectness in LLM-generated outputs. We - team MUCS participated in Subtask 2, which requires classifying LLM-generated summaries into one of four predefined misinformation categories: misrepresentation, fabrication, false attribution, and incorrect quantities. The subtask is challenging due to the semantic similarity between classes and the multilingual setting that includes Kannada, Malayalam, Tamil, and Telugu. We propose three multilingual Deep Learning (DL) pipelines: (i) Bidirectional Long Short Term Memory (BiLSTM) model, (ii) Transformer + BiLSTM hybrid model, and (iii) Bidirectional Gated Recurrent Unit (BiGRU) model, to categorize each data point into one of the four diferent categories based on the presence of factual incorrectness in the summaries. Each model employs language-aware pre-processing, subword-aware tokenization, and contextual encoders tailored to sequence modeling. The Transformer + BiLSTM model integrates transformer encoders, BiLSTM layers, and multi-head self-attention to capture both global and local dependencies. In contrast, BiLSTM and BiGRU models use simpler recurrent architectures combined with attention mechanisms to reduce computational overhead. To address mild class imbalance, we apply focal loss during training along with mixed-precision optimization for eficiency. The proposed models obtained best performances with: BiLSTM model ranking 1 st in both Tamil and Telugu, BiGRU model ranking 2nd in Kannada, and the Transformer + BiLSTM model ranking top-3 position in Malayalam. These results demonstrate the utility of hybrid neural modeling and linguistically-informed pre-processing for multilingual misinformation classification in LLM-generated content.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Misinformation detection</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Transformer-BiLSTM</kwd>
        <kwd>Multilingual NLP</kwd>
        <kwd>South Indian languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The emergence of powerful LLMs such as Generative Pre-trained Transformer (GPT)-4, Pathways
Language Model (PaLM), and Large Language Model Meta AI (LLaMA) has significantly transformed
the landscape of natural language generation. These models exhibit unprecedented fluency, contextual
awareness, and language understanding across a wide range of tasks. However, their susceptibility to
generate factually incorrect outputs—commonly referred to as hallucinations, has become a matter of
growing concern [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In many use cases, including news summarization, health advisories, or
educational content generation, such hallucinated information can mislead users, propagate misinformation,
and erode trust in Artificial Intelligence (AI) systems.
      </p>
      <p>
        Combating misinformation in LLM-generated content is not only a matter of detecting falsehoods but
also of understanding their origin. Many instances of misinformation can be traced back to ambiguous or
misleading prompts that guide the model to generate specific types of incorrect summaries. Traditional
misinformation detection systems often neglect this generative aspect, focusing on surface-level textual
features or post-analysis of static web content [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To address this gap, PROMID [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] - a shared task at
FIRE 2025, aims to foster research on detecting and analyzing misinformation through the lens of both
its content and its generative context. PROMID-2025 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] shared task contains three subtasks: Subtask 1:
Prompt Recovery from LLM generated misinformative text, Subtask 2: Misinformation Detection in
LLM generated text, and Subtask 3: Misinformation Detection in social media texts, and we participated
in only Subtask 2. Given a piece of LLM-generated text with misinformation, the objective of Subtask
2 is to categorize each data point into one of the four diferent categories based on the presence of
factual incorrectness in the summaries. The four fine-grained classes of factual incorrectness are:
misrepresentation, fabrication, false attribution, and incorrect quantities, and the Subtask 2 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is ofered
in four South Indian languages - Kannada, Malayalam, Tamil, and Telugu. This demands a nuanced
understanding of not only the summary but also its deviation from grounded source material. The
task becomes even more complex in a multilingual setting, particularly across the given low-resource,
morphologically rich South Indian languages.
      </p>
      <p>We, team MUCS, address the challenges of Subtask 2 by designing a language-agnostic misinformation
classification framework capable of capturing both contextual semantics and syntactic nuances that
diferentiate hallucinated content types. We develop three end-to-end multilingual DL pipelines: (i)
BiLSTM model, (ii) Transformer + BiLSTM hybrid model, and (iii) BiGRU model, to categorize each
data point into one of the four diferent categories based on the presence of factual incorrectness in the
summaries. Each pipeline incorporates robust text pre-processing, custom subword-aware tokenization,
and neural encoders designed to classify misinformation into one of four predefined categories across
four South Indian languages - Kannada, Malayalam, Tamil, and Telugu. The architectures vary in
complexity: the Transformer+ BiLSTM model includes transformer encoders, multi-head self-attention,
and BiLSTM layers, while the BiLSTM and BiGRU models use purely recurrent layers along with
attention mechanisms tailored to their depth. These configurations allow the models to learn both
global dependencies and local sequential patterns—critical for distinguishing subtle misinformation
cues. To address the mild class imbalance present in the dataset, we apply the focal loss function with
class-aware weighting. Additionally, mixed-precision training is used for computational eficiency in
deeper configurations.</p>
      <p>
        Performances of the proposed models varied across languages depending on the architecture. The
BiLSTM model attained Rank 1 in both Tamil and Telugu, the BiGRU model achieved Rank 2 in Kannada,
and the Transformer + BiLSTM model secured a consistent top-3 position in Malayalam. These results
underscore the benefits of combining linguistic preconditioning, cross-lingual representation learning,
and architecture-specific modeling choices for classifying misinformation. Overall, this work contributes
to the broader goal of developing robust and prompt-resilient misinformation detection systems for
LLM-generated content [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The subsequent sections of this paper details the related works (Section 2), methodology (Section 3),
experiments, results, and implications of our approach (Section 4), followed by conclusion and future
works (Section 5).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Recent advancements in natural language generation have intensified eforts to detect hallucinated or
misleading content automatically. Misinformation classification, particularly in LLM-generated text,
intersects with factuality evaluation, hallucination detection, prompt sensitivity, and deep contextual
modeling. Researchers have proposed a broad spectrum of models, evaluation metrics, and benchmarks
to address these challenges, ranging from Question Answer (QA) based fidelity evaluation to fine-grained
hallucination categorization.</p>
      <p>
        Ji et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] presents a foundational taxonomy of hallucinations, categorized into intrinsic (information
not grounded in the source) and extrinsic (conflicting with source content). The authors systematically
evaluated summarization models, notably BART and PEGASUS, on CNN/DailyMail, XSum, and WebNLG
datasets. They observed that ROUGE - a commonly used metric is incapable of penalizing factual
mismatches, thereby overestimating performance. The authors also reviewed recent evaluation tools
- FactCC, DAE, and QuestEval, finding their F1-based factuality improvements modest compared to
human evaluation. While the survey is comprehensive, it remains focused on English data and does not
cater to multilingual or low-resource hallucination scenarios. Zhou et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] focus on hallucination
phenomena in large-scale instruction-tuned models like GPT-2, GPT-3, and T5. The authors identified
various failure cascades, especially under weak prompt specification and retrieval-agnostic generation.
They test strategies such as Retrieval-Augmented Generation (RAG) and fact-aware k-nearest neighbor
decoding on datasets like WebGPT and TriviaQA, achieving 10–12% reductions in factual errors. While
the interventions ofer measurable improvements, the solutions are tightly coupled with access to
document-level retrieval systems, posing scalability issues in disconnected or unknown knowledge
domains.
      </p>
      <p>
        Manakul and Gales [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] introduced SelfCheckGPT, a zero-reference evaluation method that detects
hallucinations by measuring variation between multiple LLM responses under the same input. Evaluated
across XSum, Wikibio, and CommonGen, SelfCheckGPT outperforms supervision-based tools like
DAE and QAEval, achieving an average F1 score of 0.72. Because SelfCheckGPT treats the LLM as
a black box, it is suitable for commercial models (e.g., ChatGPT) where internal architectures are
inaccessible. Nevertheless, the model’s reliability depending on generative diversity—hallucinations
may go undetected if the model consistently repeats incorrect outputs. Rashkin et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose
TruthfulQA, a benchmark system built to expose models to adversarial factual errors through misleading
or underspecified prompts. They show that even few-shot GPT-3 achieves only 58% accuracy when
prompted with deceptive or ambiguous inputs. Their analysis reveals the models’ alignment with
misinformation commonly found on the web due to pretraining. TruthfulQA is useful in diagnosing
hallucination types related to real-world disinformation, but its design is task-specific—limited to QA
rather than open-ended summarization or generation contexts.
      </p>
      <p>
        Gupta and Vishwakarma [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] address fine-grained misinformation detection in the context of
COVID19 tweets. They apply BERT and RoBERTa classifiers to a custom-annotated dataset with misinformation
categories that include denial, satire, and conspiracy. Achieving macro F1 scores of up to 0.76, their
approach significantly outperformed traditional linear and tree-based models like Support Vector Machine
(SVM) and Decision Trees. However, class imbalance and semantic overlap between misinformation
types reduce per-class precision and recall, particularly for satire vs. sarcasm. Furthermore, the dataset
is monolingual and domain-specific, restricting generalizability. Wright and Pavlick [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] present Factool
- a factuality evaluation tool built for summarization. It integrates dependency-based heuristics and
semantic entailment checks into a unified scoring mechanism. On XSum and CNN/DailyMail (evaluated
against human annotations), Factool improves alignment scores by roughly 15% over BART and T5
outputs. Its integration of linguistic parsing provides control over soft factual inconsistencies. However,
the tool requires high-quality syntactic parsers and cannot be easily deployed in non-English or noisy
language settings, limiting its scalability.
      </p>
      <p>
        Alam et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] investigated misinformation detection in multilingual settings using multilingual
BERT and XLM-R. Applied to COVID-19 claim datasets in Arabic, Hindi, Tamil, and Bengali, the models
demonstrate that fine-tuned multilingual encoders outperform machine-translated pipelines by a margin
of 12–15% F1. While the work confirms that transfer learning can be beneficial across low-resource
setups, it is constrained to binary classification and does not integrate type-level misinformation tags like
hallucinated quantities or misattribution. Krishna et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] analyzed factual consistency by comparing
model output (T5, PEGASUS) on summarization datasets (SAMSum, XSum) using both lexical and
semantic evaluation frameworks. They illustrated that hallucinations, especially false attribution and
fabrication, often go unpunished by standard metrics (BLEU, ROUGE), with up to 40% of incorrect
content rated acceptable due to lexical overlap. They introduce a claim-type tagging scheme combined
with human judgment alignment but stop short of ofering a machine-learnable model for hallucination
classification. Min et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] propose FactScore - a sentence-level hallucination evaluator combining
dense retrieval and entailment verification. Initially applied to SciFact and PubMedQA, it achieves 85%
factual agreement without requiring reference summaries. FactScore is particularly advantageous in
domain-specific settings such as biomedical summarization. Yet, its performance in multilingual or
informal narratives remains unexplored, limiting its utility in non-academic LLM-generated content
domains.
      </p>
      <p>
        Razeghi et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] explored the impact of prompt template variations on GPT-3’s ability to produce
accurate answers and observed a variability of up to 20% in task accuracy (SuperGLUE and ARC)
under slight changes to input phrasing. Their study demonstrates that prompts themselves introduce
biases and instability, often leading to contradictory answers. While insightful, the study focuses
primarily on zero-shot settings and does not extend the prompt variation analysis to hallucination
detection. Augenstein et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] present a fine-grained taxonomy of hallucination classes—such as
numeric misrepresentation, source fabrication, and attribute drift—and apply the framework to GPT-3
outputs across QA, summarization, and text generation. Human annotations reveal that 60% of analyzed
outputs contain overlapping hallucination types. The study identifies multi-layered error patterns but
does not provide an automated model to perform this categorization, leaving the system useful largely
as an annotation benchmark.
      </p>
      <p>
        Collectively, these studies ofer strong foundational insights into hallucination detection and
misinformation classification. Yet, challenges remain with respect to multilingual support, hallucination
attribution granularity, prompt variability, and label-level explainability. Although SelfCheckGPT and
FactScore contribute to progress in zero-reference evaluation, and taxonomy eforts enhance
annotation clarity [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], existing systems still struggle with cross-lingual generative hallucinations that lack
structured evidence or precise prompts. Bridging these gaps is critical for the scalable and reliable
deployment of LLMs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes the three proposed multilingual DL pipelines. Each pipeline processes summaries
and source articles in South Indian languages - Kannada, Malayalam, Tamil, and Telugu, and assigns one
of the four misinformation class labels to the given input. The models difer in tokenization, sequence
encoders, attention mechanisms, and optimization strategies. Each pipeline follows a general structure
of: (i) pre-processing the input, (ii) tokenization, (iii) sequence encoding with neural layers, and (iv)
classification. The proposed end-to-end multilingual DL pipelines for Subtask 2 is shown in Figure 1
and a description of the models is given in the following sub-sections.</p>
      <sec id="sec-3-1">
        <title>3.1. BiLSTM Model</title>
        <p>The BiLSTM model serves as a strong baseline in our system. It combines a deep recurrent encoder
with multi-head self-attention to model sequential dependencies and incorporate global context for
efective hallucination detection. The details of the model are given below:
• Text Pre-processing:
– URLs and email address are removed using regex1.
– Punctuation characters (e.g., !"#$...~) are stripped.
– Whitespace is normalized by collapsing multiple spaces and trimming.
– No long-token removal is used in this basic version.
– We employ a BasicTokenizer that splits input text on whitespace characters.
– Language indicator tokens (&lt;ta&gt;, &lt;kn&gt;) are retained.
– Tokens longer than 10 characters are split into fixed-length, non-overlapping 5-character
chunks2.
1https://en.wikipedia.org/wiki/Uniform_Resource_Locator
2Inspired by Byte-Pair Encoding (BPE) principles: https://huggingface.co/blog/how-to-preprocess</p>
        <p>– Vocabulary is created using tokens with a frequency ≥2 in the training set.
• Model and Training Loop:
– The architecture is based on BiLSTM3 - a recurrent model that captures both past and future
context by processing the sequence in forward and backward directions.
– Each token is embedded into a 256-dimensional vector.
– A linear layer with dropout simulates positional information.
– A 2-layer BiLSTM with 512 hidden units per direction outputs 1024 dimensional contextual
embeddings.
– A Multi-Head Self-Attention layer4 with 8 heads learns token importance for downstream
aggregation.
3https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
4https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
– Masked average pooling skips padding and compress the sequence into a fixed-length vector.
– A Multi-Layer Perceptron (MLP) classifier with dimensions 1024 → 512 → 256 → 4
computes the class prediction. MLPs are fully connected layers that use GELU activations
[17], LayerNorm, and dropout for regularization.
– We apply Weighted Cross-Entropy Loss where class weights are computed as the inverse
frequency from the training labels.
– Optimization uses the AdamW optimizer [18] — an adaptive gradient method with decoupled
weight decay.
– Learning rate is controlled via the OneCycleLR scheduler [19], which warms up and gradually
cools down the learning rate for better convergence.
– Gradient clipping is applied to prevent exploding gradients.</p>
        <p>– Training is conducted on mini-batches (size 16) for 10 epochs.</p>
        <p>This model provides a balance between performance and computational simplicity, ofering strong results
across two languages while using a purely recurrent encoder architecture. Its relatively lightweight
design makes it particularly efective for resource-constrained settings.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Transformer + BiLSTM Model</title>
        <p>This hybrid model combines a Transformer Encoder with a BiLSTM network, enabling a fusion of global
attention-based contextual modeling and sequential dependency learning. It is the most architecturally
complex of the three approaches. The details of the model are given below:
• Text Pre-processing:
• Tokenization:
• Model and Training Loop:
– Same as in BiLSTM model, but includes a filter to discard tokens with length ≥50
5.
– The EnhancedTokenizer splits on whitespace and segments tokens &gt;10 characters using
overlapping 5-character windows with stride 3.
– This overlapping subword approach captures richer morphological structures for
agglutinative languages.
– Vocabulary is constructed with min_freq = 2.
– Tokens are mapped to 384 dimensional embeddings plus learned positional embeddings.
– A 3-layer Transformer Encoder applies multi-head self-attention and feed-forward layers to
model global context.
– The two representations (Transformer + BiLSTM) are concatenated and projected down to
768-d using a linear layer.
– Outputs are passed to a 2-layer BiLSTM (384 hidden units per direction).
– After dropout and LayerNorm, masked average pooling yields a fixed vector.</p>
        <p>– An MLP classifier (768 → 384 → 4) predicts the class.
• Loss and Optimization:
– We use FocalLoss [20] with focusing parameter  = 2.0 and class weighting .
– It improves learning by penalizing easy examples and focusing on hard-to-classify ones.
– Optimizer: AdamW with weight decay.</p>
        <p>– Scheduler: OneCycleLR with warm-up phase.
5Extreme-length tokens often indicate noise from LLM-generated or malformed inputs
– Mixed-precision training is enabled via Automatic Mixed Precision (AMP)6.</p>
        <p>– Gradient accumulation is used to emulate larger batch sizes.</p>
        <p>This architecture benefits from the transformer’s ability to capture long-range dependencies and the
BiLSTM’s ability to maintain sequential coherence. It performed robustly on morphologically rich
scripts, demonstrating its suitability for complex linguistic structures.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. BiGRU Model</title>
        <p>The BiGRU model is a lightweight architecture that discards transformer components in favor of a
purely recurrent encoder with attention. It serves as a strong baseline in low-resource scenarios. The
details of the model are given below:
– Same as in Transformer + BiLSTM model and URL + punctuation removal, space
normalization, and dropping tokens &gt;50 characters.
– SimpleTokenizer applies minimal whitespace-based splitting.
– No subword segmentation is used (baseline).
– 256 dimension tokens processed by a 2-layer Bidirectional GRU (BiGRU)7 with 512 units per
direction.
– Additive attention computes token-level importance using: Linear(1024 → 512) → Tanh
→ Linear(512 → 1) followed by Softmax.
– A weighted sum over token vectors produces a document representation.</p>
        <p>– MLP classifier: 1024 → 512 → 256 → 4.
• Loss and Optimization:
– FocalLoss with  = 2.0 and class-weighted .
– AdamW optimizer.
– CosineAnnealingLR scheduler8 reduces learning rate smoothly.</p>
        <p>– Model trained with batch size 32; best checkpoint chosen based on lowest loss.</p>
        <p>Due to its simplicity, fast convergence, and minimal memory overhead, the BiGRU model is ideal for
deployment in real-time applications or where computational resources are limited.</p>
        <p>The configuration of hyperparameters used in training these three models are provided in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>This section presents the empirical evaluation of our proposed system on Subtask 2 of the PROMID-2025
shared task. We report both quantitative performance scores and qualitative insights. The evaluation
is designed to assess the system’s generalization across diverse linguistic structures in four Indian
languages: Kannada, Malayalam, Tamil, and Telugu. Each model is evaluated using Precision, Recall,
Macro F1-score, and overall classification Accuracy. A description of the dataset, followed by a detailed
analysis of model performance for each language, is provided below:
6https://pytorch.org/docs/stable/amp.html
7https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
8https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html</p>
      <p>Description
A short title or headline summarizing the article topic
Additional or extended versions of the title, if available
The full ground-truth article body used as the reference source
A hallucinated or misleading summary, typically generated by an LLM
One of four misinformation categories: misrepresentation, fabrication,
false attribution, or incorrect quantities</p>
      <p>A human-written, factually accurate summary of the article</p>
      <sec id="sec-4-1">
        <title>4.1. Data Description</title>
        <p>The multilingual misinformation classification dataset [ 21, 22] released as part of the PROMID-2025
[23] shared task contains hallucinated summaries generated by LLMs paired with gold-standard articles
in four South Indian languages: Kannada, Malayalam, Tamil, and Telugu. Each summary is labeled with
one of four incorrectness types: misrepresentation, fabrication, false attribution, and incorrect quantities.
The overall category-level distribution across all training data is as follows:
• Misrepresentation: 29.7%
• Fabrication: 25.3%
• False Attribution: 25.3%
• Incorrect Quantities: 19.7%
Table 2 gives the description of the datasets and Table 3 shows the class-wise distribution.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results and Analysis</title>
        <p>The models are evaluated using four standard metrics: macro-averaged Precision, Recall, F1-score, and
overall Accuracy, and ranked based on macro-averaged F1-score. We tested the proposed multilingual
models: BiLSTM model, Transformer + BiLSTM hybrid model, and BiGRU-based model, on each
language. Table 4 presents the performance metrics for each model across the four languages and
Figure 2 ofers a visual comparison of team-wise macro F1-scores of all the participating teams in
Subtask 2 across the four languages.</p>
        <p>Our analysis reveals that the BiLSTM model shows illustrated performance for Tamil and Telugu,
achieving the highest F1-scores and accuracies. This suggests that simpler architectures with adequate
attention mechanisms, when paired with class-weighted loss, can efectively capture essential patterns
in morphologically rich languages. For Kannada, the BiLSTM model attains the best F1-score, whereas
the BiGRU model yields the highest accuracy. This variation suggests that while BiGRU may be better
at general predictions, BiLSTM model maintains precision-recall balance better. For Malayalam, the
BiGRU model outperforms both the BiLSTM and Transformer + BiLSTM variants across all metrics
demonstrating that lightweight architectures can still perform competitively for certain languages.</p>
        <p>Although the dataset is relatively balanced overall, some class imbalance particularly under the
incorrect quantities category introduces challenges. Combined with semantic overlap between categories
such as false attribution and fabrication, this afects low-recall cases. Performance diferences also
underline the influence of architecture depth, loss strategy, and attention mechanisms. The use of focal
loss with class-aware weighting and sequence-level attention improves outcome reliability, especially
in low-resource, multilingual settings.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>Our multilingual DL system for Subtask 2 of the PROMID-2025 shared task demonstrates strong
performance in classifying LLM-generated summaries into fine-grained misinformation categories across
four South Indian languages—Kannada, Malayalam, Tamil, and Telugu. Leveraging hybrid
transformerrecurrent architectures, attention-based encoding, and class-aware focal loss, the Transformer + BiLSTM
model efectively captures both sequential and contextual signals required to detect subtle factual
inconsistencies. While the attention mechanism and focal loss are employed across all three models,
the hybrid architecture most directly benefits from combined transformer and recurrent components.
BiLSTM model achieved Rank 1 in Telugu and Tamil, BiGRU model achieved Rank 2 in Kannada, and
Transformer + BiLSTM model obtained Rank 3 in Malayalam, based on macro F1-scores submitted
to the oficial leader board. These results reflect not only the adaptability of our architecture across
typologically distinct languages but also the efectiveness of pre-processing strategies, subword-aware
(a) Kannada</p>
      <p>(b) Malayalam
(c) Tamil
(d) Telugu
tokenization, and stable training under a mildly imbalanced class distribution. To further improve the
system, we plan to explore methods for enhancing model interpretability, particularly in generating
explanations for the predicted misinformation types. We also intend to evaluate lightweight model
variants for more eficient deployment in practical applications. These directions are aimed at making
LLM-based misinformation detection more scalable, interpretable, and robust in multilingual scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>In the course of preparing this paper, we made limited use of a generative AI assistant to support the
writing process. The tool was used primarily for language refinement, section structuring, and LaTeX
formatting consistency. All technical content, including experimental design, model implementation,
and results, was conceived, executed, and validated entirely by the authors. The AI assistant did not
contribute novel research ideas, nor did it influence the reported findings. Its role was strictly supportive
comparable to using grammar checkers or typesetting tools and all content included in this manuscript
has been carefully reviewed and approved by the authors.
[17] D. Hendrycks, K. Gimpel, Gaussian Error Linear Units (GELUs), arXiv preprint arXiv:1606.08415
(2016). URL: https://arxiv.org/abs/1606.08415.
[18] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization, 2019. URL: https://arxiv.org/abs/
1711.05101. arXiv:1711.05101.
[19] L. N. Smith, A Disciplined Approach to Neural Network Hyper-parameters: Part 1 – Learning
Rate, Batch Size, Momentum, and Weight Decay, 2018. URL: https://arxiv.org/abs/1803.09820.
arXiv:1803.09820.
[20] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal Loss for Dense Object Detection, in:
Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–
3007. doi:10.1109/ICCV.2017.324.
[21] G. K. Shahi, Y. Mejova, Too Little, Too Late: Moderation of Misinformation around the
RussoUkrainian Conflict, in: Proceedings of the 17th ACM Web Science Conference (WebSci ’25),
Association for Computing Machinery, 2025. doi:10.1145/3717867.3717876.
[22] G. K. Shahi, T. A. Majchrzak, AMUSED: An Annotation Framework of Multimodal Social Media
Data, in: F. Sanfilippo, O.-C. Granmo, S. Y. Yayilgan, I. S. Bajwa (Eds.), Intelligent Technologies
and Applications, Springer International Publishing, Cham, 2022, pp. 287–299.
[23] S. Satapara, P. Mehta, D. Ganguly, S. Modha, Fighting Fire with Fire: Adversarial Prompting to
Generate a Misinformation Detection Dataset, CoRR abs/2401.04481 (2024). URL: https://doi.org/
10.48550/arXiv.2401.04481. doi:10.48550/ARXIV.2401.04481. arXiv:2401.04481.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Survey of Hallucination in Natural Language Generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <source>A Survey on Hallucination in Large Language Models: Principles</source>
          , Taxonomy, Challenges, and Open Questions,
          <source>ACM Transactions on Information Systems</source>
          <volume>43</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
          . URL: https: //dl.acm.org/doi/10.1145/3703155. doi:
          <volume>10</volume>
          .1145/3703155.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zafarani</surname>
          </string-name>
          , Fake News Detection:
          <article-title>A Survey, ACM Computing Surveys (CSUR) 53 (</article-title>
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shasirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          , G. Pasi, T. Mandl,
          <article-title>Overview of the First Shared Task on Prompt Recovery for Misinformation Detection (PROMID</article-title>
          <year>2025</year>
          ), in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Chakraborty (Eds.), Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation, Varanasi, India</article-title>
          .
          <source>December 17-20</source>
          ,
          <year>2025</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shasirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          , G. Pasi, T. Mandl,
          <article-title>Prompt recovery for misinformation detection at fire 2025, in: Proceedings of the 17th Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          , FIRE '25,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Key Takeaways from the Second Shared Task on Indian Language Summarization (ILSUM</article-title>
          <year>2023</year>
          ), in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2023 -
          <article-title>Forum for Information Retrieval Evaluation (FIRE-WN</article-title>
          <year>2023</year>
          ), Goa, India,
          <source>December 15-18</source>
          ,
          <year>2023</year>
          , volume
          <volume>3681</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>724</fpage>
          -
          <lpage>733</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3681</volume>
          /
          <fpage>T8</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Vishwakarma</surname>
          </string-name>
          , Detecting Fine-grained
          <source>Misinformation Categories in Covid-19 Tweets, in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>35</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>13557</fpage>
          -
          <lpage>13564</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <article-title>A Survey of Hallucination in Large Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.02123</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Manakul</surname>
          </string-name>
          , M. Gales,
          <article-title>SelfCheckgpt: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2023</year>
          , pp.
          <fpage>8434</fpage>
          -
          <lpage>8451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Rashkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Stiennon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>TruthfulQA: Measuring How Models Mimic Human Falsehoods, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2021</year>
          , pp.
          <fpage>3214</fpage>
          -
          <lpage>3229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wright</surname>
          </string-name>
          , E. Pavlick,
          <article-title>Factool: Factuality Evaluation for Abstractive Summarization, in: Findings of the Association for Computational Linguistics: ACL</article-title>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>2217</fpage>
          -
          <lpage>2231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Alhindi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          , U. Umer,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Fighting the Covid-
          <volume>19</volume>
          Infodemic:
          <article-title>Modeling the Performance of Multilingual Misinformation Detectors</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6896</fpage>
          -
          <lpage>6911</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          , L. Ma, G. Durrett,
          <article-title>Faithful or Not? Revisiting Factual Consistency Evaluation in Abstractive Summarization</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          , W.-t. Yih, H. Hajishirzi,
          <article-title>FactScore: Fine-Grained Factuality Scoring for Content Hallucination</article-title>
          ,
          <source>in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Razeghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Logan</surname>
          </string-name>
          <string-name>
            <given-names>IV</given-names>
            ,
            <surname>M. Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Impact of Prompt Formatting and Perturbation on Gpt-3's Zero-Shot Performance, in: Findings of the Association for Computational Linguistics: EMNLP</article-title>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>4649</fpage>
          -
          <lpage>4662</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Augenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiseman</surname>
          </string-name>
          ,
          <source>Taxonomy of Hallucinations in Natural Language Processing, Transactions of the Association for Computational Linguistics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>