<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hallucination Detection and Mitigation in Scientific Text Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krishna Chaitanya Marturi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heba H. Elwazzan</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our methodology for the CLEF 2025 SimpleText Task 2, which focuses on detecting and evaluating creative generation and information distortion in scientific text simplification. Our solution integrates multiple strategies: we construct an ensemble framework that leverages BERT-based classifier, semantic similarity measure, natural language inference model, and large language model (LLM) reasoning. These diverse signals are combined using meta-classifiers to enhance the robustness of spurious and distortion detection. Additionally, for grounded generation, we employ an LLM-based post-editing system that revises simplifications based on the original input texts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text Simplification</kwd>
        <kwd>hallucination detection</kwd>
        <kwd>LLMs</kwd>
        <kwd>CLEF 2025</kwd>
        <kwd>SimpleText</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the rise of social media’s approach to learning via "quick bites", demand for lay-accessible content
has surged. As a result, there is an increasing need for scientific information to be distilled into short,
concise, and factually accurate content while remaining relatively simple. This combination of criteria
is inherently challenging to satisfy, and there remain many disparate challenges to tackle, such as the
risk of oversimplification, the potential for misinformation, and the loss of critical context.</p>
      <p>Text Simplification has long been a goal in natural language processing, and has recently become a
prominent task in machine learning research. Recognizing the core ideas of a paragraph, extracting
the most pertinent information, and paraphrasing it is no small feat—even for humans. What poses an
obstacle for machine learning techniques is the lack of clear measures for evaluating the "goodness" of
a simplified text. Several metrics have been devised, but they mostly compare the simplification to a
reference human summary, which is subjective in itself.</p>
      <p>With the advent of LLMs, many NLP tasks have benefited, and text simplification is no exception.
However, new problems naturally arose. One such problem is hallucinations, where an LLM generates
spurious information without any basis in the reference or source text. The causes of hallucination
in LLMs have been heavily researched, but remain unclear. This complicates mitigation endeavors,
requiring considerable efort to prevent LLMs from producing misinformation or irrelevant content.</p>
      <p>
        The SimpleText track of CLEF 2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has several tasks focusing on the simplifying scientific text
problem. In this paper, we detail our submission to its second task, hallucination detection and mitigation
in text simplification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. There are three subtasks:
• Task 2.1: Identifying creative generation at document level
• Task 2.2: Detection and classification of information distortion errors in simplified sentences
• Task 2.3: Avoiding creative generation and performing grounded generation by design
In this paper, we present a unified system that tackles the three subtasks of hallucination detection
using a combination of machine learning models and large language models (LLMs). For Tasks 2.1
and 2.2, we use multiple strategies to evaluate whether a simplified sentence is spurious or distorted,
including a fine-tuned BERT classifier, a semantic similarity model that compares embeddings, an
entailment model trained to detect contradictions, and an LLM prompted to act as a reasoning-based
evaluator. The results from these diferent components are combined using a small neural network
that learns how to make the final decision based on all available signals. For Task 2.3, which involves
generating grounded simplifications, we prompt an LLM to revise simplified text using the original
source as a reference, correcting any inaccurate or unrelated information. This layered setup allows us
to catch diferent types of hallucinations and enforce factual consistency across tasks.
      </p>
      <p>The paper is organized as follows: Section 2 reviews the state of the art in hallucination detection and
mitigation in text simplification; Section 3 outlines our approach to each subtask; Section 4 presents
and discusses the results; and Section 5 concludes the paper and discusses possible future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Hallucination in natural language generation (NLG) refers to the phenomenon where models produce
content that is not supported or entailed by the input source [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This issue is particularly critical
in scientific text simplification, where factual fidelity is essential. Several prior works have explored
hallucination in the context of machine translation, summarization, and more recently, simplification.
      </p>
      <p>
        Early eforts to detect hallucinations relied on heuristic-based methods, such as lexical overlap or
n-gram precision metrics (e.g., BLEU), which have been shown to be inadequate for capturing semantic
inconsistencies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Recent advances incorporate entailment-based evaluation using natural language
inference (NLI) models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which judge whether the generated text is logically supported by the source.
      </p>
      <p>
        To mitigate hallucination in simplification, researchers have investigated controlled generation
frameworks. These include constrained decoding [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], post-editing using retrieval-based systems [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
and reinforcement learning with factuality-based rewards [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In the context of scientific texts, hallucination is more severe due to the density and complexity of the
source material. Nishino et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed SciTLDR, a dataset for extreme summarization of scientific
papers, and highlighted hallucination issues in LLM outputs. Similarly, Vendeville et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] introduced
ifne-grained annotations for information distortion errors in simplification, enabling targeted detection
of hallucinations, overgeneralizations, and contradictions.
      </p>
      <p>
        Large Language Models (LLMs), such as GPT and LLaMA, have shown strong performance in
simplification tasks but remain prone to hallucination [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Techniques such as few-shot prompting,
retrieval-augmented generation (RAG), and post-hoc correction using entailment scores have been used
to reduce hallucinations in their output.
      </p>
      <p>Our work builds on these foundations by combining multiple detection strategies. We further propose
a grounded generation framework where LLMs act as faithful post-editors.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The focus of task 2 is on identifying and evaluating creative generation and information distortion in
text simplification.</p>
      <sec id="sec-3-1">
        <title>3.1. Spurious Text Detection - Task 2.1</title>
        <p>This task requires identifying whether the input text is spurious or not, both with access to its source
abstract (subtask : sourced) and without access to the source (subtask : post-hoc). We apply an overall
ensemble approach for both subtasks; however, the approaches diverge based on the availability of the
source text.</p>
        <p>
          The spurious detection problem is tackled by combining four complementary approaches into a
unified framework. Each approach contributes distinct signals about the spuriousness of an input text.
The outputs from these methods are further fused using an ensemble model:
1. BERT Classifier: A fine-tuned BERT-based binary classification model trained to detect whether
an input sentence is spurious or not. It captures lexical and syntactic cues through supervised
learning.
2. Cosine Similarity Score: A semantic similarity score computed using sentence embeddings. It
measures how closely the input text aligns with the source in embedding space. Low similarity
may indicate of-topic or fabricated content, while High similarity suggests grounding.
3. Pre-trained NLI Model: A natural language inference (NLI) model is trained on entailment
tasks. Given a reference text, It is used to assess whether the input text contradicts, entails, or is
unrelated to the reference, providing a logic-based signal for spuriousness.
4. LLM as a Judge: A large language model prompted with task-specific instructions to act as a
reasoning-based evaluator. It reviews the given reference and input text and provides a verdict on
whether the input contains hallucinations, contradictions, or irrelevant content [Appendix A.1].
5. Spurious Ensemble Detector: A lightweight meta-classifier that takes as input the prediction
scores from the BERT Classifier, Cosine similarity score, NLI model, and LLM Judge. It combines
their outputs using a small neural network to make a final spuriousness prediction, leveraging
the strengths of all three sources.
3.1.1. Sourced
In the sourced setting, the input sentence is provided with a source abstract. The Bert Classifier is
trained only on input text without using any source information. As such, it relies entirely on intrinsic
textual features and learned patterns to make determinations about the spuriousness of the input text.
To efectively utilize long-source abstracts, which often exceed standard input limits of transformer
models, we adopt a document chunking strategy. Each source abstract is segmented into overlapping
passages (or "chunks") of fixed length—specifically, 100 words per chunk with a 50-word overlap. This
ensures coverage and contextual continuity across segments. Figure 1 illustrates the architecture of our
system in this setting, and diferent components are described below:
• Cosine Similarity Score: For each input text, we compute the maximum cosine similarity
between its embedding and those of all its source abstract chunks, using the Sentence Transformer
model multi-qa-MiniLM-L6-cos-v1. This score captures the semantic alignment between
the input and its source content.
• Pre-trained NLI Model: We utilize the natural language inference model
facebook/bart-large-mnli to evaluate semantic consistency. For each chunk of the
source abstract, we compute the entailment and contradiction probabilities with respect to the
input text and retain the highest values across all chunks.
• LLM as Judge: A large language model, llama-3.3-70b-versatile, is prompted with both
the source abstract and the input text using few-shot prompting. The model assigns scores in the
range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] for four dimensions: spuriousness, over-generalization, contradiction, and vagueness,
based on its reasoning capabilities.
• Spurious Ensemble Detector: A three-layer neural network classifier is trained to aggregate the
eight probabilistic features derived from the previous approaches. This ensemble model predicts
whether the input text is spurious by learning from complementary signals across bert-classifier,
similarity, entailment, and judgment-based reasoning.
3.1.2. Post-hoc
The input texts are without access to their source in the post-hoc setting. There is no change in how
the Bert Classifier is implemented compared to the sourced setting. The model is trained only on the
input texts to determine if they are spurious or not. As illustrated in Figure 2, the system architecture
mirrors that of the sourced setting with some diferences that are described below:
• Dense Passage Retrieval: We utilize a pre-trained sentence embedding model,
multi-qa-MiniLM-L6-cos-v1, to encode both the input text and the chunks derived
from the source abstracts. Each source abstract is first segmented into 100-word chunks. These
chunks are then embedded into a dense vector space. Given an input sentence, we compute its
embedding and retrieve the top-5 most semantically similar chunks using cosine similarity. Let 
be the embedding of the input sentence and 1, 2, . . . ,  be the embeddings of source chunks.
The top- chunks (here,  = 5) with the highest cosine similarity scores are selected as the most
relevant context.
• Cosine Similarity Score: For each input text, the highest cosine similarity score of the Dense
        </p>
        <p>Passage Retrieval is used.
• Pre-trained NLI Model: For the top-5 most semantically similar chunks retrieved, we compute
the entailment and contradiction probabilities with respect to the input text and retain the highest
values across all chunks.
• LLM as Judge: The top-5 highest cosine similar chunks retrieved for each input text are
concatenated. A large language model, llama-3.3-70b-versatile, is prompted with concatenated
source chunks and the input text using few-shot prompting.
• Spurious Ensemble Detector: Same as the source setting, A three-layer neural network
classifier is trained to aggregate the eight probabilistic features derived from previous approaches.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Detecting and Classifying Information Distortion Errors - Task 2.2</title>
        <p>
          This task focuses on detecting information distortion in simplified sentences and classifying them into
diferent types of errors [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We approached the multi-label classification problem through distinct
strategies, leveraging both transformer-based models and large language models (LLMs).
        </p>
        <p>We explored fine-tuning a RoBERTa-large model for multi-label text classification. Similarly,
we investigated LLM-based classification [Appendix A.2] using LLaMA-3.3-70B-Versatile to flag
errors in the simplified sentence when compared to the source sentence. As illustrated in Figure 3, We
combined the strengths of both approaches through an ensemble framework, where the probability
outputs from the DeBERTa model and the binary flags from the LLM are used as inputs to a three-layer
neural network-based meta-classifier.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Avoiding Creative Generation by performing Grounded Generation - Task 2.3</title>
        <p>This task requires a pair of Task 1 text simplification submissions, where one functions as the baseline
approach and the other applies grounding to avoid overgeneration or other additional content not in
the source documents or sentences.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. LLM-Based Grounded Generation</title>
          <p>To ensure factual consistency in the simplified text, we leverage LLaMA-3.3-70B-Versatile as a
grounded generator. The model is prompted with both the baseline simplified text and a corresponding
source text and tasked with producing a corrected version of the simplified text if necessary. The goal is
to revise the baseline simplified text to eliminate any hallucinations, contradictions, fabricated content,
or over generalizations, ensuring that the output remains strictly grounded in the provided reference.</p>
          <p>We use a structured prompt to guide the LLM’s reasoning and generation:</p>
          <p>You are given:
- A reference document
- An input text that may contain errors such as fabricated content, contradictions, hallucinations or
overgeneration.</p>
          <p>Your task is to revise the input text so that it is fully grounded in the reference document. The
corrected version must:
- Be factually consistent with the reference
- Avoid introducing unrelated or inaccurate information
Return only the corrected version of the input text if it is needed, otherwise return the same input
text.</p>
          <p>Reference Document: {reference_doc}
Input Text: {input_text}</p>
          <p>Corrected Text:</p>
          <p>As depicted in Figure 4, this prompting strategy enables the LLM to act as a post-editing agent that
enforces alignment between the simplified text and the source text. When the baseline simplification
is already grounded and accurate, the model returns it unchanged; otherwise, it produces a revised
version that faithfully reflects the content and intent of the original reference.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>4.1.1. Sourced</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation of Spurious Text Detection - Task 2.1</title>
        <p>We evaluate the performance of our ensemble approach separately for both sub-tasks: sourced, where
the input text is accompanied by its source abstract, and post-hoc, where the source is unavailable. In
addition to reporting results for each setting independently, we also analyze whether the absence of
source information in the post-hoc setting leads to any significant degradation in detection performance.
To evaluate the efectiveness of our approach, we compare the performance of our ensemble model,
which integrates predictions from the BERT Classifier, LLM as Judge, and NLI model, against each
individual approach. Table 1 summarizes the evaluation metrics, including Accuracy, Precision, Recall,
F1 Score, ROC AUC, AUPRC for this setting.
While the BERT Classifier alone performs strongly with an F1 score of 0.95, the ensemble model ofers
a more balanced performance across metrics, especially in terms of precision and recall. The LLM
as Judge provides high precision (0.94) but lower recall (0.76), suggesting conservative but accurate
lfagging of spurious text. The NLI model performs comparatively poorly in isolation, likely due to its
sensitivity to phrasing and lack of contextual understanding. However, when combined in the ensemble,
it contributes complementary signals, enhancing robustness.</p>
        <p>Overall, the ensemble model achieves high performance (ROC AUC: 0.68) and demonstrates the
efectiveness of combining shallow, semantic, and reasoning-based components for the spuriousness
detection task.
4.1.2. Post-hoc
In the post-hoc setting, the system must detect spurious content without access to the source abstract.
We do a similar comparison of the performance of the ensemble model—combining BERT Classifier,
LLM as Judge, and NLI Entailment model—with each individual component model. Table 2 reports
Accuracy, Precision, Recall, F1 Score, ROC AUC, AUPRC in this setting.
The results show that the BERT Classifier alone performs very competitively, achieving the highest
Accuracy (0.91), even slightly outperforming the ensemble model (0.90). This suggests that, in the
absence of source information, the classifier trained purely on input text features is highly efective. The
LLM as Judge exhibits high precision (0.95), indicating that it is conservative and reliable in flagging
spurious text. However, it sacrifices recall (0.78), which reduces its overall F1 performance. While the
ensemble approach does not outperform the BERT-only model in this post-hoc setting, it maintains
robust and consistent performance by integrating diverse perspectives, making it resilient to individual
model weaknesses.</p>
        <sec id="sec-4-1-1">
          <title>4.1.3. Overall Performance</title>
          <p>We observe that the performance of the ensemble model remains remarkably stable across both settings.
The accuracy and F1 score show only a marginal decline when source information is unavailable
(accuracy drops from 0.91 to 0.90), suggesting that the model is largely resilient to the absence of source
grounding.</p>
          <p>However, a more noticeable decline is observed in the ROC AUC metric, which drops from 0.68 in
the sourced setting to 0.64 in the post-hoc setting. This suggests that while the model still performs
well in binary classification, its ability to rank predictions with high confidence across the full score
distribution is somewhat diminished without access to source context.</p>
          <p>Overall, the results indicate that the absence of the source abstract does not significantly impair the
model’s detection capabilities, thanks to the strong contribution of the BERT-based classifier and LLM
judgment mechanisms.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation of Information Distortion Error Classification - Task 2.2</title>
        <p>We evaluate the performance of the ensemble model that aggregates the probability outputs from
DeBERTa-based classifier and the binary flags from the LLM to form a robust Meta-Classifier for
detecting various types of information distortion errors.
The ensemble model consistently outperforms individual classifiers across all error categories. It
achieves the highest F1 score (0.763) and AUC-PR (0.561) for correctly identifying instances with
no errors, indicating strong precision and recall in distinguishing clean outputs. For all four error
categories—Fluency (A), Alignment (B), Information (C), and Simplification (D)—the ensemble model
yields superior F1 and AUC-PR scores compared to standalone models like RoBERTa, LLaMA, and BERT.</p>
        <p>While the LLaMA-based model shows competitive performance, particularly in Fluency (A) and
Alignment (B), it falls short of the ensemble approach in detecting more nuanced errors such as
Information loss (C) and Simplification issues (D). The BERT model, by contrast, lags in performance
across all metrics, highlighting the advantage of using more advanced or combined architectures.</p>
        <p>Overall, these results demonstrate the benefit of leveraging an ensemble method that integrates
DeBERTa and LLM-based predictions, especially in scenarios requiring nuanced error detection.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation of Grounded Generation over Creative Generation - Task 2.3</title>
        <p>Table 4 compares grounded submissions with baseline systems (denoted by ⋆) for CLEF 2025 SimpleText
Task 2.3. Across both 37 aligned Cochrane-auto abstracts and 217 plain language summary test datasets,
grounded systems generally exhibit higher semantic fidelity, as evidenced by consistently higher BLEU
scores. For instance, llama_summary_simplification_grounded outperforms the baseline with
BLEU scores of 15.00 vs. 7.63 and 9.89 vs. 5.32. These results suggest that grounded simplification
preserves the semantic content of the original text more faithfully. This is further supported by higher
Levenshtein similarity scores and lower deletion proportions, indicating that grounded systems retain
more of the original wording and structure.</p>
        <p>However, the SARI scores—measuring the balance between addition, deletion, and copying operations
for simplification—tend to be slightly lower for grounded models compared to their baseline counterparts.
For example, plan_guided_llama (baseline) achieves a SARI of 42.98 versus 33.41 for its grounded
variant on the 217 plain language summaries test dataset, suggesting that baseline models introduce
more aggressive simplification operations.</p>
        <p>This reflects a fundamental trade-of: grounded models, while producing more faithful and
semantically aligned simplifications (as evidenced by higher BLEU and lexical similarity), may be less efective
in performing bold rewrites or deletions that lead to simpler outputs, thus lowering their SARI scores.
In summary, grounded systems are advantageous when semantic fidelity and contextual accuracy are
prioritized.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>Large Language Models (LLMs) have demonstrated strong capabilities in the domain of scientific
text simplification. However, they are also prone to hallucinations and overgeneration, which can
compromise the faithfulness and factual accuracy of the generated content.</p>
      <p>In this work, we proposed ensemble-based approaches to identify spurious content and
information distortion errors in simplified text. Our methods combine the strengths of individual
components—including BERT-based classifiers, NLI models and LLM-based classifiers and together form a
robust detection framework. For grounded simplification, we further demonstrated that LLMs serve
efectively as high-precision post-editors, capable of revising simplified text to maintain consistency
with the source document while correcting factual errors.</p>
      <p>Our experiments show that spurious text detection degrades only slightly in post-hoc settings
where the source document is unavailable. However, a notable drop in ROC AUC suggests that
additional investigation is needed to better understand this discrepancy. Furthermore, in comparing
grounded generation with baseline simplification, we observe a clear trade-of between simplification
and faithfulness. Grounded outputs are more accurate but often more complex.</p>
      <p>Future work will explore techniques to optimize this trade-of, aiming to preserve simplicity without
sacrificing factual correctness or introducing hallucinations.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>
        We thank the Data Science at Georgia Tech (DS@GT) CLEF competition group for their support. This
research was supported in part through research cyberinfrastructure resources and services provided by
the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology,
Atlanta, Georgia, USA [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Gemini for grammar and spelling
check, as well as assistance in the code for the conducted experiments. After using these tools, the
authors reviewed and edited the content as needed and take full responsibility for the publication’s
content.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Prompt Templates</title>
      <sec id="sec-8-1">
        <title>A.1. LLM as Judge Prompt</title>
        <p>You a r e an e x p e r t a n n o t a t o r t a s k e d w i t h e v a l u a t i n g w h e t h e r an i n p u t
t e x t i s s p u r i o u s when compared t o a s o u r c e document .</p>
        <p>An i n p u t t e x t i s ∗ ∗ s p u r i o u s ∗ ∗ i f :
− I t f a b r i c a t e s i n f o r m a t i o n n o t g r o u n d e d i n any s o u r c e .
− I t m i s r e p r e s e n t s o r c o n t r a d i c t s t h e s o u r c e documents .
− I t i s t o o g e n e r a l , t r i v i a l , o r i r r e l e v a n t i n t h e c o n t e x t o f t h e
documents , even i f t e c h n i c a l l y t r u e .</p>
        <p>P l e a s e r e v i e w t h e f o l l o w i n g e x a m p l e s t o g u i d e your e v a l u a t i o n :
Example 1 :
SOURCE DOCUMENT :
O n l i n e s o c i a l media p r o v i d e u s e r s w i t h o p p o r t u n i t i e s t o e n g a g e w i t h
d i v e r s e o p i n i o n s and can s p r e a d m i s i n f o r m a t i o n .</p>
        <p>INPUT TEXT :
S o c i a l media a l w a y s s p r e a d s m i s i n f o r m a t i o n .
Example 2 :
SOURCE DOCUMENT :
We p r o p o s e a new w e l f a r e c r i t e r i o n t h a t a l l o w s us t o rank
a l t e r n a t i v e f i n a n c i a l market s t r u c t u r e s i n t h e p r e s e n c e o f b e l i e f
h e t e r o g e n e i t y .</p>
        <p>INPUT TEXT :
We p r o p o s e a new economic t h e o r y t o manage i n f l a t i o n .</p>
        <p>RESPONSE :
{
" s p u r i o u s n e s s " : 1 . 0 ,
" o v e r _ g e n e r a l i z a t i o n " : 0 . 8 ,
" c o n t r a d i c t i o n " : 0 . 6 ,
" v a g u e n e s s " : 0 . 5
}
}
RESPONSE :
{
" s p u r i o u s n e s s " : 0 . 1 ,
" o v e r _ g e n e r a l i z a t i o n " : 0 . 4 ,
" c o n t r a d i c t i o n " : 0 . 0 ,
" v a g u e n e s s " : 0 . 2
SOURCE DOCUMENT :
{ s o u r c e }
INPUT TEXT :
{ i n p u t _ t e x t }
Example 3 :
SOURCE DOCUMENT :
We a n a l y z e economies with c o m p l e t e and i n c o m p l e t e f i n a n c i a l ma rk et s
and r e s t r i c t e d t r a d i n g p o s s i b i l i t i e s l i k e borrowing l i m i t s .
INPUT TEXT :
We a n a l y z e economies with c o m p l e t e and i n c o m p l e t e f i n a n c i a l ma rk et s .
P l e a s e answer with a s c o r e between 0 and 1 f o r each o f t h e f o l l o w i n g
:
1 . S p u r i o u s n e s s ( f a b r i c a t e d , i r r e l e v a n t , or ungrounded ) :
2 . Over − g e n e r a l i z a t i o n ( t o o broad or o m i t s key d e t a i l s ) :
3 . C o n t r a d i c t i o n ( m i s r e p r e s e n t s or o p p o s e s t h e s o u r c e ) :
4 . Vagueness ( t o o i m p r e c i s e , l a c k s s p e c i f i c i t y ) :
Do not i n c l u d e any t e x t or commentary o u t s i d e o f t h e JSON r e s p o n s e
f o r m a t below .</p>
        <p>Respond i n t h i s JSON f o r m a t :
{
}
" s p u r i o u s n e s s " : f l o a t ,
" o v e r _ g e n e r a l i z a t i o n " : f l o a t ,
" c o n t r a d i c t i o n " : f l o a t ,
" v a g u e n e s s " : f l o a t</p>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. LLM-based Classification Prompt</title>
        <p>You a r e an e x p e r t l i n g u i s t s p e c i a l i z i n g i n t e x t s i m p l i f i c a t i o n .</p>
        <p>Given a s o u r c e s e n t e n c e and a s i m p l i f i e d v e r s i o n , your t a s k i s t o
s c o r e between 0 and 1 f o r any o f t h e i n f o r m a t i o n d i s t o r t i o n
e r r o r s i n t h e s i m p l i f i e d s e n t e n c e .</p>
        <p>P l e a s e r e v i e w t h e f o l l o w i n g example t o g u i d e your e v a l u a t i o n :
Example 1 :
SOURCE SENTENCE : We c o n d u c t e d t h e e x p e r i m e n t s with a d a t a s e t o f 94
c h e s t CTs ( l a b o r a t o r y c o n f i r m e d 39 v i r a l b r o n c h i o l i t i s c a u s e d by
human p a r a i n f l u e n z a ( HPIV ) , 34 n o n t u b e r c u l o u s m y c o b a c t e r i a l (NTM)
, and 21 normal c o n t r o l ) .</p>
        <p>SIMPLIFIED SENTENCE : ’ The t e s t s were Out o f t h e s e , t h e r e were 39
c a s e s with v i r a l b r o n c h i o l i t i s from HPIV , 34 c a s e s o f non−
t u b e r c u l o s i s m y c o b a c t e r i a (NTM) , and 21 h e a l t h y p e o p l e f o r
comparison .</p>
        <p>RESPONSE :
{ {
’No e r r o r ’ : 0 ,
’ A1 . Random g e n e r a t i o n ’ : 0 ,
’ A2 . Syntax e r r o r ’ : 1 ,
’ A3 . C o n t r a d i c t i o n ’ : 0 ,
’ A4 . S i m p l e p u n c t u a t i o n / grammar e r r o r s ’ : 0 ,
’ A5 . Redundancy ’ : 0 ,
’ B1 . Format m i s a l i g n e m e n t ’ : 0 ,
’ B2 . Prompt m i s a l i g n e m e n t ’ : 0 ,
’ C1 . F a c t u a l i t y h a l l u c i n a t i o n ’ : 0 ,
’ C2 . F a i t h f u l n e s s h a l l u c i n a t i o n ’ : 0 ,
’ C3 . Topic s h i f t ’ : 0 ,
’ D1 . 1 . O v e r g e n e r a l i z a t i o n ’ : 0 ,
’ D1 . 2 O v e r s p e c i f i c a t i o n o f Concepts ’ : 0 ,
’ D2 . 1 . L o s s o f I n f o r m a t i v e Content ’ : 0 ,
’ D2 . 2 . Out − of − Scope G e n e r a t i o n ’ : 0
} }
Now e v a l u a t e t h e n e x t p a i r :
SOURCE SENTENCE :
{ s o u r c e _ s e n t e n c e }
SIMPLIFIED SENTENCE :
{ s i m p l i f i e d _ s e n t e n c e }
Do not i n c l u d e any t e x t or commentary o u t s i d e o f t h e JSON r e s p o n s e
f o r m a t below .</p>
        <p>Respond i n t h i s JSON f o r m a t :
{ {
’No e r r o r ’ : f l o a t ,
’ A1 . Random g e n e r a t i o n ’ : f l o a t ,
’ A2 . Syntax e r r o r ’ : f l o a t ,
’ A3 . C o n t r a d i c t i o n ’ : f l o a t ,
’ A4 . S i m p l e p u n c t u a t i o n / grammar e r r o r s ’ : f l o a t ,
’ A5 . Redundancy ’ : f l o a t ,
’ B1 . Format m i s a l i g n e m e n t ’ : f l o a t ,
’ B2 . Prompt m i s a l i g n e m e n t ’ : f l o a t ,
’ C1 . F a c t u a l i t y h a l l u c i n a t i o n ’ : f l o a t ,
’ C2 . F a i t h f u l n e s s h a l l u c i n a t i o n ’ : f l o a t ,
’ C3 . Topic s h i f t ’ : f l o a t ,
’ D1 . 1 . O v e r g e n e r a l i z a t i o n ’ : f l o a t ,
’ D1 . 2 O v e r s p e c i f i c a t i o n o f Concepts ’ : f l o a t ,
’ D2 . 1 . L o s s o f I n f o r m a t i v e Content ’ : f l o a t ,
’ D2 . 2 . Out − of − Scope G e n e r a t i o n ’ : f l o a t</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , et al.,
          <article-title>Overview of clef 2025 simpletext track: Simplify scientific texts (and nothing more)</article-title>
          , in: J.
          <string-name>
            <surname>Carillo de Albornoz</surname>
          </string-name>
          , et al. (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), LNCS, Springer-Verlag,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2025 SimpleText Task 2: Identify and Avoid Hallucination</article-title>
          , in: G.
          <string-name>
            <surname>Faggioli</surname>
          </string-name>
          , et al. (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Fries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sachan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bohnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <article-title>On faithfulness and factuality in abstractive summarization</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kryściński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Evaluating the factual consistency of abstractive text summarization</article-title>
          ,
          <source>in: EMNLP</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M.
          <article-title>Lapata, Multi-fact correction in abstractive text summarization</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dolan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Revisiting factual evaluation of summarization via question answering</article-title>
          ,
          <source>in: ACL Findings</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Neubig, Factual error correction for abstractive summarization via reinforcement learning</article-title>
          ,
          <source>in: NAACL</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nishino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Augenstein</surname>
          </string-name>
          , Scitldr:
          <article-title>Extreme summarization of scientific documents</article-title>
          ,
          <source>in: EMNLP</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , P. De Loor,
          <article-title>Resource for error analysis in text simplification: New taxonomy and test collection</article-title>
          ,
          <source>arXiv preprint arXiv:2505.16392</source>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/ pdf/2505.16392.pdf, to appear
          <source>in SIGIR '25: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ziems</surname>
          </string-name>
          , et al.,
          <article-title>Can large language models be consistently trusted for factuality detection?</article-title>
          ,
          <source>arXiv preprint arXiv:2305.15005</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>PACE</surname>
          </string-name>
          ,
          <article-title>Partnership for an Advanced Computing Environment (PACE</article-title>
          ),
          <year>2017</year>
          . URL: http://www. pace.gatech.edu.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>