<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DUTH at CLEF 2025 SimpleText Track: Tackling Scientific Text Simplification and Hallucination Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgios Arampatzis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avi Arampatzis</string-name>
          <email>avi@ee.duth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Democritus University of Thrace, Department of Electrical and Computer Engineering</institution>
          ,
          <addr-line>Xanthi</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the participation of the DUTH team in the CLEF 2025 SimpleText Track, focusing on the automatic simplification of scientific texts and the detection of hallucinations. For the simplification tasks, we employed large instruction-tuned language models (LLMs), such as FLAN-T5-Large and BART-SAMSum. Experiments at both the sentence level (Task 1.1) and the document level (Task 1.2) showed that scaling up the model and curating the content significantly improve simplification quality. The models demonstrated the ability to preserve semantic accuracy, even in complex contexts. In the field of hallucination detection (Task 2), we applied both binary and multi-class classification methods, based on lexical and semantic representations. Tree-based ensemble learning models, such as Extra Trees and Random Forest, achieved top performance in identifying erroneous content, under both posthoc and sourced conditions. However, the fine-grained classification of error types (Task 2.2) revealed substantial challenges-particularly in detecting semantic deviations, such as hallucinations of reality. Future work will focus on incorporating contextual embeddings, applying few-shot learning, and enhancing the robustness of the models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Simplification</kwd>
        <kwd>Hallucination</kwd>
        <kwd>Scientific Texts</kwd>
        <kwd>Instruction Tuning</kwd>
        <kwd>Document-level Simplification</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Posthoc Annotation</kwd>
        <kwd>Ensemble Methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Scientific texts are often characterized by dense terminology and complex syntactic structures, making
them dificult to understand for lay audiences. As a result, non-experts frequently avoid engaging with
primary scientific literature, instead relying on simplified or secondary sources—such as blogs or social
media—which may contain distorted or unreliable interpretations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Automatic text simplification,
particularly within scientific domains, aims to bridge this accessibility gap by transforming complex
content into more comprehensible forms while preserving factual accuracy and intended meaning.
      </p>
      <p>
        The CLEF 2025 SimpleText Track [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was introduced to support the systematic evaluation of scientific
text simplification systems and to address the growing concern of hallucinations—spurious content not
grounded in the source—often produced by generative language models. In this context, we participated
in two core tasks of the track:
• Task 1: Simplify Scientific Text , which includes:
      </p>
      <p>– Task 1.1: Sentence-level simplification</p>
      <p>
        – Task 1.2: Document-level simplification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
• Task 2: Identify and Avoid Hallucination, which focuses on detecting erroneous or fabricated
information in simplified outputs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>
        Prior work in scientific text simplification includes both supervised approaches using aligned
corpora [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] and prompt-based techniques leveraging large language models (LLMs) such as GPT-3 or
T5 [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Although LLMs generate fluent outputs, they are prone to hallucinations—posing a significant
risk in scientific applications where factual consistency is paramount. Consequently, recent research
has increasingly focused on evaluating and mitigating hallucinated content in generated text [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
      </p>
      <p>
        This paper presents our submission to Tasks 1 and 2 of the CLEF 2025 SimpleText track. For full
details on task definitions, datasets, and evaluation protocols, we refer the reader to the oficial track
and task overview papers [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ].
      </p>
      <p>The remainder of this paper is organized as follows: Section 2 describes our methodology, including
data preprocessing, system architecture, and prompt engineering strategies. Section 3 presents the
experimental results and analysis. Section 4 concludes with key findings and discusses directions for
future work.
2. Experimental Setup</p>
      <sec id="sec-1-1">
        <title>2.1. Task 1:Text Simplification: Simplify scientific text</title>
        <sec id="sec-1-1-1">
          <title>2.1.1. Task 1.1: Sentence-level Scientific Text Simplification</title>
          <p>2.1.1.1. Dataset Table 1 provides a detailed overview of the dataset configuration used in Task 1.1,
which focuses on the simplification of scientific texts at the sentence level. Each split is characterized by
the number of unique complex sentences, the total number of entries, and the presence or absence of
reference simplifications.</p>
          <p>The training set contains 11,452 unique complex sentences and 11,510 total entries, each paired with
a corresponding simplification. The validation and internal test sets consist of 1,695 and 1,510 unique
sentences, respectively, and are used for model tuning and intermediate evaluation.</p>
          <p>In contrast, the final test set comprises 9,086 unique sentences and 9,160 entries, but does not
include reference simplifications. This test set is used for the oficial system evaluation and leaderboard
submission via the Codabench platform. The absence of gold outputs ensures an unbiased, blind
evaluation of system performance.</p>
          <p>The diference between the number of unique sentences and total entries stems from cases in which
multiple simplifications are provided for a single complex sentence, as defined by the dataset schema in
the CSV files.</p>
          <p>Unique Complex Sentences</p>
          <p>Total Entries Simplifications Included
2.1.1.2. Methodology The simplification approach adopted for Task 1.1 leverages large pretrained
language models tailored for sequence-to-sequence tasks. These models operate in a zero-shot setting
and are prompted to rewrite complex scientific sentences into simpler, more accessible forms while
preserving core meaning and factual accuracy.</p>
          <p>To guide generation, task-specific prompts were applied where relevant, encouraging the models to
produce outputs that align with simplification goals. The decoding strategy emphasized determinism
and brevity, ensuring that the generated sentences were concise and syntactically well-formed.</p>
          <p>This methodology exploits the generalization capabilities of large-scale foundation models to perform
scientific text simplification without additional supervision, demonstrating their feasibility for specialized
communication tasks in scientific domains.</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>2.1.2. Task 1.2 – Document-level Scientific Text Simplification</title>
          <p>2.1.2.1. Dataset Table 2 presents an overview of the document-level dataset distribution used in
Task 1.2, which focuses on the simplification of full scientific abstracts. Each split is characterized
by the number of unique complex documents, total entries (rows), and the availability of reference
simplifications.</p>
          <p>The training set comprises 3,967 documents, each paired with corresponding simplifications. The
validation and internal test sets contain 500 and 502 documents, respectively, and include gold-standard
simplifications. These subsets are intended for model development, hyperparameter tuning, and
preliminary evaluation.</p>
          <p>The final test set includes 666 complex documents without reference simplifications. It serves as the
blind evaluation input for oficial submissions on Codabench. The absence of target outputs ensures fair
and unbiased scoring of participants’ systems.</p>
          <p>Notably, the number of total entries equals the number of unique documents across all subsets,
indicating that each row corresponds to a single abstract. Unlike Task 1.1, document-level simplification
requires handling discourse structure, paragraph segmentation, and sentence-level transformations in a
holistic and coherent manner.
2.1.2.2. Methodology The approach followed for Task 1.2 leverages a large pretrained language
model to simplify full scientific abstracts, addressing the broader context and discourse structure inherent
to document-level content. The model is guided through natural language prompts that explicitly define
the simplification objective.</p>
          <p>Each document is treated as a single input unit, and the model is prompted to generate a simplified
version that preserves terminological precision, semantic fidelity, and coherence across multiple
sentences. This is achieved by prepending structured instructions (e.g., “Simplify the following scientific
document:”) to the complex abstract.</p>
          <p>The methodology highlights the ability of large-scale instruction-tuned models to handle extended
scientific discourse and produce simplified outputs that remain faithful to the original content—without
the need for fine-tuning on domain-specific simplification data.</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>2.1.3. Implementation and Environment</title>
          <p>All experiments were implemented in Python 3.10, using the Transformers library from Hugging
Face and PyTorch (v2.1.0). Execution was performed on a compute node equipped with an NVIDIA
RTX A6000 GPU.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>2.2. Controlled Creativity: Identify and Avoid Hallucination</title>
        <sec id="sec-1-2-1">
          <title>2.2.1. Task 2.1 – Identify Creative Generation at Document Level</title>
          <p>2.2.1.1. Dataset Table 3 presents a quantitative summary of the posthoc subset used in Task 2 of the
CLEF 2025 SimpleText track, which evaluates hallucination detection in simplified scientific texts. This
subset consists of system-generated simplifications that were annotated after generation to determine
whether they contain hallucinated content.</p>
          <p>The table reports both the number of unique simplified sentences and the total number of entries,
which may include duplicates due to multiple annotations or metadata variants. Specifically:</p>
          <p>The posthoc training set contains 13,137 unique simplified sentences and 13,519 total entries, indicating
that some examples appear more than once due to secondary annotations, such as annotator disagreement
or metadata variation.</p>
          <p>The posthoc test set includes 3,249 unique sentences and 3,293 entries, and is used to evaluate
hallucination detection models under controlled conditions.</p>
          <p>This subset is particularly valuable for training models that must generalize to noisy, real-world
outputs from text simplification systems. Its posthoc nature provides a realistic evaluation setting, where
hallucinations are assessed independently of the system that produced the simplification.</p>
          <p>The distinction between unique instances and total entries ofers insight into the annotation process
and potential variance introduced by human labeling or system generation artifacts.</p>
          <p>The sourced training set contains 13,120 unique simplified sentences and 13,514 total entries. Minor
redundancy may arise from sentence variants, additional annotations (e.g., multiple annotators), or
metadata replication.</p>
          <p>The sourced test set includes 3,318 unique sentences and 3,379 entries, and serves as the benchmark
for evaluating hallucination detection models on source-aligned simplifications.</p>
          <p>The explicit grounding ofered by this dataset makes it particularly suitable for supervised learning
and fine-grained hallucination evaluation. In combination with the posthoc subset, it supports robust
model development across both real-world and controlled hallucination scenarios.
Sourced Train Set
Sourced Test Set</p>
          <p>Unique Sentences</p>
          <p>Total Entries
13,120
3,318
13,514
3,379
2.2.1.2. Methodology To address the detection of spurious or hallucinated content in simplified
scientific sentences, we adopted a supervised binary classification framework based on lexical features.
Two parallel models were developed—one for the sourced and one for the posthoc subsets—using the
same processing pipeline.</p>
          <p>Each classifier was trained to distinguish between factually accurate and spurious simplifications
using an ensemble-based learning approach. Specifically, we employed the ExtraTreesClassifier,
a non-parametric ensemble method that aggregates multiple randomized decision trees to improve
robustness and generalization.</p>
          <p>Input sentences were vectorized using a TF-IDF representation over a vocabulary of the 3,000 most
informative terms. To address class imbalance in the training data, the minority class was upsampled
via random oversampling, resulting in a balanced training set. Predictions were then generated for the
test instances and exported in structured format for evaluation.</p>
          <p>This approach demonstrates the efectiveness of combining simple lexical representations with
ensemble learning methods for hallucination detection in scientific text simplification.</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>2.2.2. Task 2.2 – Detect and Classify Information Distortion Errors in Simplified Sentences</title>
          <p>2.2.2.1. Dataset Table 5 summarizes the dataset used in Task 2.2 of the CLEF 2025 SimpleText Track,
which targets fine-grained error annotation in sentence-level simplifications. The dataset supports
supervised training and evaluation of systems capable of identifying specific simplification errors, such
as hallucinations, faithfulness violations, and discourse-level inconsistencies.</p>
          <p>The training set comprises 42,392 annotated entries corresponding to 35,621 unique simplified
sentences. Each entry includes one or more categorical labels indicating the presence of error types (e.g.,
factuality hallucination, topic shift, overgeneralization). Due to the multi-label structure and multiple
annotations per sentence, individual sentences may appear more than once in the dataset.</p>
          <p>The test set contains 2,659 entries derived from 1,537 unique complex source sentences. Unlike the
training data, test instances do not include simplified outputs, enabling blind evaluation: systems must
infer likely errors solely based on the input sentence.</p>
          <p>This dataset plays a key role in advancing error-aware simplification systems, providing a structured
foundation for training multi-class classifiers and enabling performance breakdown by error type across
diverse semantic and pragmatic dimensions.
2.2.2.2. Methodology For the fine-grained detection of hallucination errors in simplified
scientific text, we adopt a multi-label classification framework grounded in semantic similarity. Sentence
pairs—comprising the original and the simplified version—are embedded into dense semantic vectors
using a pretrained sentence encoder (all-mpnet-base-v2), enabling the model to capture
meaningpreserving or distorting transformations.</p>
          <p>A multi-output classifier is trained on these embeddings to predict the presence of specific hallucination
categories, as defined by a structured error taxonomy. To address class imbalance and data sparsity, the
training set is augmented with synthetic examples and oversampling techniques. Label-wise thresholds
are tuned via validation-based F1 maximization to ensure calibrated predictions across error types.
Implementation Details. We used the all-mpnet-base-v2 model from the SentenceTransformers
library to generate fixed-size semantic embeddings. The encoder operated in inference-only mode,
with no task-specific fine-tuning ; that is, its parameters remained frozen during training. Only the
downstream classifier—a MultiOutputClassifier using either Logistic Regression or Random Forest
as the base estimator—was trained on the extracted embeddings. This lightweight architecture enables
eficient yet efective classification of hallucination error types in scientific simplifications.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>2.3. Implementation and Environment</title>
        <p>All experiments were implemented in Python 3.10, using the PyTorch framework (v2.1.0) in
conjunction with the Hugging Face transformers and sentence-transformers libraries. For the
classification tasks, models were built using scikit-learn, including both tree-based ensemble
methods (e.g., Extra Trees, Random Forest) and linear classifiers (e.g., Logistic Regression).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Results</title>
      <sec id="sec-2-1">
        <title>3.1. Evaluation Metrics</title>
        <p>We evaluate sentence simplification and hallucination detection using a combination of reference-based,
semantic, and classification-based metrics.</p>
        <sec id="sec-2-1-1">
          <title>3.1.1. Sentence Simplification (Tasks 1.1 and 1.2)</title>
          <p>
            The main evaluation metrics include:
• SARI [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]: Evaluates the quality of added, deleted, and retained n-grams with respect to reference
simplifications.
• BLEU [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]: Measures n-gram overlap with reference texts, though it is less sensitive to
simplification quality.
• BERTScore [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]: Computes semantic similarity between system outputs and references using
contextual embeddings.
• LENS [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]: A learned metric trained on human-annotated simplification quality ratings.
• SLE [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]: A classifier-based, reference-less metric that distinguishes simplified from non-simplified
outputs.
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>3.1.2. Hallucination Detection (Task 2.1).</title>
          <p>
            For binary classification of hallucinated content, we report standard classification metrics:
• Accuracy [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]: Proportion of correct predictions over all predictions.
• Precision [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]: Proportion of predicted positives that are correct.
• Recall [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]: Proportion of actual positives that are correctly predicted.
• F1-score [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]: Harmonic mean of precision and recall.
          </p>
          <p>
            • ROC AUC [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]: Area under the ROC curve, indicating overall class separability.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Task 1.1 – Sentence-level Scientific Text Simplification</title>
        <sec id="sec-2-2-1">
          <title>3.2.1. Experimental Results</title>
          <p>Table 6 presents detailed results for sentence-level scientific text simplification (Task 1.1), evaluated
using three metrics: SARI (original), SARI (auto), and the final Score. These metrics assess simplification
quality in terms of information added, deleted, and retained, incorporating both reference-based and
automatic evaluations.</p>
          <p>The FLAN-T5-Large model outperforms all others, achieving a SARI (original) of 35.35 and a high
SARI (auto) of 38.73. This indicates a strong ability to generate simplified outputs that preserve core
semantic content while enhancing accessibility. Its consistent performance across human and automatic
references demonstrates the robustness of large-scale, instruction-tuned models in zero-shot settings.</p>
          <p>The BART-SAMSum model, despite being pretrained on dialogue summarization data, performs
competitively with a SARI (original) of 29.68, surpassing the generic BART model (23.84). This suggests
that pretraining on abstractive, paraphrastic tasks can efectively transfer to scientific simplification,
even in the presence of domain mismatch.</p>
          <p>In contrast, smaller variants such as FLAN-T5-Base and FLAN-T5-XL yield significantly lower
scores (19.51 and 18.78, respectively), underscoring the impact of model scale on simplification quality.
These results support the hypothesis that both size and instruction tuning are key factors in enabling
generalization without task-specific supervision.</p>
          <p>Finally, the gap between SARI (original) and SARI (auto) ofers additional insights into evaluation
alignment. The top-performing FLAN-T5-Large exhibits strong agreement across both metrics, suggesting
its outputs align well with both human references and automated paraphrases—further validating its
generalization capacity.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Task 1.2 – Document-level Scientific Text Simplification</title>
        <sec id="sec-2-3-1">
          <title>3.3.1. Experimental Results</title>
          <p>The evaluation results for Task 1.2 indicate that models incorporating domain adaptation or content
cleaning strategies yield improved performance in document-level scientific text simplification. The
top-performing system, bart-samsum_clean, achieved a score of 36.998, demonstrating the benefit
of leveraging dialogue-style summarization pretraining combined with targeted refinement.</p>
          <p>Closely following, flan-t5-xl_clean and flan-t5-xxl_clean achieved scores of 36.620
and 35.813, respectively, confirming the positive efect of scaling and data curation. The
flan-t5-large_co variant, presumably optimized with contrastive objectives, also performed
competitively with a score of 34.612.</p>
          <p>In contrast, flan-t5-base—the smallest model—achieved a lower score of 33.130, suggesting a
performance ceiling for models lacking suficient capacity or instruction tuning. This reinforces the
sensitivity of document-level simplification to both model scale and pretraining configuration.</p>
          <p>Overall, the results highlight the importance of instruction tuning, scaling, and input refinement
in achieving high-quality simplifications that preserve coherence and semantic fidelity—crucial for
expert-to-lay communication in scientific domains.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3.4. Task 2.1 – Identify Creative Generation at Document Level</title>
        <sec id="sec-2-4-1">
          <title>3.4.1. Experimental Results</title>
          <p>The best-performing model is the Extra Trees classifier, which achieves an F1-score and Score
of 0.948, alongside a high Recall (0.974) and Accuracy (0.904). These results underscore the model’s
robustness in identifying hallucinated content, even in imbalanced or sparse feature settings, afirming
the strength of ensemble tree-based methods.</p>
          <p>Random Forest follows closely with a Score of 0.945 and an F1-score of 0.945, further confirming
the eficacy of ensemble approaches. Both models efectively balance precision and recall, which is
essential for minimizing both false positives and false negatives in hallucination detection pipelines.</p>
          <p>Support Vector Classifier and XGBoost achieve moderate performance, with Scores of 0.879 and 0.874,
respectively. Despite being more complex learners, they lag behind the ensemble methods, possibly due
to their sensitivity to data representation or hyperparameter tuning.</p>
          <p>Linear models like Logistic Regression and Ridge Regression also perform competitively, reaching
F1-scores of 0.863 and 0.862. Their success suggests that even without complex architectures,
highdimensional TF-IDF representations can be efectively leveraged to detect semantic inconsistencies.</p>
          <p>
            At the lower end, Gradient Boosting and KNN scored lowest (0.784, 0.210), reflecting limited
generalization in sparse, high-dimensional spaces. The weak KNN performance aligns with prior findings [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]
on its ineficacy in complex multi-label settings.
          </p>
          <p>Overall, the findings confirm that tree-based ensemble models, particularly Extra Trees and Random
Forest, are highly efective in posthoc hallucination detection. Their ability to handle feature sparsity,
combined with robust discriminative performance, makes them suitable for integration into real-world
simplification quality assurance systems.</p>
          <p>The results presented in Table 9 demonstrate that ensemble-based classifiers achieve superior
performance in the sourced hallucination detection setting. Specifically, Extra Trees and Random Forest
attain the highest overall scores (F1-score: 0.950 and 0.945, respectively), indicating their robustness
in capturing subtle lexical or semantic cues related to spurious content. Both models exhibit excellent
recall (0.974 and 0.964) while maintaining high precision, suggesting a balanced ability to identify
hallucinated instances without overfitting.</p>
          <p>Among the linear models, Ridge Regression and Logistic Regression perform consistently well (F1-scores:
0.861 and 0.860), showing that even without non-linear transformations, TF-IDF-based representations
provide strong discriminative power. The Support Vector Classifier also demonstrates notable performance
(F1-score: 0.881), with an accuracy of 0.799 and ROC AUC of 0.688, confirming its capacity to construct
expressive hyperplanes for this binary classification task.</p>
          <p>SGD Classifier and Naive Bayes yield slightly lower performance (F1-scores: 0.842 and 0.838,
respectively), yet still maintain reasonable balance between precision and recall, afirming their utility as
lightweight and interpretable alternatives.</p>
          <p>The lowest performing model is Gradient Boosting, with an F1-score of 0.768 and recall of just 0.642,
despite a high precision of 0.955. This suggests that while the model is highly conservative in predicting
hallucinations (yielding few false positives), it fails to recall a significant portion of true hallucinated
cases — potentially due to overfitting or an inability to generalize across sparse lexical input.</p>
          <p>Overall, the sourced hallucination detection results corroborate the efectiveness of tree-based
ensembles and strong linear classifiers, which consistently achieve a desirable trade-of between precision and
recall, making them well-suited for reliable identification of hallucinated content in scientific text.</p>
          <p>The lowest performing model is Gradient Boosting, with an F1-score of 0.768 and recall of just 0.642,
despite a high precision of 0.955. This suggests that while the model is highly conservative in predicting
hallucinations (yielding few false positives), it fails to recall a significant portion of true hallucinated
cases — potentially due to overfitting or an inability to generalize across sparse lexical input.</p>
          <p>Overall, the sourced hallucination detection results corroborate the efectiveness of tree-based
ensembles and strong linear classifiers, which consistently achieve a desirable trade-of between precision and
recall, making them well-suited for reliable identification of hallucinated content in scientific text.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>3.5. Task 2.2 – Detect and Classify Information Distortion Errors in Simplified</title>
      </sec>
      <sec id="sec-2-6">
        <title>Sentences</title>
        <sec id="sec-2-6-1">
          <title>3.5.1. Evaluation Metrics</title>
          <p>The evaluation of Task 2.2 (Detect and Classify Information Distortion Errors) is framed as a multi-label
classification problem, where each simplified sentence may exhibit multiple error types drawn from a
predefined taxonomy.</p>
          <p>System performance is assessed using:
• Precision, Recall, and F1-score per error class;
• Macro-averaged F1-score across all labels, to account for class imbalance and to provide an overall
measure of system efectiveness.</p>
          <p>
            This evaluation setup enables a fine-grained assessment of a model’s ability to detect both surface-level
issues (e.g., grammar errors) and deeper semantic inconsistencies (e.g., factual hallucinations), in line
with prior work on multi-label learning [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] and factual consistency evaluation in text generation [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
The task design and metric definitions follow the CLEF 2025 SimpleText guidelines [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-6-2">
          <title>3.5.2. Experimental Results</title>
          <p>The classification results for Task 2.2 reveal substantial variability in performance across error categories,
highlighting the inherent dificulty of multi-label hallucination detection in scientific simplification.</p>
          <p>The system demonstrates strong performance in detecting sentences labeled as having no errors,
with an F1-score of 0.496, driven by high recall (0.937) but limited precision. This suggests that the
model tends to over-predict error-free cases, successfully retrieving many valid simplifications, albeit
with a high false-positive rate.</p>
          <p>In contrast, performance on hallucination-related categories remains low. For instance, Factuality
Hallucination (C1) and Prompt Misalignment (B2) achieve F1-scores of only 0.025 and 0.014,
respectively—reflecting the subtle, context-dependent nature of these phenomena and the challenge of
reliably capturing them from limited input representations.</p>
          <p>Some categories, such as Loss of Informative Content (D2.1) and Faithfulness Hallucination
(C2), yielded relatively higher F1-scores (0.290 and 0.185), suggesting that models are more capable of
detecting content reduction or minor semantic inconsistencies compared to abstract hallucination types.</p>
          <p>Overall, these findings indicate that while surface-level or structural errors (e.g., syntactic mistakes or
overgeneralization) are more tractable, deeper semantic distortions and hallucinations remain dificult
to detect using current feature-based classifiers—emphasizing the need for richer contextual modeling
or task-specific representation learning.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Discussion and Conclusions</title>
      <p>This work presented our participation in the CLEF 2025 SimpleText Track, addressing both scientific
text simplification and hallucination detection. For simplification (Tasks 1.1 and 1.2), instruction-tuned
large language models—such as FLAN-T5-Large and BART-SAMSum—demonstrated strong zero-shot
capabilities, particularly when scaled or enhanced through content cleaning. Notably, sentence-level
simplification benefited from increased model capacity, while document-level tasks required
coherenceaware prompting strategies.</p>
      <p>Looking ahead, we plan to investigate few-shot prompting for Tasks 1.1 and 1.2, incorporating
in-context examples to further improve simplification quality—especially in domains requiring
terminological precision and semantic fidelity.</p>
      <p>For hallucination detection (Task 2.1), tree-based ensemble classifiers ( Extra Trees, Random Forest)
proved highly efective in both posthoc and sourced conditions. While these methods perform well
using lexical features, future work will explore transformer-based classifiers (e.g., fine-tuned BERT ) to
assess whether contextualized embeddings can better capture subtle inconsistencies beyond shallow
representations.</p>
      <p>Task 2.2 further revealed limitations in capturing fine-grained semantic distortions, with substantial
variation in performance across error categories. To address this, we aim to incorporate contextual
embeddings from transformer encoders (e.g., BERT, RoBERTa) into classification pipelines, and apply
hierarchical modeling and curriculum learning to better capture inter-error dependencies.
Enhancing data diversity via augmentation and annotation bootstrapping will also be critical for improving
generalization in underrepresented categories.</p>
      <p>Overall, our goal is to develop models that are both simplification-aware and hallucination-resilient,
supporting faithful and accessible communication of scientific content to non-expert audiences.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Acknowledgments</title>
      <p>We gratefully acknowledge the organizers of the CLEF 2025 SimpleText Track for their dedicated eforts
in designing and coordinating the track. The datasets, tools, and evaluation infrastructure they provided
formed a crucial basis for the development and assessment of our systems.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-6">
      <title>Appendix A: Submission Information</title>
      <p>• 5 runs for Task 1.1
• 5 runs for Task 1.2
• 10 runs for Task 2.1 (Posthoc)
• 8 runs for Task 2.1 (Sourced)
• 1 run for Task 2.2</p>
    </sec>
    <sec id="sec-7">
      <title>Appendix B: Prompt Examples for Task 1</title>
      <sec id="sec-7-1">
        <title>Task 1.1 – Sentence-level Simplification (FLAN-T5-XL).</title>
        <p>The following prompt was used in a zero-shot setting:
Prompt: Simplify: The medicine caused drowsiness and fatigue.</p>
        <p>Output: The medicine made the person tired and sleepy.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Task 1.1 – Sentence-level Simplification (BART-SAMSUM).</title>
        <p>For BART, no explicit instruction was used. The model received the raw sentence directly as input:
Input: The medicine caused drowsiness and fatigue.</p>
        <p>Output: The drug made people feel tired and sleepy.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Task 1.2 – Document-level Simplification.</title>
        <p>The prompt used for document-level simplification was:</p>
        <p>Prompt: Simplify the following scientific document:
In this study, we investigate the structural behavior of graphene-based materials under
varying thermal and mechanical conditions. Our findings demonstrate significant
improvements in tensile strength and flexibility when integrated into polymer composites.
Output: This study looks at how graphene materials behave under heat and stress. The
results show they become stronger and more flexible in plastics.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , et al.,
          <article-title>Overview of clef 2025 simpletext track: Simplify scientific texts (and nothing more), in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), LNCS, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the clef 2025 simpletext task 1: Simplify scientific text</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the clef 2025 simpletext task 2: Identify and avoid hallucination</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Lapata,
          <article-title>Sentence simplification with deep reinforcement learning, Transactions of the Association for Computational Linguistics (TACL) 5 (</article-title>
          <year>2017</year>
          )
          <fpage>365</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J. O.</given-names>
            <surname>Suarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seddah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <article-title>Controllable sentence simplification with a constrained seq2seq model</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , p.
          <fpage>3537</fpage>
          -
          <lpage>3550</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          , et al.,
          <article-title>Prompting gpt for text simplification: A case study, in: Findings of the Association for Computational Linguistics</article-title>
          : EMNLP,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Madaan</surname>
          </string-name>
          , et al.,
          <article-title>Text simplification with large language models</article-title>
          ,
          <source>in: arXiv preprint arXiv:2302.13971</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bohnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <article-title>On faithfulness and factuality in abstractive summarization</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , p.
          <fpage>1906</fpage>
          -
          <lpage>1919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. R.</given-names>
            <surname>Zaiane</surname>
          </string-name>
          , et al.,
          <article-title>Evaluating the factual consistency of abstractive text summarization, in: Findings of the Association for Computational Linguistics</article-title>
          : NAACL,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2025 SimpleText track: Simplify scientific texts (and nothing more)</article-title>
          , in: J. Carrillo de Albornoz,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Overview of the clef 2025 simpletext task 1: Simplify scientific text</article-title>
          ,
          <source>in: [22]</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Overview of the clef 2025 simpletext task 2: Identify and avoid hallucination</article-title>
          ,
          <source>in: [22]</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoles</surname>
          </string-name>
          ,
          <article-title>Optimizing statistical machine translation for text simplification</article-title>
          ,
          <source>in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          ,
          <year>2016</year>
          , p.
          <fpage>560</fpage>
          -
          <lpage>570</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          , p.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <source>International Conference on Learning Representations (ICLR)</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , G. Durrett,
          <article-title>Lens: A learned evaluation metric for sentence simplification</article-title>
          ,
          <source>in: Proceedings of ACL</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fawcett</surname>
          </string-name>
          ,
          <article-title>An introduction to roc analysis</article-title>
          ,
          <source>Pattern recognition letters 27</source>
          (
          <year>2006</year>
          )
          <fpage>861</fpage>
          -
          <lpage>874</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>D. M. W. Powers</surname>
          </string-name>
          ,
          <article-title>Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation</article-title>
          ,
          <source>Journal of Machine Learning Technologies</source>
          <volume>2</volume>
          (
          <year>2011</year>
          )
          <fpage>37</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Arampatzis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Perifanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Symeonidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arampatzis</surname>
          </string-name>
          , DUTH at SemEval
          <article-title>-2023 Task 9: An Ensemble Approach for Twitter Intimacy Analysis</article-title>
          ,
          <source>in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)</source>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>1225</fpage>
          -
          <lpage>1230</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .170. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . semeval-
          <volume>1</volume>
          .
          <fpage>170</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsoumakas</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Katakis, Multi-label classification: An overview</article-title>
          ,
          <source>International Journal of Data Warehousing and Mining (IJDWM) 3</source>
          (
          <issue>2007</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vendeville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wrigley</surname>
          </string-name>
          ,
          <article-title>Overview of the clef 2025 simpletext task 2: Identify and avoid hallucination</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025:
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>Team name: DUTH Track: CLEF 2025 SimpleText Submitted Tasks: Task 1.1, Task 1.2, Task 2.1, Task</source>
          <volume>2</volume>
          .2 Run IDs:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>