<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OpenFact at CheckThat! 2025: Application of Self-Reflecting and Reasoning LLMs for Fact-Checking Claim Normalization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcin Sawiński</string-name>
          <email>marcin.sawinski@ue.poznan.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Węcel</string-name>
          <email>krzysztof.wecel@ue.poznan.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ewelina Księżniak</string-name>
          <email>ewelina.ksiezniak@ue.poznan.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Systems, Poznań University of Economics and Business</institution>
          ,
          <addr-line>Al. Niepodległości 10, 61-875 Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a system for the claim normalization task, developed for the CLEF 2025 CheckThat! Task 2 competition. We evaluated large language models of approximately 8B parameters, including LLaMA 3.1, DeepSeek-R1, and GPT-4.1-mini, using the METEOR score as a primary metric. GPT-4.1-mini with supervised ifne-tuning emerged as the best-performing approach, ranking second or third on six out of seven languages ofered in zero-shot setting. Our study also explores self-reflection and multiple candidate selection techniques, ifnding that while self-reflection did not improve METEOR scores, it helped reduce factual errors. These insights highlight the need to balance metric-driven evaluation with qualitative analysis for efective claim normalization in real-world scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;check-worthiness</kwd>
        <kwd>fact-checking</kwd>
        <kwd>fake news detection</kwd>
        <kwd>language models</kwd>
        <kwd>claim normalization</kwd>
        <kwd>LLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ifne-tuning.
ization task:
The task of claim normalization involves transforming unstructured textual content, such as social
media posts, into concise and factually accurate claims suitable for downstream fact-checking [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Large language models (LLMs) have demonstrated strong potential for generative and summarization
tasks, suggesting their suitability for claim normalization. However, evaluating claim normalization
outputs is complex, as metrics such as METEOR often prioritize stylistic similarity over factual correctness.
This discrepancy is particularly important for fact-checking applications, where factual inaccuracies
can lead to misleading interpretations.</p>
      <p>This paper presents a system for claim normalization developed for the CLEF 2025 CheckThat! Task
2 competition. We evaluated decoder-only LLMs, including LLaMA 3.1, DeepSeek-R1, and
GPT-4.1mini, and explored techniques such as self-reflection, multiple candidate generation, and supervised
We posed the following research question to examine the application of LLMs to the claim
normal• RQ1: How well are diferent LLMs suited for the claim normalization task?
• RQ2: Do reasoning fine-tuned models outperform base chat models?
• RQ3: How does self-reflection impact the LLM performance on the claim normalization task?
• RQ4: How does multiple candidate generation and selection impact performance on the claim
normalization task?</p>
      <p>• RQ5: Can supervised fine-tuning of LLMs improve performance on the claim normalization task?
Our findings show that while self-reflection did not consistently improve METEOR scores, it helped
reduce factual errors in generated claims. Fine-tuning the GPT-4.1-mini model further improved
performance, demonstrating the importance of task-specific adaptation for robust claim normalization.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Sundriyal et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed the Check-worthiness Aware Claim Normalization (CACN) approach,
which leverages chain-of-thought prompting and reverse check-worthiness. The goal of
chain-ofthought prompting is to decompose a complex task into a sequence of simpler subtasks, using examples
to encourage step-by-step reasoning. Reverse check-worthiness, on the other hand, prioritizes claims
that meet the criteria for check-worthiness.
      </p>
      <p>
        Our study seeks to build upon the research of Sundriyal et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] by leveraging newer, more
capable LLMs and examining the impact of increased test-time compute on system performance. While
advancements in foundational models remain a primary focus within the field, it is widely acknowledged
that comprehensive systems built around LLMs can yield even greater improvements. We investigated
the use of reasoning-tuned models and self-reflection as potential methods to enhance performance.
      </p>
      <p>
        Reasoning-tuned models have gained significant attention by topping multiple leaderboards. However,
it has been observed that their performance varies depending on the complexity of the task, and in
some cases, tasks are better addressed by models without reasoning fine-tuning. A recent study [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
highlights that reasoning models often struggle to adapt generation length to task complexity, with
both overly short and excessively long generations leading to degraded performance. We decided to
run a direct comparison of the same model architecture with and without reasoning-tuning.
      </p>
      <p>
        Automated methods to auto-correct large language models (LLMs) aim to improve their output
without human intervention. These approaches include self-correction, where the LLM refines its
own outputs using self-generated feedback iteratively; generation-time correction, where LLMs adjust
outputs during generation guided by feedback from critic models or external knowledge sources; and
post-hoc correction, where outputs are refined after generation, leveraging external tools, knowledge
bases, or multi-agent debates. These strategies address errors like hallucinations, unfaithful reasoning,
and toxic content, ofering flexible and scalable ways to enhance LLM performance autonomously [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Self-refine [ 4], an iterative refinement approach use the same LLM to generate an initial answer, then
provides feedback on its own answer, and finally refines the answer using that feedback[ 4]. Iterative,
self-correcting loop without external supervision, additional training, or reinforcement learning could
be applied across diverse tasks, especially with complex or nuanced quality criteria.</p>
      <p>Self-correction approach has been also criticized indicating that LLMs often fail to correct their
responses without external feedback, and that self-correction can even degrade the performance. [5]</p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>The dataset provided by the organizers comprised three splits: train, dev, and test. The train and dev
splits covered 13 languages, while the test split included 20 languages (see Table 1). The train and dev
splits contained both post and normalized claim columns, whereas the test split included only the post
column.</p>
      <p>An initial quality check of the train and dev splits revealed several issues that could negatively impact
training and evaluation:
• Language mismatch: the post and normalized claim were in diferent languages.
• Referenced media: the post text referred to external media (e.g., images or videos) that were
not available but necessary to formulate the claim.
• Content mismatch: the claim text referenced facts not present in the post text.
• Multiple claims: the post text contained multiple possible claims or detailed elements, while the
normalized claim arbitrarily selected only one relevant detail for fact-checking.</p>
      <p>To address these issues, we first filtered out examples with language mismatches. We then used
gpt-4.1-mini to flag semantic mismatches between the post and normalized claim, identify external
media references, and detect if the normalized claim contained information that was absent in the post.</p>
      <p>Examples flagged in this process or excluded due to content policy violations (e.g., hate speech,
jailbreak, self-harm, sexual or violent content) were removed from further processing.</p>
      <p>Examples containing multiple or highly complex claims were retained. Overall, approximately one
third of the provided examples were used for training and evaluation, as shown in Table 1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data processing pipelines</title>
      <p>We modeled the system as LLM-based self-reflection agent that iteratively improve outputs until no
further improvements are observed. We also introduced multiple initial candidate generations by
adjusting temperature or seed settings, depending on the model. This step was motivated by the
observation that in some cases, the initial claim phrasing anchored the model to specific details of the
post throughout subsequent iterations.</p>
      <p>The complete claim normalization process consisted of three steps:
• Initial claim extraction: generation of up to three claim candidates using a prompt with
guidelines for claim normalization.
• Improvement via self-reflection : iterative refinement of each claim candidate through multiple
self-reflection steps. The process was capped at a maximum number of steps but stopped earlier
if no changes were detected compared to the previous iteration.
• Selection with LLM-as-a-judge: the model was presented with up to three improved claim
candidates and tasked with selecting the best one.</p>
      <p>Results were collected at three pipeline steps to measure the impact of diferent techniques:
• Initial claim extraction (Ini) — Baseline results obtained by generating a normalized claim
with a single call to the LLM for each post. This score reflects the basic output quality of the LLMs
without any additional techniques.
• Self-reflection (Ref) — Results achieved by applying the self-reflection technique to previously
generated outputs. In this step, the model received the post text, the normalized claim from
initial extraction or the previous iteration, and the claim normalization guidelines. The prompt
instructed the model to refine the normalized claim to best match the guidelines. This process
was repeated up to ten times for Llama and DeepSeek and up to five times for GPT and GPT FT, or
until no further changes were made. All iterations were run with temperature set to zero or the
same seed to ensure determinism.
• Candidate selection (Sel) — For Llama and DeepSeek, the initial claim extraction step was
repeated three times with diferent random seeds, resulting in varied claim formulations. The
second and third outputs were not reported in the previous steps. All three outputs were
independently refined using the self-reflection loop described above. Finally, the LLM was presented
with the three improved candidates and tasked to select the best one that matched the guidelines.
For GPT and GPT FT, the variability in initial outputs was very low with temperature set below 1,
and the self-reflection step with temperature zero consistently produced the same claim phrasing.</p>
      <p>As a result, this step was omitted for the GPT models.</p>
      <p>The LLMs were queried in chat mode, using message chains composed of a system and a user message.
Identical guidelines for claim normalization were used across all three processing steps.</p>
      <p>We applied supervised fine-tuning using the train datasets for all languages, converting them into
message chains that included system and user parts, combined with the assistant part derived from the
normalized claim column.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Models</title>
      <p>We limited our study to LLMs with approximately 8B parameters. Larger models were excluded due
to hardware constraints, longer inference times, and prohibitive costs, given the large-scale nature of
the claim normalization task. Smaller models were excluded because of low performance observed in
preliminary experiments.</p>
      <p>The models selected for this study were: LLaMA 3.1 8B, DeepSeek-R1 8B (a reasoning fine-tuned
version of LLaMA 3.1 8B), and GPT-4.1-mini. While the specifications of GPT-4.1-mini are not publicly
available, we assumed its parameter count to be between 7B and 9B, making it a fair comparison to the
other models. LLaMA and DeepSeek models were run using 4-bit grouped quantization with the Ollama
backend. The hardware consisted of four NVIDIA GeForce RTX 2080 Ti GPUs, yielding a generation
speed of approximately 300 tokens per second.</p>
      <p>GPT-4.1-mini was used in its default configuration and also fine-tuned for this task using Azure
OpenAI services with the following hyperparameters: number of epochs 3, batch size 4 and LR multiplier
2.</p>
      <p>In the experiments section, we refer to the models as follows:
• LLaMA 3.1 8B — Llama
• DeepSeek-R1 8B — DeepSeek
• GPT-4.1-mini — GPT
• GPT-4.1-mini fine-tuned on the train dataset — GPT FT</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental Results</title>
      <sec id="sec-6-1">
        <title>6.1. METEOR score results analysis</title>
        <p>All models and techniques were evaluated using the METEOR score on dev dataset split. For readability,
all metrics in Table 2 are presented as percentages.</p>
        <p>Results are reported per pipeline step to highlight the impact of diferent techniques.</p>
        <sec id="sec-6-1-1">
          <title>6.1.1. Initial Claim Extraction</title>
          <p>The Llama model achieved a METEOR score of 24 across all languages using only the initial claim
generation step. The DeepSeek model yielded a lower METEOR score of 16. In contrast, the GPT model
achieved a METEOR score of 42 with the same approach.</p>
          <p>A language-wise breakdown showed the same trend: GPT consistently outperformed Llama, while
Llama outperformed DeepSeek in every language within the dev dataset.</p>
          <p>Although the DeepSeek model produced promising “thinking” traces—iterating over most of the
guideline items to generate normalized claims—this was not reflected in the final scores. This discrepancy
may be partly attributed to DeepSeek’s lower adherence to instructions and the addition of extraneous
text to the final output (e.g., “The claim extracted from the post is: . . . ”).</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>6.1.2. Self-reflection</title>
          <p>The METEOR scores for outputs processed through self-reflection loops were not consistently higher
than those of the initially generated outputs. While specific language-model intersections showed some
improvement, no clear overall trend or consistent gains were observed.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>6.1.3. Candidate Selection Results</title>
          <p>Generating multiple initial outputs and selecting the best candidate using the LLM as a judge did not
lead to improvements in METEOR scores.</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>6.1.4. Supervised Fine-tuning</title>
          <p>Fine-tuning the GPT model on the train dataset improved the METEOR score by 16 across all languages.
This approach proved to be the most efective adaptation technique tested in the study.
Candidates: METEOR
1.Dandelion is able to grow 98% of cancer cells within 4 hours 84
2.Dandelion is able to kill 1% of cancer cells within 4 hours" 84
3.Dandelion kills 98% of cancer cells within 2 days 63
4.Dandelion kills nearly all cancer cells in 2 days 25</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Additional manual analysis</title>
        <p>Scoring claim normalization results presents several challenges. First, the METEOR score does not
adequately recognize semantically equivalent sentences and tends to prioritize stylistic features over
factual accuracy. As shown in Table 3, incorrect claims 1 and 2—which reverse the meaning or alter
numerical values—achieve high scores of 84, while correct claims 3 and 4, which use alternative
phrasings, score lower at 63 and 25, respectively.</p>
        <p>Analysis of the results indicates that high METEOR scores (above 8) are only observed when the post is
short and contains a well-formulated claim that can be extracted verbatim. For more complex examples,
exact phrasing alignment with the annotator’s version is nearly impossible, leading to significantly
lower scores. In these cases, generated claims often contain substantial factual errors or omissions that
the METEOR metric fails to capture.</p>
        <p>Second, there is no established gold standard for claim normalization. With multiple annotation
authors and styles, it is often impossible to determine whether the reference or the generated claim is
more appropriate when both are factually correct but difer in the level of detail. This situation results
in low METEOR scores, even though both reference and generated claims may be accurate. An example
from the dev dataset illustrating this challenge is provided below.</p>
        <p>Example:
Post:
#AIUDF_Pakistan_Zindabad. #AntiNationalAIUDF
"Pakistan Zindabad" Slogans Raised in Silchar Airport By AIUDF Members. #AIUDF supreme
#BadruddinAjmal should immediately resign as MP ! #Shame #Shame I request District administration should
take after the mtr immediately. DC Cachar #AIUDF_Pakistan_Zindabad. #AntiNationalAIUDF
"Pakistan Zindabad" Slogans Raised in Silchar Airport By AIUDF Members. #AIUDF supreme
#BadruddinAjmal should immediately resign as MP ! #Shame #Shame I request District administration should
take after the mtr immediately. DC Cachar #AIUDF_Pakistan_Zindabad. #AntiNationalAIUDF
"Pakistan Zindabad" Slogans Raised in Silchar Airport By AIUDF Members. #AIUDF supreme
#BadruddinAjmal should immediately resign as MP ! #Shame #Shame I request District administration should
take after the mtr immediately. DC Cachar None
Reference:
Pro-Pakistan slogans shouted outside an airport in India</p>
        <p>Note that the reference claim includes additions such as “Pro-Pakistan”,“shouted”, and “outside”,
which were introduced by the annotator and are not present in the original post text. Conversely,
important details like WHO: “AIUDF Members” and WHERE: “Silchar” were omitted in the reference.
Llama output – initial extraction – attempt 1 – METEOR 17 :
Slogans "Pakistan Zindabad" were raised in Silchar Airport by AIUDF Members.</p>
        <p>The claim generated by Llama received a METEOR score of only 17, despite being factually accurate
and complete.</p>
        <p>Llama output – self-reflection – attempt 1 – METEOR 11 :
AIUDF members allegedly raised pro-Pakistani slogans ("Pakistan Zindabad") at Silchar Airport.</p>
        <p>It could be argued that the final claim generated by Llama after self-reflection was in fact improved:
it retained all relevant information and accompanied “Pakistan Zindabad” with the more general term
“pro-Pakistani slogans” as in reference. However, the METEOR score dropped further to 11. The addition
of the word “allegedly” was not consistent with the claim normalization strategy.</p>
        <p>Llama output – initial extraction – attempt 2 – METEOR 13:
"Pakistan Zindabad" slogans were raised in Silchar Airport by AIUDF members.</p>
        <p>Note: I’ve removed hashtags, emojis, and repeated text to extract the main claim, which is about the
slogan "Pakistan Zindabad" being raised at the airport.</p>
        <p>On a second attempt, the Llama model included additional text while still retaining a METEOR score
of 13.</p>
        <p>Llama output – self-reflection – attempt 2 – METEOR 12 :
AIUDF members allegedly raised "Pakistan Zindabad" slogans at Silchar Airport.</p>
        <p>Again, the self-reflection process objectively improved the output by removing extraneous text, but the
METEOR score further decreased to 12.</p>
        <p>DeepSeek output – initial extraction – attempt 1 – METEOR 16:
"Pakistan Zindabad" Slogans Raised in Silchar Airport By AIUDF Members. #AIUDF supreme
#BadruddinAjmal should immediately resign as MP !
DeepSeek produced a meaningful output on the first attempt; however, it ignored the instruction to
remove hashtags and received a METEOR score of 16.</p>
        <p>DeepSeek output – self-reflection – attempt 1 – METEOR 16 :
"Pakistan Zindabad Slogans Were Raised by AIUDF Members at Silchar Airport, Calling for Badruddin
Ajmal’s Resignation as MP."
The self-reflection process in DeepSeek removed the hashtags and polished the phrasing of the claim,
but this improvement was not reflected in any change in the METEOR score. It is also worth noting
that DeepSeek attempted to generate a more complex claim by combining the “slogans raised” and
“Ajmal should resign” information pieces.</p>
        <p>GPT output – initial extraction – attempt 1 – METEOR 22:
"Pakistan Zindabad" slogans were raised in Silchar Airport by All India United Democratic Front
members.</p>
        <p>GPT received a METEOR score of 22 by generating a claim similar to that produced by Llama, likely
due to the expansion of an abbreviation and the addition of the word “India”, which overlapped with
the reference claim.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>Our experiments indicate that claim normalization can be efectively handled by decoder-only LLMs,
such as GPT-4.1-mini, even without fine-tuning. While the METEOR metric provides a quantitative
measure of performance, it does not always align with factual correctness or task-specific nuances.
This limitation underscores the importance of qualitative review when evaluating claim normalization
outputs.</p>
      <p>Subtle aspects of fact-checking practice are challenging to encode as prompting guidelines. This may
explain why fine-tuning—despite the GPT model already performing well with default weights—further
improved performance. Fine-tuning likely helps align the model more closely with task-specific details
that are hard to capture through prompting alone.</p>
      <p>Interestingly, our results suggest that strict adherence to normalization guidelines may be particularly
useful for clustering posts, even when these clusters difer from claims extracted by professional
factcheckers. In this sense, prioritizing guideline adherence could ofer practical advantages beyond just
accuracy metrics.</p>
      <p>The self-reflection approach did not yield improved METEOR scores. However, manual analysis of
the outputs revealed that self-reflection efectively removed many factual errors present in the initial
generations. This highlights the potential of iterative improvement to refine outputs, even if it is not
reflected in automated scoring.</p>
      <p>A final observation concerns model-specific diferences: DeepSeek’s “thinking” process, while
thorough, appears less eficient, as many tokens are devoted to reasoning rather than concise output
generation. In contrast, the more straightforward Llama outputs—despite scoring lower overall than
ifne-tuned GPT —remain practically usable, especially in resource-constrained scenarios where
processing speed and scalability are critical.</p>
      <p>The system based on the fine-tuned GPT-4.1-mini ranked second for Czech (METEOR on test dataset
split was reported 21.44), Bengali (29.59), and Marathi (30.48), and third for Greek (23.33), Telugu (45.59),
Romanian (23.50), Dutch (18.66), and Punjabi(26.96) out of 20 languages in the CLEF 2025 CheckThat!
Task 2 competition. It is important to note that the system ranked second or third in 6 out of 7 languages
for which only test data were available (the so-called zero-shot setting).</p>
      <p>Overall, our findings emphasize that decoder-only LLMs can achieve strong results for claim
normalization, and that practical trade-ofs (such as hardware limitations and processing time) should be
carefully considered when selecting models and techniques for real-world deployments.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT to check grammar, spelling, and style.
The tool was applied to selected paragraphs, and all corrections were manually reviewed and approved.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>The research is supported by the project “OpenFact – artificial intelligence tools for verification of
veracity of information sources and fake news detection” (INFOSTRATEG-I/0035/2021-00), granted
within the INFOSTRATEG I program of the National Center for Research and Development, under the
topic: Verifying information sources and detecting fake news.
Association for Computational Linguistics 12 (2024) 484–506. URL: https://aclanthology.org/2024.
tacl-1.27/. doi:10.1162/tacl_a_00660.
[4] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegrefe, U. Alon, N. Dziri, S.
Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, P. Clark,
Self-refine: Iterative refinement with self-feedback, 2023. URL: https://arxiv.org/abs/2303.17651.
arXiv:2303.17651.
[5] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, D. Zhou, Large language models cannot
self-correct reasoning yet, 2024. URL: https://arxiv.org/abs/2310.01798. arXiv:2310.01798.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>From chaos to clarity: Claim normalization to empower fact-checking</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>6594</fpage>
          -
          <lpage>6609</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-emnlp.
          <volume>439</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . findings-emnlp.
          <volume>439</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          ,
          <article-title>Between underthinking and overthinking: An empirical study of reasoning length and</article-title>
          correctness in llms,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.00127. arXiv:
          <volume>2505</volume>
          .
          <fpage>00127</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saxon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nathani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies, Transactions of the</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>