<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Mergen);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tarık Saraç</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Selin Mergen</string-name>
          <email>s.mergen@etu.edu.tr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mucahid Kutlu</string-name>
          <email>mucahidkutlu@qu.edu.qa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Scientific Discourse Detection, Debating Method, Large Language Models</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Qatar University</institution>
          ,
          <addr-line>Doha</addr-line>
          ,
          <country country="QA">Qatar</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TOBB University of Economics and Technology</institution>
          ,
          <addr-line>Ankara</addr-line>
          ,
          <country country="TR">Türkiye</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies.</p>
      </abstract>
      <kwd-group>
        <kwd>Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        tweet:
In this work, we present our approach for Subtask 4a (Scientific Web Discourse Detection) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of the
CheckThat! 2025 shared task [
        <xref ref-type="bibr" rid="ref2 ref20">2</xref>
        ]. The subtask involves three binary classification problems for a given
• Category 1: Does the tweet contain a scientific claim?
• Category 2: Does the tweet reference a scientific study or publication?
• Category 3: Does the tweet mention scientific entities, such as a university or scientist?
In this work, we propose a few-shot classification approach in which LLMs engage in debate to
reach a final decision. Specifically, we introduce three distinct debating strategies:
single debate, team
debate, and council debate. In the single debate setting, two LLMs argue from opposing perspectives,
while a third model serves as the judge. In team debate, multiple models collaborate on each side; team
members first discuss internally before presenting their collective arguments to the opposing team.
Finally, in the council debate approach, a group of expert models discuss together to reach a consensus,
moderated by a chairperson model. As we observed a strong correlation between the positive classes in
Category 2 and Category 3, we adopt a simple heuristic: if our method predicts that a tweet contains a
reference to a scientific study or publication (Category 2), we also label it as positive for Category 3 (i.e.,
containing mentions of scientific entities).
      </p>
      <p>
        In our experiments, we utilize several LLMs, including Gemma3 (12B) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Qwen3 (8B) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
DeepSeekR1 (8B) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Phi-4 (14B) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Mistral (7B) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], LLaMA 3.1 (8B) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], o4-mini1, and Claude-42. When using
CEUR
      </p>
      <p>ceur-ws.org
only open-source models, the council debate method outperforms other approaches on the development
set, and the team debate method is superior to the single debate method. Based on these results, we
selected the council debate method as our primary method.</p>
      <p>In the oficial rankings, our model ranked 8ℎ ( 1 = 0.7273) in Category 1 (detecting scientific claims)
and 9ℎ ( 1 = 0.7766) in Category 3 (detecting mentions of scientific entities). However, it achieved 1
place ( 1 = 0.7805) in Category 2 (detecting references to scientific studies).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Scientific Web Discourse Detection</title>
        <p>
          Detecting scientific claims and related content in social media, especially on platforms like Twitter 3,
has become a growing area of research in recent years [
          <xref ref-type="bibr" rid="ref1 ref2 ref20 ref9">1, 2, 9</xref>
          ]. Shared tasks such as CLEF CheckThat!
(Task 4a) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], SemEval-2023 Task 8 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and the FIRE-2023 CLAIMSCAN challenge [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] have focused on
distinguishing scientific claims, references, and science-related entities from general online discourse.
Most competitive systems in these tasks use supervised learning, with large pre-trained transformer
models like BERT or RoBERTa fine-tuned on carefully annotated datasets, often augmented with
auxiliary features such as tweet metadata or ensemble techniques [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. While these methods deliver
strong results, they rely on large labeled datasets and task-specific training, which can limit adaptability
to new domains or languages. In contrast, our approach utilizes the few-shot reasoning capabilities
of LLMs and uses a debate-style framework, where multiple models collectively reason and reach a
consensus on the presence of scientific discourse, rather than a single model making a prediction. Thus,
our methods ofer better flexibility and reduce reliance on annotated large training datasets.
2.2. Debating LLMs for Scientific Discourse Detection
Recently, debate-based multi-agent frameworks using LLMs have been proposed to improve model
reasoning, evaluation, and decision-making [
          <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
          ]. In these systems, multiple LLMs (sometimes
with diferent roles or perspectives) discuss, argue, or collaborate on a task before producing a final
judgment, often resulting in outcomes that align more closely with human assessment than those from
a single model. Earlier debate methods typically featured two agents (one ”pro” and one ”con”) with a
judge, or small fixed panels, mainly for open-ended generation or evaluation tasks [
          <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
          ]. However,
homogeneous panels can sometimes fall into biased agreement or fail to explore diverse viewpoints [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
Eo et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] address eficiency by triggering debates only on uncertain cases. Our council debate
method builds on these ideas by using a more diverse set of LLM agents, with each agent contributing
its own reasoning. Unlike previous works focused on generation quality evaluation, we apply the
debate framework directly to scientific discourse classification in social media. By combining several
LLMs in a structured decision process, our approach aims to reduce individual model bias and promote
more careful evaluation, resulting in more reliable detection of scientific claims, references, and entities
in noisy online environments.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methods</title>
      <p>We explored three diferent multi-LLM approaches for scientific discourse detection, each building
upon the previous method’s insights. All approaches use structured deliberation processes but difer
in their organizational structure and decision-making mechanisms. In our methods, we use several
diferent prompts. The actual prompts used in our study are given in Appendix A. We apply each
method independently for each category. However, when a tweet references a scientific study (Category
2), we also assign a positive label to Category 3 (i.e., containing scientific entities), as such references
typically imply the presence of scientific terms. An overview of these three debate frameworks is
illustrated in Figure 1.</p>
      <sec id="sec-3-1">
        <title>Now we explain each debate method in detail.</title>
        <sec id="sec-3-1-1">
          <title>3.1. Single Debate Method</title>
          <p>Our first approach implements a traditional debate format where two LLMs argue for opposing positions
while a third model acts as a judge. This method processes each category independently through separate
debates.</p>
          <p>Algorithm 1 describes our single debate method. For each interaction with LLMs, a brief prompt
is given just to explain the prompt. Actual prompts are provided in Appendix A.3. The algorithm
takes the tweet and the classification category as parameters. We first set our models used in the
debate process and assign specific models to diferent roles: a proponent model  that argues the
tweet contains the category, an opponent model  that argues against it, and a judge model  that
makes the final decision [ Lines 2-4]. The debate begins with opening statements where the proponent
generates supporting arguments and the opponent generates opposing arguments for the given tweet
and category [Lines 5-6]. We maintain a transcript that records all arguments throughout the debate
[Line 7]. The core debate consists of  rounds where the proponent rebuts the opponent’s argument,
and the opponent rebuts the proponent’s argument in response [Lines 8-12]. After the  rebuttal
rounds, both sides provide closing statements to summarize their positions [Lines 13-15]. Finally, the
judge model evaluates the complete debate transcript and makes a classification decision for the current
category [Lines 16-17].</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Team Debate Method</title>
          <p>Building on the single debate approach, we propose a team-based method where multiple models
collaborate within each side of the debate. This approach maintains the adversarial structure while
adding diverse perspectives within each team. Importantly, team members discuss among themselves
before presenting their arguments to the opposing team.</p>
          <p>Algorithm 2 describes our team debate method. For each interaction with LLMs, a brief prompt
is given just to explain the prompt. Actual prompts are provided in Appendix A.4. We assign teams
of models to diferent roles: a proponent team   that argues the tweet contains the category, an
opponent team  that argues against it, and a judge model  that makes the final decision [ Lines
2-4]. The debate begins with internal team discussions where proponent team members collaborate to
develop their strategy and arguments, followed by opponent team members doing the same [Lines
5-6]. Each team then presents their opening statements, with all proponent team members arguing for
the category and all opponent team members arguing against it [Lines 7-8]. We maintain a transcript
that records all arguments throughout the debate [Line 9]. The core debate consists of  rounds where
teams first conduct internal discussions to plan their rebuttals, then present coordinated responses to
Algorithm 1 Single Debate Method
1: Input: Tweet text  , category  , maximum rounds 
2:  ← proponent model for category 
3:  ← opponent model for category 
4:  ← judge model
5:  0 ← Generate( , ”argue tweet  contains category  ”)
6:  0 ← Generate( , ”argue tweet  does NOT contain category  ”)
7:    ← [ 0,  0]
8: for  = 1 to  do
9:   ← Generate( , ”rebut opponent argument  −1 ”)
10:   ← Generate( , ”rebut proponent argument   ”)
11:    ←    + [  ,   ]
12: end for
13:    ← Generate( , ”provide closing statement”)
14:    ← Generate( , ”provide closing statement”)
15:    ←    + [   ,    ]
16:  ← Generate( , ”evaluate debate    for category  ”)
17: return ExtractClassification(  )
the opposing team’s arguments [Lines 10-15]. Finally, the judge model evaluates the complete team
debate transcript and makes a classification decision for the current category [ Lines 17-18].
Algorithm 2 Team Debate Method
1: Input: Tweet text  , category  , team size  , maximum rounds 
2:   ← proponent team of size  for category 
3:  ← opponent team of size  for category 
4:  ← judge model
5:    ← InternalDiscussion(  , ”develop strategy for category  ”)
6:   ← InternalDiscussion( , ”develop strategy against category  ”)
7:    ← TeamArguments(  , ”argue tweet  contains category  ”)
8:   ← TeamArguments( , ”argue tweet  does NOT contain category  ”)
9:    ← [   ,   ]
10: for  = 1 to  do
11:    _ ← InternalDiscussion(  , ”plan rebuttals to  arguments”)
12:   _ ← InternalDiscussion( , ”plan rebuttals to   arguments”)
13:    ← TeamArguments(  , ”present coordinated rebuttals”)
14:   ← TeamArguments( , ”present coordinated rebuttals”)
15:    ←    + [   ,   ]
16: end for
17:  ← Generate( , ”evaluate team debate    for category  ”)
18: return ExtractClassification(  )</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.3. Council Debate Method</title>
          <p>Our final approach implements a collaborative council structure where multiple expert models deliberate
together to reach consensus, moderated by a chairperson model. This method moves away from
adversarial debate toward collaborative decision-making.</p>
          <p>Algorithm 3 describes our council debate method. For each interaction with LLMs, a brief prompt
is given just to explain the prompt. Actual prompts are provided in Appendix A.5. Firstly, we assign a
council of expert models  and a chairperson model  to moderate the discussion [Lines 2-3]. The
process begins with each council member providing an initial assessment and vote for the current
category [Line 5]. We then check if the initial votes have reached the consensus threshold  . If
consensus is achieved, we finalize the decision using the majority vote [ Lines 5-7]. Otherwise, we
proceed with  rounds of structured discussion where the chairperson summarizes the current state
and guides the focus, followed by each council member responding to the discussion and updating their
position [Lines 8-15]. After each round, we check whether consensus has been reached or if the votes
have stabilized to finish the discussion early [ Lines 12-14]. Finally, we use the majority vote from the
ifnal round as our decision [ Line 16].</p>
          <p>To provide concrete examples of how our debating methods work in practice, we present conversation
logs from each method in Appendix B, showing how diferent models interact to reach a final decision.
Algorithm 3 Council Debate Method
1: Input: Tweet text  , Category  , consensus threshold  , maximum rounds 
2:  ← assign council members from  for category 
3: ℎ   ← assign chairperson 
4:   ← CollectVotes( , ”assess tweet  for category  ”)
5: if isConsensusReached(  ,  ) then
6: return MajorityVote(  )
7: else
8: for  = 1 to  do
9:   ← Generate(ℎ  
10:   ← CollectResponses(
11:   ← ExtractVotes(  )
12: if isConsensusReached(  ,  ) or VotesStabilized(  ) then
13: break
14: end if
15: end for
16: return MajorityVote(  )
17: end if
, ”summarize discussion and guide focus”)</p>
          <p>,   , ”update positions”)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>Dataset. We conducted experiments using the oficial CheckThat! 2025 Task 4a datasets, which include
training, development, and test sets with tweets labeled for three categories of scientific content. Table
1 provides the label distribution for the datasets. As our approach does not need training, we only use
the development set to pick our primary model.</p>
        <p>Models. Our council debate framework was implemented using the Ollama framework4 for local
model access, with parallel processing capabilities to handle multiple concurrent debates eficiently. We
systematically evaluated diferent model combinations across our three proposed methods using the
following models:
• Gemma3 (12B): Architecture Gemma3, 12.2B parameters, Q4_K_M quantization
• Qwen3 (8B): Architecture Qwen3, 8.19B parameters, Q4_K_M quantization
• DeepSeek-R1 (8B): Architecture Llama, 8.03B parameters, Q4_K_M quantization
• Phi4 (14B): Architecture Phi3, 14.7B parameters, Q4_K_M quantization
• Mistral (7B): Architecture Llama, 7.52B parameters, Q4_0 quantization
• Llama3.1 (8B): Architecture Llama, 8.03B parameters, Q4_K_M quantization
• o4-mini: OpenAI’s commercial model accessed via API
• Claude-4: Anthropic’s commercial model accessed via API
Configuration : For our three proposed methods, we used the following configurations:
• Council Debate: Five council members (Gemma3, Qwen3, DeepSeek-R1, Phi4, Mistral) with</p>
        <p>Llama3.1 serving as chairperson.
• Team Debate: We tested two configurations where Llama3.1 serving as judge: (1) Same teams
configuration with five members on each side using all models, and (2) Diferent teams
configuration with Team A (Gemma3, Qwen3, Mistral) versus Team B (DeepSeek-R1, Phi4, Llama3.1), each
team having three members.
• Single Debate: We tested both same model configurations (same model as both proponent and
opponent) and diferent model configurations where Llama3.1 serving as judge. The diferent
model pairings include DeepSeek-R1 vs Qwen3, Gemma3 vs Phi4, and o4-mini vs Claude-4.</p>
        <p>Key hyperparameters were set as follows: minimum consensus threshold of 80% for council and
team debates, maximum of 5 discussion rounds for collaborative methods, and 3 rounds for adversarial
single debates. The system included checkpointing capabilities to ensure robustness during large-scale
processing. All models were accessed through the Ollama framework with the quantization settings
specified above.</p>
        <p>Baseline Models. In order to better analyze the impact of our debating methods, we also use each
LLM we picked separately for few-shot detection. For these baseline models, we use the detailed
category descriptions with examples (provided in Appendix A.2) for few-shot learning prompts.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Results on Development Set</title>
        <p>We evaluated all three proposed methods and baselines on the development set to select our best
approach for the final submission. Table 2 shows the results.</p>
        <p>Our observations on experiments with the development set are as follows. Firstly, the council debate
method achieves the highest macro F1-score and excelled particularly in Category 1 (i.e., scientific claims
detection) with an F1-score of 0.8756. Secondly, the commercial model pairing of o4-mini vs Claude-4
showed superior performance in Categories 2 and 3 with F1-scores of 0.8923 and 0.8478, respectively.
This is likely due to their larger model sizes. Thirdly, ignoring the commercial models, the team debate
method shows improved performance over single debate by incorporating collaborative discussion
within teams, with diferent team configurations outperforming same team configurations. Lastly, the
comparison with individual models demonstrates the significant impact of our debate methods. In
particular, all debate approaches outperform detection using models individually.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Oficial Ranking</title>
        <p>At the time of the submission deadline, we did not have results for the single debate method with o4-mini
vs. Claude-4. Therefore, we picked the council debate method as our primary method and submitted
the results accordingly. On the test set, our council debate method achieves 0.7273 (ranked 8th), 0.7805
(ranked 1st), and 0.7766 (ranked 9th) for Category 1, Category 2, and Category 3, respectively. Our
results show that while LLMs are not particularly efective at detecting scientific claims or scientific
entities, they perform well in identifying references to scientific studies.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we present our participation in CheckThat 2025 Task 4a, the scientific web discourse
detection task. We propose three debating methods where multiple LLMs discuss with each other to
classify tweets. The debate frameworks enable models to have opposing views and try to convince each
other. Among three methods, the council debate framework, in which expert models discuss together
to reach a consensus, moderated by a chairperson model, outperforms other debating frameworks.</p>
      <p>Although our council debate framework showed relatively weak performance in Category 1 (8th
place, F1 = 0.7273) and Category 3 (9th place, F1 = 0.7766), it achieved the highest F1-score (0.7805) in
Category 2.</p>
      <p>In future work, we plan to extend our debating framework to other classification tasks. We also aim
to investigate the impact of prompt design and utilizing other LLMs.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4.5 and Claude Sonnet 4 in order to: Grammar
and spelling check. After using these tools/services, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-7">
      <title>Appendix</title>
    </sec>
    <sec id="sec-8">
      <title>A. Prompts and System Messages</title>
      <p>This appendix contains the system prompts and message templates used in our three multi-LLM
approaches, as well as the category descriptions used in the task.</p>
      <sec id="sec-8-1">
        <title>A.1. System Prompts</title>
        <p>Category Titles:</p>
        <sec id="sec-8-1-1">
          <title>1. Contain scientific claims 2. Reference to scientific studies/publications 3. Mention any scientific entities</title>
          <p>Proponent System Prompt:</p>
          <p>You are a scientific content detector participating in a formal debate. Your job is to argue
why the given tweets [CATEGORY TITLE]. Be thorough and precise in your analysis.
Provide specific evidence from the text and URLs to support your arguments. Keep your
responses concise and focused on the strongest evidence.</p>
          <p>Opponent System Prompt:</p>
          <p>You are a scientific content critic participating in a formal debate. Your job is to argue why
the given tweets do NOT [CATEGORY TITLE]. Be thorough and precise in your analysis.
Provide specific evidence from the text and URLs to support your arguments. Keep your
responses concise and focused on the strongest counter-evidence.</p>
          <p>Judge System Prompt:</p>
          <p>You are a neutral judge evaluating a debate about whether the tweets [CATEGORY
TITLE]. Based on the debate transcripts and the tweets themselves, determine if the tweets
[CATEGORY TITLE]. Provide your classification and a brief explanation of your decision,
including which arguments from the debate you found most compelling in the following
JSON format: {”category”: 0 or 1, ”explanation”: [EXPLANATION]}</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Category Descriptions</title>
        <p>• Example 5 (Negative): ”how u treat others will reflect on how u feel”</p>
        <p>Explanation: General **life advice without scientific backing** or evidence.</p>
        <p>• Example 1 (Positive): ”Diabetes Research and New Registry Aim to Improve
Outcomes
**https://www.acc.org/latest-in-cardiology/articles**/2015/03/04/16/32/diabetesresearch-and-new-registry-aim-to-improve-outcomes?wt.mc_id=twitter #NCDR”
Explanation: Contains **URL to cardiology articles** (domain: acc.org, path: /articles) and
mentions **research** directly.
• Example 2 (Positive): ”With our current lifestyle, most of us are sleep deprived, which
creates problems like mood disorders, weakened immunity, weight gain and diabetes. **A
review of 16 studies** found that sleeping for less than 6 to 8 hours a night increases the
risk of early death by as much as 12%. image”</p>
        <p>Explanation: References **”a review of 16 studies”** indicating scientific literature review.
• Example 3 (Negative): ”How can this be unfolding?
**https://www.bbc.co.uk/news**/health53990068”
Explanation: **BBC news link** about health but not specifically referencing scientific
studies or publications.
• Example 4 (Negative): ”How Employment Can Change the Life of Someone with a
Disability
**http://www.tennesseeworks.org**/how-employment-can-change-the-life-of-someonewith-a-disability-and-everyone-involved/”
Explanation: **General informational content** from a non-academic source, not referencing
scientific studies.</p>
      </sec>
      <sec id="sec-8-3">
        <title>A.3. Single Debate Method Prompts</title>
        <p>Proponent Prompt:</p>
        <p>Task: You are participating in a formal debate about whether a given tweet [CATEGORY
TITLE].
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Examine both tweet content and
URLs carefully - Focus on science rather than technology, religion, or politics
Your Role: Argue why the tweet [CATEGORY TITLE]. Provide specific evidence from the
text and URLs to support your arguments.
Previous Discussion: [DEBATE CONTEXT IF ANY]
Response Format: Provide thorough analysis with specific evidence. Keep responses
concise and focused on the strongest evidence supporting your position. ONLY write your
response, do not include any other text.</p>
        <p>Opponent Prompt:</p>
        <p>Task: You are participating in a formal debate about whether a given tweet [CATEGORY
TITLE].</p>
        <p>Category Description: [CATEGORY DESCRIPTION]
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Examine both tweet content and
URLs carefully - Focus on science rather than technology, religion, or politics
Your Role: Argue why the tweet does NOT [CATEGORY TITLE]. Provide specific evidence
from the text and URLs to support your arguments.</p>
        <p>Tweet: [TWEET TEXT]
Previous Discussion: [DEBATE CONTEXT IF ANY]
Response Format: Provide thorough analysis with specific counter-evidence. Keep
responses concise and focused on the strongest evidence against the classification. ONLY
write your response, do not include any other text.</p>
        <p>Judge Prompt:</p>
        <p>Task: You are a neutral judge in a debate about whether a given tweet [CATEGORY TITLE].
Evaluate the debate and make the final classification decision.</p>
        <p>Category Description: [CATEGORY DESCRIPTION]
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Consider both tweet content and
URLs in your decision - Focus on science rather than technology, religion, or politics
Tweet: [TWEET TEXT]
Debate Transcript: [FULL DEBATE DISCUSSION]
Response Format: Provide your classification and explanation in the following JSON
format: {”category”: 0 or 1, ”explanation”: ”Brief explanation of your decision, including
which arguments you found most compelling”}</p>
      </sec>
      <sec id="sec-8-4">
        <title>A.4. Team Debate Method Prompts</title>
        <p>Team Member (Internal Discussion) Prompt:</p>
        <p>Task: You are a member of a [PROPONENT/OPPONENT] team discussing whether a tweet
[CATEGORY TITLE]. Collaborate with teammates to develop strategy.
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Examine both tweet content and
URLs carefully - Focus on science rather than technology, religion, or politics
Your Role: Discuss with teammates to develop coordinated arguments and strategy. Share
your perspective and build upon teammates’ viewpoints.</p>
        <p>Tweet: [TWEET TEXT]
Team Discussion: [INTERNAL TEAM DISCUSSION CONTEXT]
Response Format: Share your analysis and strategic insights with your team. Focus on
identifying strongest arguments and coordinating with teammates to avoid repetition.
Team Member (External Debate) Prompt:</p>
        <p>Task: You are a member of a [PROPONENT/OPPONENT] team in a formal debate about
whether a tweet [CATEGORY TITLE]. Present coordinated arguments to the opposing
team.</p>
        <p>Category Description: [CATEGORY DESCRIPTION AND GUIDELINES]
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Examine both tweet content and
URLs carefully - Focus on science rather than technology, religion, or politics
Your Role: Based on your team’s internal discussion, present arguments and respond to
the opposing team’s points.</p>
        <p>Tweet: [TWEET TEXT]
Team Strategy: [INTERNAL TEAM DISCUSSION RESULTS]
Debate Context: [EXTERNAL DEBATE DISCUSSION]
Response Format: Present clear, coordinated arguments that build upon your team’s
strategy. Respond to the opposing team’s points while avoiding repetition with teammates.
Team Debate Judge Prompt:</p>
        <p>Task: Evaluate a team debate about whether a tweet [CATEGORY TITLE] and make the
ifnal classification decision.</p>
        <p>Category Description: [CATEGORY DESCRIPTION AND GUIDELINES]
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Consider both tweet content and
URLs in your decision - Focus on science rather than technology, religion, or politics
Tweet: [TWEET TEXT]
Team Debate Transcript: [FULL TEAM DEBATE DISCUSSION]
Response Format: Provide your classification and explanation in the following JSON
format: {”category”: 0 or 1, ”explanation”: ”Brief explanation based on the most convincing
team arguments”}</p>
      </sec>
      <sec id="sec-8-5">
        <title>A.5. Council Debate Method Prompts</title>
        <p>Council Member Prompt:</p>
        <p>Task: You are a member of a scientific council discussing whether a tweet [CATEGORY
TITLE]. Collaborate with other council members to reach consensus.
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Examine both tweet content and
URLs carefully - Focus on science rather than technology, religion, or politics
Your Role: 1) Share your perspective on whether the tweet [CATEGORY TITLE], 2)
Respond to points made by other council members, 3) Provide specific evidence from the
tweet to support your position.
Council Discussion: [ONGOING DISCUSSION CONTEXT]
Chairperson Summary: [CHAIRPERSON GUIDANCE IF ANY]
Response Format: Provide concise, analytical, and evidence-based contribution. After
presenting your arguments, conclude with your vote: VOTE: [YES/NO]. Use following
JSON format: {”vote”: ”YES” or ”NO”, ”explanation”: ”Brief explanation of your decision,
including which arguments you found most compelling”}
Chairperson Prompt:</p>
        <p>Task: You are the Chairperson of a scientific council discussing whether a tweet
[CATEGORY TITLE]. Moderate the discussion and guide consensus building.</p>
        <p>Category Description: [CATEGORY DESCRIPTION AND GUIDELINES]
Rules &amp; Definitions: - @user tags represent anonymized usernames from the original
tweet - ”image” at the end indicates an attached image - Consider both tweet content and
URLs in moderation - Focus on science rather than technology, religion, or politics
Your Role: 1) Moderate the discussion, 2) Summarize key points, 3) Determine when the
discussion should end, 4) Guide focus to unresolved issues.</p>
        <p>Tweet: [TWEET TEXT]
Council Discussion: [CURRENT DISCUSSION STATE]
Vote Status: [CURRENT VOTING RESULTS]
Response Format: Summarize key points from the discussion and guide council members
to focus on specific aspects. The discussion should end when: a clear consensus emerges
(at least 80% agreement), no new substantive arguments are being presented, or maximum
rounds reached. Be impartial and facilitate productive discussion. Use following JSON
format: {”status”: ”CONSENSUS REACHED” or ”CONSENSUS NOT REACHED”, ”summary”:
”Brief summary of the discussion”}</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. Conversation Logs</title>
      <p>This appendix contains detailed conversation logs from our debating methods, demonstrating how
diferent models interact to reach classification decisions. Table 3 presents an example interaction
between LLMs in the Single Debate Method for Category 1 (i.e., Does the tweet contain a scientific
claim? ). The participants are as follows: DeepSeek-R1 (8B) as proponent, Gemma3 (12B) as opponent,
and Llama3.1 (8B) as Judge.
”The tweet discusses a legal case about Harvard admissions and cites
support from Asian-American groups. It presents no scientific finding
or research result, so it does not contain a scientific claim.”
”That support is often grounded in empirical social-science studies
on admission bias. Even if implicit, it points to a scientific claim about
systemic bias in university admissions.”
summarizes the opening statements...</p>
      <p>Opponent (Gemma) ”Indirect implications aren’t enough—there’s no scientific language,</p>
      <p>data, or study citation in the tweet itself.”
Proponent ”Social-science claims often surface without jargon. The attached URL
(DeepSeek) could point to a data-driven study; brevity doesn’t negate the claim’s</p>
      <p>scientific nature.”
Judge (Llama 3.1) re- summarizes the process...
cap
Opponent (Gemma) ”Speculating about hidden studies isn’t evidence. The tweet simply</p>
      <p>reports community support—a societal fact, not a scientific one.”
Proponent ”True, explicit evidence is absent; but citing discrimination necessarily
(DeepSeek) leans on established research. That connection keeps the door open to</p>
      <p>a scientific claim.”
Judge (Llama 3.1) re- summarizes the process...
cap
Opponent (Gemma)
Proponent
(DeepSeek)
Llama 3.1
”In sum, no explicit scientific claim or evidence appears in the
tweet.”
”There’s an implicit research-based claim about admission bias,
though not spelled out.”
{"category": 0, "explanation": "No explicit scientific
claim found; tweet merely describes a legal event
without scientific evidence."}</p>
      <p>Table 4 presents an example interaction between LLMs in Team Debate Method for Category 2 (i.e.,
Does the tweet reference a scientific study/publication? ). The participants are as follows. Proponent Team:
Gemma3, Qwen3, Mistral; Opponent Team: DeepSeek-R1, Phi4, and Llama3.1; Judge: Llama3.1</p>
      <p>Table 5 presents an example interaction between LLMs in Council Debate Method for Category 3 (i.e.,
Does the tweet mention scientific entities? ). The council members are Gemma3, Qwen3, DeepSeek-R1,
Phi4, and Mistral while Llama3.1 acts as chairperson.
”The Bloomberg URL provided could potentially link to studies
regarding admissions biases or discrimination.”
”Agree; typically, lawsuits regarding admissions include references to
demographic studies or research findings.”
...</p>
      <p>DeepSeek-R1 (Opp. ”The tweet itself mentions no specific studies, just an article from
Team) Bloomberg, a news outlet.”
Phi4 (Opp. Team) ”Correct, we should highlight the distinction between news reports</p>
      <p>and scholarly references.”
Proponent Team
Opponent Team
”The provided Bloomberg URL likely discusses data or findings relevant
to admissions discrimination, inherently implying reference to scientific
studies on racial biases or admissions processes.”
”While possible, the tweet explicitly does not reference any scientific
studies or publications. Bloomberg generally provides news reports
rather than scholarly articles.”
Gemma3 (Pro. Team)
Qwen3 (Pro. Team)
...</p>
      <p>DeepSeek-R1 (Opp. ”Our stance remains solid; tweet lacks explicit academic references.”
Team)
Phi4 (Opp. Team) ”Highlight absence of specific terms like ’study’ or ’research’.”
...</p>
      <p>”We must clarify our point; referencing an admissions lawsuit usually
draws on research findings.”
”We should also acknowledge indirect referencing clearly.”
Proponent Team
Opponent Team
”Admissions-related lawsuits almost always involve referencing
demographic studies or analyses, even if indirectly. The URL is strongly
suggestive of such scholarly content.”
”Indirect implications are insuficient. The tweet itself explicitly
provides no reference to specific scientific studies or academic literature.”
Qwen3 (Pro. Team)
Gemma3 (Pro. Team)
...</p>
      <p>Phi4 (Opp. Team) ”Reafirm explicit reference absence.”
Llama3.1 (Opp. ”Consider the practical diferentiation clearly between news and
Team) academia.”
...</p>
      <p>”We can reiterate strongly that the Bloomberg article is likely supported
by scholarly analysis.”
”Maintain emphasis on indirect referencing.”
Proponent Team
Opponent Team
Both Teams
Llama3.1
”Given Bloomberg’s reputation, reporting on admissions lawsuits
typically includes statistical or demographic research findings.”
”Reputation alone doesn’t establish explicit reference. No concrete
scientific studies or papers explicitly appear in the tweet.”
Both teams repeat similar positions, with proponents emphasizing
indirect scholarly implications and opponents maintaining the lack of
explicit reference.
{"category": 0, "explanation": "Although the URL could
potentially discuss research, the tweet itself explicitly
lacks direct reference to any scientific studies or
publications."}
R1 –
Discussion
R2 –
Discussion
Judge – Final
decision</p>
      <p>Speaker
Gemma3
Qwen3
DeepSeek-R1
Phi4
Mistral
Chairperson
Gemma3
Qwen3
DeepSeek-R1
Phi4
Mistral
Chairperson
Qwen3
Mistral
Llama3.1</p>
      <p>Utterance
”Harvard is explicitly mentioned, and universities conducting research
qualify as scientific entities. VOTE: YES”
”Harvard University is widely recognized as a scientific institution.</p>
      <p>VOTE: YES”
”Harvard clearly counts as a scientific entity given its status and role
in scientific research. VOTE: YES”
”The tweet mentions Harvard but doesn’t explicitly refer to its scientific
or academic nature, just a lawsuit. VOTE: NO”
”No explicit indication of Harvard’s scientific role or research in the
tweet. VOTE: NO”
YES: 3, NO: 2 (Consensus: 60% - Not reached)
”Initial consensus not reached; clarify whether simply mentioning
Harvard is suficient to consider it a scientific entity.”
”Harvard’s extensive involvement in scientific research is widely
recognized. Mention alone implicitly references its academic nature. VOTE:
YES”
”Agreeing with Gemma3. Harvard’s recognition as a prominent
research institution implies scientific entity status. VOTE: YES”
”Reafirming YES. The broad academic and scientific recognition of
Harvard inherently qualifies its mention as referencing a scientific
entity. VOTE: YES”
”While Harvard is a known research institution, the tweet context
doesn’t reference research or science explicitly, only legal action. VOTE:
NO”
”Still NO. Explicit context in the tweet is a lawsuit; no scientific research
or entity context directly mentioned.”
YES: 3, NO: 2 (Consensus: 60% - Still not reached)
”Consensus not achieved yet. Consider explicitly if context within the
tweet matters or if inherent recognition sufices.”
”The nature of Harvard as inherently academic/scientific overrides the
specific context of the tweet. Harvard itself implies scientific credibility.</p>
      <p>VOTE: YES”
”Harvard’s identity inherently conveys scientific afiliation regardless
of tweet context. I maintain YES. VOTE: YES”
”Context in tweets can be implicit. Harvard’s mention is suficient to
implicitly consider scientific entity recognition. VOTE: YES”
”Given Harvard’s strong reputation and implicit association with
science and academia, I’m persuaded that implicit mention sufices here.</p>
      <p>VOTE: YES”
”Context explicitly remains unrelated to science. Still NO due to explicit
context only.”
YES: 4, NO: 1 (Consensus: 80% - Achieved)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Kartal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Boland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bringay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! Lab Task 4 on Scientific Web Discourse</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. V.</surname>
          </string-name>
          ,
          <article-title>The clef-2025 checkthat! lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vieillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Merhej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Matejovicova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rivière</surname>
          </string-name>
          , et al.,
          <source>Gemma 3 technical report, arXiv preprint arXiv:2503.19786</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lv</surname>
          </string-name>
          , et al.,
          <source>Qwen3 technical report, arXiv preprint arXiv:2505.09388</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Song,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Deepseek-</surname>
          </string-name>
          r1:
          <article-title>Incentivizing reasoning capability in llms via reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:2501.12948</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Behl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gunasekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Hewett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Javaheripi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaufmann</surname>
          </string-name>
          , et al.,
          <source>Phi-4 technical report, arXiv preprint arXiv:2412.08905</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7b,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2310.06825. arXiv:
          <volume>2310</volume>
          .
          <fpage>06825</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bringay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <article-title>Scitweets-a dataset and annotation framework for detecting scientific online discourse</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3988</fpage>
          -
          <lpage>3992</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Khetan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wadhwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wallace</surname>
          </string-name>
          , S. Amir, SemEval
          <article-title>-2023 task 8: Causal medical claim identification and related PIO frame extraction from social media posts, Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2266</fpage>
          -
          <lpage>2274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          , T. Chakraborty,
          <article-title>Overview of the claimscan-2023: Uncovering truth in social media through claim detection and identification of claim spans</article-title>
          ,
          <source>in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Panchendrarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research</article-title>
          ,
          <source>Natural Language Processing Journal</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <fpage>100066</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>C.-M. Chan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , Chateval:
          <article-title>Towards better llm-based evaluators through multi-agent debate</article-title>
          ,
          <source>arXiv preprint arXiv:2308.07201</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ye</surname>
          </string-name>
          , M. Han,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Debatrix: Multi-dimensional debate judge with iterative chronological analysis based on LLM, in: Findings of the Association for Computational Linguistics: ACL 2024, Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>14575</fpage>
          -
          <lpage>14595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hughes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Valentine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ruis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sachan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radhakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , E. Perez,
          <article-title>Debating with more persuasive llms leads to more truthful answers</article-title>
          ,
          <source>in: Proceedings of the 41st International Conference on Machine Learning</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>23662</fpage>
          -
          <lpage>23733</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Can chatgpt defend its belief in truth? evaluating llm reasoning via debate</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>11865</fpage>
          -
          <lpage>11881</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Taubenfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldstein</surname>
          </string-name>
          ,
          <article-title>Systematic biases in llm simulations of debates</article-title>
          ,
          <source>in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2024</year>
          , p.
          <fpage>251</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Estornell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          <article-title>, Multi-llm debate: Framework, principals, and interventions</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>37</volume>
          (
          <year>2024</year>
          )
          <fpage>28938</fpage>
          -
          <lpage>28964</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Eo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Zi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          , H. Lim,
          <article-title>Debate only when necessary: Adaptive multiagent collaboration for eficient llm reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2504.05047</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          2.
          <string-name>
            <surname>Category</surname>
          </string-name>
          2
          <article-title>- References to Scientific Studies/Publications : Direct references to scientific papers, research studies, academic publications, or scholarly articles. Guidelines: Examine both URL and tweet content carefully. For URLs, check if the domain is a known academic/scientific source and if the path contains keywords like /articles</article-title>
          , /research, /studies, etc. Keywords like ”research,” ”study,” ”published,”
          <article-title>”findings” also indicate scientific references. Pay equal attention to URLs and tweet content. Category 2 references typically imply Category 3 entities</article-title>
          . Examples:
          <article-title>Important parts are highlighted in **bold**.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>