<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SINAI at Touché: From Generation to Evaluation through Multistep and Comparative Prompting for Retrieval-Augmented Debate</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>María Estrella Vallecillo-Rodríguez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>María Teresa Martín-Valdivia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arturo Montejo-Ráez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, SINAI, CEATIC, Universidad de Jaén</institution>
          ,
          <addr-line>23071</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This article describes the participation of the SINAI research group in the Retrieval-Augmented Debating shared task at CLEF 2025, which includes two subtasks: Subtask 1 focuses on generating multi-turn argumentative responses using retrieved evidence, while Subtask 2 addresses the automatic evaluation of debate responses based on quality, quantity, relation, and manner. For both subtasks, we employed the instruction-tuned LLaMA3.18B-Instruct model with a structured, multi-step prompting strategy to guide the model's reasoning. In Subtask 1, the generation process was divided into five stages, from analyzing the dialogue tone and argumentative strategy to formulating retrieval queries and generating the final response. This enabled the model to produce concise, coherent, and well-supported counterarguments, leading to a 4th place ranking overall. For Subtask 2, we explored three prompting paradigms (Zero-shot, Few-shot, and Analyzer strategies) to assess the model's ability to classify responses according to the four evaluation metrics. Experimental results demonstrate the efectiveness of structured reasoning, particularly with the Analyzer strategy, which achieved competitive performance across all metrics and led in Manner. Our system illustrates the potential of open-source language models for structured, retrieval-enhanced argumentative dialogue generation and evaluation, even when competing against proprietary models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Argument retrieval</kwd>
        <kwd>Argumentative Response Generation</kwd>
        <kwd>Debate Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) are increasingly integrated into everyday life, but their widespread
use also raises concerns about the reliability of the content they generate—particularly when based on
online sources that may be false or misleading. To address this, it is essential to develop systems that
make their reasoning transparent, enabling users to evaluate the credibility of the underlying evidence.
This is especially important in the context of social media, where hate speech often circulates in the form
of ofensive statements lacking valid argumentation. While automatic counter-narrative generation has
been explored in diferent languages such as English [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Spanish [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] among others [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], existing
approaches frequently fall short in argumentative richness. Incorporating structured argumentation
could not only expose flawed reasoning but also contribute to influencing perspectives or informing
bystanders. Additionally, automated systems can engage consistently over time, potentially reducing
the spread of harmful content.
      </p>
      <p>
        The Retrieval-Augmented Debating shared task was proposed to develop generative retrieval systems
capable of arguing against users, with the goal of supporting opinion formation, confirmation, or
debate training. It includes two subtasks. The first focuses on building a multi-turn debating system
that responds to random claims by counterattacking or defending previous arguments, using distinct
retrieved arguments and limiting responses to 60 words. The second subtask aims to automatically
evaluate such systems using four metrics: quantity (informativeness), quality (truthfulness), relation
(relevance), and manner (clarity). To support these tasks, the organizers released the ClaimRev dataset
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which includes arguments retrieved from the Kialo platform1 and simulated debates based on 100
claims from the ChangeMyView subreddit 2[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        To explore the reasoning capabilities of current large language models, we selected
LLaMA3.1-8BInstruct [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for both tasks. In Subtask 1, we propose a system based on a multistep prompt strategy, as
used in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], to guide the model through a complex task by dividing it into manageable steps. This ensures
that the model has a clear goal at each stage, allowing it to build coherent responses incrementally.
The first step involves analyzing the tone and style of the conversation to determine the appropriate
argumentative approach—either by identifying weak points in the opponent’s response or addressing
their main idea—and selecting the argumentative perspective and type of evidence to retrieve. This
analysis is guided by principles from logic (supporting conclusions with premises), dialectics (interactive
discourse rules) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and rhetoric (capturing and persuading the audience) [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ], as discussed in
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In the second step, the model generates up to three queries to retrieve arguments, specifying the
target idea and whether it intends to support or refute it. The argument retrieval is then performed
using an ElasticSearch API, which returns six diferent types of argument. Steps three and four involve
ifltering and selecting the most relevant arguments per query and then refining that selection for final
use. Finally, in step five, the model generates the final response, integrating the selected arguments
while adapting tone, style, and perspective; any incompatible arguments may be excluded if they do
not align with the intended rhetorical strategy. For the second subtask, we experiment with diferent
prompting strategies [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], as the way information is presented to the model plays a crucial role in
shaping its final response. Specifically, for each evaluation metric, we first examine the behavior of the
model in a zero-shot learning (ZSL) setting, where it does not receive examples. This allows us to assess
its prior knowledge of the task. We then test a one-shot setting to determine whether the model can
generalize from a single example. Finally, we explore a few-shot setting in which the model is provided
with 1, 3, or 5 examples per possible label for the given metric. Based on these examples, the model is
instructed to generate a list of reasons that justify the choice of one label over another. In a subsequent
step, this reasoning is incorporated into the prompt, and the model is asked to assign a label to the text
under evaluation.
      </p>
      <p>The rest of this paper is structured as follows: Section 2 provides a detailed overview of the proposed
system developed for the shared task. Section 3 describes the dataset employed and outlines the
methodology adopted to address the task. In Section 4, we present the experimental results obtained
during the development and evaluation phases. Lastly, Section 5 ofers concluding remarks and a
discussion of our findings.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System overview</title>
      <p>
        This section describes the developed systems designed to address the subtasks of Retrieval-Augmented
Debating [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] at CLEF 2025. Due to the significant diferences between the two subtasks, this section
is organized into two subsections. The first addresses response generation and argument retrieval
strategies, whereas the second focuses on evaluating debate systems according to various metrics,
resembling a ranking task. All prompts used to implement each strategy can be found in Appendix A,
specifically in Subsections A.1 and A.2, corresponding to subtask 1 and subtask 2, respectively.
1https://www.kialo.com/
2https://www.reddit.com/r/changemyview/
      </p>
      <sec id="sec-2-1">
        <title>2.1. Subtask 1: Debate response generator</title>
        <p>As explained above, the task is highly complex. It requires retrieving arguments that counter the
claims of our opponent or support the position of the system. Once the relevant arguments have been
identified, the system must generate an appropriate response. To handle this complexity, we divided the
overall process into smaller, more manageable steps. This decomposition not only simplifies the task for
the model, but also helps reduce hallucinations and improves the relevance of the results. The proposed
system consists of the following five steps: (1) Identification of the textual expression and argumentative
strategy to be used in the generated response, (2) Query generation for the construction of search
queries to retrieve relevant arguments from the database, (3) Initial argument retrieval: Selects the most
relevant arguments of each query, (4) Filter all the selected arguments of each query to select the most
suitable arguments for response generation, and (5) Counter-response generation that produces the
ifnal response opposing the input claim, using the selected arguments. A visual representation of these
steps can be found in Figure 1.</p>
        <p>Once all steps have been defined, we now provide a more detailed explanation of each one:
• Step 1: Discourse and Argumentative Strategy Analyzer. In this step, the system analyzes
the current debate context to define how it should respond in terms of tone, style, argumentative
strategy, type of argument, and perspective. The tone can be Neutral, Respectful,
Assertive/Critical, or Inquisitive. The style may be Academic, Colloquial, Socratic, Logical, or Rhetorical.
Argumentative strategies include Comprehensive (respond to all points), Focused (target weaker
points), Principled (challenge underlying assumptions), and Free. The argument types are Logical,
Ethical, Emotional, and Analogical, while perspectives include Economic, Moral/Ethical, Scientific,
Pragmatic, Historical, and Cultural. Once all of these dimensions are defined, the system proceeds
to generate search queries accordingly.
• Step 2: Query Generator for Argument Retrieval. In this step, the system evaluates the
opponent’s input and, based on the argumentative strategy defined earlier, selects up to three
key ideas to either attack or support. It then formulates one query per idea to search for relevant
arguments in a retrieval database. The goal is to ensure that the selected ideas align with the
desired argumentative strategy and that the queries are focused, relevant, and diverse enough to
enrich the subsequent response generation.
• Step 2.5: Argument Retrieval via Elasticsearch. In this phase, the system retrieves arguments
using a basic Elasticsearch setup, without additional embeddings due to computational limitations.
For each query, three retrieval strategies are applied. First, the (1) Text Strategy retrieves two
arguments based on textual similarity to the target idea. Then, depending on whether the
model’s objective is to support or attack the idea, diferent chains of arguments are retrieved.
If the goal is to attack, in the (2) Attack strategy two attack arguments targeting the idea are
retrieved (searching by the attack field in elasticsearch), and finally in the (3) Support strategy two
support arguments for that idea are also collected; for each of those, the system further retrieves
one argument that supports the attack associated with each support argument. Conversely, if
the goal is to support the idea, the (3) Support strategy retrieved two supporting arguments
(using the support field in elasticsearch), along with the (2) Attack strategy that retrieves two
attack arguments directed at the idea;for each attack argument, one supporting argument for its
corresponding attack is retrieved. This layered retrieval process enables the system to construct
a small argument graph that reflects both direct and indirect relations aligned with the model’s
stance.
• Step 3: Argument Selection per Query. Given the large number of arguments retrieved per
query, and to avoid exceeding token limits in the prompts, the system performs a first filtering step.
From the 6 arguments retrieved for each query, it selects the top three, regardless of the strategy
they came from. This filtering is based on the preferences determined in Step 1 (argument type),
and the model is prompted accordingly to choose the most relevant and contextually appropriate
ones.
• Step 4: Final Argument Selection. In this step, the model selects the three best arguments
overall from among those shortlisted in the previous step. It is free to distribute them as it sees
ift—for instance, choosing one argument per query, all from one query, or even using just two if
deemed stronger.
• Step 5: Response Generation. Finally as final step and based on the argument selection made
from the step 4 and the tone, style, and argumentative strategy from Step 1, the system generates
a final response of no more than 60 words.</p>
        <p>Each decision taken by the model in the process must be accompanied by a brief justification, which
is the only content visible to the LLM at each step that need to use these aspects.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Subtask 2: Evaluation system</title>
        <p>For the second subtask, which is based on developing a system that tries to evaluate response-generation
systems in debates based on four diferent aspects (quantity, quality, relation, and manner), the organizers
provide data with yes/no answers, making the task similar to binary classification. In addition, each
metric is formulated as a yes/no question to facilitate the evaluation of specific aspects of the response.
For instance, the question associated with the quantity metric is: “Does the response contain at least
one (attack or defense) argument, and at most one of each type of defense and attack?”. For quality,
the question is: “Can the response be deduced from the retrieved arguments?”. The relation metric
asks: “Is the response coherent with the conversation and does it express a contrary stance to the user?’.
’Finally, the manner metric is evaluated using the question: “Is the response clear and precise?”. In our
proposed method, we conduct the experiments shown in Figure 2. We include all of them, even if they
are experimental setups, in one figure as a summary of the system, since depending on the metric being
evaluated, one type of system performed better than another. As a summary and as can be seen in the
ifgure, our system receives in its prompt a description of the task it has to perform and the question it
must answer with yes or no. Now, depending on the strategy applied, this prompt will include examples
or not. For example, in the first strategy with ZSL, it will not receive any example from the dataset. In
the second strategy, related to FSL, it will receive one example where the answer is yes and another
where the answer is no. Finally, in the third strategy, the approach is a bit diferent. The model will first
have to analyze why a person answered yes to the metric question or why they answered no. For that,
the model will receive 1, 3, or 5 examples of each type of label. With the analysis done and the reasons
provided for answering yes or no, this reasoning will be included in a ZSL-style prompt to try to guide
the reasoning of the model.</p>
        <p>It is important to highlight that the questions the models must answer appear in Section 4. Also,
depending on the metric to be evaluated, each system follows the following strategy: Quantity uses the
analysis strategy where the model receives only one example of each answer type (yes/no); Quality
also uses this third analysis strategy, but when receiving five examples to analyze; the Relation metric
applies a ZSL strategy; and finally, the Manner metric uses the third strategy related to analysis with
just one example per answer type.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental setup</title>
      <p>
        3.1. Data
To run our experiments, we used the data provided by the organizers. Among the provided datasets, we
ifnd a database of arguments retrieved from the ClaimRev dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These arguments consisted of an
id linking them to their corresponding ClaimRev_id, a general topic to which the argument belongs, a
list of labels indicating the categories associated with the topic, an argument that is attacked by the
current one, another argument that is supported by it, the text of the argument, a list of references to
back up the argument, and a field indicating whether the argument (support or attack) was originally
part of the ClaimRev dataset or was automatically generated by the organizers.
      </p>
      <p>
        In total, 287,156 arguments are provided, covering 1,522 distinct topics, including prominent themes
such as ‘Politics’, ‘Ethics’, ‘Society’, ‘Religion’, and ‘Philosophy’, with 1,210 diferent associated labels.
These arguments are publicly accessible through the Elasticsearch API. It is important to note that
the dataset also includes embeddings computed using the stella_en_400M_v5 model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. These
embeddings allow participants to perform argument retrieval based on vector similarity. However, due
to computational limitations, we were unable to leverage these embeddings and thus decided not to use
them in our experiments.
      </p>
      <p>
        To simulate debates and provide training data, the organizers selected 100 claims from the
ChangeMyView subreddit and simulated a series of debates with five interaction turns each. Additionally, for
Subtask 2, annotations are provided for each turn in terms of four metrics: quality, quantity, relation,
and manner [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Each annotation consists of a label (‘yes’ or ‘no’) indicating whether the respective
question metric is satisfied (in Section 2.2 these questions are mentioned).
      </p>
      <p>To illustrate the class distribution in this classification task, Table 1 shows the frequency of each label
for each metric in the initial dataset.</p>
      <p>For the generation of the prompts that we are going to use to conduct our experiments in Subtask
2, we removed instances with unknown labels. From the remaining dataset, we selected 20 instances,
equally divided between the labels ‘yes’ and ‘no’. The rest of the dataset (476 instances) is used to
evaluate the proposed strategies.</p>
      <sec id="sec-3-1">
        <title>3.2. Experiments and Selected Models</title>
        <p>
          Across both tasks, our goal is to analyze the generalization ability of LLMs, their reasoning
behavior under diferent prompting strategies, and how instruction-tuned language models perform in
argumentation-related tasks without requiring additional fine-tuning. For this reason and due to
computational constraints, we selected the LLaMA3-8B-instruct model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for both subtasks.
        </p>
        <p>Regarding the proposed experiments, in this section we present a minimal summary of the experiments
described in the previous section (Section 2):
• Subtask 1. For this subtask, only one experiment is proposed. This experiment is based on
a multistep prompting strategy. We divided the task of generating a response that takes into
account diferent retrieved arguments into several small steps, such as analyzing tone and style to
respond to the opponent, reasoning about the strategic approach, selecting the type of arguments,
and determining the argument perspective. Furthermore, the model must formulate various
queries to retrieve arguments from Elasticsearch and finally generate a response that incorporates
all aspects evaluated throughout the steps. With this experiment, we aim to analyze the model’s
ability to generate coherent, contextually appropriate, and argumentatively structured responses
with minimal supervision.
• Subtask 2. The experiments for this task focus on diferent prompting strategies, as we aim
to evaluate the model’s knowledge and its capacity to reason based on a few examples, using
the model in its base form without requiring task-specific fine-tuning. The following prompting
strategies are proposed:
– Zero-shot Learning (ZSL): The task is explained to the model, and it is presented with
a question to which it must respond with “yes” or “no”, in order to evaluate the dialogue
using the selected metric. The model is then expected to directly answer the question. This
experiment serves as the baseline, where the key factor under evaluation is the model’s
prior knowledge about the task.
– Few-shot Learning (FSL): This strategy is similar to the previous one, except that the
model is provided with two examples: one where the answer to the metric-related question
is “yes”, and another where it is “no”. The objective of this experiment is to analyze whether
the model, after observing the examples, can perform better classifications. In this case we
try with two types of FSL, the first called FSL1 where the positive example appear first and
another called FSL2 where the negative example appear first in the prompt to understand if
the order of the presented examples afect to the classification of the model.
– Analyzer Strategy: This strategy is divided into two stages. In the first stage, the model is
given a set of examples where the answer to the metric-related question is “yes”, and an
equal number of examples where the answer is “no”. The number of examples may vary (1,
3, or 5). After reviewing the examples, the model is asked to explain why someone would
answer “yes” based on the positive examples, and similarly for “no” based on the negative
ones. These explanations must be provided in a generalized form. In the second stage, the
metric-related question is posed again, and the model is asked to answer with “yes” or “no’,
taking into account the reasoning it generated in the previous step. This strategy aims
to analyze whether the reasoning of the model is useful in guiding its final response, and
to determine the optimal number of examples required to support coherent and helpful
reasoning.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        In this section, we present the results of each subtask. Specifically, it includes the outcomes of our
experiments conducted during the development phase (Subsection 4.1) and the results obtained by the
systems that were ultimately submitted to the TIRA.io platform [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for the final evaluation (Subsection
4.2).
      </p>
      <p>
        As previously mentioned, this task is divided into two subtasks: the first focuses on generating
responses to argue against a simulated debate partner, and the second aims to evaluate the systems
developed for Subtask 1. To assess Subtask 1, the organizers proposed a manual evaluation carried out
by human annotators based on the evaluation metrics defined for Subtask 2 (as described in Section 2.2).
For Subtask 2, which is focused on the binary classification of whether a generated response and its
corresponding argument meet predefined criteria, the organizers chose to evaluate the systems using
standard binary classification metrics: macro-precision, macro-recall, and macro-F1 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <sec id="sec-4-1">
        <title>4.1. Development Phase</title>
        <p>During the development phase for Subtask 1, we focused on assessing how the model performed with
the provided prompt through manual review. This was necessary because the task is dificult to evaluate
using a single, exact metric that captures all relevant aspects as a human would. When we observed
that the model consistently failed to generate appropriate responses—across multiple systems—we
interpreted this as an indication that it was not properly understanding the assigned task. In those
cases, we refined the prompt accordingly. This process was carried out iteratively until we obtained a
system capable of generating coherent texts that integrated the retrieved arguments in alignment with
the human annotator’s perspective.</p>
        <p>Table 2 reports the results of the developed systems for Subtask 2 across four categories (Relation,
Quality, Manner, and Quantity) with macro-F1 selected as the primary metric due to its balanced
reflection of both precision and recall, especially relevant in imbalanced datasets.</p>
        <p>The results highlight the varying efectiveness of diferent prompting strategies in a base,
non-finetuned language model. Some key findings include:
• Zero-shot Learning Strategy (ZSL, baseline). It achieved the highest F1-score for the Relation
category (F1: 0.515), indicating a relatively strong inherent understanding of this aspect without
prior examples. However, performance on other categories—particularly Quantity (F1: 0.266)—was
considerably weaker. This suggests that while foundational knowledge exists, its application
across diferent communicative aspects remains inconsistent without additional guidance.
• Few-shot Learning (FSL, with two examples). Two FSL variants were tested: FSL1 (positive
example first) and FSL2 (negative example first). Overall, FSL did not significantly outperform ZSL
or the Analyzer Strategy. For example, in the Relation aspect, FSL1 and FSL2 yielded F1-scores of
0.404 and 0.360, respectively—both trailing behind ZSL. These findings suggest that providing
only a few direct examples may be insuficient for improving classification accuracy in complex
reasoning tasks. Notably, FSL1 consistently outperformed FSL2, underscoring the influence of
example ordering.
• Analyzer Strategy: This approach produced the most promising results in several categories. For
instance, Analysis5 achieved the highest F1-score for Quality (F1: 0.575), while Analysis1 led in
Manner (F1: 0.497) and Quantity (F1: 0.389). These outcomes suggest that prompting the model to
explicitly reason—especially with a greater number of guiding examples—enhances classification
performance. The comparison across Analysis1, Analysis3, and Analysis5 also indicates that the
optimal number of examples for efective reasoning is task-specific. Interestingly, ZSL maintained
superior performance in the Relation category, implying that additional reasoning may not
always be necessary for certain well-understood communicative principles. This also points to
the importance of example selection; randomly chosen examples may not suficiently support
high-quality reasoning.</p>
        <p>These findings emphasize the critical role of prompt design in leveraging pre-trained language
models for classification tasks. While zero-shot approaches provide a solid baseline for aspects like
Relation, strategies that incorporate structured reasoning, such as the Analyzer Strategy, yield significant
performance gains in more nuanced aspects such as Quality, Manner, and Quantity. In contrast, Few-shot
Learning with limited examples shows limited impact, suggesting that mere exposure is less efective
than guided reasoning for complex evaluative tasks.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Phase</title>
        <p>(0.70 vs. 0.35), Relation (0.86 vs. 0.32), and Manner (0.59 vs. 0.80), demonstrating greater informativeness
and clarity. The low score in the “Quality” metric (0.02 vs. 1.00) may stem from the model’s tendency
to significantly alter the style of retrieved arguments during generation, reducing their traceability.
Notably, unlike most top-performing teams that relied on large-scale commercial models such as GPT-4.1,
Gemini 2.5 or Claude Opus, while our system is based on an open-source model (LLaMA3.1-8B-Instruct),
showing that a competitive and resource-eficient alternative is possible by analyzing diferent aspect
of the debate context.</p>
        <p>For the second subtask, in addition to the proposed methods, we implemented a strategy named
BEST, which applies the best-performing method per metric based on development results: ZSL for
Relation, Analysis5 for Quality, and Analysis1 for Manner and Quantity. Results are shown in Table 4.
Our systems achieved competitive performance using only the open-source LLaMA3.1-8B-Instruct
model. The best overall result came from the Analyzer strategy with five examples (Analysis5), reaching
a macro-F1 of 0.56, suggesting that guided reasoning with balanced examples improves consistency. The
“Best” configuration also performed well (F1 of 0.55), allowing adaptation to each dimension’s specific
needs. In general, Analyzer outperformed other approaches, especially with more examples, likely
due to the intermediate reasoning phase. ZSL served as a reasonable baseline (F1 of 0.52), while FSL
strategies underperformed (FSL1: 0.39, FSL2: 0.35), indicating that example inclusion alone is insuficient
without reflection. No significant diferences were observed between FSL1 and FSL2, though models
tended to perform slightly better when the positive example appeared first. Overall, our results show
that structured and adaptive reasoning strategies can yield solid performance even with non-commercial
models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work presents the system developed by the SINAI team for the shared task Retrieval-Augmented
Debating, which includes two subtasks: the first involves building an automatic system to generate
responses that argue against a simulated debate partner by retrieving arguments from a database; the
second focuses on evaluating such responses across four metrics: Quantity (informativeness), Quality
(truthfulness), Relation (relevance to the conversation), and Manner (clarity). To tackle both tasks, we
used the open-source model LLaMA3.1-8B-Instruct, chosen for its accessibility and lower computational
cost. For Subtask 1, our system was based on a detailed analysis of debate elements such as tone,
topic, and perspective, guiding the retrieval and generation process. This approach led us to a 4th
place ranking, behind the task’s baseline and proprietary state-of-the-art models. Our system showed
strong performance in generating informative, relevant, and clear responses, although there is room for
improvement in truthfulness, as the generated answers did not always clearly align with the retrieved
content. This highlights a promising direction for future work, particularly in reinforcing factual
consistency. For Subtask 2, we explored various prompting strategies without fine-tuning, aiming to
assess the reasoning capabilities of the model. Our open-source systems achieved competitive results
compared to proprietary approaches, particularly excelling in the Manner metric, where we led in
precision. Despite limitations compared to large-scale models like GPT-4 or Gemini, we remained
competitive in Relation and Quantity. The use of both Zero-Shot and Few-Shot Learning strategies
underscores the exploratory and adaptive nature of our approach. Overall, the results demonstrate
that, with open models and thoughtful design, it is possible to efectively address complex semantic
evaluation tasks.</p>
      <p>
        After all that has been observed throughout this work, we still have a considerable amount of work
ahead, which we plan to address gradually as future work. Regarding the system that automatically
generates responses in an argumentative way, we aim to carry out a thorough analysis to determine
whether the model tends to adopt certain tones and styles, and whether these variations influence the
selection of argument types. In the argument retrieval phase, we are not only interested in continuing
with arguments from Kialo, but also in exploring other databases, argument types, or retrieval methods
based on LLMs. In this regard, we propose a system based on agents that demostrated good results
in diferent tasks [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], where by providing a search engine with internet access, the model could
autonomously respond to questions and analyse the type of arguments it retrieves, evaluating the
performance of such dynamically sourced systems [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. As for the second subtask, as future work
we envisage choosing the examples we show in the prompt more rigorously using the knowledge
of a human expert rather than selecting them randomly. In addition, although we deliberately avoid
ifne-tuning in these experiments to evaluate the performance of the base model, we believe that a slight
adaptation to the task could bring noticeable improvements. Moreover, in both subtasks, we do not
intend to limit ourselves to models such asuch as LLaMA, but rather to explore others with a higher
number of parameters or diferent architectures such as Mistral [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Qwen [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and models oriented to
argumentation such as Veritas-12B [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by Project CONSENSO (PID2021-122263OB-C21), Project
MODERATES (TED2021-130145B-I00), and Project SocialTox (PDC2022-133146-C21) funded by
MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o and Deepl in order to: Grammar, spelling
and translation check. After using these services, the authors reviewed and edited the content as needed
and takes full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Used prompts</title>
      <sec id="sec-8-1">
        <title>A.1. Subtask 1: Development of debate system</title>
        <p>You are an expert debater and you are having a dialogue with another user. Your task is to analyze with
what TONE and STYLE you should answer your opponent. Keep in mind that the idea you are defending
is contrary to the one he is showing. It is also very important that you define the APPROACH STRATEGY
to be followed, as well as the TYPE OF ARGUMENTS you should retrieve or the PERSPECTIVE from
which the argumentation should be elaborated. The diferent options of each category are:
TONE:
- Neutral: Based on facts and logic (without using emotions or making personal judgments)
- Respectful: acknowledges the other person’s point of view without disqualifying it
- Assertive/Critical: defends a position clearly and firmly. Without being aggressive
- Inquisitive: tries to ask questions to invite reflection or question a statement.</p>
        <p>STYLE:
- Academic: follows a formal, structured style, supported by sources. The language to be used is usually
technical and includes many quotations and data.
- Colloquial: It follows an informal style, with a natural and close language.
- Socratic: makes use of questions that invite reflection or lead to a contradiction.
- Logical: based on the rational structure of the argument (usually including a series of premises,
conclusions, syllogisms and formal deductions).
- Rhetorical: appeals to emotions
APPROACH STRATEGY: refers to how the counterargument(s) should address the original text
- Comprehensive: addresses all points made
- Focused: focus on those arguments that are weaker.
- Principled: questions the underlying principles or assumptions of the original message.
- Free (explore other approaches not mentioned in this text).</p>
        <p>TYPE OF ARGUMENTS:
- Logical (based on reason and facts)
- Ethical (values, rights)
- Emotional (empathy, human impact)
- Analogical (establishing examples and similarities)
PERSPECTIVE:
- Economic
- Ethical/Moral
- Scientific
- Practical/pragmatic
- Historical
- Cultural
The format of your response have to consist in an unique JSON with exactly these keys:
- tone: a string with the selected TONE (Neutral, Respectful,Assertive/Critical, or Inquisitive).
- justification_tone: an explanation about why you have selected that tone.
- style: a string with the selected STYLE (Academic, Colloquial, Socratic, Logical or Rhetorical).
- justification_style: an explanation about why you have selected that style
- approach_strategy: a string with the selected APPROACH STRATEGY (Comprehensive, Focused,
Principled or Free).
- justification_approach: an explanation about why you have selected that approach strategy.
- type_of_arguments: a string with the selected TYPE OF ARGUMENTS (Logical, Ethical, Emotional, or
Analogical)
- justification_type: an explanation about why you have selected that type of arguments.
- perspective: a string with the selected PERSPECTIVE (Economic, Ethical/Moral, Scientific,
Practicalpragmatic, Historical, or Cultural)
- justification_perspective: an explanation about why you have selected that perspective.
Specific information about the established debate is included below. Remember that you have to analyze
the way you should answer and oppose to your opponent’s last message:
{debate_dialogue}
You are an expert debater and you are having a dialogue with another user. Your task is to analyze how
to find arguments or contra-arguments to response your last opponent message. Keep in mind that the
idea you are defending is contrary to the one he is showing.</p>
        <p>Please response with only a JSON Object that contain as keys the word SEARCH followed by a number
with the step to search (SEARCH_1,SEARCH_2, . . . ). Each key has a value a dictionary with the following
keys:
- opponent_idea: a string with the idea showed by your opponent that you want to use to search the
arguments, please try to be specific and put the opponent idea complete. You must include the specific
subject in your sentence removing inespecific subjects like its, his or her.
- field_to_look: a string with two possible values (SUPPORT or ATTACK). For example you can identify the
opponent idea and look for argument that support his idea and later look for arguments that attack the
support arguments, or you want to look for arguments that attack the idea of your opponent, or instead
you prefer to find arguments that attacks your opponent idea and retrieve arguments that support your
main idea.
- justification: an string with a justification of why you have selected that decision.
It is important that at maximum you can make 3 diferent search. Specific information about the
established debate is included below.
{debate_dialogue}
Take into account that previously you decided that the arguments or counterarguments that you are
looking for should address your opponent message with a {FirstStep[approach_strategy]} strategy that
consist in {description[FirstStep[approach_strategy]]}. The justification of your previous decision is:
{FirstStep[justification_approach]}
You are an expert debater and you are having a dialogue with another user. Your task is to select the 3
best arguments in order to generate a reply to your opponent’s last message in a future step.
Specific information about the established debate is included below.
{debate_dialogue}
The information you are looking for is listed below:
{SecondStepResponse_Query_X}
We search arguments by 3 diferent strategies:
- ATTACK STRATEGY is for arguments that attack the opponent_idea.
- SUPPORT STRATEGY is looking for arguments that attack some arguments that support the
opponent_idea.
- TEXT STRATEGY is looking by the similarity of your opponent_idea and the text of the retrieval
argument
The information retrieved in each strategy is shown next:
{RetrievedArguments}
To select the best arguments take into consideration that you want to select
{FirstStep[ type_of_arguments ]} arguments. The justification to select these types of arguments
appear next: {FirstStep[ justification_type ]}
You are an expert debater and you are having a dialogue with another user. Your task is to select at
maximum of 3 arguments to answer the last opponent message. Keep in mind that the idea you are
defending is contrary to the one he is showing.</p>
        <p>Specific information about the established debate is included below.
{debate_dialogue}
Previously some arguments are retrieved in base a diferent aspects. The following information contains
the aspect to search arguments and selected arguments for each criteria.
{SEARCH_1: {aspect: {SecondStepResponse[SEARCH_1][field_to_look]} to</p>
        <p>{SecondStepResponse[SEARCH_1][opponent_idea]},
retrieved_arguments: {ThirdStepQuery1[arguments]},
justification: {ThirdStepQuery1[justification]}},
{SEARCH_2: [...]},
{SEARCH_3: [...]}}
Now respond in a JSON format with the keys:
- aspect_argument_1: a string with the exact text of the aspect of the first selected argument.
- retrieved_argument_1: a string with the exact text of the first selected argument.
- justification_argument_1: a string with the justification of the selection of the first selected argument.
- aspect_argument_2: a string with the exact text of the aspect of the second selected argument.
- retrieved_argument_2: a string with the exact text of the second selected argument.
- justification_argument_2: a string with the justification of the selection of the second selected
argument.
- aspect_argument_3: a string with the exact text of the aspect of the third selected argument.
- retrieved_argument_3: a string with the exact text of the third selected argument.
- justification_argument_3: a string with the justification of the selection of the third selected argument.
You are an expert debater and you are having a dialogue with another user. Your task is to generate a
response to your last opponent message. Keep in mind that the idea you are defending is contrary to the
one he is showing.</p>
        <p>Specific information about the established debate is included below.
{debate_dialogue}
Your answer should take into account the following parameters: { tone: {firstStep[ tone ]},
justification_tone: {firstStep[ justification_tone ]},
style: {firstStep[ style ]},
justification_style: {firstStep[ justification_style ]},
perspective: {firstStep[ perspective ]},
justification_perspective: {firstStep[ justification_perspective ]},
}
Retrieved arguments to elaborate the answer: {responseFourhtStep}
Please response with JSON format that contain as keys the words:
- response: a string with the response to your last opponent message. This answer has to be of a
maximum of 60 words.
- justification: a string with the explanation about how you generate your answer.
- arguments: a list of the string with the exact retrieved arguments that you used to elaborate the final
answer and that have relation with your response.</p>
        <p>Moreover take in account that your answer must meet the following criteria:
- Quantity: be informative and provide enough information for the opponent to understand your
position.. Does the response contain at least one (attack or defense) argument, and at most one of each
type of defense and attack?
- Quality: be truthful. Can the response be deduced from the retrieved arguments?
- Relation: be relevant. Is the response coherent with the conversation and does it express a contrary
stance to the user?
- Manner: be clear. Is the response clear and precise?
- Length: be of at most 60 words</p>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Subtask 2: Evaluation of debate systems</title>
        <p>Prompt ZSL (Subtask 2)
You are an expert in evaluating and analyzing the quality of the answers and arguments used
during various rounds of debate. Your task is to evaluate the last answer of the system based on the
{evaluation_metric} metric:
{evaluation_question_metric}
Bellow you have the conversation and the arguments retrieved:
{debate_dialogue}
Please respond only with yes or no to the question about
{evaluation_metric}( {evaluation_metric_question} )</p>
        <p>Analysis Strategy Prompt for Step 2 (Subtask 2)
You are an expert in evaluating and analyzing the quality of the answers and arguments used during
various rounds of debate. Your task is to evaluate the last answer of the system based on the
{evaluation_metric} metric:
{evaluation_question_metric}
Now I will give you some reason to answer with yes or no. - Respond with yes if {answer_model_step1[yes]}.
- Respond with no if {answer_1[]}.</p>
        <p>Now is your turn. Below you have the information of the debate:
{debate_dialogue}
Please respond only with
( {evaluation_question_metric} )
yes
or</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Fanton</surname>
          </string-name>
          ,
          <article-title>Margherita and Bonaldi, Helena and Tekiroğlu, Serra Sinem and Guerini, Marco, Humanin-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics</article-title>
          , Association for Computational Linguistics,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M.-E.</surname>
            Vallecillo-Rodríguez,
            <given-names>M.-V.</given-names>
          </string-name>
          <string-name>
            <surname>Cantero-Romero</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Cabrera-De-Castro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Montejo-Ráez</surname>
          </string-name>
          , M.-
          <string-name>
            <surname>T.</surname>
          </string-name>
          Martín-Valdivia,
          <article-title>CONAN-MT-SP: A Spanish corpus for counternarrative using GPT models</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italy,
          <year>2024</year>
          , pp.
          <fpage>3677</fpage>
          -
          <lpage>3688</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>326</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bonaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Vallecillo-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Zubiaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montejo-Raez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          , M.-T. MartínValdivia, M. Guerini,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          ,
          <source>The first workshop on multilingual counterspeech generation at COLING</source>
          <year>2025</year>
          :
          <article-title>Overview of the shared task</article-title>
          , in: H.
          <string-name>
            <surname>Bonaldi</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <string-name>
            <surname>Vallecillo-Rodríguez</surname>
            ,
            <given-names>I. Zubiaga</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montejo-Ráez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Martín-Valdivia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerini</surname>
          </string-name>
          , R. Agerri (Eds.),
          <source>Proceedings of the First Workshop on Multilingual Counterspeech Generation</source>
          , Association for Computational Linguistics, Abu Dhabi,
          <string-name>
            <surname>UAE</surname>
          </string-name>
          ,
          <year>2025</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>107</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .mcg-
          <volume>1</volume>
          .10/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.-L.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kuzmenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Tekiroglu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Guerini, CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>2819</fpage>
          -
          <lpage>2829</lpage>
          . URL: https://www.aclweb.org/ anthology/P19-1271. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1271.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Skitalinskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Klaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <article-title>Learning from revisions: Quality assessment of claims in argumentation at scale</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>1718</fpage>
          -
          <lpage>1729</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .eacl-main.
          <volume>147</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gohsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mirzakhmedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , TouchÉ 25 rad claims,
          <year>2025</year>
          . URL: https://doi.org/10.5281/zenodo.15401620. doi:
          <volume>10</volume>
          .5281/zenodo.15401620.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>AI</given-names>
            <surname>@Meta</surname>
          </string-name>
          ,
          <source>Introducing llama 3</source>
          .
          <article-title>1: Our most capable models to date (</article-title>
          <year>2024</year>
          ). URL: https://ai.meta. com/blog/meta-llama-3-1/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sabharwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          , T. Khot,
          <article-title>Complexity-based prompting for multi-step reasoning</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>F. H. van Eemeren</surname>
          </string-name>
          ,
          <article-title>In what sense do modern argumentation theories relate to aristotle? the case of pragma-dialectics</article-title>
          ,
          <source>Argumentation</source>
          <volume>27</volume>
          (
          <year>2013</year>
          )
          <fpage>49</fpage>
          -
          <lpage>70</lpage>
          . URL: https://doi.org/10.1007/s10503-012-9277-4. doi:
          <volume>10</volume>
          .1007/s10503-012-9277-4.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Herrick</surname>
          </string-name>
          ,
          <source>The History and Theory of Rhetoric: An Introduction</source>
          , 7th ed.,
          <source>Routledge</source>
          ,
          <year>2020</year>
          . URL: https://doi.org/10.4324/9781003000198. doi:
          <volume>10</volume>
          .4324/9781003000198.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , The Psychology of Persuasion: How to Persuade Others to Your Way of Thinking, Pelican Publishing,
          <year>2010</year>
          . URL: https://books.google.es/books?id=FAHzLM-pY7cC.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Simons</surname>
          </string-name>
          , Persuasion in Society, 2nd ed.,
          <source>Routledge</source>
          ,
          <year>2011</year>
          . URL: https://doi.org/10.4324/ 9780203933039. doi:
          <volume>10</volume>
          .4324/9780203933039.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Morado</surname>
          </string-name>
          ,
          <article-title>Funciones básicas del discurso argumentativo (????)</article-title>
          . URL: https://revistas.uam.es/ria/ article/view/8195. doi:
          <volume>10</volume>
          .15366/ria2013.6.007,
          <issue>number</issue>
          :
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <article-title>A survey of zero-shot learning: Settings, methods, and applications</article-title>
          ,
          <source>ACM Trans. Intell. Syst. Technol</source>
          .
          <volume>10</volume>
          (
          <year>2019</year>
          ). URL: https://doi.org/10.1145/3293318. doi:
          <volume>10</volume>
          .1145/3293318.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          , Ç. Çöltekin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gohsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heineking</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erjavec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Meden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mirzakhmedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Morkevičius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          , I. Zelch,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , Overview of Touché 2025:
          <article-title>Argumentation Systems</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Jasper and stella: distillation of sota embedding models</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2412.19048. arXiv:
          <volume>2412</volume>
          .
          <fpage>19048</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elstner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.),
          <source>Advances in Information Retrieval. 45th European Conference on IR Research (ECIR</source>
          <year>2023</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sokolova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lapalme</surname>
          </string-name>
          ,
          <article-title>A systematic analysis of performance measures for classification tasks</article-title>
          ,
          <source>Information Processing Management</source>
          <volume>45</volume>
          (
          <year>2009</year>
          )
          <fpage>427</fpage>
          -
          <lpage>437</lpage>
          . URL: https://www. sciencedirect.com/science/article/pii/S0306457309000259. doi:https://doi.org/10.1016/j. ipm.
          <year>2009</year>
          .
          <volume>03</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K. C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Comparative analysis of open-source frameworks for agentic ai systems: Capabilities, design philosophies, and development experiences, World Scientific Annual Review of Fintech 0 (0) null</article-title>
          . URL: https://doi.org/10.1142/S2811004824500015. doi:
          <volume>10</volume>
          . 1142/S2811004824500015. arXiv:https://doi.org/10.1142/S2811004824500015.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yeginbergen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oronoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          ,
          <article-title>Dynamic knowledge integration for evidence-driven counterargument generation with large language models</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2503.05328. arXiv:
          <volume>2503</volume>
          .
          <fpage>05328</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>[21] mistralai/mistral-small-3</source>
          .
          <fpage>1</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          24b-instruct-
          <volume>2503</volume>
          · hugging face, ????. URL: https://huggingface.co/ mistralai/Mistral-Small-
          <volume>3</volume>
          .
          <fpage>1</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          24B-Instruct-
          <volume>2503</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <article-title>Qwen/qwen3-30b-a3b-GGUF · hugging face</article-title>
          , ????. URL: https://huggingface.co/Qwen/ Qwen3-30B
          <string-name>
            <surname>-A3B-GGUF.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23] soob3123/
          <fpage>veritas</fpage>
          -12b · hugging face, ????. URL: https://huggingface.co/soob3123/Veritas-12B.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>