<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lucile Favero</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Antonio Pérez-Ortiz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tanja Käser</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nuria Oliver</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Argument mining algorithms analyze the argumentative structure of essays, making them a valuable tool for enhancing education by providing targeted feedback on the students' argumentation skills. While current methods often use encoder or encoder-decoder deep learning architectures, decoder-only models remain largely unexplored, ofering a promising research direction. This paper proposes leveraging open-source, small Large Language Models (LLMs) for argument mining through few-shot prompting and fine-tuning. The small size of these models and their open-source nature ensure accessibility, privacy, and computational eficiency, enabling schools and educators to adopt and deploy them locally. Specifically, we perform three tasks: segmentation of student essays into arguments, classification of the arguments by type, and assessment of their quality. We empirically evaluate the models on the “Feedback Prize - Predicting Efective Arguments” dataset of grade 6-12 students essays and demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting the essays and determining the argument types while few-shot prompting yields comparable performance to that of the baselines in assessing quality. This work highlights the educational potential of small, open-source LLMs to provide real-time, personalized feedback, enhancing independent learning and writing skills while ensuring low computational cost and privacy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;argument mining</kwd>
        <kwd>educational feedback</kwd>
        <kwd>small LLMs</kwd>
        <kwd>accessibility</kwd>
        <kwd>privacy</kwd>
        <kwd />
        <kwd>computational eficiency deployment</kwd>
        <kwd>argument segmentation classification quality assessment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Writing well-structured essays can be challenging for students, as they require not only quality argument
components but also cohesive connections between them [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. However, many students struggle to
meet these requirements, often due to a lack of clear guidance on efective argumentation. High
teacherstudent ratios exacerbate this issue, limiting the individualized support that teachers can provide1. To
address this challenge, automatic argument mining has emerged as a promising solution, aiming to
analyze the argumentative structure of essays and deliver targeted feedback to help students strengthen
their arguments and overall essay structure [
        <xref ref-type="bibr" rid="ref24 ref5">24, 5</xref>
        ].
      </p>
      <p>
        Efective argument mining entails several subtasks, such as segmenting the essay into distinct
argument components, classifying their type, assessing their quality and establishing relationships
between them [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. These tasks can be performed sequentially or in parallel. State-of-the-art approaches
in argument mining rely on encoder or encoder-decoder deep neural network-based architectures
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, to date, decoder-only models remain underexplored for certain subtasks, suggesting a
promising area for future exploration [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ].
      </p>
      <p>
        While high-performing argument mining methods have potential, they are often dificult to access and
scale, especially in educational settings [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Automated Essay Scoring (AES) systems ofer an alternative
solution for grading essays, focusing on overall evaluation rather than on assessing individual arguments
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However, limited research has explored the integration of argument mining with argument quality
assessment. This work aims to bridge this gap by providing detailed feedback that helps students
critically analyze their arguments and enhance their writing skills [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>We propose leveraging open-source2, small Large Language Models (LLMs) to perform a complete
pipeline in argument mining (from segmenting the text in arguments to classifying their type and
assessing their quality) in an educational context. By means of few-shot prompting and fine-tuning, these
models can execute these tasks locally on the student’s laptop, ensuring accessibility and maintaining
computational eficiency. Figure 1 depicts the proposed method’s pipeline, outlining each stage of the
process.</p>
      <p>Contributions In this paper, we make several contributions to this domain. First, we address the
gap in leveraging small, open-source LLMs for argument mining, combining argument segmentation,
type classification, and quality assessment. Second, we propose a computationally eficient and
privacypreserving approach, enabling local analysis on standard devices through fine-tuning and few-shot
prompting of the LLMs. Finally, by evaluating our approach on a benchmark dataset of student essays,
we demonstrate its ability to deliver actionable feedback on a local computer, fostering improved
writing skills for students grades 6-12 while preserving privacy. Our method advances argument mining
in resource-constrained educational settings and highlights the transformative potential of LLMs in
personalized education.</p>
      <p>This paper is organized as follows: Section 2 reviews the most relevant literature, providing the
background and context for our research. In Section 3, we describe our methodology. Section 4 presents
and analyses our experimental results. Finally, Section 5 provides a discussion of the findings, followed
by a conclusion and an outline of the limitations in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Argument mining Argument mining is a complex field that aims to identify, classify, and analyze
argumentative structures within text [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], drawing inspiration from frameworks, such as Toulmin’s
model of argumentation [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. Argument mining involves numerous subtasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], including argument
detection, classification, assessment, and relation prediction, making end-to-end solutions particularly
2We use the term open-source to refer to LLMs that are freely available with at least open-weights.
challenging [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Thus, despite its importance, few studies tackle the full argument mining pipeline due
to its complexity and methodological diversity. [
        <xref ref-type="bibr" rid="ref28 ref3 ref6">6, 3, 28</xref>
        ].
      </p>
      <p>
        State-of-the-art methods in argument mining typically rely on deep neural networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Recently,
advancements in Large Language Models (LLMs) have pushed the field forward. For instance, T5 [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
has been applied efectively to argument mining tasks [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], while models like Longformer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and
BERT-based approaches [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] have demonstrated competitive performance across various subtasks. More
complex systems provide end-to-end solutions by combining models like BART [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] with prompting
and graph-based approaches [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], or by leveraging graph prefix tuning to enhance discourse-level
understanding [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Recently, Gorur et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] demonstrates that prompt-tuned, open-source models
like Llama-2 [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] and Mixtral [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] can outperform state-of-the-art RoBERTa-based baselines [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] in
identifying agreement and disagreement relations among arguments. However, to the best of our
knowledge, no research has explored to date the use of open-source, small LLMs for the combined tasks
of argument classification and quality assessment. In this paper, we aim to fill this gap.
Educational multi-task argument mining Educational multi-task argument mining focuses on
extracting, classifying, and evaluating arguments in student essays—a challenging problem due to the
noisy, resource-constrained nature of student writing [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Beyond the tasks of argument segmentation
and classification, assessing the quality of arguments is essential for evaluating their persuasiveness
and coherence [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. Existing approaches, such as Longformer-based classification methods [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and
graph-based frameworks [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], have made contributions to this area.
      </p>
      <p>
        Providing meaningful feedback from such analyses is particularly impactful in educational contexts.
Actionable feedback enables students and educators to identify strengths and areas for improvement,
with standardized scoring systems serving as valuable tools to guide learning and enhance outcomes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Moreover, incorporating discourse-level features has been shown to improve performance by ofering
deeper insights into argument structures [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Tasks and models</title>
        <p>Given a human-written essay, the goal is to perform three tasks: (1) segmenting it into arguments;
(2) classifying each argument into one of seven categories—Lead, Position, Claim, Counterclaim,
Rebuttal, Evidence, Concluding Statement; and (3) assessing the quality of each argument using three
levels —Inefective, Adequate, Efective. Segmentation is conducted first, followed by argument type
classification and quality assessment. These latter tasks can be performed independently or jointly.</p>
        <p>
          We investigate the efectiveness of three open-source, small LLMs, namely Qwen 2.5 7B [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], Llama
3.1 8B [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and Gemma 2 9B [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], for these tasks using both few-shot prompting and fine-tuning. We
compare against a state-of-the-art baseline [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ] and GPT-4o mini [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to shed light on the performance
of open-source vs closed-source models. According to de facto common capacity frames 3, models
with up to ∼ 10B parameters are considered “small”. This categorization reflects the current landscape
of LLMs. While 10B parameters were once considered very large, modern LLMs often exceed 100B
parameters, with state-of-the-art commercial models surpassing 500B. In this context, models with
up to 10B parameters are relatively small, allowing us to analyze research-grade LLMs that remain
practically accessible. Appendix A.7 provides additional details on these models, as well as experimental
results for three additional small, open-source LLMs (Llama 3.2 2B, DeepSeek R1 7B, and OLMo 2 7B).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Few-shot prompting</title>
        <p>
          The first approach applies few-shot prompt-tuning sequentially: first, the LLM is prompted to segment
the essay into arguments. Then another prompt is used for argument type classification and/or quality
assessment. The prompts are built by concatenating the following textual elements:
3https://huggingface.co/blog/jjokah/small-language-model
1. Few-shot examples are given in the format depicted by the examples included in Tables 7, 8 and
9 in Appendix A.3, depending on the task. The label of the type of argument or its quality is added
between each argument in the essay. The example essays are extracted from the training set split used
in Ding et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and described in Section 3.3.
2. The essay. The segmentation task takes the original essay as input. For argument type classification
and quality assessment, the essay is provided in a pre-segmented format, as illustrated in Table 7 in
Appendix A.3. In this format, each argument is separated by a designated SEP label, which corresponds
to the output of the segmentation task.
3. The query, specifying the role of the LLM and the overall instructions for the task. The detailed
formulations for the segmentation, argument type classification and quality assessment tasks are
provided in Table 5 in Appendix A.2.2.
4. Output requirements. For the argument type classification and argument quality assessment tasks,
the model is asked to generate the output in a specific JSON format, described in the prompt as follows:
“For the given argument component, identify its [type] and/or [quality]. Provide the output as a JSON
object with the key: [TYPE] and/or [QUALITY].” For segmentation, the output is described in the prompt
as follow: "Place &lt;SEP&gt; immediately at the end of each segment. Preserve all original words, spacing, and
order."
5. The specific argument of the essay that the LLM is asked to analyze in the argument type
classification and argument quality assessment tasks.
        </p>
        <p>Examples of the expected output format and the handling of incorrect outputs are detailed in
Appendix A.4. After segmentation, classifying the entire text at once may cause label mismatches,
misaligning them with arguments. To simplify this task for smaller LLMs, we classify one argument
per prompt. However, the full essay is still provided as context, given that the type classification and
quality assessment of each argument depend on the essay’s overall argumentation strategy.</p>
        <p>Also note that the specific structure, content and phrasing of each prompt have been meticulously
designed after an intensive efort of prompt engineering. Any changes to the sequence or omission of
parts significantly afect both the accuracy and the format of the output. An alternative approach to
prompt-tuning the LLM would involve asking it to replace each separator with its corresponding label
(type, quality, or both), as demonstrated in the fine-tuning process described in the next section. With
this method, the output should exactly replicate the original essay, with classified labels inserted at the
appropriate points between arguments. However, this approach did not deliver good performance with
the small LLMs as these models struggle to reproduce the input text word-for-word and often introduce
errors such as adding extra words, omitting parts of the text, or inserting additional separators. For
output examples, see Appendix A.3.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Fine-tuning</title>
        <p>The second approach involves fine-tuning an LLM for the following tasks: (1) argument segmentation;
(2) argument type classification; (3) argument quality assessment; and (4) argument type classification
and quality assessment together. The three small, open-source models were fine-tuned on Google Colab
using an A100 GPU on each of the tasks. To establish a comparative upper bound, we also fine-tuned
GPT-4o mini using OpenAI’s API. Further details on the fine-tuning of this model can be found in
Appendix A.2.3.</p>
        <p>
          Data Fine-tuning requires the use of a dataset. In our study, we used a collection of argumentative
essays written by U.S. students in grades 6-12, annotated by expert raters. The dataset is derived from
the Kaggle competition “Feedback Prize - Predicting Efective Arguments,” 4 which constitutes a subset
of the PERSUADE 2.0 Corpus. This subset includes approximately 6,900 essays from a total of 26,000
argument components, representing just over a quarter of the corpus. The essays were selected to
achieve a balanced distribution of discourse elements across varying levels of efectiveness [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. The
4https://www.kaggle.com/competitions/feedback-prize-efectiveness/data
dataset encompasses the seven argument types derived from Toulmin’s argumentation model [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]:
Lead, Position, Claim, Counterclaim, Rebuttal, Evidence, and Concluding Statement; a quality assessment
for each argument is also provided: Inefective, Adequate, or Efective.
        </p>
        <p>
          For fine-tuning, we used the split provided by Ding et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] consisting of 3,353 essays (i.e, 29,440
arguments) for the training set and 419 essays (i.e., 3,614 arguments) for the validation set. Detailed
statistics of this dataset split can be found in Ding et al. [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ].
        </p>
        <p>Setting We performed the fine-tuning of the small, open-source LLMs using the SFTTrainer module
from the TRL library5. We employed Unsloth6 to optimize performance and reduce memory usage.</p>
        <p>
          Additionally, we incorporated Low-Rank Adaptation (LoRA) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and Quantized Low-Rank
Adaptation (QLoRA) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to further reduce memory demands and improve the fine-tuning speed.
An early stopping criterion was applied to optimize performance and reduce memory usage. The
specific hyperparameters used in the fine-tuning can be found in Appendix A.2.3. Regarding GPT
4o-mini, we used OpenAI’s API for fine-tuning this model.
        </p>
        <p>Input sequence and target Both the input sequence and the fine-tuning target were formatted in
accordance with the previously described few-shot prompting methodology. For the segmentation
task, the input was the original essay, while the target output was the corresponding essay segmented
by a designated SEP label (See Table 7 in Appendix). In contrast, for the joint task of argument type
classification and quality assessment, the input consisted of the essay already segmented with the
SEP labels (See Table 7 in Appendix). The target, in this instance, was defined as the same essay
further partitioned into discrete arguments, with each argument interleaved with labels specifying
both the type of argument and its quality (See Table 10 in Appendix). This approach ensures
methodological consistency across tasks and facilitates a systematic evaluation of model performance
on both segmentation and combined argument type classification and quality assessment tasks. See
Appendix A.2.3 for more details.</p>
        <p>
          Inference The fine-tuned models were run on an Apple M1 Pro laptop with 32 GB RAM using
Ollama7, an open-source framework that enables users to run, create, and share LLMs locally on their
machines. We did not provide any few-shot examples demonstrating how to perform the task or how
to specify the output format. To evaluate performance, we used the same test set as Ding et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
Fine-tuning is expected to enhance the model’s ability to generate outputs that closely mirror the input
essay and conform better to the specified output format than the non-fine-tuned models. See Table 3 in
Appendix, for a summary of experiment variants.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          We performed all our evaluations on the test set of the “Feedback Prize - Predicting Efective Arguments,”
8 datasets. We use the same test set employed by Ding et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], consisting of 419 essays with a total
of 3,711 arguments. Detailed statistics of this dataset split can be found in Ding et al. [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ]. For the
segmentation task, the essay has to be segmented into arguments. For the argument type classification
task, the segmented arguments need to be classified into one of seven types: Lead, Position, Claim,
Counterclaim, Rebuttal, Evidence, and Concluding Statement. In the case of the quality assessment task,
the possible values are: Inefective, Adequate, Efective. See Table 5 in Appendix A.2.2 for a description
of each label.
5https://huggingface.co/docs/trl/sft_trainer
6https://github.com/unslothai/unsloth
7https://github.com/ollama/ollama, https://ollama.com
8https://www.kaggle.com/competitions/feedback-prize-efectiveness/data
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance metrics</title>
        <p>Segmentation F1 score We compute the F1 score for segmentation at the token level using the BIO
framework, where each token at the beginning of an argument is tagged as B, tokens inside the argument
are tagged as I and O denotes when a token is not part of an argument, which is not the case in our
task as all the tokens are supposed to belong to an argument.</p>
        <p>
          Type and quality F1 score We adopt the evaluation method used by Ding et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. A predicted
argument () with at least 50% of overlap with a gold argument () is considered a match i.e:
min(, ) &gt; 0.5, where:  = ||∩|| is the overlap of the predicted argument with the
||
gold argument and  = ||∩|| is the overlap of the gold argument with the predicted argument.
        </p>
        <p>||
Matched predicted arguments are considered true positives if they are of the same type (or quality) as
the gold argument. Otherwise, they are classified as a false negative. Unmatched predicted arguments
are considered false positives and labeled as Echec.</p>
        <p>Spelling errors LLMs are required to reproduce each input essay—originally written by
children—segmented with the SEP separator. Because these essays contain numerous spelling errors,
smaller LLMs often attempt to correct them automatically, thereby altering the text and complicating
direct comparisons with the ground truth. Since our goal is not to address spelling errors and the
dataset labels do not account for them, we first correct the essays before providing them to the LLM to
ensure accurate and consistent segmentation. To do so we use language-tool-python, a wrapper
for LanguageTool9.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Baselines</title>
        <p>
          We compare our method with several encoder-based variants following the approach proposed by
Ding et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Their framework employs BERT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for argument type and quality assessment and a
Longformer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for token-level segmentation, leveraging the Longformer’s ability to handle long-text
classification (see Appendix A.2.1). Notably, their method is the only approach in the literature that
utilizes this dataset for the same tasks while processing segmented essays as input. Additionally, as
mentioned earlier, we compare our results with GPT-4o mini [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], both vanilla and fine-tuned (using the
same fine-tuning data as with the other small LLMs, see Appendix A.2.3) versions, to have a sense of
the upper performance bound achieved by a commercial model.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Results</title>
        <sec id="sec-4-4-1">
          <title>4.4.1. Argument segmentation</title>
          <p>Figure 2 compares the performance of the small, open-source models on the segmentation task, evaluated
in their best few-shot settings (either zero or three) and fine-tuned configurations. Additionally, we
include the Longformer as a state-of-the-art baseline, and GPT-4o mini (with three-shot learning and
ifne-tuned) as a commercial upper bound. Error bars indicate the standard deviation across three runs.
All LLM models surpass the Longformer’s performance, demonstrating the advantages of large-scale
pretraining and transfer learning. Fine-tuned models exhibit substantial performance gains over their
few-shot counterparts, underscoring the efectiveness of supervised adaptation. Among all small
opensource models, the fine-tuned Llama 3.1 8B achieves the highest F1 score of 87.52, an increase of 18.05
points or 26.00% in segmentation performance over the baseline (Longformer) and 6.41 points but
3.7% below GPT-4o mini’s performance. Figure 4 in Appendix A.5.4 depicts the overlap between the
inferred segmentation and the gold segmentation across models. Fine-tuned Llama 3.1 8B achieves
the highest overlap with the ground truth, outperforming Longformer and aligning with the macro F1
results. Figure 5 in Appendix A.5.3 reports the average number of arguments per essay across models
for further segmentation analysis.</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>4.4.2. Argument type and quality assessment</title>
          <p>Figure 3 shows the macro-averaged F1 scores for the argument type (left) and quality (right) tasks,
respectively, across various models and experimental settings. The figure compares the performance of
the three small open-source models in the best few-shot (either zero or three) and fine-tuned settings,
along with the baseline and GPT 4o mini (three-shot and fine-tuned). The F1 scores are reported
for two task setups: individual classification (indiv.) and joint classification of both argument type
and quality together (joint). Results are shown for two segmentation conditions: gold (provided) and
inferred. Transparent colors indicate the gold condition, while less transparent colors represent inferred
segmentation. Circles and triangles correspond to the joint and individual setups, respectively. Error
bars represent the standard deviation of the results.</p>
          <p>Argument type classification Regarding the type classification task, models using the gold
segmentation consistently outperform the models with inferred segmentation and the baseline. We also
observe significant improvements when fine-tuning the small, open-source LLMs, both with the gold
and inferred segmentation. The best-performing model with the gold segmentation is Gemma 2 9B
ifne-tuned, with an F1 score of 79.74, which is 10.27 points or 14.78% larger than the baseline (BERT)
but 5.1% below GPT-4o mini’s performance. The best-performing model with inferred segmentation is
Llama 3.1 8B fine-tuned, with an F1 score of 51.45, which is 12.27 points or 31.32% larger than the
baseline (Longformer + BERT) but 19.85% lower than GPT-4o mini fine-tuned. Generally, models yield
better performance in the joint than in the individual setup.</p>
          <p>Argument quality assessment Regarding the argument quality assessment task, models using the
gold segmentation also consistently outperform the models with inferred segmentation. In this case,
the best-performing small open-source model is Gemma 2 9B with three-shot learning, achieving an F1
score of 44.56, which is 19.96 points or 81.14% larger than the baseline’s performance but 18.98% lower
than the GPT-4o mini fine-tuned performance. With inferred segmentation, the best performing small,
open-source LLM (Gemma 2 9B with three-shot learning) achieves similar performance to the baseline,
both of them below that of GPT-4o mini. Interestingly, fine-tuning, in this case, tends to worsen the
performance across models, particularly when given the gold segmentation.</p>
          <p>Note that the results for Llama 3.1 Qwen 2.5 and GPT-4o mini fine-tuned are absent in the individual
setup due to repeated inference failures. Refer to Section 6 for more details.</p>
        </sec>
        <sec id="sec-4-4-3">
          <title>4.4.3. Label-level performance analysis</title>
          <p>To better understand the previously reported performance figures, we summarize the results at the
label level for the best-performing model (Llama 3.1 8B fine-tuned in the argument type classification
task and Gemma 2 9B three-shot learning in the argument quality assessment task) in the joint set up
and the inferred segmentation configuration. Tables 1, 2, (and 12 in Appendix) depict the precision,
recall and F1 score for the argument type classification, quality assesment and segmentation tasks,
respectively. Furthermore, Tables 14, 15 and 13 in Appendix contain the corresponding confusion
matrices.</p>
          <p>Type
Lead
Position
Claim
C claim
Rebuttal
Evidence
Concluding
Argument type classification As seen in Table 1 the model is the most accurate in classifying
Concluding statements, followed by Evidence. In contrast, the model exhibits the lowest performance
when classifying Rebuttals and Counterclaims. Interestingly, the model has very high recall (96.97%) but
low precision (48.48%) when classifying Lead arguments, suggesting that it often over-labels arguments
as Lead. There is large misclassification rates of Claim, Position, and Evidence, into Echec entries,
confirming that segmentation errors contribute to lowering the performance in the classification task,
see Table 14 in the Appendix.</p>
          <p>Quality assessment Regarding quality assessment, the best performance of the model is in the
Adequate label, struggling with the Inefective (low recall) and especially the Efective (low precision)
labels. In fact, Adequate tends to be over-predicted by the model. Part of the classification errors are
also due to segmentation mistakes, as reflected by the Echec column in the confusion table in Table 15
in the Appendix.
Quality
Inefective
Adequate
Efective</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this paper, we have explored the potential of three small, open-source LLMs—namely, Qwen 2.5
7B, Llama 3.1 8B, Gemma 2 9B—to perform three argument mining tasks in an educational setting:
argument segmentation, argument type classification, and argument quality assessment. We have
experimented with both few-show prompting and fine-tuning, comparing small open-source LLMs to
commercial LLMs (GPT-4o mini) and state-of-the-art encoders. From these extensive experiments, we
draw several findings.</p>
      <p>First, small and open-source LLMs are able to efectively perform argument mining tasks with
significantly better performance than state-of-the-art baselines (Longformer and BERT). Commercial
small LLMs, such as GPT4o-mini, yield the best performance.</p>
      <p>Second, diferent approaches provide the best results depending on the task. Whereas
finetuned models consistently outperform few-shot prompting in the argument segmentation and type
classification tasks, we observe the opposite behavior in the argument quality assessment task.</p>
      <p>
        Third, model performance in the tasks of interest does not necessarily increase with the number of
shots when performing few-shot prompting, which is consistent with what has been reported in the
literature [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. This finding is particularly evident in the case of Llama 3.1 8B, which exhibits the best
performance in tasks with zero-shot prompting. The deterioration in performance with the number
of shots is probably due to the complexity of longer prompts which seemed too hard for the model to
make sense of.
      </p>
      <p>
        Fourth, joint fine-tuning setups where both tasks were carried out at the same time tend to yield
better results than individual setups where the tasks were performed independently, showing a strong
link between classifying argument types and assessing their quality [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Fifth, the automatic segmentation and classification of the type of argument seems to be an easier task
than the assessment of the quality of the arguments, likely due to dificulties in creating high-quality
and consistent ground truth quality assessments across essays 10 [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ].
      </p>
      <p>
        Finally and most importantly, this study highlights the potential of open-source, small LLMs, running
locally on personal computers, to support students in the development of their essay-writing skills. Our
framework prioritizes privacy and accessibility, addressing the challenge of developing eficient models
for local use without consuming too many resources [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and future work</title>
      <p>
        In this paper, we have presented a study of the potential of small, open-source LLMs for argument
mining, investigating their efectiveness in both few-shot prompting and fine-tuning setups.
Finetuning proved especially valuable for argument segmentation and type classification such that small,
open-source LLMs significantly outperform state-of-the-art approaches by 18.05 points in argument
segmentation and 10.27 points in argument type classification. Our experiments also illustrate the value
of joint setups for improved argument type classification. By focusing on models running locally on
students’ personal computers, our research promotes accessibility and resource eficiency, illustrating
the potential of open-source, small LLMs as a promising tool for educational applications.
10See Appendix A.6 for a discussion on the annotation quality
Our work is not exempt from limitations that we plan to address in future work. First, the performance
on the argument quality assessment task is low for all models, including state-of-the-art methods. We
hypothesize that the poor performance might be due to the quality of the annotated data, as suggested
by other authors [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Hence, we plan to improve the annotation quality and create a more reliable
dataset to be shared with the research community.
      </p>
      <p>Second, the fine-tuned Qwen 2.5 7B, Llama 3.1 8B, and GPT4o mini in the individual setup were
unable to perform the argument type classification and quality assessment tasks due to repeated failures
during inference. While the exact cause remains to be investigated, potential reasons include the
hyperparameters used for the fine-tuning not being suitable for the individual setup or instability in
the fine-tuned models.</p>
      <p>Finally, we evaluated small LLMs using the PERSUADE 2.0 corpus dataset, which consists solely of
English high-school writing. As a result, it remains uncertain whether findings can be generalized to
other educational contexts and languages. Furthermore, a “clinical” evaluation in the classroom with
students would be necessary to assess the real-world efectiveness of the proposed approach beyond
what can be measured with benchmark datasets. We leave such an evaluation to future work.</p>
    </sec>
    <sec id="sec-7">
      <title>Ethical considerations</title>
      <p>
        The integration of AI-based argument mining for automatically evaluating student essays raises ethical
concerns related to data privacy, fairness, accountability and human autonomy [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. First, sensitive
educational data, which is often tied to the students’ identities, must be handled securely, with clear
consent and transparent data-sharing policies. Second, the presence of biases in the annotation and
model training processes can lead to inequitable outcomes that disadvantage specific student groups.
Third, blind trust and over-reliance on automated assessments can lower the teachers’ professional
judgment and lead to a lack of human touch in the educational process. The proposed approach is part
of a larger project aiming to develop an educational chatbot that mitigates these concerns by leveraging
small, open-source LLMs that run locally on the students’ computers and by consciously measuring
and mitigating biases both in the training data and the models. In the long term, the use of these kinds
of systems may influence students’ writing practices and learning strategies, and could shift teachers’
roles toward facilitating deeper reasoning and metacognitive skills.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>L.F. and N.O. have been partially supported by a nominal grant received at the ELLIS Unit Alicante
Foundation from the Regional Government of Valencia in Spain (Resolución de la Conselleria de
Innovación, Industria, Comercio y Turismo, Dirección General de Innovación). L.F. has also been
partially funded by a grant from the Banc Sabadell Foundation.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Generative AI to check grammar, formal tone,
and spelling. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Appendix</title>
      <sec id="sec-10-1">
        <title>A.1. Detailed experimental setup and methodology</title>
        <p>The experimental setup consists of multiple configurations based on diferent task types, segmentation
methods, adaptation strategies, and model choices. Table 3 provides an overview of the experiment
variants. The tasks include argument segmentation, argument type classification, and argument quality
assessment. Two segmentation approaches are considered: gold segmentation (ground truth) and
inferred segmentation (automatically performed by the model). The setup can be either individual,
where argument type and quality are classified separately, or joint, where both are assessed together. We
explore two adaptation strategies: few-shot learning (ranging from zero to four-shot) and fine-tuning.
The models used in the experiments fall into three categories: encoder-based models, small open-source
LLMs, and a proprietary model, GPT-4o mini.</p>
        <p>Category</p>
        <p>Task
Segmentation</p>
        <p>Setup
Adaptation
strategy
Models</p>
        <p>Options
- Argument segmentation
- Argument type classification
- Argument quality assessment
- Gold
- Inferred
- Individual: type or quality
- Joint: type and quality
- Few-shot (zero to four-shot)
- Fine-tuned
- Encoder-based:
- BERT
- BERT with two heads
- Longformer
- Longformer with two heads
- Small open source LLMs:
- Llama 3.2 3B
- OLMo 2 7B
- Qwen 2.5 7B
- DeepSeek R1 7B
- Llama 3.1 8B
- Gemma 2 9B
- GPT 4-o mini</p>
      </sec>
      <sec id="sec-10-2">
        <title>A.2. Further details on adaptation strategies</title>
        <sec id="sec-10-2-1">
          <title>A.2.1. Encoder-based baseline</title>
          <p>
            The following encoders are used for diferent task variants, following the approach proposed by Ding
et al. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]:
• BERT [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] for argument type classification and argument quality assessment (individual setup)
with segmentation given.
• BERT with two prediction heads [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] for joint argument type classification and argument
quality assessment (joint setup) with segmentation given.
• Longformer [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] for segmentation via token classification.
• Longformer with two prediction heads [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] for segmentation and argument type classification
and segmentation and argument quality assessment (joint setup).
          </p>
          <p>
            The training was conducted for 10 epochs and using the same specific setting used in Ding et al. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
The evaluation follows the same methodology used to assess the LLMs’ performance.
          </p>
        </sec>
        <sec id="sec-10-2-2">
          <title>A.2.2. Few-shot learning</title>
          <p>Few-shot prompting queries Complementing the Section 3.2 in the main paper, Table 4 presents
the expected output format for diferent tasks, while Table 5 lists the queries used to prompt the LLMs.</p>
          <p>Tasks</p>
          <p>Type</p>
          <p>Quality
Type and Quality</p>
          <p>Few-shot output format</p>
          <p>{"TYPE": ["Position"]}
{"QUALITY": ["Adequate"]}
{"TYPE AND QUALITY":</p>
          <p>["Position", "Adequate"]}
#TASK: Segment the following essay into distinct argument components. After each
argument component, insert the marker &lt;SEP&gt;. Keep the original text in the same
order without adding, removing, or altering any words (other than inserting the
&lt;SEP&gt; markers).
#GUIDELINES: Identify each coherent segment that forms a logical unit of the
argument (e.g., claims, premises, evidence, or conclusions).</p>
          <p>You are a strict AI evaluator specializing in detecting the type of argument
components in essays. The argument types are as follows:
- Lead: An introduction that begins with a statistic, quotation, description, or other
device to grab the reader’s attention and point toward the thesis.
- Position: An opinion or conclusion on the main question.
- Claim: A statement that supports the position.
- Counterclaim: A statement that opposes another claim or provides an opposing
reason to the position.
- Rebuttal: A statement that refutes a counterclaim.
- Evidence: Ideas or examples that support claims, counterclaims, or rebuttals.
- Concluding Statement: A statement that restates the claims and summarizes the
argument.</p>
          <p>You are a strict AI evaluator specializing in assessing the quality of argument
components in essays. Each component should be rated as one of the following:
- Inefective: The component is unclear, unconvincing, or poorly structured.
- Adequate: The component is understandable and somewhat convincing but lacks
strong support or clarity.
- Efective: The component is well-structured, clear, and strongly supports the
argument.</p>
        </sec>
        <sec id="sec-10-2-3">
          <title>A.2.3. Fine-tuning</title>
          <p>
            Following the dataset split used in [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], we fine-tuned the small LLMs and GPT-4o-mini on a training
set of 3,353 essays (29,440 argument segments) and evaluated them on a validation set of 419 essays
(3,614 argument segments).
Input sequences and target formats Complementing the Section 3.3 in the main paper, Table 6
provides the input sequences and target formats used during the fine-tuning for the diferent tasks. See
Tables 7, 8, 9 and 10 in Appendix A.3 for examples of these formats.
          </p>
          <p>Tasks
Segmentation</p>
          <p>Type (Indiv.)
Quality (Indiv.)
Type + Quality</p>
          <p>Input Format</p>
          <p>Essay
Essay + SEP
Essay + SEP
Essay + SEP</p>
          <p>Target Format</p>
          <p>Essay + SEP separator</p>
          <p>Essay + TYPE separators</p>
          <p>Essay + QUALITY separators
Essay + TYPE and QUALITY sep.</p>
          <p>Fine-tuning small, open source LLMs Qwen 2.5 7B, Llama 3.1 8B, and Gemma 2 9B were
finetuned on a single GPU. We used the AdamW optimizer ( 1 = 0.9,  2 = 0.999) in 8-bit precision. A
weight decay of 0.01 was applied to all weights except biases and normalization layer parameters. The
mini-batch size was 2, and we opted for 12 warmup steps. Regarding QLoRA, the rank of LoRA modules,
 is 16, the LoRA scaling factor,  is 16 with 0 dropout and 4-bit quantization. Gradient accumulation
is set to 4, with a learning rate of 1 − 4 or 5 − 4 , depending on the model and setup configuration,
and a cosine learning rate schedule. Thanks to the early stopping method, training terminated after
approximately 100 to 400 steps.</p>
          <p>Fine-tuning GPT-4o mini GPT-4o mini-2024-07-18 was fine-tuned, using the OpenAI fine-tuning
platform11. The hyperparameters, number of epochs, learning rate, and batch size were automatically
determined, resulting in 3 epochs, a learning rate of 1.8, and a batch size of 6.</p>
        </sec>
      </sec>
      <sec id="sec-10-3">
        <title>A.3. Prompting and handling output format</title>
        <sec id="sec-10-3-1">
          <title>A.3.1. Examples of prompt and output format</title>
          <p>Hi, i’m Isaac, i’m going to be writing about how this face on Mars is a natural landform or if there is life on Mars
that made it. The story is about how NASA took a picture of Mars and a face was seen on the planet. NASA
doesn’t know if the landform was created by life on Mars, or if it is just a natural landform. &lt;SEP&gt;. On my
perspective, I think that the face is a natural landform because I dont think that there is any life on Mars. In
these next few paragraphs, I’ll be talking about how I think that is is a natural landform &lt;SEP&gt;I think that the
face is a natural landform because there is no life on Mars that we have descovered yet &lt;SEP&gt; [...] Though
people were not satified about how the landform was a natural landform, in all, we new that alieans did not
form the face. I would like to know how the landform was formed. we know now that life on Mars doesn’t exist.
&lt;SEP&gt;</p>
          <p>Table 7 presents an example of essay segmentation format used as few-shot and fine-tuned output
format for the segmentation task and input format for the argument type classification task. Table 8,
Table 9 and Table 10 present examples of the few-shot and fine-tuned output format used for the
argument type classification and quality assessment tasks in the individual and join setups, respectively.
Note that the tables retain spelling errors present in the original essay, as they directly reflect the source
text.
Hi, i’m Isaac, i’m going to be writing about how this face on Mars is a natural landform or if there is life on Mars
that made it. The story is about how NASA took a picture of Mars and a face was seen on the planet. NASA
doesn’t know if the landform was created by life on Mars, or if it is just a natural landform. &lt;Lead&gt;. On my
perspective, I think that the face is a natural landform because I dont think that there is any life on Mars. In
these next few paragraphs, I’ll be talking about how I think that is is a natural landform &lt;Position&gt;I think that
the face is a natural landform because there is no life on Mars that we have descovered yet &lt;Claim&gt; [...] Though
people were not satified about how the landform was a natural landform, in all, we new that alieans did not
form the face. I would like to know how the landform was formed. we know now that life on Mars doesn’t exist.
&lt;Concluding Statement&gt;
Hi, i’m Isaac, i’m going to be writing about how this face on Mars is a natural landform or if there is life on Mars
that made it. The story is about how NASA took a picture of Mars and a face was seen on the planet. NASA
doesn’t know if the landform was created by life on Mars, or if it is just a natural landform. &lt;Adequate&gt;. On
my perspective, I think that the face is a natural landform because I dont think that there is any life on Mars. In
these next few paragraphs, I’ll be talking about how I think that is is a natural landform &lt;Adequate&gt;I think that
the face is a natural landform because there is no life on Mars that we have descovered yet &lt;Adequate&gt; [...]
Though people were not satified about how the landform was a natural landform, in all, we new that alieans did
not form the face. I would like to know how the landform was formed. we know now that life on Mars doesn’t
exist. &lt;Inefective&gt;</p>
          <p>Hi, i’m Isaac, i’m going to be writing about how this face on Mars is a natural landform or if there is life on Mars
that made it. The story is about how NASA took a picture of Mars and a face was seen on the planet. NASA
doesn’t know if the landform was created by life on Mars, or if it is just a natural landform. &lt;Lead, Adequate
&gt;. On my perspective, I think that the face is a natural landform because I dont think that there is any life on
Mars. In these next few paragraphs, I’ll be talking about how I think that is is a natural landform &lt;Position,
Adequate&gt; I think that the face is a natural landform because there is no life on Mars that we have descovered
yet &lt;Claim, Adequate&gt; [...] Though people were not satified about how the landform was a natural landform,
in all, we new that alieans did not form the face. I would like to know how the landform was formed. we know
now that life on Mars doesn’t exist. &lt;Concluding Statement, Inefective&gt;</p>
        </sec>
      </sec>
      <sec id="sec-10-4">
        <title>A.4. Incorrect outputs format during LLM inference</title>
        <p>Handling incorrect outputs During inference, the LLM’s responses did not always adhere to the
specified output format required to perform properly the argument mining tasks. In such cases, the
inference was repeated up to five times. If the issue persisted, the argument (or the essay) was discarded.
Examples of incorrect outputs Table 11 presents two examples of incorrect outputs generated with
Llama 3.1 8B for the argument type task, along with their corresponding prompt. Note how in the first
output, Llama 3.1 8B argues that it is not able to perform the task because it is too complex for it. In the
second output, a part of the essay is missing: “The Electoral College is also a modern sense", and Llama
3.1 8B has omitted the classifications of some arguments.</p>
        <sec id="sec-10-4-1">
          <title>Fequency of incorrect output format across models and experiments Among all the tested</title>
          <p>LLMs, fine-tuned models required more inference retries and experienced higher failure rates compared
to their few-shot counterparts. Furthermore, increasing the number of shots in the few-shot setting led
to a higher number of retries and failures. Notably, Qwen 2.5 7B in the few-shot setting exhibited the
Prompt“#QUERY: You are a very strict AI evaluator specializing in detecting the type of argument in essays. The
argument types are as follows: - &lt;Lead &gt;: An introduction that begins with a statistic, quotation, description, or
other device to grab the reader’s attention and point toward the thesis. - &lt;Position &gt;: An opinion or conclusion
on the main question. - &lt;Claim &gt;: A statement that supports the position. - &lt;Counterclaim &gt;: A statement that
opposes another claim or provides an opposing reason to the position. - &lt;Rebuttal &gt;: A statement that refutes a
counterclaim. - &lt;Evidence &gt;: Ideas or examples that support claims, counterclaims, or rebuttals. - &lt;Concluding
Statement &gt;: A statement that restates the claims and summarizes the argument.
#TASK: Analyze the following essay and replace only each instance of &lt;SEP&gt;with the argument type of the
sentence immediately preceding it. Do not insert any additional markers or segmentations beyond replacing
&lt;SEP &gt;. If a sentence does not have &lt;SEP&gt;, leave it unmodified.
#ESSAY: Dear,Florida I think that we should keep the Electoral College becuase it keeps the number of votes and
it produces a clear winner. &lt;SEP&gt;The Electoral College restores some of the weight in the political balance that
large states (by population) lose by virtue of the senate decreed in the constitution. &lt;SEP&gt;The Electoral College
is a process that consists of 538 electors. Each state equals the number of members in the house of representives.
The Electoral also select what their responsibilities are. &lt;SEP&gt;The Electoral college is widely regarded as an
anachronism a non-democratic method. &lt;SEP&gt;The Electoral College is also a mordern sense. Also no region
(south,northeast,etc.) has enough electoral votes to elect a president. Residents also feel that their votes dont
get counted that the new president will have no regard for thier interest, that he really isnt their president.
&lt;SEP&gt;Electoral College alson avoids the problem of elections in which no candidate receives a majority of the
vote cast. &lt;SEP&gt;"
Output 1 “I can’t help you with this task. This task requires identifying and labeling specific argument types in
an essay, which is a complex task that may require human judgment and expertise. If you’d like to discuss this
further, I can try to provide more information on the topic of argument type identification."
Output 2 “After analyzing the essay, I will replace each instance of &lt;SEP&gt;with the argument type of the sentence
immediately preceding it. Here is the modified essay: Dear, Florida I think that we should keep the Electoral
College because it keeps the number of votes and it produces a clear winner. &lt;Claim &gt;The Electoral College
restores some of the weight in the political balance that large states (by population) lose by virtue of the senate
decreed in the constitution. &lt;Claim &gt;The Electoral College is a process that consists of 538 electors. Each state
equals the number of members in the house of representives. The Electoral also select what their responsibilities
are. &lt;Counterclaim &gt;The Electoral college is widely regarded as an anachronism a non-democratic method.
&lt;Position &gt;Also, no region (south,northeast,etc.) has enough electoral votes to elect a president. Residents also
feel that their votes dont get counted that the new president will have no regard for thier interest, that he really
isnt their president. &lt;Claim &gt;Electoral College alson avoids the problem of elections in which no candidate
receives a majority of the vote cast. Note: Some sentences don’t have &lt;SEP&gt;as they are not part of an argument
sequence or are standalone statements."
highest failure rate among all models.</p>
        </sec>
      </sec>
      <sec id="sec-10-5">
        <title>A.5. Further analyses</title>
        <sec id="sec-10-5-1">
          <title>A.5.1. Details about the performance metrics</title>
          <p>To assess the eficiency of the proposed methods, we report the following metrics that consider the
imbalance in the distribution of labels.</p>
          <p>Metrics per label, namely precision, recall, and F1 score for each individual label to have an in-depth
look at the classifier’s performance on a per-label basis.</p>
          <p>Multi-label confusion matrix to provide a detailed breakdown of model performance across
argument categories, highlighting both accurately predicted cases (along the diagonal) and common
misclassifications. The values in the matrix give insights into which argument types or quality
assessment labels the proposed method distinguishes efectively and where it struggles.</p>
        </sec>
        <sec id="sec-10-5-2">
          <title>A.5.2. Details about the macro-averaged F1 score</title>
          <p>The macro-averaged F1 score is the mean of the F1 scores for each label, treating all labels equally.
Thereby it provides a measure of overall performance across all labels without considering label
imbalance.</p>
          <p>For  classes, the macro-averaged F1 score is 1 = 1 ∑︀
=1 1,, where the F1 score of class  is given
by:
1, =
2 · Precision  · Recall 
Precision + Recall
,
with Precision and Recall representing the precision and recall scores of class .</p>
        </sec>
        <sec id="sec-10-5-3">
          <title>A.5.3. Segmentation analysis</title>
          <p>Table 12 depicts the precision, recall, and F1 score for the argument segmentation task, and Table 13 the
corresponding confusion matrix.</p>
          <p>Segm. Precision Recall F1-score
B 66.01 88.32 75.55</p>
          <p>I 99.79 99.18 99.48</p>
          <p>Llama 3.1 8B fine-tuned shows very strong performance in predicting the I label (precision, recall,
and F1-score are above 99%). However, for the B label (beginning of an argument), while recall is quite
high (88.32%), precision is noticeably lower (66.01%), suggesting the model sometimes over-predicts
beginnings. The confusion matrix confirms that the model occasionally misclassifies B tokens as I, but
rarely the other way around. Overall, this points to strong segmentation performance, with the main
challenge being the precise identification of the beginning of an argument (B).</p>
          <p>Figure 4 shows the overlap with the gold segmentation and predicted segmentation across models.</p>
        </sec>
        <sec id="sec-10-5-4">
          <title>A.5.4. Argument type classification analysis</title>
          <p>As a complementary analysis of Table 1 in the main paper, Table 14 displays the confusion matrix using
the fine-tuned Llama 3.1 8B model on the joint setup with the segmentation inferred by the model. Note
that the entire Echec row consists of zeros as this label is not present in the ground truth dataset, and it
is counted when the predicted argument does not match the ground truth argument.</p>
          <p>Type Lead
Lead 10.33
Pos 0.00
Clai 0.00
C Clai 0.00
Reb 0.00
Evid 0.33
Ccl 0.00
Ech 0.00</p>
        </sec>
        <sec id="sec-10-5-5">
          <title>A.5.5. Quality assessment analysis</title>
          <p>As a complementary analysis of Table 2 in the main paper, Table 15 displays the confusion matrix of
ifne-tuned Gemma 2 9B on the joint setup with the segmentation inferred by the model.</p>
          <p>Quality
Inefective
Adequate
Efective
Echec</p>
          <p>Inefective
32.33
72.67
4.00
0.00</p>
          <p>Adequate Efective
16.00 2.00
95.22 10.67
40.33 15.00
0.00 0.00</p>
          <p>Echec
26.67
86.33
34.67
0.00</p>
        </sec>
      </sec>
      <sec id="sec-10-6">
        <title>A.6. Discussion on the annotation quality</title>
        <p>Data quality issues—such as inconsistent annotations, missing data, or biased labeling—can skew model
performance and reduce the reliability of automatic essay assessments, leading to inflated or misleading
metrics and limit the generalization of results. In our experiments, we identified noise in the annotations.
Hence, future research should aim to address these limitations by improving annotation guidelines,
enhancing inter-rater reliability and expanding the available dataset to include a more diverse range of
essays.</p>
      </sec>
      <sec id="sec-10-7">
        <title>A.7. Additional small open source LLMs</title>
        <p>
          We evaluated three recent small open-source LLMs on the argument type classification task, using
three-shot prompting. Table 16 contains the macro-averaged F1 score for the following models:
• Llama 3.2, 3B. Llama 3.2 a multilingual auto-regressive language model which uses an optimized
transformer architecture, released in September 2024 by Meta. See https://ollama.com/library/
llama3.2:3b.
• OLMo 2 7B. OLMo 2 is the latest iteration of the fully open language model, featuring dense
autoregressive models with enhanced architecture and training methodologies, released in
November 2024 by the Allen Institute for AI [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]
• Qwen 2.5 7B. Qwen 2.5 is a multilingual transformer-based LLM with RoPE, SwiGLU, RMSNorm
and Attention QKV bias, released in September 2024 by the Qwen Team. [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ].
• DeepSeek R1 7B. DeepSeek R1 is an open-source large language model designed to enhance
reasoning capabilities through reinforcement learning. It rivals other advanced models in tasks
such as mathematics, coding, and logical reasoning. Released in January, 2025 by the Chinese AI
startup DeepSeek[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
• Llama 3.1 8B, Llama 3.1 is a multilingual large language model optimized for dialogue applications.
        </p>
        <p>
          It supports eight languages and ofers a context window of up to 128,000 tokens, enabling it to
handle extensive conversational contexts. Released in July 2024 by Meta [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
• Gemma 2 9B, Gemma is a text-to-text decoder-only LLM available in English with open weights,
released in June 2024 by Google, [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ].
        </p>
        <p>Additionally, we tested Mistral v 0.2 12 and Falcon 3 13. However, the majority of their outputs did not
conform to the expected format, making it impossible to evaluate their performance.</p>
        <p>Model
Llama 3.2 3B
OLMo 2 7B</p>
        <p>Qwen 2.5 7B
DeepSeek R1 7B</p>
        <p>Llama 3.1 8B
Gemma 2 9B</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Josh</given-names>
            <surname>Achiam</surname>
          </string-name>
          , Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman,
          <string-name>
            <given-names>Shyamal</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.
          <year>2023</year>
          . Gpt-4
          <source>technical report. arXiv preprint arXiv:2303</source>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sakshi</given-names>
            <surname>Arora</surname>
          </string-name>
          , Ajay Rana, and
          <string-name>
            <given-names>Archana</given-names>
            <surname>Singh</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Argument mining: A categorical review</article-title>
          .
          <source>In Modern Electronics Devices and Communication Systems: Select Proceedings of MEDCOM 2021</source>
          , pages
          <fpage>353</fpage>
          -
          <lpage>367</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jianzhu</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yuhang</surname>
            <given-names>He</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Sun</given-names>
          </string-name>
          , Bin Liang, Jiachen Du, Bing Qin,
          <string-name>
            <given-names>Min</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ruifeng</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>A generative model for end-to-end argument mining with reconstructed positional encoding and constrained pointer mechanism</article-title>
          .
          <source>In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>10437</fpage>
          -
          <lpage>10449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Iz</given-names>
            <surname>Beltagy</surname>
          </string-name>
          , Matthew E Peters, and
          <string-name>
            <given-names>Arman</given-names>
            <surname>Cohan</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Longformer: The long-document transformer</article-title>
          . arXiv preprint arXiv:
          <year>2004</year>
          .05150.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Elena</given-names>
            <surname>Cabrio</surname>
          </string-name>
          and
          <string-name>
            <given-names>Serena</given-names>
            <surname>Villata</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Five years of argument mining: A data-driven analysis</article-title>
          .
          <source>In IJCAI</source>
          , volume
          <volume>18</volume>
          , pages
          <fpage>5427</fpage>
          -
          <lpage>5433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Lang</given-names>
            <surname>Cao</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Autoam: An end-to-end neural model for automatic and universal argument mining</article-title>
          .
          <source>In International Conference on Advanced Data Mining and Applications</source>
          , pages
          <fpage>517</fpage>
          -
          <lpage>531</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Scott A Crossley</surname>
          </string-name>
          , Perpetual Bafour, Yu Tian, Aigner Picou, Meg Benner, and
          <string-name>
            <given-names>Ulrich</given-names>
            <surname>Boser</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>The persuasive essays for rating, selecting, and understanding argumentative and discourse elements (persuade) corpus 1.0</article-title>
          .
          <string-name>
            <given-names>Assessing</given-names>
            <surname>Writing</surname>
          </string-name>
          ,
          <volume>54</volume>
          :
          <fpage>100667</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Andrew Crossley</surname>
          </string-name>
          , Perpetual Bafour, Yu Tian, Alex Franklin, Meg Benner, and
          <string-name>
            <given-names>Ulrich</given-names>
            <surname>Boser</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>A large-scale corpus for assessing written argumentation: Persuade 2.0</article-title>
          . Available at SSRN 4795747.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Darshan</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , Zhivar Sourati, Filip Ilievski, and
          <string-name>
            <given-names>Fred</given-names>
            <surname>Morstatter</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Contextualizing argument quality assessment with relevant knowledge</article-title>
          .
          <source>arXiv preprint arXiv:2305</source>
          .
          <fpage>12280</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Tim</surname>
            <given-names>Dettmers</given-names>
          </string-name>
          , Artidoro Pagnoni, Ari Holtzman, and
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>QLoRA: Eficient ifnetuning of quantized LLMs</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <volume>36</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Yuning</surname>
            <given-names>Ding</given-names>
          </string-name>
          , Marie Bexte, and
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Horbach</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>Don't drop the topic-the role of the prompt in argument identification in student writing</article-title>
          .
          <source>In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2022</year>
          ), pages
          <fpage>124</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Yuning</surname>
            <given-names>Ding</given-names>
          </string-name>
          , Marie Bexte, and
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Horbach</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Score it all together: A multi-task learning study on automatic scoring of argumentative essays</article-title>
          .
          <source>In Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          , pages
          <fpage>13052</fpage>
          -
          <lpage>13063</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Abhimanyu</surname>
            <given-names>Dubey</given-names>
          </string-name>
          , Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang,
          <string-name>
            <given-names>Angela</given-names>
            <surname>Fan</surname>
          </string-name>
          , et al.
          <year>2024</year>
          .
          <article-title>The llama 3 herd of models</article-title>
          .
          <source>arXiv preprint arXiv:2407</source>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Lucile</surname>
            <given-names>Favero</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juan-Antonio</surname>
            Pérez-Ortiz,
            <given-names>Tanja</given-names>
          </string-name>
          <string-name>
            <surname>Käser</surname>
            , and
            <given-names>Nuria</given-names>
          </string-name>
          <string-name>
            <surname>Oliver</surname>
          </string-name>
          .
          <year>2025</year>
          .
          <article-title>Do AI tutors empower or enslave learners? Toward a critical use of AI in education</article-title>
          .
          <source>arXiv preprint arXiv:2507</source>
          .
          <fpage>06878</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Rujun</surname>
            <given-names>Gao</given-names>
          </string-name>
          , Hillary E Merzdorf, Saira Anwar,
          <string-name>
            <given-names>M Cynthia</given-names>
            <surname>Hipwell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Arun</given-names>
            <surname>Srinivasa</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Automatic assessment of text-based responses in post-secondary education: A systematic review</article-title>
          .
          <source>Computers and Education: Artificial Intelligence</source>
          , page
          <volume>100206</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Luke</surname>
            <given-names>Gessler</given-names>
          </string-name>
          , Shabnam Behzad, Yang Janet Liu, Siyao Peng, Yilun Zhu, and
          <string-name>
            <given-names>Amir</given-names>
            <surname>Zeldes</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Discodisco at the disrpt2021 shared task: A system for discourse segmentation, classification, and connective detection</article-title>
          .
          <source>arXiv preprint arXiv:2109</source>
          .
          <fpage>09777</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Deniz</surname>
            <given-names>Gorur</given-names>
          </string-name>
          , Antonio Rago, and
          <string-name>
            <given-names>Francesca</given-names>
            <surname>Toni</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Can large language models perform relation-based argument mining? arXiv preprint</article-title>
          arXiv:
          <volume>2402</volume>
          .
          <fpage>11243</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Daya</surname>
            <given-names>Guo</given-names>
          </string-name>
          , Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang,
          <string-name>
            <surname>Xiao Bi</surname>
          </string-name>
          , et al.
          <year>2025</year>
          .
          <article-title>Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:2501</source>
          .
          <fpage>12948</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Edward J Hu</surname>
            , Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,
            <given-names>Yuanzhi</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Shean</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Lu</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>Weizhu</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>LoRA: Low-rank adaptation of large language models</article-title>
          .
          <source>arXiv preprint arXiv:2106</source>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Albert</surname>
            <given-names>Q Jiang</given-names>
          </string-name>
          , Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna,
          <string-name>
            <given-names>Florian</given-names>
            <surname>Bressand</surname>
          </string-name>
          , et al.
          <year>2024</year>
          .
          <article-title>Mixtral of experts</article-title>
          .
          <source>arXiv preprint arXiv:2401</source>
          .
          <fpage>04088</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Omid</surname>
            <given-names>Kashefi</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Sophia</given-names>
            <surname>Chan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Swapna</given-names>
            <surname>Somasundaran</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Argument detection in student essays under resource constraints</article-title>
          .
          <source>In Proceedings of the 10th Workshop on Argument Mining</source>
          , pages
          <fpage>64</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Masayuki</surname>
            <given-names>Kawarada</given-names>
          </string-name>
          , Tsutomu Hirao, Wataru Uchida, and
          <string-name>
            <given-names>Masaaki</given-names>
            <surname>Nagata</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Argument mining as a text-to-text generation task</article-title>
          .
          <source>In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>2002</fpage>
          -
          <lpage>2014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>John</given-names>
            <surname>Lawrence</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Reed</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Argument mining: A survey</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>45</volume>
          (
          <issue>4</issue>
          ):
          <fpage>765</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .13461.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Nelson</surname>
            <given-names>F Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kevin Lin</surname>
            , John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and
            <given-names>Percy</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Lost in the middle: How language models use long contexts</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>12</volume>
          :
          <fpage>157</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Santiago</surname>
            <given-names>Marro</given-names>
          </string-name>
          , Elena Cabrio, and
          <string-name>
            <given-names>Serena</given-names>
            <surname>Villata</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>Graph embeddings for argumentation quality assessment</article-title>
          .
          <source>In EMNLP 2022-Conference on Empirical Methods in Natural Language Processing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Gaku</surname>
            <given-names>Morio</given-names>
          </string-name>
          , Hiroaki Ozaki, Terufumi Morishita, and
          <string-name>
            <given-names>Kohsuke</given-names>
            <surname>Yanai</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>End-to-end argument mining with cross-corpora multi-task learning</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>10</volume>
          :
          <fpage>639</fpage>
          -
          <lpage>658</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Team</surname>
            <given-names>OLMo</given-names>
          </string-name>
          , Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          , et al.
          <year>2024</year>
          .
          <article-title>2 olmo 2 furious</article-title>
          . arXiv preprint arXiv:
          <volume>2501</volume>
          .
          <fpage>00656</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Colin</surname>
            <given-names>Rafel</given-names>
          </string-name>
          , Noam Shazeer, Adam Roberts,
          <string-name>
            <given-names>Katherine</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sharan</given-names>
            <surname>Narang</surname>
          </string-name>
          , Michael Matena,
          <string-name>
            <surname>Yanqi Zhou</surname>
            ,
            <given-names>Wei</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Peter J Liu</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>21</volume>
          (
          <issue>140</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Ramon</given-names>
            <surname>Ruiz-Dolz</surname>
          </string-name>
          , Jose Alemany,
          <string-name>
            <surname>Stella M Heras Barberá</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ana</surname>
          </string-name>
          García-Fornes.
          <year>2021</year>
          .
          <article-title>Transformer-based models for automatic identification of argument relations: A cross-domain evaluation</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>36</volume>
          (
          <issue>6</issue>
          ):
          <fpage>62</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Marlene</given-names>
            <surname>Scardamalia</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carl</given-names>
            <surname>Bereiter</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>Knowledge telling and knowledge transforming in written composition</article-title>
          .
          <source>Advances in applied psycholinguistics</source>
          ,
          <volume>2</volume>
          :
          <fpage>142</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Yang</surname>
            <given-names>Sun</given-names>
          </string-name>
          , Guanrong Chen, Caihua Yang, Jianzhu Bao, Bin Liang, Xi Zeng,
          <string-name>
            <given-names>Min</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ruifeng</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Discourse structure-aware prefix for generation-based end-to-end argumentation mining</article-title>
          .
          <source>In Findings of the Association for Computational Linguistics ACL</source>
          <year>2024</year>
          , pages
          <fpage>11597</fpage>
          -
          <lpage>11613</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Yang</surname>
            <given-names>Sun</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muyi</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Jianzhu Bao, Bin Liang,
          <string-name>
            <given-names>Xiaoyan</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Caihua</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Min</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ruifeng</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Pita: Prompting task interaction for argumentation mining</article-title>
          .
          <source>In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>5036</fpage>
          -
          <lpage>5049</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Gemma</surname>
            <given-names>Team</given-names>
          </string-name>
          , Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari,
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Ramé</surname>
          </string-name>
          , et al.
          <year>2024</year>
          .
          <article-title>Gemma 2: Improving open language models at a practical size</article-title>
          .
          <source>arXiv preprint arXiv:2408</source>
          .
          <fpage>00118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Toulmin</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>The uses of argument</article-title>
          . Cambridge university press.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Hugo</surname>
            <given-names>Touvron</given-names>
          </string-name>
          , Louis Martin,
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <article-title>Amjad Almahairi</article-title>
          , Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava,
          <string-name>
            <given-names>Shruti</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.
          <year>2023</year>
          .
          <article-title>Llama 2: Open foundation and fine-tuned chat models</article-title>
          .
          <source>arXiv preprint arXiv:2307</source>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Henning</surname>
            <given-names>Wachsmuth</given-names>
          </string-name>
          , Gabriella Lapesa, Elena Cabrio, Anne Lauscher, Joonsuk Park, Eva Maria Vecchi, Serena Villata, and
          <string-name>
            <given-names>Timon</given-names>
            <surname>Ziegenbein</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Argument quality assessment in the age of instruction-following large language models</article-title>
          .
          <source>arXiv preprint arXiv:2403</source>
          .
          <fpage>16084</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>An</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Baosong</given-names>
            <surname>Yang</surname>
          </string-name>
          , Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,
          <string-name>
            <given-names>Chengyuan</given-names>
            <surname>Li</surname>
          </string-name>
          , Dayiheng Liu, Fei Huang,
          <string-name>
            <given-names>Haoran</given-names>
            <surname>Wei</surname>
          </string-name>
          , et al.
          <year>2024</year>
          .
          <article-title>Qwen2. 5 technical report</article-title>
          .
          <source>arXiv preprint arXiv:2412</source>
          .
          <fpage>15115</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>