<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>End-to-End Argument Mining in Student Essays: A Comparison of Pipeline and Multi-Task Generative Models⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fahad M. Alzaidee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommy Yuan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Nightingale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of York</institution>
          ,
          <addr-line>Heslington, York YO10 5DD</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>End-to-end argument mining aims to automatically identify argumentative components and their relationships within unstructured text. This paper investigates the efectiveness of generative large language models (LLMs) for this task on student essays from the Argument-annotated Essays Corpus (AAEC). We compare two strategies: a Pipeline approach, which separates component and relation identification into sequential stages, and a Multi-Task Learning (MTL) framework, which jointly models all subtasks. Both approaches employ the TANL tagging format to represent argumentative components and their relations. Experiments with LLaMA3 and Mistral models show that the Pipeline consistently outperforms MTL, demonstrating its advantage in structuring argumentative discourse.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Argument Mining</kwd>
        <kwd>LLM</kwd>
        <kwd>Student essays</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Argumentation is a fundamental skill that empowers individuals to present and defend claims through
evidence and reasoning. For students, mastering this skill is essential for developing critical thinking
and efective communication. Argument Mining (AM), a specialized area within Natural Language
Processing (NLP), seeks to automate the identification and structuring of arguments in text,
approximating human-like reasoning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. By breaking down arguments into subtasks—such as identifying
argumentative spans, classifying components, and determining relationships (typically support or
attack)—AM constructs structured argumentative frameworks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In recent years, argument mining has gained significant traction across diverse domains. From
analyzing scientific literature [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and enhancing information retrieval systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to enabling automated
essay scoring [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and advancing dialogue analysis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], its applications are both broad and impactful.
As the field continues to evolve, the development of algorithms capable of recognizing and analyzing
argument structures with human-like precision remains a central goal, driving innovation in intelligent
systems and beyond.
      </p>
      <p>
        The Argument-annotated Essays Corpus (AAEC), introduced by Stab and Gurevych [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], is a widely
used dataset for evaluating argument mining approaches. It models student essays as hierarchical
trees of claims and premises. End-to-end argument mining (AM) aims to automatically transform
unstructured argumentative text into a structured representation by identifying both the argument
components (e.g., claims, premises) and the relationships between them. To address the complexity of
this task, two methodological paradigms have emerged. The Pipeline approach decomposes AM into
sequential subtasks, training specialized models for each step [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Conversely, Multi-Task Learning
(MTL) employs a unified neural architecture to jointly model all AM subtasks, enabling shared learning
across tasks.learning across tasks.
      </p>
      <p>
        Recent advancements in Large Language Models (LLMs), such as LLaMA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Mistral [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and Falcon
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], have revolutionized natural language processing. These models, which are based on Transformer
frameworks, employ self-attention mechanisms to eficiently capture contextual relationships between
tokens. They excel in text comprehension, generation, and adaptability across self-supervised,
multitask, and few-shot learning settings.
      </p>
      <p>
        End-to-end argument mining (AM) is a challenging task due to the non-linear and complex discourse
structure of argumentative texts, as well as the scarcity of annotated datasets. These challenges
necessitate innovative approaches to efectively identify argument components and their relationships.
In this study, we investigate the eficacy of generative large language models (LLMs) for end-to-end
AM by addressing the following research questions:
1. How efective are decoder-only architectures of LLMs in addressing the subtasks of argument
mining?
2. To what extent can pipeline and multi-task learning approaches enhance the performance of
pretrained LLMs for argument mining, and which strategy is more efective for capturing discourse
structure?
To address these questions, we apply and compare the following strategies for argument mining:
• Pipeline Approach: We divide the subtasks of argument mining into two consecutive steps. The
ifrst step (component identification) involves segmentation detection and component classification,
while the second step (relation identification) focuses on relation detection and classification.
• MTL with TANL: We implement a unified framework using Translation between Augmented
Natural Languages (TANL) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which reframes structured prediction as a text-to-text task.
This approach has achieved state-of-the-art results in semantic parsing and is adapted here for
argument mining.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Argument Mining</title>
        <p>
          The process of argument mining involves several key subtasks:
• Text Segmentation: Identifying the boundaries of argumentative segments in a given text
[
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ].
• Component Classification: Classifying argumentative segments into categories such as claims
and premises [
          <xref ref-type="bibr" rid="ref15">15, 16</xref>
          ].
• Relation Detection: Identifying and classifying relationships between propositions, such as
support or attack [17, 18].
        </p>
        <p>While most works focus on individual subtasks, relatively few address the end-to-end AM problem,
which aims to construct argumentation structures directly from unstructured text. Two common
approaches are used for end-to-end argument mining:
Pipeline Approach: This approach decomposes the AM task into sequential subtasks. For instance,
Persing and Vincent proposed a pipeline method consisting of two stages: first identifying argument
components and then classifying the argumentative relations between them. Their approach used
Integer Linear Programming (ILP) to perform joint inference over the outputs of argument
component identification and relation identification classifiers, incorporating global consistency constraints.
However, their results were moderate, achieving F1 scores of 47.1 for component identification and
classification and 12.9 for relation identification and classification.</p>
        <p>
          Stab and Gurevych [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] adopted a sequence labeling approach at the token level using a Conditional
Random Field (CRF) and tow classifiers based on Support Vector Machines (SVM) for component
identification task and relation identification task.A joint model based on ILP is used to optimizes the
outcomes the two classifier to detect argumentation structures in persuasive essays.
Multi-Task Learning (MTL): This approach mitigates the limitations of the pipeline method, such
as error propagation and the rigid constraints of Integer Linear Programming (ILP), by jointly learning
multiple subtasks within a unified framework.
        </p>
        <p>Eger et al. [19] proposed two models: BLCC and LSTM-ER. BLCC approaches argument mining
as a sequence tagging problem, whereas LSTM-ER integrates tree-structured and sequential LSTM
architectures for end-to-end relation extraction. On the AAEC dataset, LSTM-ER achieved an F1 score
of 66.21 for component identification, outperforming BLCC’s score of 63.23. For relation identification,
LSTM-ER achieved an F1 score of 29.59, while BLCC scored 34.82.</p>
        <p>Morio et al. [20] proposed a multi-task learning framework built on a biafine architecture and
longformer model. The framework jointly identifies the argument component spans, their types (e.g.,
Claim/Premises), and the relations between these components along their relation types.They proposed
two models: the ST model, trained on a single dataset, and the MT model, trained on multiple auxiliary
corpora and fine tuned on the target dataset. On the AAEC dataset, ST model achieved F1 scores of
75.54 for argument component identification and classification and 55.17 for relation identification and
classification while MT model achieved comparable scores.</p>
        <p>Closely related to our work, there are several studies that eliminate the need for dependency parsing
to frame the complex argumentative structure. Bao et al. [21] proposed a novel generative framework
that frames the end-to end argument mining as a simple sequence-to-sequence generation task. The
framework employs the pre-trained sequence-to-sequence language model (BART) with two
mechanisms: a Constrained Pointer Mechanism (CPM), which acts as an auxiliary task during training
and a constrained decoding method during inference, guiding the generation to valid output; and a
Reconstructed Positional Encoding (RPE) to mitigate the order biases introduced by the autoregressive
generation for modeling long-range dependencies. On the AAEC dataset, their approach achieved
F1 scores of 75.94 for argument component identification and classification and 50.08 for relation
identification and classification.</p>
        <p>
          Similarly, Kawarada et al. [22] redefined argument mining as a structured prediction task using
the TANL framework [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. They utilized FLAN-T5 XXL to generate structured outputs by annotating
argumentative spans, components, and relations. Their method achieved state-of-the-art results across
benchmark datasets, including the Argument-annotated Essays Corpus (AAEC), AbstRCT, and the
Cornell eRulemaking Corpus (CDCP). Specifically, on the AAEC dataset, they reported F1 scores of
80.15 for component identification and 61.12 for relation identification.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Generative Models</title>
        <p>Large Language Models (LLMs) have garnered substantial attention within the artificial intelligence
community due to their advanced capabilities in processing, comprehending, and generating
humanlike text. These models are trained on extensive textual corpora using autoregressive architectures,
which generate text by predicting one token at a time based on the preceding sequence. Built on the
Transformer framework [23], LLMs utilize self-attention mechanisms to eficiently capture dependencies
between tokens, regardless of their position in the input sequence. Tokens, the fundamental units of
text, may represent words or subword fragments depending on the tokenization strategy. By iteratively
predicting subsequent tokens, LLMs can uncover intricate linguistic patterns.</p>
        <p>Several studies have explored the application of generative models in computational argumentation.
Chen et al. [24] investigate the capabilities of large language models (LLMs), such as ChatGPT, Flan,
and LLaMA2, in identifying and extracting arguments from text, as well as in generating
counterarguments. Gilardi et al. [25] highlight the potential of employing ChatGPT as a tool for text-annotation
tasks by evaluating its performance across four tasks: assessing content relevance, detecting stances,
identifying topics, and recognizing general frames. Their study compares ChatGPT’s annotations
against those of crowd workers and expert annotators, demonstrating its competitive accuracy and
eficiency. Mirzakhmedova et al. [ 26] examined the use of large language models (LLMs), specifically
GPT-3 and PaLM 2, to assess argument quality. using a zero-shot learning approach. Recent work
on Relation-based Argument Mining (RbAM) by Gorur et al. [27] evaluates few-shot prompting for
relation-type classification over argument pairs and shows that instruction-tuned LLMs can outperform
strong RoBERTa baselines on several benchmarks.</p>
        <p>Although prior work demonstrates the potential of generative models in specific argument mining
tasks, a unified approach that comprehensively addresses all subtasks of argument mining remains
underexplored. Developing such an integrated framework is crucial to advance the robustness and
applicability of large language models in computational argumentation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <p>
        This study explores end-to-end argument mining using two methodologies: Multi-Task Learning
(MTL) and a Pipeline approach (depicted in Figure 1). We employ open-source generative language
models—LLaMA 3 (8B), which is proficient in causal language modeling, and LLaMA 3-instruct (8B),
an instruction-following fine-tuned variant. Additionally, we use Mistral 7B-Instruct v0.3, a 7 billion
parameter pre-trained and fine-tuned model known for its computational eficiency and cost-efective
inference. We employ the tagging frameworks proposed by Paolini et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Specifically, we use their
entity augmentation format to annotate argumentative segments (e.g., Major Claim, Claim, Premise)
and their joint entity-relation format to label argumentative relations (Support/Attack). These formats
are used to fine-tune the base models. Both the Pipeline and Multi-Task Learning (MTL) approaches
use identical inputs and produce the same augmented essay output. In this section, we focus on the
Pipeline approach, which follows the traditional two-stage process for argument mining. First, the
system detects and categorizes argumentative segments within the raw essay as Major Claims, Claims,
or Premises. Then, it evaluates these segments to identify and classify the relationships between them
as either supporting or attacking.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Stage one</title>
        <p>Given a raw essay  consisting of  tokens,  = [1, . . . , ], the objective of this stage is to perform two
argument mining subtasks simultaneously: segment identification and component classification. Each
argumentative segment is represented as a 3-tuple (, , ), where  denotes the type of the segment
(Major Claim, Claim, or Premise), and  and  represent the indices of the first and last tokens of the
segment. The output of this stage is an essay with tagged components, following the format used
in TANL’s entity extraction (Figure 3). To ensure a structured representation, each segment type is
assigned a number based on its order of appearance. For example, the first detected claim is labeled
Claim 1, the second Claim 2, and so on. This numbering scheme facilitates tracking and referencing
argument components in the second stage.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Stage two</title>
        <p>Given the augmented essay from the previous stage, each argumentative segment is identified in the
form
where the brackets indicate the boundaries of the argumentative segment, and</p>
        <p>[segment content | ],
 ∈  = {Major Claim, Claim, Premise}.</p>
        <p>The objective of this stage is to identify argumentative relationships and classify them by type. Each
identified relationship is represented as a 3-tuple:</p>
        <p>(src, tgt, ),
where src denotes the source argument component, tgt represents the target argument component,
and  specifies the relationship type. The indices  and  are defined as
 ∈ {1, 2, . . . ,  },</p>
        <p>∈ {1, 2, . . . ,  },  ̸= ,
where  and  represent the total number of argument components of type  ∈ , ensuring that the
source and target components are distinct. The relationship type is drawn from</p>
        <p>To structure the argumentative relationships within the original essay, we adapt the output format of
TANL’s joint entity and relation extraction task. The encoded format is expressed as
 ∈  = {Support, Attack}.</p>
        <p>[segment content | src |  = tgt],</p>
        <p>
          After predicting the argumentative structure of an essay, we apply the alignment method proposed
by Paolini et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. In the Pipeline, alignment is applied at every stage, while in the MTL approach, it
is applied before generating the final output. This method reduces the impact of adding or removing
words or phrases between detected argumentative segments, which could otherwise shift their start
and end positions.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>We evaluate both approaches on the Argument-annotated Essays Corpus (AAEC) to facilitate the study
of argumentation structures in persuasive essays. AAEC consists of 402 structured essays on various
controversial topics (Table 1 for more details). Of the 402 essays, 322 are set aside for the train set and
80 for the test set. The statistics of the PE dataset are given in Table I. Each essay is annotated with
argumentative discourse units (ADUs), which are segments of text that represent individual components
of an argument (Major Claim, claim, or premise) and the relationships between ADUs (Support, Attack).
The argumentative structure is represented as a hierarchical tree. We fine-tuned and evaluated all
models on the essay level rather than on the paragraph level.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Prompting and Input Preparation for Fine-Tuning</title>
        <p>We employ the open-source models LLaMA-3 (8B), LLaMA-3-Instruct (8B), and Mistral-7B-Instruct v0.3
as our base large language models to assess their performance in automating argument mining tasks
using both pipeline and MTL approaches. All models are run in an 8-bit quantized configuration, with
each weight stored in 8 bits on the GPU. For each approach, we report the average scores from three
runs on the test set.</p>
        <p>To guide the models in performing argument mining subtasks, we employ a prompt template adapted
from Stanford Alpaca [28], which comprises instruction, input, and output sections (see Appendix A).</p>
        <p>We evaluate the model’s performance on two types of input: (1) raw essay text and (2) text
augmented with special tokens indicating paragraph functions (see Figure 3) The special tokens used are:
&lt;intro&gt; and &lt;/intro&gt; to mark the introduction paragraph, &lt;body&gt; and &lt;/body&gt; to mark the body
paragraph(s), and &lt;conc&gt; and &lt;/conc&gt; to mark the conclusion paragraph. To address the computational
demands of fine-tuning large language models (LLMs), we employ Low-Rank Adaptation (LoRA) [ 29], a
parameter-eficient fine-tuning (PEFT) method. LoRA retains the weights of the original model and
integrates trainable low-rank matrices into the transformer layers to simulate weight adjustments. This
approximation, based on the principle that the adaptation process has a low “intrinsic rank,” reduces
the number of trainable parameters to under 1% while maintaining performance levels comparable to
full fine-tuning.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Criteria</title>
        <p>We assessed the performance of component and relation classification using micro F1 scores, following
the approach by Persing and Ng [30]. For component classification, a true positive occurs when a
predicted argumentative component matches both the type (e.g., claim, premise) and boundary of a
gold standard component. For relation classification, a true positive is counted when the predicted
source and target components correspond exactly to those in the gold annotations and share the same
relation type (e.g., Support, Attack).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Implementation Details</title>
        <p>For this study, both the Multi-Task Learning (MTL) framework and the pipeline approach utilized
a unified set of hyperparameters in all base language models (Table 6). Specifically, we applied a
batch size of 8, trained for 3 epochs, and enabled early stopping. Optimization was carried out using
pagedadamw32bit, with a learning rate of 0.0002, scheduled via a cosine decay strategy, alongside a
weight decay of 0.006. Additionally, we employed a warmup ratio of 0.03 to stabilize training.
LoRAspecific settings were consistently maintained, with a rank of 32, an alpha value of 64, and a dropout
rate of 0.01.</p>
        <p>All experiments were conducted on NVIDIA A40 GPU platform using the AdamW optimizer. Training
was conducted over 10 epochs with a maximum token length of 1024. Model checkpoints were saved
every 10 training steps.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Evaluation and Results</title>
        <p>We conducted an ablation study to assess the impact of adding special tokens that encode paragraph
roles. As shown in Table 5, this modification did not improve performance in predicting argumentative
component types, indicating that paragraph-level role encoding does not benefit generative models in
this task.</p>
        <p>Table 4 presents the performance of the pipeline and multi-task learning approaches on the AAEC
dataset. The pipeline approach outperforms multi-task learning across all evaluated generative models
in both component and relation identification. The LLaMA-3-8B-Instruct model achieves the highest
micro F1 scores: 80.02 for component identification and 60 for relation identification.</p>
        <p>Table 2 compares our best-performing pipeline model with prior studies that conducted end-to-end
argument mining on the AAEC dataset at the essay level. Existing methods include:
• T2TGen by Kawarada et al. [22], which frames argument mining as structured prediction using
the TANL framework and FLAN-T5 XXL.
• BLCC and LSTM-ER by Eger et al. [19], which introduce sequence tagging and tree-structured</p>
        <p>LSTM architectures for component and relation extraction.
• MT Model by Morio et al. [20], employing a biafine multi-task learning framework built on</p>
        <p>Longformer for joint modeling of components and relations.
• BART-CPM by Bao et al. [21], proposing a generative sequence-to-sequence framework with
constrained decoding and positional encoding to model argumentative structure without dependency
parsing.</p>
        <sec id="sec-4-5-1">
          <title>Model ACC ARC BLCC 63.23 34.82 LSTM-ER 66.21 29.56 ST Model 76.55 54.66</title>
          <p>T2T-Gen 80.15 61.19
BART-CPM 75.94 50.08</p>
          <p>LLaMA 3 8B (8-bit) (ours) 80.02 60</p>
          <p>As shown in Table 2, our model performs competitively with existing approaches. It achieves
strong ACC and ARC scores with the LLaMA-3-8B pipeline using 8-bit quantization, demonstrating the
eficiency and robustness of our approach.</p>
          <p>To evaluate the second stage of our pipeline for relation classification on the AAEC dataset, we
perform an oracle analysis using annotated essays. These annotations specify the boundaries
and types of argument components in the raw text, efectively removing errors from component
identification. F1 scores are computed separately for support and attack relations, as well as an overall
F1 score. We use 10-fold cross-validation, testing the model on 10% of the dataset in each fold.
True Positives (TP), False Positives (FP), and False Negatives (FN) are aggregated across folds to
compute dataset-level F1 scores. In this binary setting, a false negative for one class is treated as a false
positive for the other.</p>
          <p>Table 3 compares our results with those of Gorur et al.[27], who employed an instruction-tuned
LLaMA 2 70B (70 billion parameters, 4-bit precision) model for few-shot relation-type classification.
In contrast, our best system uses the much smaller LLaMA-3-8B-Instruct (8 billion parameters, 8-bit
precision) model yet achieves closely comparable performance across relation types. This clearer
comparison highlights that our relation classification framework approaches state-of-the-art few-shot
results while being significantly more parameter-eficient.</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>Model</title>
          <p>LLaMA 2 70B (4-bit)</p>
        </sec>
        <sec id="sec-4-5-3">
          <title>LLaMA 3 8B (8-bit) (ours) Support 94 92</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and limitation</title>
      <p>This study evaluated the Pipeline and Multi-Task Learning (MTL) approaches for end-to-end argument
mining using three generative models on the AAEC dataset. The Pipeline approach consistently
outperformed MTL in both component and relation classification, highlighting the benefits of task
decomposition and sequential processing for decoder-only architectures. While these results are
promising, future work should assess the generalizability of the Pipeline framework across diverse
datasets and domains. By separating the two main subtasks, the Pipeline design not only improves
classification performance but also supports user involvement in refining argumentative structure. This
allows users to revise predicted components before relation generation. However, this performance gain
comes at the cost of lower inference eficiency. Adapter switching and dual generation steps introduce
computational overhead that may limit real-time deployment.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Declaration on Generative AI</title>
      <p>During the preparation of this work, ChatGPT was employed to assist with grammar and spelling
correction as well as paraphrasing. We thoroughly reviewed and edited the text generated with ChatGPT
and retain full responsibility for the content presented.
[16] C. Stab, I. Gurevych, Identifying argumentative discourse structures in persuasive essays, in:
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),
2014, pp. 46–56.
[17] H. Nguyen, D. Litman, Context-aware argumentative relation mining, in: Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016,
pp. 1127–1137.
[18] O. Cocarascu, F. Toni, Identifying attack and support argumentative relations using deep learning,
in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
2017, pp. 1374–1379.
[19] S. Eger, J. Daxenberger, I. Gurevych, Neural end-to-end learning for computational argumentation
mining, arXiv preprint arXiv:1704.06104 (2017).
[20] G. Morio, H. Ozaki, T. Morishita, K. Yanai, End-to-end argument mining with cross-corpora
multi-task learning, Transactions of the Association for Computational Linguistics 10 (2022)
639–658.
[21] J. Bao, Y. He, Y. Sun, B. Liang, J. Du, B. Qin, M. Yang, R. Xu, A generative model for end-to-end
argument mining with reconstructed positional encoding and constrained pointer mechanism, in:
Proceedings of the 2022 conference on empirical methods in natural language processing, 2022,
pp. 10437–10449.
[22] M. Kawarada, T. Hirao, W. Uchida, M. Nagata, Argument mining as a text-to-text generation
task, in: Proceedings of the 18th Conference of the European Chapter of the Association for
Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2002–2014.
[23] A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017).
[24] G. Chen, L. Cheng, L. A. Tuan, L. Bing, Exploring the potential of large language models in
computational argumentation, arXiv preprint arXiv:2311.09022 (2023).
[25] F. Gilardi, M. Alizadeh, M. Kubli, Chatgpt outperforms crowd workers for text-annotation tasks,</p>
      <p>Proceedings of the National Academy of Sciences 120 (2023) e2305016120.
[26] N. Mirzakhmedova, M. Gohsen, C. H. Chang, B. Stein, Are large language models reliable argument
quality annotators?, in: Conference on Advances in Robust Argumentation Machines, Springer,
2024, pp. 129–146.
[27] D. Gorur, A. Rago, F. Toni, Can large language models perform relation-based argument mining?,
in: Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp.
8518–8534.
[28] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto, Stanford
alpaca: an instruction-following llama model, https://github.com/tatsu-lab/stanford_alpaca (2023).
[29] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021).
[30] I. Persing, V. Ng, End-to-end argumentation mining in student essays, in: Proceedings of the
2016 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, 2016, pp. 1384–1394.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <p>Model
LLAMA 3 (8B)
LLAMA 3-Instruct (8B)
Mistral 7B-Instruct v0.3
MTL
77
78.2
72.1</p>
      <p>ACC</p>
      <p>Pipeline
Prompt template for MTL approach</p>
      <sec id="sec-7-1">
        <title>You are an argument analysis expert. Follow these steps EXACTLY:</title>
        <p>Step 1: Identify Argument Components
- Major Claim: The essay’s central argument. Label exactly once or twice if repeated.
- Claims: Debatable assertions that support or attack other components. Claims require support from Premises
and can be challenged or defended. Label as Claim 1, Claim 2, etc.
- Premises: Supporting reasons, evidence, or justification for a Claim or another Premise. Premises do not stand
alone as arguments but provide necessary reasoning. They may contain factual evidence, logic-based reasoning,
or expert opinions.</p>
        <p>Step 2: Determine Argumentative Relationships
- Support: The component strengthens or reinforces another.
- Refute: The component provides counterarguments that directly challenge another.
- Contradict: The component presents opposing ideas without direct refutation.
- Always reference the exact label of the target component (e.g., Major Claim, Claim 1, Premise 2).
- Recognize implicit relationships where direct connections are not explicitly stated but inferred from meaning,
tone, and structure.</p>
        <p>Formatting Rules (DO NOT DEVIATE):
1. Wrap components in brackets: [Exact Text | Label].
2. For relationships: Add ’| Relationship = Target Component’ after the label.
3. If a component spans multiple sentences, include ALL sentences inside one bracket.</p>
      </sec>
      <sec id="sec-7-2">
        <title>4. Preserve original text order and wording—do NOT add or remove words.</title>
        <p>5. Labels must follow this syntax: Major Claim, Claim X, Premise X (X = number).
6. DO NOT include transitional phrases, introductory/reinforcing phrases, or stance markers (e.g., ’I
believe,’) inside the brackets.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7. If a sentence contains multiple distinct ideas separated by semicolons, conjunctions (’which’, ’but’, etc.), or implicit contrasts, they must be annotated separately. Only annotate the core argument component itself.</title>
        <p>Additional Guidance:
- Some argumentative components do not function as supporters or refuters, yet they can still be targets of
argumentation.
- Transitional words and phrases serve as clues for relationships:</p>
      </sec>
      <sec id="sec-7-4">
        <title>Support indicators: ‘therefore,’ ‘as a result,’ ‘consequently,’ ‘this proves’</title>
      </sec>
      <sec id="sec-7-5">
        <title>Refutation indicators: ‘however,’ ‘on the contrary,’ ‘while,’ ‘although’</title>
      </sec>
      <sec id="sec-7-6">
        <title>Contradiction indicators: ‘despite,’ ‘in contrast,’ ‘alternatively,’ ‘whereas’</title>
        <p>- If an argument contains both supportive and opposing elements, label it according to its dominant function.
Input:
{essay}
Output:
{Augmented essay}</p>
        <p>Template prompt for the first stage of the pipeline approach</p>
      </sec>
      <sec id="sec-7-7">
        <title>Identify and annotate the argument components in the input, including Major Claims, Claims, and Premises.</title>
        <p>- Major Claim: The essay’s central argument. Label exactly once or twice (if repeated).
- Claims: Statements that directly assert a position, opinion, or viewpoint regarding the Major Claim. A Claim
should make a debatable assertion rather than just provide background context. Claims must require support
from Premises and can be challenged or defended.
- Premises: Statements that provide justification, evidence, or reasoning for a Claim or another Premise. A</p>
      </sec>
      <sec id="sec-7-8">
        <title>Premise does not stand alone as an argument but instead supports a Claim. It may contain factual evidence,</title>
        <p>logic-based reasoning, or expert opinions.</p>
        <p>Formatting Rules (DO NOT DEVIATE):
- Wrap components in brackets: [Exact original text | Label].
- If a component spans multiple sentences, include ALL its sentences inside one bracket.
- Preserve the original essay’s text and order. Do NOT add/remove words.
- Labels MUST follow this syntax: Major Claim, Claim X, Premise X (X = number).
- DO NOT include transitional phrases, introductory/reinforcing phrases, or stance markers (e.g., ’I
believe,’) inside the brackets.
- If a sentence contains multiple distinct ideas separated by semicolons, conjunctions (’which’, ’but’, etc.), or
implicit contrasts, they must be annotated separately. Only annotate the core argument component itself.
The essay is:
{input_text}
Output:
{Augmented essay}</p>
        <p>Template prompt for the second stage of the pipeline approach</p>
      </sec>
      <sec id="sec-7-9">
        <title>You will analyze tagged text spans representing argumentative components.</title>
        <p>Task:</p>
      </sec>
      <sec id="sec-7-10">
        <title>1. Identify the argumentative relationships between the tagged text spans.</title>
      </sec>
      <sec id="sec-7-11">
        <title>2. Preserve the original content without modifying or removing any parts.</title>
      </sec>
      <sec id="sec-7-12">
        <title>3. Explicitly state whether each tagged span SUPPORTS, REFUTES, or CONTRADICTS another span.</title>
      </sec>
      <sec id="sec-7-13">
        <title>4. Recognize implicit relationships where direct connections are not clearly stated but can be inferred from meaning, tone, and structure.</title>
        <p>Guidelines:
- SUPPORT: A span strengthens or reinforces the claim of another component.
- ATTACK: A span challenges another component by either:
• Refuting it with direct counterarguments that explicitly disprove or weaken the claim.
• Contradicting it by subtly undermining or presenting an opposing idea without outright refutation.
- Some argumentative components do not function as supporters or refuters, yet they can still be the target of
support, contradiction, or refutation.
- Transitional words and phrases serve as clues for relationships. These include:</p>
        <p>Support indicators: therefore, as a result, consequently, this proves
Refutation indicators: however, on the contrary, while, although</p>
        <p>Contradiction indicators: despite, in contrast, alternatively, whereas
- Consider cases where grammar or phrasing is unconventional—recognize argumentative intent even when
structure is not ideal.
- When an argument contains both supportive and opposing elements, identify the dominant argumentative
function.</p>
        <p>Input:
{input_text}
Output:
{Augmented essay}</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. Mochales</given-names>
            <surname>Palau</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          ,
          <article-title>Automatic detection of arguments in legal texts</article-title>
          ,
          <source>in: 19th BelgianDutch Conference on Artificial Intelligence</source>
          , Date:
          <year>2007</year>
          /11/05-2007/11/06, Location: Utrecht, The Netherlands,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <article-title>Argument mining: A survey</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>45</volume>
          (
          <year>2020</year>
          )
          <fpage>765</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>A structureaware argument encoder for literature discourse analysis</article-title>
          ,
          <source>in: Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>7093</fpage>
          -
          <lpage>7098</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Cross-topic argument mining from heterogeneous sources using attention-based neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>05758</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Automated essay scoring: A survey of the state of the art</article-title>
          .,
          <source>in: IJCAI</source>
          , volume
          <volume>19</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>6300</fpage>
          -
          <lpage>6308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Srihari</surname>
          </string-name>
          ,
          <article-title>Dialo-ap: A dependency parsing based argument parser for dialogues</article-title>
          ,
          <source>in: Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>887</fpage>
          -
          <lpage>901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stab</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Parsing argumentation structures in persuasive essays</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>43</volume>
          (
          <year>2017</year>
          )
          <fpage>619</fpage>
          -
          <lpage>659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Niculae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          ,
          <article-title>Argument mining with structured svms and rnns</article-title>
          ,
          <source>arXiv preprint arXiv:1704.06869</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Almazrouei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alobeidli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alshamsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cojocaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Debbah</surname>
          </string-name>
          , É. Gofinet,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hesslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Launay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Malartic</surname>
          </string-name>
          , et al.,
          <article-title>The falcon series of open language models</article-title>
          ,
          <source>arXiv preprint arXiv:2311.16867</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Paolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Athiwaratkun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Achille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anubhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N. d.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soatto</surname>
          </string-name>
          ,
          <article-title>Structured prediction as translation between augmented natural languages</article-title>
          ,
          <source>arXiv preprint arXiv:2101.05779</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Feng</surname>
          </string-name>
          , G. Hirst,
          <article-title>Two-pass discourse segmentation with pairing and global features</article-title>
          ,
          <source>arXiv preprint arXiv:1407.8215</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McAlister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravenscroft</surname>
          </string-name>
          ,
          <article-title>Mining arguments from 19th century philosophical texts using topic based modelling</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Argumentation Mining</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>R. M. Palau</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          ,
          <article-title>Argumentation mining: the detection, classification and structure of arguments in text</article-title>
          ,
          <source>in: Proceedings of the 12th international conference on artificial intelligence and law</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>98</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>