<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Pisa, Italy
∗ Corresponding author.
† The authors contributed equally.
$ jeremie.cabessa@uvsq.fr (J. Cabessa); hugoh@playtika.com
(H. Hernault); umer.mushtaq@univ-lr.fr (U. Mushtaq)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Argument Mining in BioMedicine: Zero-Shot, In-Context Learning and Fine-tuning with LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jérémie Cabessa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hugo Hernault</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Umer Mushtaq</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>David Lab, University of Versailles Saint-Quentin (UVSQ) - University of Paris-Saclay, 78000 Versailles, France Institute of Computer Science of the Czech Academy of Sciences</institution>
          ,
          <addr-line>18207 Prague 8</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratoire Informatique</institution>
          ,
          <addr-line>Image, Interaction (L3i)</addr-line>
          ,
          <institution>University of La Rochelle</institution>
          ,
          <addr-line>17042 La Rochelle</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Playtika Ltd.</institution>
          ,
          <addr-line>CH-1003 Lausanne</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Argument Mining (AM) aims to extract the complex argumentative structure of a text and Argument Type Classification (ATC) is an essential sub-task of AM. Large Language Models (LLMs) have shown impressive capabilities in most NLP tasks and beyond. However, fine-tuning LLMs can be challenging. In-Context Learning (ICL) has been suggested as a bridging paradigm between training-free and fine-tuning settings for LLMs. In ICL, an LLM is conditioned to solve tasks using a few solved demonstration examples included in its prompt. We focuse on AM in the biomedical AbstRCT dataset. We address ATC using quantized and unquantized LLaMA-3 models through zero-shot learning, in-context learning, and fine-tuning approaches. We introduce a novel ICL strategy that combines kNN-based example selection with majority vote ensembling, along with a well-designed fine-tuning strategy for ATC. In zero-shot setting, we show that LLaMA-3 fails to achieve acceptable classification results, suggesting the need for additional training modalities. However, in our ICL training-free setting, LLaMA-3 can leverage relevant information from only a few demonstration examples to achieve very competitive results. Finally, in our fine-tuning setting, LLaMA-3 achieves state-of-the-art performance on ATC task in AbstRCT dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Argument Mining</kwd>
        <kwd>NLP</kwd>
        <kwd>LLMs</kwd>
        <kwd>LLaMA-3</kwd>
        <kwd>Zero-Shot Learning</kwd>
        <kwd>In-Context Learning</kwd>
        <kwd>Fine-tuning</kwd>
        <kwd>Ensembling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and fine-tuning approaches. Our contributions are as
follows:
• In zero-shot learning setting, we show that
LLaMA-3 fails to achieve acceptable classification
results, suggesting the need for implementing
additional training modalities.</p>
      <sec id="sec-1-1">
        <title>Our work sits at the intersection of zero-shot learn</title>
        <p>ing, in-context learning and fine-tuning. We implement
and compare the performance of the latest openly
available LLMs using these three approaches for AM on the
AbstRCT dataset.</p>
        <p>Our code is freely available on GitHub.
2. Related Works
3. Methodology
3.1. Datasets
• We introduce a novel ICL strategy that combines
kNN-based example selection with majority vote
ensembling. In this training-free setting,
LLaMA3 can leverage relevant information from only
a few demonstration examples to achieve very
competitive results.</p>
      </sec>
      <sec id="sec-1-2">
        <title>In early works, Argument Mining has been approached</title>
        <p>
          using both classical algorithms such as SVM [15, 2, 16, Dataset Split Abstracts ACs
17] as well as recurrent neural network models such as
BiLSTMs [
          <xref ref-type="bibr" rid="ref4">18, 19, 4</xref>
          ]. Transformer-based models, such NNeeoo--ttreasitn 315000 62,92191
as BERT [20], have also been utilized for AM, including Gla-test 100 615
multi-scale argument modelling and customized feature- Mix-test 100 609
injected BERT-based models [
          <xref ref-type="bibr" rid="ref5 ref6">21, 22, 23, 5, 6, 24, 25</xref>
          ]. AM
in the biomedical AbstRCT dataset has been approached Table 1
using LSTMs [26, 27], sequential transfer learning [28] AbstRCT dataset statistics.
as well as transformer-based models [
          <xref ref-type="bibr" rid="ref10 ref11 ref12">29, 30, 31</xref>
          ].
        </p>
        <p>
          More recently, AM sub-tasks have been modeled as An sample of the AbstRCT dataset is provided below.
text generation tasks using LLMs. For the Argument Type The argument components (ACs) and their corresponding
Classification (ATC) sub-task, this approach involves us- classes are indicated by bold tags.
ing a prompt template to generate the corresponding &lt;AC1: Major Claim&gt;A combination of mitoxantrone plus
predclass of an argument component. This method has been nisone is preferable to prednisone alone for reduction of pain in men
applied to various AM use-cases, such as podcast tran- wpuitrhpomseetoafsttahtiiscs,thuodrymwoanse-troeasisssteasnst,thperoesfetcattseocfatnhceesre.&lt;t/rAeaCt m1&gt;enTthse
scripts and legal documents [
          <xref ref-type="bibr" rid="ref13 ref14 ref15">32, 33, 34</xref>
          ]. The latest ap- on health-related quality of life (HQL). Men with metastatic prostate
proach in this ‘AM using LLM text generation’ direction cancer (n = 161) were randomized to receive either daily prednisone
involves a prompt that includes the argument compo- raelocneievoerdmpriteodxnainsotrnoeneal(oenveercyo3ulwdeheakvse) pml uitsopxraendtnroisnoenea.dTdehdosaeftwerho6
nent as the query and the complete text as the context, weeks if there was no improvement in pain. HQL was assessed
to output the class of the argument component using a before treatment initiation and then every 3 weeks using the
Eurogenerative model [
          <xref ref-type="bibr" rid="ref16">35</xref>
          ]. In this study, the three AM sub- opfe-aLnifOerQguanesiztiaotnionnaifroer
CR3e0se(aErOchRaTnCdQTLreQa-tCm3e0n)taonfdCtahneceQruQauliatylitoyftasks are modeled using the Persuasive Essays (PE) and Life Module-Prostate 14 (QOLM-P14), a trial-specific module
develAbstRCT datasets. oped for this study. An intent-to-treat analysis was used to
deter
        </p>
        <p>
          In contrast to the fine-tuning approach, a relevant immipnreotvheemmenetadnurdautrioantiobnetwofeeHnQgLroiumppsroofvpeamtieennttsa.n&lt;dACdi2fe:rePnrceems iisne&gt;
training-free ICL prompting strategy for LLMs has been At 6 weeks, both groups showed improvement in several HQL
doproposed [
          <xref ref-type="bibr" rid="ref9">9, 11</xref>
          ]. This strategy combines kNN-based mains&lt;/AC2&gt;, and &lt;AC3: Premise&gt;only physical functioning and
pain were better in the mitoxantrone-plus-prednisone group than
example selection, generated chain-of-thought prompt- in the prednisone-alone group&lt;/AC3&gt;. &lt;AC4: Premise&gt;After 6
ing, and majority vote ensembling for few-shot classifi- weeks, patients taking prednisone showed no improvement in HQL
cation. Interestingly, the ICL strategy outperforms the sscigonreificsa,nwthiemreparsotvheomsee ntatskiinng gmloitboaxlaqnutraolniteypolufslipfere(dPni=s.o0n0e9)s,hfoowuerd
ifne-tuning approach on the datasets used in the study. functioning domains, and nine symptoms (.001 &lt; P &lt;. 01)&lt;/AC4&gt;,
and &lt;AC5: Premise&gt;the improvement (&gt; 10 units on a scale of 0
to100) lasted longer than in the prednisone-alone group (.004 &lt; P
&lt;.05)&lt;/AC5&gt;. &lt;AC6: Premise&gt;The addition of mitoxantrone to
prednisone after failure of prednisone alone was associated with
improvements in pain, pain impact, pain relief, insomnia, and global
quality of life (.001 &lt; P &lt;.003).&lt;/AC6&gt; &lt;AC7: Claim&gt;Treatment
with mitoxantrone plus prednisone was associated with greater and
longer-lasting improvement in several HQL domains and symptoms
than treatment with prednisone alone.&lt;/AC7&gt;
        </p>
        <sec id="sec-1-2-1">
          <title>3.2. Zero-Shot Learning (ZSL) and In-Context Learning (ICL)</title>
          <p>the ACs and their corresponding classes in these
k abstracts is constructed (kNN). Finally, the LLM
predicts the classes yˆ1, . . . , yˆm of c1, . . . , cm on
the basis of on this prompt.
(2) n-Ensembling (n = 3, 5): The kNN-based
examples selection step, which involves randomness, is
repeated n times (nEns), leading to a set of n
sequences of class predictions {(yˆi,1, . . . , yˆi,m) : i =
1, . . . n}. The final class predictions yˆ1, . . . , yˆm of
c1, . . . , cm are obtained by applying a component
wise majority vote to the n predictions sequences.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Zero-shot learning (ZSL) is the paradigm where the LLM</title>
        <p>
          is asked to solve a downstream task without receiving The kNN-based example selection optimizes learning
any specific solved examples in the prompt. By contrast, from few examples by selecting samples most similar
in-context learning (ICL) refers to the emergent ability of to the current instance, rather than choosing them
ranLLMs to solve a downstream task based on a few demon- domly. The ensembling step increases prediction
robuststration examples given in the prompt as contextual in- ness by selecting the most frequent predictions. Note that
formation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. As the major advantage, ZSL and ICL the relevance of the ensembling step relies on the random
paradigms do not require any fine-tuning of the model’s selection in the kNN step. This randomness ensures that
parameters (i.e. training-free framework). same predictions are not always produced, allowing for
        </p>
        <p>Formally, let x be a query input text and C = majority voting and thereby increasing robustness.
[I; t(xi1 , yi1 ); . . . ; t(xik , yik )] be a context composed To aid the LLM in generating predictions, additional
of instructions I concatenated with input-output pairs task-specific information is typically included in the
(xj , yij ) in text format, where X = {x1, x2, . . . } and prompt. For example, definitions of the ‘Claim’ and
Y = {y1, . . . , yk} are the sets of possible input and ‘Premise’ classes, along with their statistics in the
Neooutputs, respectively. The ZSL and ICL paradigms corre- train set, can be incorporated in the prompt (info).
Morespond to the cases where k = 0 and k &gt; 0, respectively. over, in addition to the ACs c1, . . . , cm whose class are
For input x, the LLM M predicts the output yˆ such that to be predicted, the abstract text from which these ACs
originate can be included in the prompt (abstract).
Acyˆ = arg ymi∈aYx PM(yi | C; x) , cording to this ICL strategy, the classes yˆ1, . . . , yˆm of
c1, . . . , cm are predicted all-at-once (see Figure 1).
Therewhere PM(yi | C; x) is the probability that M gener- fore, a prompt of the form ‘info + abstract + 3NN + 3Ens’
ates yi when C and x are given as prompt. The main (see Table 3) indicates that the argument components
rationale behind ZSL and ICL is that the consideration (ACs) of the abstract are predicted all-at-once, by
incorof a well-chosen context C increases the probability of porating additional information and the entire abstract
M predicting the correct answer y for input x, i.e., that text as contextual cues in the prompt, and employing
PM(y | C; x) &gt; PM(y | x). the ICL strategy with 3NN-based example selection and</p>
        <p>
          We consider a 2-step ICL strategy for argument type 3-ensembling. A similar ICL strategy, where the classes
classification (ATC) inspired by a recent study [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] (see yˆ1, . . . , yˆm are inferred one-by-one (i.e., each model
inFigure 1). More precisely, let A be an abstract contain- ference leads to a single prediction yˆj ), has been
considing argument components (ACs) c1, . . . , cm with cor- ered but shown to be significantly less eficient. Due to
responding true classes y1, . . . , ym, where each yi ∈ space constraints, the latter results are omitted in this
{Claim, Premise}. Given the ACs c1, . . . , cm in the work.
prompt, the LLM generates the corresponding class
predictions yˆ1, . . . , yˆm as follows:
        </p>
        <sec id="sec-1-3-1">
          <title>3.3. Fine-tuning</title>
          <p>
            (1) kNN-based examples selection (k = 3, 5): First,
2k neighboring abstracts A1, . . . , A2k of A are se- Fine-tuning (FT) refers to the process of further training a
lected according to the following similarity mea- pre-trained LLM on a downstream task. Previous studies
sure. For any abstract Ai, let the signature of Ai be indicate that relying solely on the text of an argument
the embedding of the first sentence of Ai using the component is insuficient for predicting its argumentative
BioBERT model. The abstracts A1, . . . , A2k are the class; additional contextual information is essential for
ones whose signatures are the closest, with respect achieving competitive classification accuracy [
            <xref ref-type="bibr" rid="ref2 ref5 ref6">2, 5, 6</xref>
            ].
to cosine similarity, to the signature of A. Then, k Therefore, we propose a fine-tuning strategy that models
abstracts, Ai1 , . . . , Aik , are randomly chosen from the ATC task at the document level. Specifically, we
A1, . . . , A2k. Afterwards, a prompt containing all incorporate task-specific information into each training
info
abstract
ex. 1
ex. k
          </p>
          <p>LLM
preds
…
…
…
…
…
…
…
…
nENS
final
preds</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>4.2. In-Context Learning</title>
          <p>sample and generate the class label predictions for the the entire text of the abstract (abstract) significantly
imACs of an abstract all-at-once. prove the results. These expected observations serve as
an ablation study and justify the usage of the additional
3.4. Implementation Details information and full abstract text (prompt template ‘info
+ abstract’) in all subsequent experiments.</p>
          <p>
            As the embedding engine, we use dmis-lab’s BioBERT1. In all experiments, we observed that the models
conFor zero-shot learning, ICL and fine-tuning, we experi- sistently generated the correct number of classes for each
ment with the LLaMA-3-8B-Instruct and LLaMA-3-70B- inference task. This observation remains valid for
subInstruct models, as well as various GGML-quantized con- sequent ICL and fine-tuning settings. It demonstrates
ifgurations of them 2. For ICL, we set the generate tem- the model’s capability to understand the correspondence
perature to 0.1. For fine-tuning, we use LoRA adapters between the number of input ACs and the number of
with loraplus_lr_ratio of 16.0. We set batch size classes to predict.
of 2 and learning rate of 5e− 5. For implementation, we In ZSL training-free setting, across Neo, Gla and Mix
use the LLaMA-Factory3 framework [
            <xref ref-type="bibr" rid="ref17">36</xref>
            ]. An example test sets, the performance of LLMs strongly correlated
of the prompts we use for zero-shot learning, in-context with the complexity of these models, achieving maximal
learning and fine-tuning with LLaMA-3 are given in Ap- macro F1-scores of 0.698, 0.819 and 0.725, respectively.
pendix A. Overall, in ZSL, the LLMs fail to achieve acceptable
results. These considerations underscore the need for
im4. Results plementing additional learning modalities to address the
ATC task efectively.
          </p>
        </sec>
        <sec id="sec-1-3-3">
          <title>4.1. Zero-Shot Learning</title>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>1https://huggingface.co/dmis-lab 2https://github.com/ggerganov/ggml 3https://github.com/hiyouga/LLaMA-Factory</title>
      </sec>
      <sec id="sec-1-5">
        <title>The results for zero-shot learning (ZSL) on ATC task</title>
        <p>are reported in Table 2. Recall that zero-shot learning The results for in-context learning (ICL) on the ATC task
corresponds to the prompting strategy where no near- are reported in Table 3. First, note that the transition
est neighbors are included as demonstration examples, from zero-shot learning (‘info + abstract + 0NN’, Table 2)
referred to as ‘info + abstract + 0NN’ in our notation. to in-context learning (‘info + abstract + kNN’, Table 3)
In an initial experimentation phase, we observed that drastically improves the results. This validates the
efecadding complementary information (info) (definitions of tiveness of the kNNN-based examples selection method.
’Claim’ and ’Premise’ and dataset statistics) and including In addition, except for the Mix test set, the 3NN
strategy consistently outperforms the 5NN strategy,
suggesting that three examples sufice for optimal learning the
ATC task in an ICL setting. The inclusion of more
demonstration examples correlates with a significant increase
LLaMA-3-8b-Instruct-bnb-4bit 0.529
LLaMA-3-8b-Instruct 0.544
LLaMA-3-70b-Instruct-bnb-4bit 0.642
LLaMA-3-8b-Instruct-bnb-4bit 0.553
LLaMA-3-8b-Instruct 0.569
LLaMA-3-70b-Instruct-bnb-4bit 0.755
LLaMA-3-8b-Instruct-bnb-4bit 0.546
LLaMA-3-8b-Instruct 0.563
LLaMA-3-70b-Instruct-bnb-4bit 0.671</p>
        <p>Neo test
Gla test
Mix test
in prompt length, potentially hindering the performance
of the LLM or exceeding the maximum size of its
context. Furthermore, the ensembling strategy consistently
improves the results, even if only slightly, ensuring that
the robustness of the results can indeed be strengthened
through ensembling predictions.</p>
        <p>Overall, the training-free ICL strategy achieves very
competitive F1-scores of 0.912, 0.910, and 0.929 on Neo,
Mix, and Gla test sets, respectively. However, these
results remain lower than those obtained by previous
training-dependent models (see Table 4, upper rows).</p>
        <sec id="sec-1-5-1">
          <title>4.3. Fine-Tuning</title>
          <p>Neo test
LLaMA-3-8b-Instruct
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens
LLaMA-3-8b-Instruct-bnb-4bit
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens
LLaMA-3-70b-Instruct-bnb-4bit
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens
LLaMA-3-8b-Instruct
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens
LLaMA-3-8b-Instruct-bnb-4bit
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens
LLaMA-3-70b-Instruct-bnb-4bit
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens</p>
          <p>Gla test</p>
          <p>Mix test
LLaMA-3-8b-Instruct
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens</p>
          <p>Prompt
The results achieved by the fine-tuning (FT) strategy on
the ATC task are reported in Table 4. Our results show 0.859 0.926 0.893
that fine-tuning significantly outperforms ICL. These 0.866 0.922 0.894
0.885 0.940 0.913
ifndings suggest that the argumentative flow within
abstracts cannot be inferred solely from the knowledge LLaMA-3-70b-Instruct-bnb-4bit
acquired during pre-training, and requires additional pa- info + abstract + 3NN 0.905 0.954 0.929
rameters updates to be efectively learned. info + abstract + 5NN 0.906 0.952 0.929</p>
          <p>In this training-dependent context, we achieve maxi- info + abstract + 3NN + 3Ens 0.904 0.952 0.928
mal F1-scores of 0.935, 0.913, and 0.951 on the Neo, Gla, Table 3
and Mix test sets, respectively, establishing new state- Results for ATC on three test sets of AbstRCT dataset with
of-the-art results for the Neo and Mix test sets. These LLaMA-3 models using the 2-step ICL strategy described in
results suggest once again that the sequentiality of argu- the text.
ments inside a specific corpus requires fine-tuning to be
optimally captured.</p>
          <p>LLaMA-3-8b-Instruct-bnb-4bit
info + abstract + 3NN
info + abstract + 5NN
info + abstract + 3NN + 3Ens</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusion</title>
      <p>In this work, we address argument type classification
(ATC) in the biomedical AbstRTC dataset with openly
available LLaMA-3 from the three-fold perspective of
zero-shot learning (ZSL), in-context learning (ICL) and
ifne-tuning (FT). We show that ZSL fails to achieve
acceptable performance, ICL significantly improves the results,
and FT reaches state-of-the-art performance.</p>
      <p>
        These results support the fact that ATC task cannot
be solved in a zero-shot setting by relying solely on
general-purpose language modalities acquired during
ResAttArg(Ensemble) [27]
SeqMT [28]
MRC_GEN [
        <xref ref-type="bibr" rid="ref16">35</xref>
        ]
GIAM [25]
LLaMA-3-8B-Instruct 0.919
LLaMA-3-8B-Instruct-bnb-4bit 0.935
LLaMA-3-70B-Instruct 0.929
LLaMA-3-70B-Instruct-bnb-4bit 0.921
Neo
pre-training. Additional learning is essential, either in
the form of solved demonstration examples (ICL) or via
parameters’ updates (FT). We conjecture that the
sequential flow of arguments within a text is a corpus-specific
feature that cannot be inferred through zero-shot
methods.
      </p>
      <p>
        Previous works demonstrated that the text of argument
components alone do not sufice to infer their
argumentative roles [
        <xref ref-type="bibr" rid="ref2 ref4 ref6">2, 4, 6</xref>
        ]. Additional contextual, structural
and syntactic features are necessary. In our ICL and FT
settings, comprehensive contextual and structural
information is incorporated through task-specific information
and complete abstract text provided in the prompt. This
information enables the model to discern the sequence
of arguments, their associated markers, and other
characteristics closely associated with their argumentative
roles.
      </p>
      <p>For future work, the design and implementation of a
full AM pipeline using LLMs represents a major
milestone. In this scenario, the LLM would take raw texts as
input and produce a detailed map of the argumentative
structure as output. We believe that LLMs will
substantially transform the landscape of AM and its practical
applications.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This work benefited from access to the computing re</title>
        <p>sources of the L3i laboratory, operated and hosted by the
University of La Rochelle. It is financed by the French
government and the Region Nouvelle-Acquitaine. This
research also benefited from institutional support RVO:
67985807 and partially supported by the grant of the
Czech Science Foundation No. GA22-02067S. Finally, we
are grateful to Playtika Ltd. for their support for this
research.</p>
        <p>ARXIV.2311.16452. arXiv:2311.16452. language understanding, in: J. B. et al. (Ed.),
Pro[10] J. Wei, X. Wang, D. Schuurmans, M. Bosma, ceedings of NAACL-HLT 2019, ACL, 2019, pp. 4171–
B. Ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, 4186. URL: https://doi.org/10.18653/v1/n19-1423.
Chain-of-thought prompting elicits reasoning doi:10.18653/V1/N19-1423.
in large language models, in: S. K. et al. (Ed.), [21] G. Zhang, P. Nulty, D. Lillis, Enhancing legal
arguProceedings of NeurIPS 2022, volume 35, 2022, ment mining with domain pre-training and neural
pp. 24824–24837. URL: https://proceedings. networks, CoRR abs/2202.13457 (2022). URL: https:
neurips.cc/paper_files/paper/2022/file/ //arxiv.org/abs/2202.13457. arXiv:2202.13457.
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference[.22] H. Wang, Z. Huang, Y. Dou, Y. Hong,
Arpdf. gumentation mining on essays at multi
[11] S. Lei, G. Dong, X. Wang, K. Wang, S. Wang, Instruc- scales, in: D. S. et al. (Ed.),
Proceedterc: Reforming emotion recognition in conversa- ings of COLING 2020, ICCL, Barcelona,
tion with a retrieval multi-task llms framework, Spain (Online), 2020, pp. 5480–5493. URL:
CoRR abs/2309.11911 (2023). URL: https://doi.org/ https://aclanthology.org/2020.coling-main.478.
10.48550/arXiv.2309.11911. doi:10.48550/ARXIV. doi:10.18653/v1/2020.coling-main.478.
2309.11911. arXiv:2309.11911. [23] S. Fioravanti, A. Zugarini, F. Giannini, L. Rigutini,
[12] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, M. Maggini, M. Diligenti, Linguistic feature
inS. Narang, A. Chowdhery, D. Zhou, Self-consistency jection for eficient natural language processing,
improves chain of thought reasoning in language in: IJCNN 2023, June 18-23, 2023, IEEE, 2023,
models, 2023. arXiv:2203.11171. pp. 1–7. URL: https://doi.org/10.1109/IJCNN54540.
[13] H. Nori, N. King, S. M. McKinney, D. Carignan, 2023.10191680. doi:10.1109/IJCNN54540.2023.</p>
        <p>E. Horvitz, Capabilities of GPT-4 on medical 10191680.
challenge problems, CoRR abs/2303.13375 (2023). [24] J. Bao, C. Fan, J. Wu, Y. Dang, J. Du, R. Xu, A
neuURL: https://doi.org/10.48550/arXiv.2303.13375. ral transition-based model for argumentation
mindoi:10.48550/ARXIV.2303.13375. ing, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.),
arXiv:2303.13375. Proceedings of the 59th Annual Meeting of the
[14] T. Mayer, Argument Mining on Clinical Trials, ACL and the 11th International Joint Conference
Theses, Université Côte d’Azur, 2020. URL: https: on Natural Language Processing (Volume 1: Long
//theses.hal.science/tel-03209489. Papers), ACL, Online, 2021, pp. 6354–6364. URL:
[15] R. Mochales, M. Moens, Argumentation min- https://aclanthology.org/2021.acl-long.497. doi:10.
ing, Artificial Intelligence and Law 19 (2011) 1–22. 18653/v1/2021.acl-long.497.
doi:10.1007/s10506-010-9104-x. [25] B. Liu, V. Schlegel, P. Thompson, R. T.
Batista[16] I. Habernal, I. Gurevych, Argumentation min- Navarro, S. Ananiadou, Global
informationing in user-generated web discourse, Computa- aware argument mining based on a
toptional Linguistics 43 (2017) 125–179. URL: https: down multi-turn qa model, Information
//aclanthology.org/J17-1004. doi:10.1162/COLI_ Processing &amp; Management 60 (2023) 103445.
a_00276. URL: https://www.sciencedirect.com/science/
[17] R. Levy, Y. Bilu, D. Hershcovich, E. Aharoni, article/pii/S0306457323001826. doi:https:
N. Slonim, Context dependent claim detection, in: //doi.org/10.1016/j.ipm.2023.103445.
ICCL, 2014. URL: https://api.semanticscholar.org/ [26] A. Galassi, M. Lippi, P. Torroni, Argumentative
CorpusID:18847466. link prediction using residual networks and
multi[18] S. Eger, J. Daxenberger, I. Gurevych, Neural end- objective learning, in: N. Slonim, R. Aharonov
to-end learning for computational argumentation (Eds.), Proceedings of the 5th Workshop on
Armining, in: R. Barzilay, M.-Y. Kan (Eds.), Proceed- gument Mining, ACL, Brussels, Belgium, 2018,
ings of ACL 2017, ACL, Vancouver, Canada, 2017, pp. 1–10. URL: https://aclanthology.org/W18-5201.
pp. 11–22. URL: https://aclanthology.org/P17-1002. doi:10.18653/v1/W18-5201.</p>
        <p>doi:10.18653/v1/P17-1002. [27] A. Galassi, M. Lippi, P. Torroni, Multi-task attentive
[19] V. Niculae, J. Park, C. Cardie, Argument mining residual networks for argument mining, IEEE/ACM
with structured SVMs and RNNs, in: R. Barzi- Transactions on Audio, Speech, and Language
Prolay, M.-Y. Kan (Eds.), Proceedings of ACL 2017, cessing 31 (2023) 1877–1892. doi:10.1109/TASLP.
ACL, Vancouver, Canada, 2017, pp. 985–995. URL: 2023.3275040.
https://aclanthology.org/P17-1091. doi:10.18653/ [28] J. Si, L. Sun, D. Zhou, J. Ren, L. Li, Biomedical
v1/P17-1091. argument mining based on sequential multi-task
[20] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: learning, IEEE/ACM Trans. Comput. Biol.
Bioinpre-training of deep bidirectional transformers for formatics 20 (2022) 864–874. URL: https://doi.org/</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Appendix</title>
      <sec id="sec-4-1">
        <title>Examples of prompts for LLaMA 3 for the zero-shot learning (ZSL), in-context learning (ICL) and fine-tuning (FT) settings are provided below.</title>
        <p>A.1. Zero-Shot Learning
### Task description: You are an expert biomedical assistant that takes 1) an abstract
text and 2) the list of all arguments from this abstract text, and must classify all
arguments into one of two classes: Claim or Premise. 68.0052% of examples are of
type Premise and 31.9948% of type Claim. You must absolutely not generate any text
or explanation other than the following JSON format {"Argument 1": &lt;predicted class
for Argument 1 (str)&gt;, ..., "Argument n": &lt;predicted class for Argument n (str)&gt;}
### Class definitions: Claim = A claim in the abstract of an RCT is a statement or
conclusion about the findings of the study. Premise = A premise in the abstract of an
RCT is a statement that provides an evidence or proof for a claim.
### Abstract: Few controlled clinical trials exist to support oral
combination therapy in pulmonary arterial hypertension (PAH). Patients with PAH
(idiopathic [IPAH] or associated with connective tissue disease [APAH-CTD])
taking bosentan (62.5 or 125 mg twice daily at a stable dose for ≥ 3 months) were
randomized (1:1) to sildenafil (20 mg, 3 times daily; n = 50) or placebo (n = 53).
The primary endpoint was change from baseline in 6-min walk distance (6MWD)
at week 12, assessed using analysis of covariance. Patients could continue in a
52-week extension study. An analysis of covariance main-efects model was used,
which included categorical terms for treatment, baseline 6MWD (&lt;325 m; ≥ 325
m), and baseline aetiology; sensitivity analyses were subsequently performed.
In sildenafil versus placebo arms, week-12 6MWD increases were similar (least
squares mean diference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to 17.1 m]; P
= 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ± 57.4 m,
respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1 m in
APAH-CTD (35% of population). One-year survival was 96%; patients maintained
modest 6MWD improvements. Changes in WHO functional class and Borg
dyspnoea score and incidence of clinical worsening did not difer. Headache,
diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition
to stable (≥ 3 months) bosentan therapy, had no benefit over placebo for 12-week
change from baseline in 6MWD. The influence of PAH aetiology warrants future study.
### Arguments: Argument 1=In sildenafil versus placebo arms, week-12 6MWD
increases were similar (least squares mean diference [sildenafil-placebo], -2.4 m
[90% CI: -21.8 to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ±
45.7 versus 11.8 ± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0
versus 17.5 ± 59.1 m in APAH-CTD (35% of population).</p>
        <p>Argument 2=Changes in WHO functional class and Borg dyspnoea score and
incidence of clinical worsening did not difer.</p>
        <p>Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil.
Argument 4=Sildenafil, in addition to stable ( ≥ 3 months) bosentan therapy, had no
benefit over placebo for 12-week change from baseline in 6MWD.
### Result:
A.2. In-Context Learning (ICL)
### Task description: You are an expert biomedical assistant that takes 1) an
abstract text, 2) the list of all arguments from this abstract text, and must classify all
arguments into one of two classes: Claim or Premise. 68.0052% of examples are of
type Premise and 31.9948% of type Claim. You must absolutely not generate any text
or explanation other than the following JSON format {"Argument 1": &lt;predicted class
for Argument 1 (str)&gt;, ..., "Argument n": &lt;predicted class for Argument n (str)&gt;}
### Class definitions: Claim = A claim in the abstract of an RCT is a statement or
conclusion about the findings of the study. Premise = A premise in the abstract of an
RCT is a statement that provides an evidence or proof for a claim.
### Examples:
## Example 1
Treatment of patients with advanced or metastatic esophagogastric adenocarcinoma
should not only prolong life but also provide relief of symptoms and improve quality
of life (QOL). Esophagogastric adenocarcinoma mainly occurs in elderly patients, but
they are underrepresented in most clinical trials and often do not receive efective
combination chemotherapy, most probably for fear of intolerance. Using validated
instruments, we prospectively assessed QOL within the randomized FLOT65+
phase II trial. Within the FLOT65+ trial, a total of 143 patients aged ≥ 65 years
were randomly allocated to receive biweekly oxaliplatin plus 5-fluorouracil (5-FU)
continuous infusion and folinic acid (FLO) or the same regimen in combination
with docetaxel 50 mg/m(2) (FLOT). The European Organisation for Research and
Treatment of Cancer Quality of Life Questionnaire C30 (EORTC QLQ-C30) and
the gastric module STO22 were administered every 8 weeks until progression.
Time to definitive deterioration of QOL parameters was analyzed and compared
within the treatment arms. The median age of patients was 70 years. Patients
receiving FLOT exhibited higher response rates and had improved disease-free and
progression-free survival (PFS). The proportions of patients with evaluable baseline
EORTC QLQ-C30 and STO22 questionnaires were balanced (83 % in FLOT and 89 %
in FLO). Considering evaluable patients with assessable questionnaires (n = 123),
neither functioning nor symptom parameters difered significantly in favor of one of
the two treatment groups. Particularly, there was no significant diference regarding
time to definitive deterioration of global health status/quality of life from baseline
(primary endpoint). Notably, patients receiving FLO or FLOT as palliative treatment
(n = 98) achieved comparable QOL results. Although toxicity was higher in patients
receiving FLOT, no negative impact of the addition of docetaxel on QOL parameters
could be demonstrated. Thus, elderly patients in need of intensified chemotherapy
may receive FLOT without compromising patient-reported outcome parameters.
Argument 1=Patients receiving FLOT exhibited higher response rates and had
improved disease-free and progression-free survival (PFS).</p>
        <p>Argument 2=there was no significant diference regarding time to definitive
deterioration of global health status/quality of life from baseline (primary endpoint).
Argument 3=patients receiving FLO or FLOT as palliative treatment (n = 98) achieved
comparable QOL results.</p>
        <p>Argument 4=Although toxicity was higher in patients receiving FLOT,
Argument 5=no negative impact of the addition of docetaxel on QOL parameters
could be demonstrated.</p>
        <p>Argument 6=elderly patients in need of intensified chemotherapy may receive FLOT
without compromising patient-reported outcome parameters.
{"Argument 1": "Premise", "Argument 2": "Premise", "Argument 3": "Premise",
"Argument 4": "Premise", "Argument 5": "Premise", "Argument 6": "Claim"}
Chemotherapy prolongs survival and improves quality of life (QOL) for good
performance status (PS) patients with advanced non-small cell lung cancer (NSCLC).
Targeted therapies may improve chemotherapy efectiveness without worsening
toxicity. SGN-15 is an antibody-drug conjugate (ADC), consisting of a chimeric
murine monoclonal antibody recognizing the Lewis Y (Le(y)) antigen, conjugated
to doxorubicin. Le(y) is an attractive target since it is expressed by most NSCLC.
SGN-15 was active against Le(y)-positive tumors in early phase clinical trials and
was synergistic with docetaxel in preclinical experiments. This Phase II, open-label
study was conducted to confirm the activity of SGN-15 plus docetaxel in previously
treated NSCLC patients. Sixty-two patients with recurrent or metastatic NSCLC
expressing Le(y), one or two prior chemotherapy regimens, and PS&lt; or =2 were
randomized 2:1 to receive SGN-15 200 mg/m2/week with docetaxel 35 mg/m2/week
(Arm A) or docetaxel 35 mg/m2/week alone (Arm B) for 6 of 8 weeks. Intrapatient
dose-escalation of SGN-15 to 350 mg/m2 was permitted in the second half of the
study. Endpoints were survival, safety, eficacy, and quality of life. Forty patients on
Arm A and 19 on Arm B received at least one treatment. Patients on Arms A and B
had median survivals of 31.4 and 25.3 weeks, 12-month survivals of 29% and 24%,
and 18-month survivals of 18% and 8%, respectively Toxicity was mild in both arms.
QOL analyses favored Arm A. SGN-15 plus docetaxel is a well-tolerated and active
second and third line treatment for NSCLC patients . Ongoing studies are exploring
alternate schedules to maximize synergy between these agents.
Argument 1=Chemotherapy prolongs survival and improves quality of life (QOL) for
good performance status (PS) patients with advanced non-small cell lung cancer
(NSCLC).</p>
        <p>Argument 2=Targeted therapies may improve chemotherapy efectiveness without
worsening toxicity.</p>
        <p>Argument 3=Le(y) is an attractive target since it is expressed by most NSCLC.
Argument 4=SGN-15 was active against Le(y)-positive tumors in early phase clinical
trials and was synergistic with docetaxel in preclinical experiments.
Argument 5=Patients on Arms A and B had median survivals of 31.4 and 25.3
weeks, 12-month survivals of 29% and 24%, and 18-month survivals of 18% and 8%,
respectively
Argument 6=Toxicity was mild in both arms.</p>
        <p>Argument 7=QOL analyses favored Arm A.</p>
        <p>Argument 8=SGN-15 plus docetaxel is a well-tolerated and active second and third
line treatment for NSCLC patients
The impact of treatment on health-related quality of life (HRQoL) is an important
consideration in the adjuvant treatment of operable breast cancer. Here we report
mature HRQoL outcomes from the ATAC trial, comparing anastrozole with tamoxifen
as primary adjuvant therapy for postmenopausal women with localized breast cancer.
Patients completed the Functional Assessment of Cancer Therapy-Breast (FACT-B)
questionnaire plus endocrine subscale (ES) at baseline, 3 and 6 months, and every 6
months thereafter. Baseline characteristics in the HRQoL sub-protocol were well
balanced between the anastrozole (n = 335) and tamoxifen (n = 347) groups in the
primary analysis population. As with previously published results at 2 years, there
was no statistically significant diference in the Trial Outcome Index of the FACT-B,
the primary endpoint of the study, between treatments at 5 years. There were no
statistically significant diferences between treatment groups in ES total scores.
Consistent with the 2-year analysis, there were diferences between treatment groups
in patient-reported side efects: diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%),
vaginal dryness (18.5% vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia
(17.3% vs. 8.1%) were significantly more frequent with anastrozole compared to
tamoxifen. Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were
significantly less frequent with anastrozole compared to tamoxifen. In this, the first
report of HRQoL over 5 years of initial adjuvant therapy with an aromatase inhibitor,
we conclude that anastrozole and tamoxifen had similar impacts on HRQoL, which
was maintained or slightly improved during the treatment period for both groups.
Argument 1=The impact of treatment on health-related quality of life (HRQoL) is an
important consideration in the adjuvant treatment of operable breast cancer.
Argument 2=As with previously published results at 2 years, there was no statistically
significant diference in the Trial Outcome Index of the FACT-B, the primary
endpoint of the study, between treatments at 5 years.</p>
        <p>Argument 3=There were no statistically significant diferences between treatment
groups in ES total scores.</p>
        <p>Argument 4=there were diferences between treatment groups in patient-reported
side efects:
Argument 5=diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%), vaginal dryness (18.5%
vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia (17.3% vs. 8.1%) were
significantly more frequent with anastrozole compared to tamoxifen.
Argument 6=Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were
significantly less frequent with anastrozole compared to tamoxifen.</p>
        <p>Argument 7=In this, the first report of HRQoL over 5 years of initial adjuvant therapy
with an aromatase inhibitor, we conclude that anastrozole and tamoxifen had similar
impacts on HRQoL, which was maintained or slightly improved during the treatment
period for both groups.
Few controlled clinical trials exist to support oral combination therapy in pulmonary
arterial hypertension (PAH). Patients with PAH (idiopathic [IPAH] or associated with
connective tissue disease [APAH-CTD]) taking bosentan (62.5 or 125 mg twice daily
at a stable dose for ≥ 3 months) were randomized (1:1) to sildenafil (20 mg, 3 times
daily; n = 50) or placebo (n = 53). The primary endpoint was change from baseline
in 6-min walk distance (6MWD) at week 12, assessed using analysis of covariance.
Patients could continue in a 52-week extension study. An analysis of covariance
main-efects model was used, which included categorical terms for treatment,
baseline 6MWD (&lt;325 m; ≥ 325 m), and baseline aetiology; sensitivity analyses were
subsequently performed. In sildenafil versus placebo arms, week-12 6MWD increases
were similar (least squares mean diference [sildenafil-placebo], -2.4 m [90% CI: -21.8
to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8
± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ±
59.1 m in APAH-CTD (35% of population). One-year survival was 96%; patients
maintained modest 6MWD improvements. Changes in WHO functional class and
Borg dyspnoea score and incidence of clinical worsening did not difer. Headache,
diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition
to stable (≥ 3 months) bosentan therapy, had no benefit over placebo for 12-week
change from baseline in 6MWD. The influence of PAH aetiology warrants future study.
Argument 1=In sildenafil versus placebo arms, week-12 6MWD increases were
similar (least squares mean diference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to
17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ±
57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1
m in APAH-CTD (35% of population).</p>
        <p>Argument 2=Changes in WHO functional class and Borg dyspnoea score and
incidence of clinical worsening did not difer.</p>
        <p>Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil.
Argument 4=Sildenafil, in addition to stable ( ≥ 3 months) bosentan therapy, had no
benefit over placebo for 12-week change from baseline in 6MWD.
### You are an expert in medical analysis. You are given the abstract of a random
controlled trial which contains numbered argument components enclosed by
&lt;AC&gt;&lt;/AC&gt; tags. Your task is to classify each argument components in the essay as
either "Claim" or "Premise". You must return a list of argument component types in
following JSON format: "component_types": [component_type (str), component_type
(str), ..., component_type (str)]
### Here is the abstract text: An open, randomized study was performed to assess
the efects of supportive pamidronate treatment on morbidity from bone metastases
in breast cancer patients. Eighty-one pamidronate patients and 80 control patients
were monitored for a median of 18 and 21 months, respectively, for events of skeletal
morbidity and the radiologic course of metastatic bone disease. The oral pamidronate
dose was 600 mg/d (high dose [HD]) during the earliest study years, then changed
to 300 mg/d (low dose [LD]) because of gastrointestinal toxicity. Twenty-nine of
81 pamidronate (HD/LD) patients first received 600 mg/d and were then changed
to 300 mg/d; 52 of 81 pamidronate LD patients received 300 mg/d throughout the
study. Tumor treatment was unrestricted. An overall intent-to-treat analysis was
performed.&lt;AC&gt; In the pamidronate group, the occurrence of hypercalcemia, severe
bone pain, and symptomatic impending fractures decreased by 65%, 30%, and 50%,
respectively; event-rates of systemic treatment and radiotherapy decreased by
35% (P &lt; or = .02). &lt;/AC&gt;&lt;AC&gt; The event-free period (EFP), radiologic course of
disease, and survival did not improve. &lt;/AC&gt;&lt;AC&gt; Subgroup analyses suggested
a dose-dependent treatment efect. &lt;/AC&gt;&lt;AC&gt; Compared with their controls,
in pamidronate HD/LD patients, events occurred 60% to 90% less frequently (P
&lt; or = .03) and the EFP was prolonged (P = .002). &lt;/AC&gt;&lt;AC&gt; In pamidronate
LD patients, event-rates decreased by 15% to 45% (P &lt; or = .04). &lt;/AC&gt;&lt;AC&gt;
Gastrointestinal toxicity of pamidronate caused a 23% drop-out rate, &lt;/AC&gt;&lt;AC&gt; but
other cancer-associated factors seemed to contribute to this toxicity. &lt;/AC&gt;&lt;AC&gt;
Pamidronate treatment of breast cancer patients eficaciously reduced skeletal
morbidity. &lt;/AC&gt;&lt;AC&gt; The efect appeared to be dose-dependent. &lt;/AC&gt;&lt;AC&gt;
Further research on dose and mode of treatment is mandatory. &lt;/AC&gt;
{"component_types": ["Premise", "Premise", "Claim", "Premise", "Premise", "Premise",
"Claim", "Claim", "Claim", "Claim"]}</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Palau</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          ,
          <article-title>Argumentation mining: The detection, classification and structure of arguments in text</article-title>
          ,
          <source>in: Proceedings of ICAIL</source>
          <year>2019</year>
          , ICAIL '09,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2009</year>
          , pp.
          <fpage>98</fpage>
          -
          <lpage>107</lpage>
          . URL: https://doi.org/10.1145/1568234.1568246. doi:
          <volume>10</volume>
          .1145/1568234.1568246.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stab</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Parsing argumentation structures in persuasive essays</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>43</volume>
          (
          <year>2017</year>
          )
          <fpage>619</fpage>
          -
          <lpage>659</lpage>
          . URL: https://aclanthology. org/J17-3005. doi:
          <volume>10</volume>
          .1162/COLI_a_
          <fpage>00295</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Potash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          ,
          <article-title>Here's my point: Joint pointer architecture for argument mining</article-title>
          , in: M. P. et al. (Ed.),
          <source>Proceedings of EMNLP</source>
          <year>2017</year>
          , ACL,
          <year>2017</year>
          , pp.
          <fpage>1364</fpage>
          -
          <lpage>1373</lpage>
          . URL: https://doi.org/10.18653/v1/d17-
          <fpage>1143</fpage>
          . doi:
          <volume>10</volume>
          . 18653/V1/D17-1143.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuribayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ouchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Inoue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reisert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miyoshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Suzuki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          ,
          <article-title>An empirical study of span representations in argumentation structure parsing, in: A. K</article-title>
          . et al. (Ed.),
          <source>Proceedings of ACL</source>
          <year>2019</year>
          , ACL, Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>4691</fpage>
          -
          <lpage>4698</lpage>
          . URL: https://aclanthology.org/P19-1464. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>P19</fpage>
          -1464.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>U.</given-names>
            <surname>Mushtaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabessa</surname>
          </string-name>
          ,
          <article-title>Argument classification with BERT plus contextual, structural and syntactic features as text</article-title>
          , in: M. T. et al. (Ed.),
          <source>Proceedings of ICONIP</source>
          <year>2022</year>
          , volume
          <volume>1791</volume>
          <source>of CCIS</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>622</fpage>
          -
          <lpage>633</lpage>
          . URL: https://doi. org/10.1007/
          <fpage>978</fpage>
          -981-99-1639-9_
          <fpage>52</fpage>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-99-1639-9\_
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>U.</given-names>
            <surname>Mushtaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabessa</surname>
          </string-name>
          ,
          <article-title>Argument mining with modular BERT and transfer learning</article-title>
          ,
          <source>in: Proceedings of IJCNN</source>
          <year>2023</year>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: https://doi.org/10.1109/IJCNN54540.
          <year>2023</year>
          .
          <volume>10191968</volume>
          . doi:
          <volume>10</volume>
          .1109/IJCNN54540.
          <year>2023</year>
          .
          <volume>10191968</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A survey of large language models</article-title>
          ,
          <source>CoRR abs/2303</source>
          .18223 (
          <year>2023</year>
          ). URL: https://doi.org/ 10.48550/arXiv.2303.18223. doi:
          <volume>10</volume>
          .48550/ARXIV. 2303.18223. arXiv:
          <volume>2303</volume>
          .
          <fpage>18223</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sui</surname>
          </string-name>
          ,
          <article-title>A survey on in-context learning</article-title>
          ,
          <source>CoRR abs/2301</source>
          .00234 (
          <year>2023</year>
          ). URL: https:// doi.org/10.48550/arXiv.2301.00234. doi:
          <volume>10</volume>
          .48550/ ARXIV.2301.00234. arXiv:
          <volume>2301</volume>
          .
          <fpage>00234</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Nori</surname>
          </string-name>
          , et al.,
          <article-title>Can generalist foundation models outcompete special-purpose tuning? case study in medicine</article-title>
          ,
          <source>CoRR abs/2311</source>
          .16452 (
          <year>2023</year>
          ). URL: https:// doi.org/10.48550/arXiv.2311.16452. doi:
          <volume>10</volume>
          .48550/ 10.1109/TCBB.
          <year>2022</year>
          .
          <volume>3173447</volume>
          . doi:
          <volume>10</volume>
          .1109/TCBB.
          <year>2022</year>
          .
          <volume>3173447</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cabrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <article-title>Transformer-based argument mining for healthcare applications</article-title>
          , in: G. D. G. et al. (Ed.),
          <source>Proceedings of ECAI</source>
          <year>2020</year>
          , volume
          <volume>325</volume>
          <source>of FAIA</source>
          , IOS Press,
          <year>2020</year>
          , pp.
          <fpage>2108</fpage>
          -
          <lpage>2115</lpage>
          . URL: https://doi.org/10.3233/FAIA200334. doi:
          <volume>10</volume>
          . 3233/FAIA200334.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>B.</given-names>
            <surname>Molinet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cabrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <source>T. Mayer, Acta</source>
          <volume>2</volume>
          .
          <article-title>0: A modular architecture for multi-layer argumentative analysis of clinical trials</article-title>
          , in: L. D.
          <string-name>
            <surname>Raedt</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of IJCAI-22, International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>5940</fpage>
          -
          <lpage>5943</lpage>
          . URL: https://doi. org/10.24963/ijcai.
          <year>2022</year>
          /859. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2022</year>
          /859, demo Track.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cabrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <article-title>Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>118</volume>
          (
          <year>2021</year>
          )
          <article-title>102098</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0933365721000919. doi:https://doi.org/10. 1016/j.artmed.
          <year>2021</year>
          .
          <volume>102098</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [32]
          <string-name>
            <surname>M. van der Meer</surname>
            , M. Reuver,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Khurana</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>S. B.</given-names>
          </string-name>
          <string-name>
            <surname>Santamaría</surname>
          </string-name>
          ,
          <article-title>Will it blend? mixing training paradigms &amp; prompting for argument quality prediction</article-title>
          , in: G.
          <string-name>
            <surname>Lapesa</surname>
          </string-name>
          , et al. (Eds.),
          <source>ArgMining@COLING</source>
          <year>2022</year>
          , ICCL,
          <year>2022</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>103</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .argmining-
          <volume>1</volume>
          .8.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pojoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dumani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schenkel</surname>
          </string-name>
          ,
          <article-title>Argumentmining from podcasts using chatgpt</article-title>
          , in: L.
          <string-name>
            <surname>Malburg</surname>
          </string-name>
          , D. Verma (Eds.),
          <source>Proceedings of ICCBR-WS</source>
          <year>2023</year>
          , volume
          <volume>3438</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEURWS.org,
          <year>2023</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>144</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3438</volume>
          /paper_10.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Al Zubaer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Granitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mitrović</surname>
          </string-name>
          ,
          <article-title>Performance analysis of large language models in the domain of legal argument mining</article-title>
          ,
          <source>Frontiers in Artificial Intelligence</source>
          <volume>6</volume>
          (
          <year>2023</year>
          ). URL: https://www.frontiersin.org/ articles/10.3389/frai.
          <year>2023</year>
          .
          <volume>1278796</volume>
          . doi:
          <volume>10</volume>
          .3389/ frai.
          <year>2023</year>
          .
          <volume>1278796</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Batista-Navarro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <article-title>Argument mining as a multi-hop generative machine reading comprehension task</article-title>
          ,
          <source>in: The 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          . URL: https://openreview. net/forum?id=KTFxOnrbvu.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Zhang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          , Y. Ma, Llamafactory:
          <article-title>Unified eficient fine-tuning of 100+ language models</article-title>
          ,
          <source>in: Proceedings of the 62nd Annual Meeting of the ACL (Volume</source>
          <volume>3</volume>
          :
          <string-name>
            <surname>System</surname>
            <given-names>Demonstrations)</given-names>
          </string-name>
          , ACL, Bangkok, Thailand,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2403.13372.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>