=Paper= {{Paper |id=Vol-3878/15_main_long |storemode=property |title=Argument Mining in BioMedicine: Zero-Shot, In-Context Learning and Fine-tuning with LLMs |pdfUrl=https://ceur-ws.org/Vol-3878/15_main_long.pdf |volume=Vol-3878 |authors=Jérémie Cabessa,Hugo Hernault,Umer Mushtaq |dblpUrl=https://dblp.org/rec/conf/clic-it/CabessaHM24 }} ==Argument Mining in BioMedicine: Zero-Shot, In-Context Learning and Fine-tuning with LLMs== https://ceur-ws.org/Vol-3878/15_main_long.pdf
                                Argument Mining in BioMedicine: Zero-Shot, In-Context
                                Learning and Fine-tuning with LLMs
                                Jérémie Cabessa1,∗,† , Hugo Hernault2,† and Umer Mushtaq3,†
                                1
                                  David Lab, University of Versailles Saint-Quentin (UVSQ) – University of Paris-Saclay, 78000 Versailles, France
                                  Institute of Computer Science of the Czech Academy of Sciences, 18207 Prague 8, Czech Republic
                                2
                                  Playtika Ltd., CH-1003 Lausanne, Switzerland
                                3
                                  Laboratoire Informatique, Image, Interaction (L3i), University of La Rochelle, 17042 La Rochelle, France


                                                 Abstract
                                                 Argument Mining (AM) aims to extract the complex argumentative structure of a text and Argument Type Classification
                                                 (ATC) is an essential sub-task of AM. Large Language Models (LLMs) have shown impressive capabilities in most NLP tasks
                                                 and beyond. However, fine-tuning LLMs can be challenging. In-Context Learning (ICL) has been suggested as a bridging
                                                 paradigm between training-free and fine-tuning settings for LLMs. In ICL, an LLM is conditioned to solve tasks using a few
                                                 solved demonstration examples included in its prompt. We focuse on AM in the biomedical AbstRCT dataset. We address
                                                 ATC using quantized and unquantized LLaMA-3 models through zero-shot learning, in-context learning, and fine-tuning
                                                 approaches. We introduce a novel ICL strategy that combines kNN-based example selection with majority vote ensembling,
                                                 along with a well-designed fine-tuning strategy for ATC. In zero-shot setting, we show that LLaMA-3 fails to achieve
                                                 acceptable classification results, suggesting the need for additional training modalities. However, in our ICL training-free
                                                 setting, LLaMA-3 can leverage relevant information from only a few demonstration examples to achieve very competitive
                                                 results. Finally, in our fine-tuning setting, LLaMA-3 achieves state-of-the-art performance on ATC task in AbstRCT dataset.

                                                 Keywords
                                                 Argument Mining, NLP, LLMs, LLaMA-3, Zero-Shot Learning, In-Context Learning, Fine-tuning, Ensembling



                                1. Introduction                                                                                        learning refers to the training-free approach where a pre-
                                                                                                                                       trained LLM is prompted to solve tasks on completely
                                Argument Mining (AM) focuses on extracting the under- unseen data samples.
                                lying argumentative and discursive structure from raw                                                     Recently, In-Context Learning (ICL) has been proposed
                                text [1]. Argument Type Classification (ATC), which in- as a bridging paradigm between the training-free and
                                volves classifying argumentative units in text according fine-tuning settings. ICL is a prompt engineering tech-
                                to their argumentative roles, is the crucial sub-task in AM. nique whereby an LLM is conditioned to solve tasks by
                                Research has shown that the argumentative role of a unit means of a few solved demonstration examples included
                                cannot be inferred solely for its text: additional structural as part of its input prompt [8]. Generally, the input
                                and contextual information is needed [2]. This additional prompt includes task instructions, the current input sam-
                                information can be incorporated via feature engineer- ple to be solved as well as several solved input-output pair
                                ing [2], memory-enabled neural architectures [3, 4] or examples. In this way, ICL maintains the training-free
                                LLM-based hybrid methods [5, 6].                                                                       posture (parameters frozen) of the LLM while at the same
                                   Large Language Models (LLMs) have become ubiqui- time providing it with some supervision through demon-
                                tous in deep learning and have shown impressive capa- stration examples. It also enables direct incorporation of
                                bilities in most NLP tasks [7]. In the main, LLMs are used selected features inside the prompt template, thereby ob-
                                in two distinct settings: (i) training-free, where the pre- viating the need for architecture customization. Creative
                                trained LLM is used for inference without any parameter ICL strategies combining kNN-based examples selection,
                                adjustment, and (ii) fine-tuning, where the parameters generated chain-of-thought (CoT) prompting, and ma-
                                of the LLM are updated through supervised training to jority vote ensembling have been proposed and shown
                                enable transfer learning on a downstream task. Zero-shot to outperform fine-tuning approaches [9, 10, 11, 12]. In
                                                                                                                                       the main, kNN-based examples selection optimizes the
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, process of learning from few examples and ensembling
                                Dec 04 — 06, 2024, Pisa, Italy
                                ∗
                                  Corresponding author.
                                                                                                                                       increases the robustness of the predictions [13, 9, 11].
                                †                                                                                                         This work focuses on AM in the biomedical AbstRCT
                                  The authors contributed equally.
                                $ jeremie.cabessa@uvsq.fr (J. Cabessa); hugoh@playtika.com                                             dataset  [14]. More specifically, we address the ATC
                                (H. Hernault); umer.mushtaq@univ-lr.fr (U. Mushtaq)                                                    task using quantized and unquantized LLaMA-3 models,
                                 0000-0002-5394-5249 (J. Cabessa); 0009-0003-4403-6048                                                among the most capable openly available LLMs (cf. leader-
                                (U. Mushtaq)                                                                                           board), through zero-shot learning, in-context learning,
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                           Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
and fine-tuning approaches. Our contributions are as              Our work sits at the intersection of zero-shot learn-
follows:                                                        ing, in-context learning and fine-tuning. We implement
                                                                and compare the performance of the latest openly avail-
     • In zero-shot learning setting, we show that              able LLMs using these three approaches for AM on the
       LLaMA-3 fails to achieve acceptable classification       AbstRCT dataset.
       results, suggesting the need for implementing
       additional training modalities.
                                                                3. Methodology
     • We introduce a novel ICL strategy that combines
       kNN-based example selection with majority vote           3.1. Datasets
       ensembling. In this training-free setting, LLaMA-
       3 can leverage relevant information from only       We consider the AbstRCT dataset which consists of ab-
       a few demonstration examples to achieve very        stracts of 650 Randomized Controlled Trials selected
       competitive results.                                from the biomedical database PUBMed [14]. For Ab-
                                                           stRCt dataset, the Neoplasm train set (Neo-train) consists
      • We further experiment with fine-tuning strategy of 350 abstracts whereas the three Neoplasm, Glaucoma
        for LLaMA-3. In this setting, we achieve state-of- and Mixed tests sets (Neo-test, Gla-test and Mix-test, re-
        the-art performance on the ATC task for AbstRCT spectively) consist of 100 abstracts each. The statistics
        dataset.                                           of AbstRCT dataset are given in Table 1. The argument
                                                           type classification (ATC) task consists of predicting the
Our code is freely available on GitHub.
                                                           type of each argument component (AC) as ‘Major Claim’,
                                                           ‘Claim’ or ‘Premise’. Following previous approaches, we
2. Related Works                                           combine the ‘Major Claim’ and ‘Claim’ classes into a
                                                           single class ‘Claim’.
In early works, Argument Mining has been approached
using both classical algorithms such as SVM [15, 2, 16,                  Dataset Split Abstracts            ACs
17] as well as recurrent neural network models such as
                                                                         Neo-train           350           2,291
BiLSTMs [18, 19, 4]. Transformer-based models, such
                                                                         Neo-test            100           691
as BERT [20], have also been utilized for AM, including                  Gla-test            100           615
multi-scale argument modelling and customized feature-                   Mix-test            100           609
injected BERT-based models [21, 22, 23, 5, 6, 24, 25]. AM
in the biomedical AbstRCT dataset has been approached Table 1
using LSTMs [26, 27], sequential transfer learning [28] AbstRCT dataset statistics.
as well as transformer-based models [29, 30, 31].
   More recently, AM sub-tasks have been modeled as           An sample of the AbstRCT dataset is provided below.
text generation tasks using LLMs. For the Argument Type The argument components (ACs) and their corresponding
Classification (ATC) sub-task, this approach involves us- classes are indicated by bold tags.
ing a prompt template to generate the corresponding A combination of mitoxantrone plus pred-
class of an argument component. This method has been nisone is preferable to prednisone alone for reduction of pain in men
                                                           with metastatic, hormone-resistant, prostate cancer. The
applied to various AM use-cases, such as podcast tran- purpose of this study was to assess the effects of these treatments
scripts and legal documents [32, 33, 34]. The latest ap- on health-related quality of life (HQL). Men with metastatic prostate
proach in this ‘AM using LLM text generation’ direction cancer (n = 161) were randomized to receive either daily prednisone
                                                           alone or mitoxantrone (every 3 weeks) plus prednisone. Those who
involves a prompt that includes the argument compo- received prednisone alone could have mitoxantrone added after 6
nent as the query and the complete text as the context, weeks if there was no improvement in pain. HQL was assessed
to output the class of the argument component using a before treatment initiation and then every 3 weeks using the Euro-
                                                           pean Organization for Research and Treatment of Cancer Quality-
generative model [35]. In this study, the three AM sub- of-Life Questionnaire C30 (EORTC QLQ-C30) and the Quality of
tasks are modeled using the Persuasive Essays (PE) and Life Module-Prostate 14 (QOLM-P14), a trial-specific module devel-
AbstRCT datasets.                                          oped for this study. An intent-to-treat analysis was used to deter-
                                                           mine the mean duration of HQL improvement and differences in
   In contrast to the fine-tuning approach, a relevant improvement duration between groups of patients. 
training-free ICL prompting strategy for LLMs has been At 6 weeks, both groups showed improvement in several HQL do-
proposed [9, 11]. This strategy combines kNN-based mains, and only physical functioning and
                                                           pain were better in the mitoxantrone-plus-prednisone group than
example selection, generated chain-of-thought prompt- in the prednisone-alone group. After 6
ing, and majority vote ensembling for few-shot classifi- weeks, patients taking prednisone showed no improvement in HQL
cation. Interestingly, the ICL strategy outperforms the scores, whereas those taking mitoxantrone plus prednisone showed
                                                           significant improvements in global quality of life (P =.009), four
fine-tuning approach on the datasets used in the study. functioning domains, and nine symptoms (.001 < P <. 01),
                                                                and the improvement (> 10 units on a scale of 0
to100) lasted longer than in the prednisone-alone group (.004 < P            the ACs and their corresponding classes in these
<.05). The addition of mitoxantrone to                   k abstracts is constructed (kNN). Finally, the LLM
prednisone after failure of prednisone alone was associated with
improvements in pain, pain impact, pain relief, insomnia, and global         predicts the classes ŷ 1 , . . . , ŷ m of c1 , . . . , cm on
quality of life (.001 < P <.003). Treatment                the basis of on this prompt.
with mitoxantrone plus prednisone was associated with greater and
longer-lasting improvement in several HQL domains and symptoms          (2) n-Ensembling (n = 3, 5): The kNN-based exam-
than treatment with prednisone alone.                                 ples selection step, which involves randomness, is
                                                                            repeated n times (nEns), leading to a set of n se-
3.2. Zero-Shot Learning (ZSL) and                                           quences of class predictions {(ŷ i,1 , . . . , ŷ i,m ) : i =
     In-Context Learning (ICL)                                              1, . . . n}. The final class predictions ŷ 1 , . . . , ŷ m of
                                                                            c1 , . . . , cm are obtained by applying a component
Zero-shot learning (ZSL) is the paradigm where the LLM                      wise majority vote to the n predictions sequences.
is asked to solve a downstream task without receiving     The kNN-based example selection optimizes learning
any specific solved examples in the prompt. By contrast,  from few examples by selecting samples most similar
in-context learning (ICL) refers to the emergent ability of
                                                          to the current instance, rather than choosing them ran-
LLMs to solve a downstream task based on a few demon-     domly. The ensembling step increases prediction robust-
stration examples given in the prompt as contextual in-   ness by selecting the most frequent predictions. Note that
formation [8]. As the major advantage, ZSL and ICL        the relevance of the ensembling step relies on the random
paradigms do not require any fine-tuning of the model’s   selection in the kNN step. This randomness ensures that
parameters (i.e. training-free framework).                same predictions are not always produced, allowing for
   Formally, let x be a query input text and C =          majority voting and thereby increasing robustness.
[I; t(xi1 , yi1 ); . . . ; t(xik , yik )] be a context composed
                                                              To aid the LLM in generating predictions, additional
of instructions I concatenated with input-output pairs    task-specific information is typically included in the
(xj , yij ) in text format, where X = {x1 , x2 , . . . } and
                                                          prompt. For example, definitions of the ‘Claim’ and
Y = {y1 , . . . , yk } are the sets of possible input and ‘Premise’ classes, along with their statistics in the Neo-
outputs, respectively. The ZSL and ICL paradigms corre-   train set, can be incorporated in the prompt (info). More-
spond to the cases where k = 0 and k > 0, respectively.   over, in addition to the ACs c1 , . . . , cm whose class are
For input x, the LLM M predicts the output ŷ such that   to be predicted, the abstract text from which these ACs
                                                          originate can be included in the prompt (abstract). Ac-
                 ŷ = arg max PM (yi | C; x) ,
                              yi ∈Y                       cording to this ICL strategy, the classes ŷ 1 , . . . , ŷ m of
                                                          c1 , . . . , cm are predicted all-at-once (see Figure 1). There-
where PM (yi | C; x) is the probability that M gener- fore, a prompt of the form ‘info + abstract + 3NN + 3Ens’
ates yi when C and x are given as prompt. The main (see Table 3) indicates that the argument components
rationale behind ZSL and ICL is that the consideration (ACs) of the abstract are predicted all-at-once, by incor-
of a well-chosen context C increases the probability of porating additional information and the entire abstract
M predicting the correct answer y for input x, i.e., that text as contextual cues in the prompt, and employing
PM (y | C; x) > PM (y | x).                               the ICL strategy with 3NN-based example selection and
   We consider a 2-step ICL strategy for argument type 3-ensembling. A similar ICL strategy, where the classes
classification (ATC) inspired by a recent study [9] (see ŷ , . . . , ŷ are inferred one-by-one (i.e., each model in-
                                                            1           m
Figure 1). More precisely, let A be an abstract contain- ference leads to a single prediction ŷ ), has been consid-
                                                                                                     j
ing argument components (ACs) c1 , . . . , cm with cor- ered but shown to be significantly less efficient. Due to
responding true classes y1 , . . . , ym , where each yi ∈ space constraints, the latter results are omitted in this
{Claim, Premise}. Given the ACs c1 , . . . , cm in the work.
prompt, the LLM generates the corresponding class pre-
dictions ŷ 1 , . . . , ŷ m as follows:
                                                                       3.3. Fine-tuning
 (1) kNN-based examples selection (k = 3, 5): First,
     2k neighboring abstracts A1 , . . . , A2k of A are se-            Fine-tuning (FT) refers to the process of further training a
     lected according to the following similarity mea-                 pre-trained LLM on a downstream task. Previous studies
     sure. For any abstract Ai , let the signature of Ai be            indicate that relying solely on the text of an argument
     the embedding of the first sentence of Ai using the               component is insufficient for predicting its argumentative
     BioBERT model. The abstracts A1 , . . . , A2k are the             class; additional contextual information is essential for
     ones whose signatures are the closest, with respect               achieving competitive classification accuracy [2, 5, 6].
     to cosine similarity, to the signature of A. Then, k              Therefore, we propose a fine-tuning strategy that models
     abstracts, Ai1 , . . . , Aik , are randomly chosen from           the ATC task at the document level. Specifically, we
     A1 , . . . , A2k . Afterwards, a prompt containing all            incorporate task-specific information into each training
                                           kNN                                                         nENS

                                                                                                                          final
       info    abstract        ex. 1                  ex. k             LLM           preds
                                                                                                                         preds
                                           …

                                            …
       …

                  …




                                           …




                                                                        …


                                                                                         …
                                            …
Figure 1: 2-step ICL approach: a kNN-based example prediction (k = 3, 5) step followed by an n-Ensembling (n = 3) step
(cf. text for further details). For each abstract A, the class predictions ŷ 1 , dots, ŷ m of all of its ACs x1 , dots, xm are generated
in one inference step (all-at-once modality).




sample and generate the class label predictions for the               the entire text of the abstract (abstract) significantly im-
ACs of an abstract all-at-once.                                       prove the results. These expected observations serve as
                                                                      an ablation study and justify the usage of the additional
3.4. Implementation Details                                           information and full abstract text (prompt template ‘info
                                                                      + abstract’) in all subsequent experiments.
As the embedding engine, we use dmis-lab’s BioBERT1 .                    In all experiments, we observed that the models con-
For zero-shot learning, ICL and fine-tuning, we experi-               sistently generated the correct number of classes for each
ment with the LLaMA-3-8B-Instruct and LLaMA-3-70B-                    inference task. This observation remains valid for sub-
Instruct models, as well as various GGML-quantized con-               sequent ICL and fine-tuning settings. It demonstrates
figurations of them2 . For ICL, we set the generate tem-              the model’s capability to understand the correspondence
perature to 0.1. For fine-tuning, we use LoRA adapters                between the number of input ACs and the number of
with loraplus_lr_ratio of 16.0. We set batch size                     classes to predict.
of 2 and learning rate of 5e−5 . For implementation, we                  In ZSL training-free setting, across Neo, Gla and Mix
use the LLaMA-Factory3 framework [36]. An example                     test sets, the performance of LLMs strongly correlated
of the prompts we use for zero-shot learning, in-context              with the complexity of these models, achieving maximal
learning and fine-tuning with LLaMA-3 are given in Ap-                macro F1-scores of 0.698, 0.819 and 0.725, respectively.
pendix A.                                                             Overall, in ZSL, the LLMs fail to achieve acceptable re-
                                                                      sults. These considerations underscore the need for im-
                                                                      plementing additional learning modalities to address the
4. Results                                                            ATC task effectively.
4.1. Zero-Shot Learning
                                                                      4.2. In-Context Learning
The results for zero-shot learning (ZSL) on ATC task
are reported in Table 2. Recall that zero-shot learning               The results for in-context learning (ICL) on the ATC task
corresponds to the prompting strategy where no near-                  are reported in Table 3. First, note that the transition
est neighbors are included as demonstration examples,                 from zero-shot learning (‘info + abstract + 0NN’, Table 2)
referred to as ‘info + abstract + 0NN’ in our notation.               to in-context learning (‘info + abstract + kNN’, Table 3)
In an initial experimentation phase, we observed that                 drastically improves the results. This validates the effec-
adding complementary information (info) (definitions of               tiveness of the kNNN-based examples selection method.
’Claim’ and ’Premise’ and dataset statistics) and including              In addition, except for the Mix test set, the 3NN strat-
                                                                      egy consistently outperforms the 5NN strategy, suggest-
                                                                      ing that three examples suffice for optimal learning the
1
    https://huggingface.co/dmis-lab                                   ATC task in an ICL setting. The inclusion of more demon-
2
    https://github.com/ggerganov/ggml
3
    https://github.com/hiyouga/LLaMA-Factory                          stration examples correlates with a significant increase
   Model                             C      P      F1          Prompt                               C       P      F1
                         Neo test                                                     Neo test
   LLaMA-3-8b-Instruct-bnb-4bit 0.529 0.539 0.534              LLaMA-3-8b-Instruct
   LLaMA-3-8b-Instruct           0.544 0.558 0.551             info + abstract + 3NN               0.832 0.912   0.872
   LLaMA-3-70b-Instruct-bnb-4bit 0.642 0.753 0.698             info + abstract + 5NN               0.843 0.914   0.878
                                                               info + abstract + 3NN + 3Ens        0.844 0.917   0.880
                         Gla test
                                                               LLaMA-3-8b-Instruct-bnb-4bit
   LLaMA-3-8b-Instruct-bnb-4bit 0.553 0.635 0.594
                                                               info + abstract + 3NN               0.847 0.916   0.881
   LLaMA-3-8b-Instruct           0.569 0.692 0.631
                                                               info + abstract + 5NN               0.817 0.890   0.853
   LLaMA-3-70b-Instruct-bnb-4bit 0.755 0.882 0.819
                                                               info + abstract + 3NN + 3Ens        0.848 0.919   0.884
                         Mix test
                                                               LLaMA-3-70b-Instruct-bnb-4bit
   LLaMA-3-8b-Instruct-bnb-4bit 0.546 0.524 0.535              info + abstract + 3NN         0.870 0.935 0.903
   LLaMA-3-8b-Instruct           0.563 0.564 0.563             info + abstract + 5NN         0.863 0.930 0.896
   LLaMA-3-70b-Instruct-bnb-4bit 0.671 0.779 0.725             info + abstract + 3NN + 3Ens  0.884 0.941 0.912

Table 2                                                                                Gla test
Zero-shot results for ATC on three test sets of the AbstRTC    LLaMA-3-8b-Instruct
dataset using LLaMA-3.                                         info + abstract + 3NN               0.834 0.929 0.882
                                                               info + abstract + 5NN               0.836 0.925 0.881
                                                               info + abstract + 3NN + 3Ens        0.872 0.947 0.910
in prompt length, potentially hindering the performance        LLaMA-3-8b-Instruct-bnb-4bit
of the LLM or exceeding the maximum size of its con-           info + abstract + 3NN               0.827 0.924   0.875
text. Furthermore, the ensembling strategy consistently        info + abstract + 5NN               0.816 0.916   0.866
improves the results, even if only slightly, ensuring that     info + abstract + 3NN + 3Ens        0.832 0.928   0.880
the robustness of the results can indeed be strengthened       LLaMA-3-70b-Instruct-bnb-4bit
through ensembling predictions.                                info + abstract + 3NN         0.868 0.946         0.907
   Overall, the training-free ICL strategy achieves very       info + abstract + 5NN         0.865 0.945         0.905
competitive F1-scores of 0.912, 0.910, and 0.929 on Neo,       info + abstract + 3NN + 3Ens  0.863 0.944         0.903
Mix, and Gla test sets, respectively. However, these                                  Mix test
results remain lower than those obtained by previous
                                                               LLaMA-3-8b-Instruct
training-dependent models (see Table 4, upper rows).
                                                               info + abstract + 3NN               0.879 0.938   0.909
                                                               info + abstract + 5NN               0.898 0.944   0.921
4.3. Fine-Tuning                                               info + abstract + 3NN + 3Ens        0.884 0.940   0.912

The results achieved by the fine-tuning (FT) strategy on       LLaMA-3-8b-Instruct-bnb-4bit
the ATC task are reported in Table 4. Our results show         info + abstract + 3NN               0.859 0.926   0.893
                                                               info + abstract + 5NN               0.866 0.922   0.894
that fine-tuning significantly outperforms ICL. These
                                                               info + abstract + 3NN + 3Ens        0.885 0.940   0.913
findings suggest that the argumentative flow within ab-
stracts cannot be inferred solely from the knowledge           LLaMA-3-70b-Instruct-bnb-4bit
acquired during pre-training, and requires additional pa-      info + abstract + 3NN         0.905 0.954 0.929
rameters updates to be effectively learned.                    info + abstract + 5NN         0.906 0.952 0.929
                                                               info + abstract + 3NN + 3Ens  0.904 0.952 0.928
   In this training-dependent context, we achieve maxi-
mal F1-scores of 0.935, 0.913, and 0.951 on the Neo, Gla,     Table 3
and Mix test sets, respectively, establishing new state-      Results for ATC on three test sets of AbstRCT dataset with
of-the-art results for the Neo and Mix test sets. These       LLaMA-3 models using the 2-step ICL strategy described in
results suggest once again that the sequentiality of argu-    the text.
ments inside a specific corpus requires fine-tuning to be
optimally captured.
                                                              zero-shot learning (ZSL), in-context learning (ICL) and
                                                              fine-tuning (FT). We show that ZSL fails to achieve accept-
5. Conclusion                                                 able performance, ICL significantly improves the results,
                                                              and FT reaches state-of-the-art performance.
In this work, we address argument type classification
                                                                 These results support the fact that ATC task cannot
(ATC) in the biomedical AbstRTC dataset with openly
                                                              be solved in a zero-shot setting by relying solely on
available LLaMA-3 from the three-fold perspective of
                                                              general-purpose language modalities acquired during
  Model                              Neo      Gla     Mix        References
  ResAttArg(Ensemble) [27]           0.879   0.877   0.897
  SeqMT [28]                         0.919   0.924   0.922       [1] R. M. Palau, M.-F. Moens, Argumentation min-
  MRC_GEN [35]                       0.928   0.926   0.940           ing: The detection, classification and structure of
  GIAM [25]                          0.930   0.928   0.936           arguments in text, in: Proceedings of ICAIL 2019,
                                                                     ICAIL ’09, ACM, New York, NY, USA, 2009, pp. 98–
  LLaMA-3-8B-Instruct           0.919        0.908   0.939
  LLaMA-3-8B-Instruct-bnb-4bit 0.935         0.910   0.953
                                                                     107. URL: https://doi.org/10.1145/1568234.1568246.
  LLaMA-3-70B-Instruct          0.929        0.913   0.940           doi:10.1145/1568234.1568246.
  LLaMA-3-70B-Instruct-bnb-4bit 0.921        0.908   0.951       [2] C. Stab, I. Gurevych, Parsing argumentation struc-
                                                                     tures in persuasive essays, Computational Linguis-
Table 4                                                              tics 43 (2017) 619–659. URL: https://aclanthology.
Fine-tuning results for ATC task on the three test sets of Ab-       org/J17-3005. doi:10.1162/COLI_a_00295.
stRCT dataset using LLaMA-3.                                     [3] P. Potash, A. Romanov, A. Rumshisky, Here’s
                                                                     my point: Joint pointer architecture for argu-
                                                                     ment mining, in: M. P. et al. (Ed.), Proceed-
pre-training. Additional learning is essential, either in            ings of EMNLP 2017, ACL, 2017, pp. 1364–1373.
the form of solved demonstration examples (ICL) or via               URL: https://doi.org/10.18653/v1/d17-1143. doi:10.
parameters’ updates (FT). We conjecture that the sequen-             18653/V1/D17-1143.
tial flow of arguments within a text is a corpus-specific        [4] T. Kuribayashi, H. Ouchi, N. Inoue, P. Reisert,
feature that cannot be inferred through zero-shot meth-              T. Miyoshi, J. Suzuki, K. Inui, An empirical study
ods.                                                                 of span representations in argumentation structure
   Previous works demonstrated that the text of argument             parsing, in: A. K. et al. (Ed.), Proceedings of ACL
components alone do not suffice to infer their argumen-              2019, ACL, Florence, Italy, 2019, pp. 4691–4698. URL:
tative roles [2, 4, 6]. Additional contextual, structural            https://aclanthology.org/P19-1464. doi:10.18653/
and syntactic features are necessary. In our ICL and FT              v1/P19-1464.
settings, comprehensive contextual and structural infor-         [5] U. Mushtaq, J. Cabessa, Argument classification
mation is incorporated through task-specific information             with BERT plus contextual, structural and syn-
and complete abstract text provided in the prompt. This              tactic features as text, in: M. T. et al. (Ed.),
information enables the model to discern the sequence                Proceedings of ICONIP 2022, volume 1791 of
of arguments, their associated markers, and other char-              CCIS, Springer, 2022, pp. 622–633. URL: https://doi.
acteristics closely associated with their argumentative              org/10.1007/978-981-99-1639-9_52. doi:10.1007/
roles.                                                               978-981-99-1639-9\_52.
   For future work, the design and implementation of a           [6] U. Mushtaq, J. Cabessa,             Argument min-
full AM pipeline using LLMs represents a major mile-                 ing with modular BERT and transfer learning,
stone. In this scenario, the LLM would take raw texts as             in: Proceedings of IJCNN 2023, IEEE, 2023,
input and produce a detailed map of the argumentative                pp. 1–8. URL: https://doi.org/10.1109/IJCNN54540.
structure as output. We believe that LLMs will substan-              2023.10191968. doi:10.1109/IJCNN54540.2023.
tially transform the landscape of AM and its practical               10191968.
applications.                                                    [7] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang,
                                                                     Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong,
                                                                     Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang,
Acknowledgments                                                      R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie,
                                                                     J. Wen, A survey of large language models,
This work benefited from access to the computing re-
                                                                     CoRR abs/2303.18223 (2023). URL: https://doi.org/
sources of the L3i laboratory, operated and hosted by the
                                                                     10.48550/arXiv.2303.18223. doi:10.48550/ARXIV.
University of La Rochelle. It is financed by the French
                                                                     2303.18223. arXiv:2303.18223.
government and the Region Nouvelle-Acquitaine. This
                                                                 [8] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
research also benefited from institutional support RVO:
                                                                     X. Sun, J. Xu, L. Li, Z. Sui, A survey on in-context
67985807 and partially supported by the grant of the
                                                                     learning, CoRR abs/2301.00234 (2023). URL: https://
Czech Science Foundation No. GA22-02067S. Finally, we
                                                                     doi.org/10.48550/arXiv.2301.00234. doi:10.48550/
are grateful to Playtika Ltd. for their support for this
                                                                     ARXIV.2301.00234. arXiv:2301.00234.
research.
                                                                 [9] H. Nori, et al., Can generalist foundation models
                                                                     outcompete special-purpose tuning? case study in
                                                                     medicine, CoRR abs/2311.16452 (2023). URL: https://
                                                                     doi.org/10.48550/arXiv.2311.16452. doi:10.48550/
     ARXIV.2311.16452. arXiv:2311.16452.                       language understanding, in: J. B. et al. (Ed.), Pro-
[10] J. Wei, X. Wang, D. Schuurmans, M. Bosma,                 ceedings of NAACL-HLT 2019, ACL, 2019, pp. 4171–
     B. Ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou,             4186. URL: https://doi.org/10.18653/v1/n19-1423.
     Chain-of-thought prompting elicits reasoning              doi:10.18653/V1/N19-1423.
     in large language models, in: S. K. et al. (Ed.), [21] G. Zhang, P. Nulty, D. Lillis, Enhancing legal argu-
     Proceedings of NeurIPS 2022, volume 35, 2022,             ment mining with domain pre-training and neural
     pp. 24824–24837. URL: https://proceedings.                networks, CoRR abs/2202.13457 (2022). URL: https:
     neurips.cc/paper_files/paper/2022/file/                   //arxiv.org/abs/2202.13457. arXiv:2202.13457.
     9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.[22] H. Wang, Z. Huang, Y. Dou, Y. Hong,                   Ar-
     pdf.                                                      gumentation mining on essays at multi
[11] S. Lei, G. Dong, X. Wang, K. Wang, S. Wang, Instruc-      scales,       in: D. S. et al. (Ed.), Proceed-
     terc: Reforming emotion recognition in conversa-          ings of COLING 2020, ICCL, Barcelona,
     tion with a retrieval multi-task llms framework,          Spain (Online), 2020, pp. 5480–5493. URL:
     CoRR abs/2309.11911 (2023). URL: https://doi.org/         https://aclanthology.org/2020.coling-main.478.
     10.48550/arXiv.2309.11911. doi:10.48550/ARXIV.            doi:10.18653/v1/2020.coling-main.478.
     2309.11911. arXiv:2309.11911.                        [23] S. Fioravanti, A. Zugarini, F. Giannini, L. Rigutini,
[12] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi,            M. Maggini, M. Diligenti, Linguistic feature in-
     S. Narang, A. Chowdhery, D. Zhou, Self-consistency        jection for efficient natural language processing,
     improves chain of thought reasoning in language           in: IJCNN 2023, June 18-23, 2023, IEEE, 2023,
     models, 2023. arXiv:2203.11171.                           pp. 1–7. URL: https://doi.org/10.1109/IJCNN54540.
[13] H. Nori, N. King, S. M. McKinney, D. Carignan,            2023.10191680. doi:10.1109/IJCNN54540.2023.
     E. Horvitz, Capabilities of GPT-4 on medical              10191680.
     challenge problems, CoRR abs/2303.13375 (2023). [24] J. Bao, C. Fan, J. Wu, Y. Dang, J. Du, R. Xu, A neu-
     URL: https://doi.org/10.48550/arXiv.2303.13375.           ral transition-based model for argumentation min-
     doi:10.48550/ARXIV.2303.13375.                            ing, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.),
     arXiv:2303.13375.                                         Proceedings of the 59th Annual Meeting of the
[14] T. Mayer, Argument Mining on Clinical Trials,             ACL and the 11th International Joint Conference
     Theses, Université Côte d’Azur, 2020. URL: https:         on Natural Language Processing (Volume 1: Long
     //theses.hal.science/tel-03209489.                        Papers), ACL, Online, 2021, pp. 6354–6364. URL:
[15] R. Mochales, M. Moens, Argumentation min-                 https://aclanthology.org/2021.acl-long.497. doi:10.
     ing, Artificial Intelligence and Law 19 (2011) 1–22.      18653/v1/2021.acl-long.497.
     doi:10.1007/s10506-010-9104-x.                       [25] B. Liu, V. Schlegel, P. Thompson, R. T. Batista-
[16] I. Habernal, I. Gurevych, Argumentation min-              Navarro, S. Ananiadou,          Global information-
     ing in user-generated web discourse, Computa-             aware argument mining based on a top-
     tional Linguistics 43 (2017) 125–179. URL: https:         down multi-turn qa model,                Information
     //aclanthology.org/J17-1004. doi:10.1162/COLI_            Processing & Management 60 (2023) 103445.
     a_00276.                                                  URL:       https://www.sciencedirect.com/science/
[17] R. Levy, Y. Bilu, D. Hershcovich, E. Aharoni,             article/pii/S0306457323001826.            doi:https:
     N. Slonim, Context dependent claim detection, in:         //doi.org/10.1016/j.ipm.2023.103445.
     ICCL, 2014. URL: https://api.semanticscholar.org/ [26] A. Galassi, M. Lippi, P. Torroni, Argumentative
     CorpusID:18847466.                                        link prediction using residual networks and multi-
[18] S. Eger, J. Daxenberger, I. Gurevych, Neural end-         objective learning, in: N. Slonim, R. Aharonov
     to-end learning for computational argumentation           (Eds.), Proceedings of the 5th Workshop on Ar-
     mining, in: R. Barzilay, M.-Y. Kan (Eds.), Proceed-       gument Mining, ACL, Brussels, Belgium, 2018,
     ings of ACL 2017, ACL, Vancouver, Canada, 2017,           pp. 1–10. URL: https://aclanthology.org/W18-5201.
     pp. 11–22. URL: https://aclanthology.org/P17-1002.        doi:10.18653/v1/W18-5201.
     doi:10.18653/v1/P17-1002.                            [27] A. Galassi, M. Lippi, P. Torroni, Multi-task attentive
[19] V. Niculae, J. Park, C. Cardie, Argument mining           residual networks for argument mining, IEEE/ACM
     with structured SVMs and RNNs, in: R. Barzi-              Transactions on Audio, Speech, and Language Pro-
     lay, M.-Y. Kan (Eds.), Proceedings of ACL 2017,           cessing 31 (2023) 1877–1892. doi:10.1109/TASLP.
     ACL, Vancouver, Canada, 2017, pp. 985–995. URL:           2023.3275040.
     https://aclanthology.org/P17-1091. doi:10.18653/ [28] J. Si, L. Sun, D. Zhou, J. Ren, L. Li, Biomedical
     v1/P17-1091.                                              argument mining based on sequential multi-task
[20] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:          learning, IEEE/ACM Trans. Comput. Biol. Bioin-
     pre-training of deep bidirectional transformers for       formatics 20 (2022) 864–874. URL: https://doi.org/
     10.1109/TCBB.2022.3173447. doi:10.1109/TCBB.
     2022.3173447.
[29] T. Mayer, E. Cabrio, S. Villata, Transformer-based
     argument mining for healthcare applications, in:
     G. D. G. et al. (Ed.), Proceedings of ECAI 2020, vol-
     ume 325 of FAIA, IOS Press, 2020, pp. 2108–2115.
     URL: https://doi.org/10.3233/FAIA200334. doi:10.
     3233/FAIA200334.
[30] B. Molinet, S. Marro, E. Cabrio, S. Villata, T. Mayer,
     Acta 2.0: A modular architecture for multi-layer
     argumentative analysis of clinical trials, in: L. D.
     Raedt (Ed.), Proceedings of IJCAI-22, International
     Joint Conferences on Artificial Intelligence Orga-
     nization, 2022, pp. 5940–5943. URL: https://doi.
     org/10.24963/ijcai.2022/859. doi:10.24963/ijcai.
     2022/859, demo Track.
[31] T. Mayer, S. Marro, E. Cabrio, S. Villata, Enhancing
     evidence-based medicine with natural language
     argumentative analysis of clinical trials, Artificial
     Intelligence in Medicine 118 (2021) 102098. URL:
     https://www.sciencedirect.com/science/article/pii/
     S0933365721000919. doi:https://doi.org/10.
     1016/j.artmed.2021.102098.
[32] M. van der Meer, M. Reuver, U. Khurana, L. Krause,
     S. B. Santamaría, Will it blend? mixing train-
     ing paradigms & prompting for argument quality
     prediction, in: G. Lapesa, et al. (Eds.), ArgMin-
     ing@COLING 2022, ICCL, 2022, pp. 95–103. URL:
     https://aclanthology.org/2022.argmining-1.8.
[33] M. Pojoni, L. Dumani, R. Schenkel, Argument-
     mining from podcasts using chatgpt, in: L. Malburg,
     D. Verma (Eds.), Proceedings of ICCBR-WS 2023,
     volume 3438 of CEUR Workshop Proceedings, CEUR-
     WS.org, 2023, pp. 129–144. URL: https://ceur-ws.
     org/Vol-3438/paper_10.pdf.
[34] A. Al Zubaer, M. Granitzer, J. Mitrović, Performance
     analysis of large language models in the domain of
     legal argument mining, Frontiers in Artificial Intel-
     ligence 6 (2023). URL: https://www.frontiersin.org/
     articles/10.3389/frai.2023.1278796. doi:10.3389/
     frai.2023.1278796.
[35] B. Liu, V. Schlegel, R. Batista-Navarro, S. Anani-
     adou, Argument mining as a multi-hop generative
     machine reading comprehension task, in: The 2023
     Conference on Empirical Methods in Natural Lan-
     guage Processing, 2023. URL: https://openreview.
     net/forum?id=KTFxOnrbvu.
[36] Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Z. Luo, Z. Feng,
     Y. Ma, Llamafactory: Unified efficient fine-tuning of
     100+ language models, in: Proceedings of the 62nd
     Annual Meeting of the ACL (Volume 3: System
     Demonstrations), ACL, Bangkok, Thailand, 2024.
     URL: http://arxiv.org/abs/2403.13372.
A. Appendix                                                                                  phase II trial. Within the FLOT65+ trial, a total of 143 patients aged ≥65 years
                                                                                             were randomly allocated to receive biweekly oxaliplatin plus 5-fluorouracil (5-FU)
                                                                                             continuous infusion and folinic acid (FLO) or the same regimen in combination
Examples of prompts for LLaMA 3 for the zero-shot learn-                                     with docetaxel 50 mg/m(2) (FLOT). The European Organisation for Research and
ing (ZSL), in-context learning (ICL) and fine-tuning (FT)                                    Treatment of Cancer Quality of Life Questionnaire C30 (EORTC QLQ-C30) and
                                                                                             the gastric module STO22 were administered every 8 weeks until progression.
settings are provided below.                                                                 Time to definitive deterioration of QOL parameters was analyzed and compared
                                                                                             within the treatment arms. The median age of patients was 70 years. Patients
                                                                                             receiving FLOT exhibited higher response rates and had improved disease-free and
A.1. Zero-Shot Learning                                                                      progression-free survival (PFS). The proportions of patients with evaluable baseline
                                                                                             EORTC QLQ-C30 and STO22 questionnaires were balanced (83 % in FLOT and 89 %
### Task description: You are an expert biomedical assistant that takes 1) an abstract       in FLO). Considering evaluable patients with assessable questionnaires (n = 123),
text and 2) the list of all arguments from this abstract text, and must classify all         neither functioning nor symptom parameters differed significantly in favor of one of
arguments into one of two classes: Claim or Premise. 68.0052% of examples are of             the two treatment groups. Particularly, there was no significant difference regarding
type Premise and 31.9948% of type Claim. You must absolutely not generate any text           time to definitive deterioration of global health status/quality of life from baseline
or explanation other than the following JSON format {"Argument 1": , ..., "Argument n": }            (n = 98) achieved comparable QOL results. Although toxicity was higher in patients
                                                                                             receiving FLOT, no negative impact of the addition of docetaxel on QOL parameters
### Class definitions: Claim = A claim in the abstract of an RCT is a statement or           could be demonstrated. Thus, elderly patients in need of intensified chemotherapy
conclusion about the findings of the study. Premise = A premise in the abstract of an        may receive FLOT without compromising patient-reported outcome parameters.
RCT is a statement that provides an evidence or proof for a claim.
                                                                                             # Arguments:
### Abstract: Few controlled clinical trials exist to support oral combina-
tion therapy in pulmonary arterial hypertension (PAH). Patients with PAH                     Argument 1=Patients receiving FLOT exhibited higher response rates and had
(idiopathic [IPAH] or associated with connective tissue disease [APAH-CTD])                  improved disease-free and progression-free survival (PFS).
taking bosentan (62.5 or 125 mg twice daily at a stable dose for ≥3 months) were             Argument 2=there was no significant difference regarding time to definitive
randomized (1:1) to sildenafil (20 mg, 3 times daily; n = 50) or placebo (n = 53).           deterioration of global health status/quality of life from baseline (primary endpoint).
The primary endpoint was change from baseline in 6-min walk distance (6MWD)                  Argument 3=patients receiving FLO or FLOT as palliative treatment (n = 98) achieved
at week 12, assessed using analysis of covariance. Patients could continue in a              comparable QOL results.
52-week extension study. An analysis of covariance main-effects model was used,              Argument 4=Although toxicity was higher in patients receiving FLOT,
which included categorical terms for treatment, baseline 6MWD (<325 m; ≥325                  Argument 5=no negative impact of the addition of docetaxel on QOL parameters
m), and baseline aetiology; sensitivity analyses were subsequently performed.                could be demonstrated.
In sildenafil versus placebo arms, week-12 6MWD increases were similar (least                Argument 6=elderly patients in need of intensified chemotherapy may receive FLOT
squares mean difference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to 17.1 m]; P            without compromising patient-reported outcome parameters.
= 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ± 57.4 m,
respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1 m in           # Result:
APAH-CTD (35% of population). One-year survival was 96%; patients maintained
modest 6MWD improvements. Changes in WHO functional class and Borg                           {"Argument 1": "Premise", "Argument 2": "Premise", "Argument 3": "Premise",
dyspnoea score and incidence of clinical worsening did not differ. Headache,                 "Argument 4": "Premise", "Argument 5": "Premise", "Argument 6": "Claim"}
diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition
to stable (≥3 months) bosentan therapy, had no benefit over placebo for 12-week              ## Example 2
change from baseline in 6MWD. The influence of PAH aetiology warrants future study.
                                                                                             # Abstract:
### Arguments: Argument 1=In sildenafil versus placebo arms, week-12 6MWD
increases were similar (least squares mean difference [sildenafil-placebo], -2.4 m           Chemotherapy prolongs survival and improves quality of life (QOL) for good
[90% CI: -21.8 to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ±             performance status (PS) patients with advanced non-small cell lung cancer (NSCLC).
45.7 versus 11.8 ± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0        Targeted therapies may improve chemotherapy effectiveness without worsening
versus 17.5 ± 59.1 m in APAH-CTD (35% of population).                                        toxicity. SGN-15 is an antibody-drug conjugate (ADC), consisting of a chimeric
Argument 2=Changes in WHO functional class and Borg dyspnoea score and                       murine monoclonal antibody recognizing the Lewis Y (Le(y)) antigen, conjugated
incidence of clinical worsening did not differ.                                              to doxorubicin. Le(y) is an attractive target since it is expressed by most NSCLC.
Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil.               SGN-15 was active against Le(y)-positive tumors in early phase clinical trials and
Argument 4=Sildenafil, in addition to stable (≥3 months) bosentan therapy, had no            was synergistic with docetaxel in preclinical experiments. This Phase II, open-label
benefit over placebo for 12-week change from baseline in 6MWD.                               study was conducted to confirm the activity of SGN-15 plus docetaxel in previously
                                                                                             treated NSCLC patients. Sixty-two patients with recurrent or metastatic NSCLC
### Result:                                                                                  expressing Le(y), one or two prior chemotherapy regimens, and PS< or =2 were
                                                                                             randomized 2:1 to receive SGN-15 200 mg/m2/week with docetaxel 35 mg/m2/week
                                                                                             (Arm A) or docetaxel 35 mg/m2/week alone (Arm B) for 6 of 8 weeks. Intrapatient
A.2. In-Context Learning (ICL)                                                               dose-escalation of SGN-15 to 350 mg/m2 was permitted in the second half of the
                                                                                             study. Endpoints were survival, safety, efficacy, and quality of life. Forty patients on
### Task description: You are an expert biomedical assistant that takes 1) an
                                                                                             Arm A and 19 on Arm B received at least one treatment. Patients on Arms A and B
abstract text, 2) the list of all arguments from this abstract text, and must classify all
                                                                                             had median survivals of 31.4 and 25.3 weeks, 12-month survivals of 29% and 24%,
arguments into one of two classes: Claim or Premise. 68.0052% of examples are of
                                                                                             and 18-month survivals of 18% and 8%, respectively Toxicity was mild in both arms.
type Premise and 31.9948% of type Claim. You must absolutely not generate any text
                                                                                             QOL analyses favored Arm A. SGN-15 plus docetaxel is a well-tolerated and active
or explanation other than the following JSON format {"Argument 1": , ..., "Argument n": }
                                                                                             alternate schedules to maximize synergy between these agents.
### Class definitions: Claim = A claim in the abstract of an RCT is a statement or
                                                                                             # Arguments:
conclusion about the findings of the study. Premise = A premise in the abstract of an
RCT is a statement that provides an evidence or proof for a claim.
                                                                                             Argument 1=Chemotherapy prolongs survival and improves quality of life (QOL) for
                                                                                             good performance status (PS) patients with advanced non-small cell lung cancer
### Examples:
                                                                                             (NSCLC).
                                                                                             Argument 2=Targeted therapies may improve chemotherapy effectiveness without
## Example 1
                                                                                             worsening toxicity.
                                                                                             Argument 3=Le(y) is an attractive target since it is expressed by most NSCLC.
# Abstract:
                                                                                             Argument 4=SGN-15 was active against Le(y)-positive tumors in early phase clinical
                                                                                             trials and was synergistic with docetaxel in preclinical experiments.
Treatment of patients with advanced or metastatic esophagogastric adenocarcinoma
                                                                                             Argument 5=Patients on Arms A and B had median survivals of 31.4 and 25.3
should not only prolong life but also provide relief of symptoms and improve quality
                                                                                             weeks, 12-month survivals of 29% and 24%, and 18-month survivals of 18% and 8%,
of life (QOL). Esophagogastric adenocarcinoma mainly occurs in elderly patients, but
                                                                                             respectively
they are underrepresented in most clinical trials and often do not receive effective
                                                                                             Argument 6=Toxicity was mild in both arms.
combination chemotherapy, most probably for fear of intolerance. Using validated
                                                                                             Argument 7=QOL analyses favored Arm A.
instruments, we prospectively assessed QOL within the randomized FLOT65+
                                                                                             Argument 8=SGN-15 plus docetaxel is a well-tolerated and active second and third
line treatment for NSCLC patients                                                         # Arguments:

# Result:                                                                                 Argument 1=In sildenafil versus placebo arms, week-12 6MWD increases were
                                                                                          similar (least squares mean difference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to
{"Argument 1": "Claim", "Argument 2": "Claim", "Argument 3": "Claim", "Argument           17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ±
4": "Premise", "Argument 5": "Premise", "Argument 6": "Premise", "Argument 7":            57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1
"Premise", "Argument 8": "Claim"}                                                         m in APAH-CTD (35% of population).
                                                                                          Argument 2=Changes in WHO functional class and Borg dyspnoea score and
## Example 3                                                                              incidence of clinical worsening did not differ.
                                                                                          Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil.
# Abstract:                                                                               Argument 4=Sildenafil, in addition to stable (≥3 months) bosentan therapy, had no
                                                                                          benefit over placebo for 12-week change from baseline in 6MWD.
The impact of treatment on health-related quality of life (HRQoL) is an important
consideration in the adjuvant treatment of operable breast cancer. Here we report         # Result:
mature HRQoL outcomes from the ATAC trial, comparing anastrozole with tamoxifen
as primary adjuvant therapy for postmenopausal women with localized breast cancer.
Patients completed the Functional Assessment of Cancer Therapy-Breast (FACT-B)            A.3. Fine-Tuning (FT)
questionnaire plus endocrine subscale (ES) at baseline, 3 and 6 months, and every 6
                                                                                          ### You are an expert in medical analysis. You are given the abstract of a random
months thereafter. Baseline characteristics in the HRQoL sub-protocol were well
                                                                                          controlled trial which contains numbered argument components enclosed by
balanced between the anastrozole (n = 335) and tamoxifen (n = 347) groups in the
                                                                                           tags. Your task is to classify each argument components in the essay as
primary analysis population. As with previously published results at 2 years, there
                                                                                          either "Claim" or "Premise". You must return a list of argument component types in
was no statistically significant difference in the Trial Outcome Index of the FACT-B,
                                                                                          following JSON format: "component_types": [component_type (str), component_type
the primary endpoint of the study, between treatments at 5 years. There were no
                                                                                          (str), ..., component_type (str)]
statistically significant differences between treatment groups in ES total scores.
Consistent with the 2-year analysis, there were differences between treatment groups
                                                                                          ### Here is the abstract text: An open, randomized study was performed to assess
in patient-reported side effects: diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%),
                                                                                          the effects of supportive pamidronate treatment on morbidity from bone metastases
vaginal dryness (18.5% vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia
                                                                                          in breast cancer patients. Eighty-one pamidronate patients and 80 control patients
(17.3% vs. 8.1%) were significantly more frequent with anastrozole compared to
                                                                                          were monitored for a median of 18 and 21 months, respectively, for events of skeletal
tamoxifen. Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were
                                                                                          morbidity and the radiologic course of metastatic bone disease. The oral pamidronate
significantly less frequent with anastrozole compared to tamoxifen. In this, the first
                                                                                          dose was 600 mg/d (high dose [HD]) during the earliest study years, then changed
report of HRQoL over 5 years of initial adjuvant therapy with an aromatase inhibitor,
                                                                                          to 300 mg/d (low dose [LD]) because of gastrointestinal toxicity. Twenty-nine of
we conclude that anastrozole and tamoxifen had similar impacts on HRQoL, which
                                                                                          81 pamidronate (HD/LD) patients first received 600 mg/d and were then changed
was maintained or slightly improved during the treatment period for both groups.
                                                                                          to 300 mg/d; 52 of 81 pamidronate LD patients received 300 mg/d throughout the
                                                                                          study. Tumor treatment was unrestricted. An overall intent-to-treat analysis was
# Arguments:
                                                                                          performed. In the pamidronate group, the occurrence of hypercalcemia, severe
                                                                                          bone pain, and symptomatic impending fractures decreased by 65%, 30%, and 50%,
Argument 1=The impact of treatment on health-related quality of life (HRQoL) is an
                                                                                          respectively; event-rates of systemic treatment and radiotherapy decreased by
important consideration in the adjuvant treatment of operable breast cancer.
                                                                                          35% (P < or = .02).  The event-free period (EFP), radiologic course of
Argument 2=As with previously published results at 2 years, there was no statistically
                                                                                          disease, and survival did not improve.  Subgroup analyses suggested
significant difference in the Trial Outcome Index of the FACT-B, the primary
                                                                                          a dose-dependent treatment effect.  Compared with their controls,
endpoint of the study, between treatments at 5 years.
                                                                                          in pamidronate HD/LD patients, events occurred 60% to 90% less frequently (P
Argument 3=There were no statistically significant differences between treatment
                                                                                          < or = .03) and the EFP was prolonged (P = .002).  In pamidronate
groups in ES total scores.
                                                                                          LD patients, event-rates decreased by 15% to 45% (P < or = .04). 
Argument 4=there were differences between treatment groups in patient-reported
                                                                                          Gastrointestinal toxicity of pamidronate caused a 23% drop-out rate,  but
side effects:
                                                                                          other cancer-associated factors seemed to contribute to this toxicity. 
Argument 5=diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%), vaginal dryness (18.5%
                                                                                          Pamidronate treatment of breast cancer patients efficaciously reduced skeletal
vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia (17.3% vs. 8.1%) were
                                                                                          morbidity.  The effect appeared to be dose-dependent. 
significantly more frequent with anastrozole compared to tamoxifen.
                                                                                          Further research on dose and mode of treatment is mandatory. 
Argument 6=Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were
significantly less frequent with anastrozole compared to tamoxifen.
                                                                                          {"component_types": ["Premise", "Premise", "Claim", "Premise", "Premise", "Premise",
Argument 7=In this, the first report of HRQoL over 5 years of initial adjuvant therapy
                                                                                          "Claim", "Claim", "Claim", "Claim"]}
with an aromatase inhibitor, we conclude that anastrozole and tamoxifen had similar
impacts on HRQoL, which was maintained or slightly improved during the treatment
period for both groups.

# Result:

{"Argument 1": "Claim", "Argument 2": "Premise", "Argument 3": "Premise", "Argument
4": "Claim", "Argument 5": "Premise", "Argument 6": "Premise", "Argument 7": "Claim"}

# Abstract:

Few controlled clinical trials exist to support oral combination therapy in pulmonary
arterial hypertension (PAH). Patients with PAH (idiopathic [IPAH] or associated with
connective tissue disease [APAH-CTD]) taking bosentan (62.5 or 125 mg twice daily
at a stable dose for ≥3 months) were randomized (1:1) to sildenafil (20 mg, 3 times
daily; n = 50) or placebo (n = 53). The primary endpoint was change from baseline
in 6-min walk distance (6MWD) at week 12, assessed using analysis of covariance.
Patients could continue in a 52-week extension study. An analysis of covariance
main-effects model was used, which included categorical terms for treatment,
baseline 6MWD (<325 m; ≥325 m), and baseline aetiology; sensitivity analyses were
subsequently performed. In sildenafil versus placebo arms, week-12 6MWD increases
were similar (least squares mean difference [sildenafil-placebo], -2.4 m [90% CI: -21.8
to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8
± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ±
59.1 m in APAH-CTD (35% of population). One-year survival was 96%; patients
maintained modest 6MWD improvements. Changes in WHO functional class and
Borg dyspnoea score and incidence of clinical worsening did not differ. Headache,
diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition
to stable (≥3 months) bosentan therapy, had no benefit over placebo for 12-week
change from baseline in 6MWD. The influence of PAH aetiology warrants future study.