=Paper=
{{Paper
|id=Vol-3878/15_main_long
|storemode=property
|title=Argument Mining in BioMedicine: Zero-Shot, In-Context Learning and Fine-tuning with LLMs
|pdfUrl=https://ceur-ws.org/Vol-3878/15_main_long.pdf
|volume=Vol-3878
|authors=Jérémie Cabessa,Hugo Hernault,Umer Mushtaq
|dblpUrl=https://dblp.org/rec/conf/clic-it/CabessaHM24
}}
==Argument Mining in BioMedicine: Zero-Shot, In-Context Learning and Fine-tuning with LLMs==
Argument Mining in BioMedicine: Zero-Shot, In-Context
Learning and Fine-tuning with LLMs
Jérémie Cabessa1,∗,† , Hugo Hernault2,† and Umer Mushtaq3,†
1
David Lab, University of Versailles Saint-Quentin (UVSQ) – University of Paris-Saclay, 78000 Versailles, France
Institute of Computer Science of the Czech Academy of Sciences, 18207 Prague 8, Czech Republic
2
Playtika Ltd., CH-1003 Lausanne, Switzerland
3
Laboratoire Informatique, Image, Interaction (L3i), University of La Rochelle, 17042 La Rochelle, France
Abstract
Argument Mining (AM) aims to extract the complex argumentative structure of a text and Argument Type Classification
(ATC) is an essential sub-task of AM. Large Language Models (LLMs) have shown impressive capabilities in most NLP tasks
and beyond. However, fine-tuning LLMs can be challenging. In-Context Learning (ICL) has been suggested as a bridging
paradigm between training-free and fine-tuning settings for LLMs. In ICL, an LLM is conditioned to solve tasks using a few
solved demonstration examples included in its prompt. We focuse on AM in the biomedical AbstRCT dataset. We address
ATC using quantized and unquantized LLaMA-3 models through zero-shot learning, in-context learning, and fine-tuning
approaches. We introduce a novel ICL strategy that combines kNN-based example selection with majority vote ensembling,
along with a well-designed fine-tuning strategy for ATC. In zero-shot setting, we show that LLaMA-3 fails to achieve
acceptable classification results, suggesting the need for additional training modalities. However, in our ICL training-free
setting, LLaMA-3 can leverage relevant information from only a few demonstration examples to achieve very competitive
results. Finally, in our fine-tuning setting, LLaMA-3 achieves state-of-the-art performance on ATC task in AbstRCT dataset.
Keywords
Argument Mining, NLP, LLMs, LLaMA-3, Zero-Shot Learning, In-Context Learning, Fine-tuning, Ensembling
1. Introduction learning refers to the training-free approach where a pre-
trained LLM is prompted to solve tasks on completely
Argument Mining (AM) focuses on extracting the under- unseen data samples.
lying argumentative and discursive structure from raw Recently, In-Context Learning (ICL) has been proposed
text [1]. Argument Type Classification (ATC), which in- as a bridging paradigm between the training-free and
volves classifying argumentative units in text according fine-tuning settings. ICL is a prompt engineering tech-
to their argumentative roles, is the crucial sub-task in AM. nique whereby an LLM is conditioned to solve tasks by
Research has shown that the argumentative role of a unit means of a few solved demonstration examples included
cannot be inferred solely for its text: additional structural as part of its input prompt [8]. Generally, the input
and contextual information is needed [2]. This additional prompt includes task instructions, the current input sam-
information can be incorporated via feature engineer- ple to be solved as well as several solved input-output pair
ing [2], memory-enabled neural architectures [3, 4] or examples. In this way, ICL maintains the training-free
LLM-based hybrid methods [5, 6]. posture (parameters frozen) of the LLM while at the same
Large Language Models (LLMs) have become ubiqui- time providing it with some supervision through demon-
tous in deep learning and have shown impressive capa- stration examples. It also enables direct incorporation of
bilities in most NLP tasks [7]. In the main, LLMs are used selected features inside the prompt template, thereby ob-
in two distinct settings: (i) training-free, where the pre- viating the need for architecture customization. Creative
trained LLM is used for inference without any parameter ICL strategies combining kNN-based examples selection,
adjustment, and (ii) fine-tuning, where the parameters generated chain-of-thought (CoT) prompting, and ma-
of the LLM are updated through supervised training to jority vote ensembling have been proposed and shown
enable transfer learning on a downstream task. Zero-shot to outperform fine-tuning approaches [9, 10, 11, 12]. In
the main, kNN-based examples selection optimizes the
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, process of learning from few examples and ensembling
Dec 04 — 06, 2024, Pisa, Italy
∗
Corresponding author.
increases the robustness of the predictions [13, 9, 11].
† This work focuses on AM in the biomedical AbstRCT
The authors contributed equally.
$ jeremie.cabessa@uvsq.fr (J. Cabessa); hugoh@playtika.com dataset [14]. More specifically, we address the ATC
(H. Hernault); umer.mushtaq@univ-lr.fr (U. Mushtaq) task using quantized and unquantized LLaMA-3 models,
0000-0002-5394-5249 (J. Cabessa); 0009-0003-4403-6048 among the most capable openly available LLMs (cf. leader-
(U. Mushtaq) board), through zero-shot learning, in-context learning,
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
and fine-tuning approaches. Our contributions are as Our work sits at the intersection of zero-shot learn-
follows: ing, in-context learning and fine-tuning. We implement
and compare the performance of the latest openly avail-
• In zero-shot learning setting, we show that able LLMs using these three approaches for AM on the
LLaMA-3 fails to achieve acceptable classification AbstRCT dataset.
results, suggesting the need for implementing
additional training modalities.
3. Methodology
• We introduce a novel ICL strategy that combines
kNN-based example selection with majority vote 3.1. Datasets
ensembling. In this training-free setting, LLaMA-
3 can leverage relevant information from only We consider the AbstRCT dataset which consists of ab-
a few demonstration examples to achieve very stracts of 650 Randomized Controlled Trials selected
competitive results. from the biomedical database PUBMed [14]. For Ab-
stRCt dataset, the Neoplasm train set (Neo-train) consists
• We further experiment with fine-tuning strategy of 350 abstracts whereas the three Neoplasm, Glaucoma
for LLaMA-3. In this setting, we achieve state-of- and Mixed tests sets (Neo-test, Gla-test and Mix-test, re-
the-art performance on the ATC task for AbstRCT spectively) consist of 100 abstracts each. The statistics
dataset. of AbstRCT dataset are given in Table 1. The argument
type classification (ATC) task consists of predicting the
Our code is freely available on GitHub.
type of each argument component (AC) as ‘Major Claim’,
‘Claim’ or ‘Premise’. Following previous approaches, we
2. Related Works combine the ‘Major Claim’ and ‘Claim’ classes into a
single class ‘Claim’.
In early works, Argument Mining has been approached
using both classical algorithms such as SVM [15, 2, 16, Dataset Split Abstracts ACs
17] as well as recurrent neural network models such as
Neo-train 350 2,291
BiLSTMs [18, 19, 4]. Transformer-based models, such
Neo-test 100 691
as BERT [20], have also been utilized for AM, including Gla-test 100 615
multi-scale argument modelling and customized feature- Mix-test 100 609
injected BERT-based models [21, 22, 23, 5, 6, 24, 25]. AM
in the biomedical AbstRCT dataset has been approached Table 1
using LSTMs [26, 27], sequential transfer learning [28] AbstRCT dataset statistics.
as well as transformer-based models [29, 30, 31].
More recently, AM sub-tasks have been modeled as An sample of the AbstRCT dataset is provided below.
text generation tasks using LLMs. For the Argument Type The argument components (ACs) and their corresponding
Classification (ATC) sub-task, this approach involves us- classes are indicated by bold tags.
ing a prompt template to generate the corresponding A combination of mitoxantrone plus pred-
class of an argument component. This method has been nisone is preferable to prednisone alone for reduction of pain in men
with metastatic, hormone-resistant, prostate cancer. The
applied to various AM use-cases, such as podcast tran- purpose of this study was to assess the effects of these treatments
scripts and legal documents [32, 33, 34]. The latest ap- on health-related quality of life (HQL). Men with metastatic prostate
proach in this ‘AM using LLM text generation’ direction cancer (n = 161) were randomized to receive either daily prednisone
alone or mitoxantrone (every 3 weeks) plus prednisone. Those who
involves a prompt that includes the argument compo- received prednisone alone could have mitoxantrone added after 6
nent as the query and the complete text as the context, weeks if there was no improvement in pain. HQL was assessed
to output the class of the argument component using a before treatment initiation and then every 3 weeks using the Euro-
pean Organization for Research and Treatment of Cancer Quality-
generative model [35]. In this study, the three AM sub- of-Life Questionnaire C30 (EORTC QLQ-C30) and the Quality of
tasks are modeled using the Persuasive Essays (PE) and Life Module-Prostate 14 (QOLM-P14), a trial-specific module devel-
AbstRCT datasets. oped for this study. An intent-to-treat analysis was used to deter-
mine the mean duration of HQL improvement and differences in
In contrast to the fine-tuning approach, a relevant improvement duration between groups of patients.
training-free ICL prompting strategy for LLMs has been At 6 weeks, both groups showed improvement in several HQL do-
proposed [9, 11]. This strategy combines kNN-based mains, and only physical functioning and
pain were better in the mitoxantrone-plus-prednisone group than
example selection, generated chain-of-thought prompt- in the prednisone-alone group. After 6
ing, and majority vote ensembling for few-shot classifi- weeks, patients taking prednisone showed no improvement in HQL
cation. Interestingly, the ICL strategy outperforms the scores, whereas those taking mitoxantrone plus prednisone showed
significant improvements in global quality of life (P =.009), four
fine-tuning approach on the datasets used in the study. functioning domains, and nine symptoms (.001 < P <. 01),
and the improvement (> 10 units on a scale of 0
to100) lasted longer than in the prednisone-alone group (.004 < P the ACs and their corresponding classes in these
<.05). The addition of mitoxantrone to k abstracts is constructed (kNN). Finally, the LLM
prednisone after failure of prednisone alone was associated with
improvements in pain, pain impact, pain relief, insomnia, and global predicts the classes ŷ 1 , . . . , ŷ m of c1 , . . . , cm on
quality of life (.001 < P <.003). Treatment the basis of on this prompt.
with mitoxantrone plus prednisone was associated with greater and
longer-lasting improvement in several HQL domains and symptoms (2) n-Ensembling (n = 3, 5): The kNN-based exam-
than treatment with prednisone alone. ples selection step, which involves randomness, is
repeated n times (nEns), leading to a set of n se-
3.2. Zero-Shot Learning (ZSL) and quences of class predictions {(ŷ i,1 , . . . , ŷ i,m ) : i =
In-Context Learning (ICL) 1, . . . n}. The final class predictions ŷ 1 , . . . , ŷ m of
c1 , . . . , cm are obtained by applying a component
Zero-shot learning (ZSL) is the paradigm where the LLM wise majority vote to the n predictions sequences.
is asked to solve a downstream task without receiving The kNN-based example selection optimizes learning
any specific solved examples in the prompt. By contrast, from few examples by selecting samples most similar
in-context learning (ICL) refers to the emergent ability of
to the current instance, rather than choosing them ran-
LLMs to solve a downstream task based on a few demon- domly. The ensembling step increases prediction robust-
stration examples given in the prompt as contextual in- ness by selecting the most frequent predictions. Note that
formation [8]. As the major advantage, ZSL and ICL the relevance of the ensembling step relies on the random
paradigms do not require any fine-tuning of the model’s selection in the kNN step. This randomness ensures that
parameters (i.e. training-free framework). same predictions are not always produced, allowing for
Formally, let x be a query input text and C = majority voting and thereby increasing robustness.
[I; t(xi1 , yi1 ); . . . ; t(xik , yik )] be a context composed
To aid the LLM in generating predictions, additional
of instructions I concatenated with input-output pairs task-specific information is typically included in the
(xj , yij ) in text format, where X = {x1 , x2 , . . . } and
prompt. For example, definitions of the ‘Claim’ and
Y = {y1 , . . . , yk } are the sets of possible input and ‘Premise’ classes, along with their statistics in the Neo-
outputs, respectively. The ZSL and ICL paradigms corre- train set, can be incorporated in the prompt (info). More-
spond to the cases where k = 0 and k > 0, respectively. over, in addition to the ACs c1 , . . . , cm whose class are
For input x, the LLM M predicts the output ŷ such that to be predicted, the abstract text from which these ACs
originate can be included in the prompt (abstract). Ac-
ŷ = arg max PM (yi | C; x) ,
yi ∈Y cording to this ICL strategy, the classes ŷ 1 , . . . , ŷ m of
c1 , . . . , cm are predicted all-at-once (see Figure 1). There-
where PM (yi | C; x) is the probability that M gener- fore, a prompt of the form ‘info + abstract + 3NN + 3Ens’
ates yi when C and x are given as prompt. The main (see Table 3) indicates that the argument components
rationale behind ZSL and ICL is that the consideration (ACs) of the abstract are predicted all-at-once, by incor-
of a well-chosen context C increases the probability of porating additional information and the entire abstract
M predicting the correct answer y for input x, i.e., that text as contextual cues in the prompt, and employing
PM (y | C; x) > PM (y | x). the ICL strategy with 3NN-based example selection and
We consider a 2-step ICL strategy for argument type 3-ensembling. A similar ICL strategy, where the classes
classification (ATC) inspired by a recent study [9] (see ŷ , . . . , ŷ are inferred one-by-one (i.e., each model in-
1 m
Figure 1). More precisely, let A be an abstract contain- ference leads to a single prediction ŷ ), has been consid-
j
ing argument components (ACs) c1 , . . . , cm with cor- ered but shown to be significantly less efficient. Due to
responding true classes y1 , . . . , ym , where each yi ∈ space constraints, the latter results are omitted in this
{Claim, Premise}. Given the ACs c1 , . . . , cm in the work.
prompt, the LLM generates the corresponding class pre-
dictions ŷ 1 , . . . , ŷ m as follows:
3.3. Fine-tuning
(1) kNN-based examples selection (k = 3, 5): First,
2k neighboring abstracts A1 , . . . , A2k of A are se- Fine-tuning (FT) refers to the process of further training a
lected according to the following similarity mea- pre-trained LLM on a downstream task. Previous studies
sure. For any abstract Ai , let the signature of Ai be indicate that relying solely on the text of an argument
the embedding of the first sentence of Ai using the component is insufficient for predicting its argumentative
BioBERT model. The abstracts A1 , . . . , A2k are the class; additional contextual information is essential for
ones whose signatures are the closest, with respect achieving competitive classification accuracy [2, 5, 6].
to cosine similarity, to the signature of A. Then, k Therefore, we propose a fine-tuning strategy that models
abstracts, Ai1 , . . . , Aik , are randomly chosen from the ATC task at the document level. Specifically, we
A1 , . . . , A2k . Afterwards, a prompt containing all incorporate task-specific information into each training
kNN nENS
final
info abstract ex. 1 ex. k LLM preds
preds
…
…
…
…
…
…
…
…
Figure 1: 2-step ICL approach: a kNN-based example prediction (k = 3, 5) step followed by an n-Ensembling (n = 3) step
(cf. text for further details). For each abstract A, the class predictions ŷ 1 , dots, ŷ m of all of its ACs x1 , dots, xm are generated
in one inference step (all-at-once modality).
sample and generate the class label predictions for the the entire text of the abstract (abstract) significantly im-
ACs of an abstract all-at-once. prove the results. These expected observations serve as
an ablation study and justify the usage of the additional
3.4. Implementation Details information and full abstract text (prompt template ‘info
+ abstract’) in all subsequent experiments.
As the embedding engine, we use dmis-lab’s BioBERT1 . In all experiments, we observed that the models con-
For zero-shot learning, ICL and fine-tuning, we experi- sistently generated the correct number of classes for each
ment with the LLaMA-3-8B-Instruct and LLaMA-3-70B- inference task. This observation remains valid for sub-
Instruct models, as well as various GGML-quantized con- sequent ICL and fine-tuning settings. It demonstrates
figurations of them2 . For ICL, we set the generate tem- the model’s capability to understand the correspondence
perature to 0.1. For fine-tuning, we use LoRA adapters between the number of input ACs and the number of
with loraplus_lr_ratio of 16.0. We set batch size classes to predict.
of 2 and learning rate of 5e−5 . For implementation, we In ZSL training-free setting, across Neo, Gla and Mix
use the LLaMA-Factory3 framework [36]. An example test sets, the performance of LLMs strongly correlated
of the prompts we use for zero-shot learning, in-context with the complexity of these models, achieving maximal
learning and fine-tuning with LLaMA-3 are given in Ap- macro F1-scores of 0.698, 0.819 and 0.725, respectively.
pendix A. Overall, in ZSL, the LLMs fail to achieve acceptable re-
sults. These considerations underscore the need for im-
plementing additional learning modalities to address the
4. Results ATC task effectively.
4.1. Zero-Shot Learning
4.2. In-Context Learning
The results for zero-shot learning (ZSL) on ATC task
are reported in Table 2. Recall that zero-shot learning The results for in-context learning (ICL) on the ATC task
corresponds to the prompting strategy where no near- are reported in Table 3. First, note that the transition
est neighbors are included as demonstration examples, from zero-shot learning (‘info + abstract + 0NN’, Table 2)
referred to as ‘info + abstract + 0NN’ in our notation. to in-context learning (‘info + abstract + kNN’, Table 3)
In an initial experimentation phase, we observed that drastically improves the results. This validates the effec-
adding complementary information (info) (definitions of tiveness of the kNNN-based examples selection method.
’Claim’ and ’Premise’ and dataset statistics) and including In addition, except for the Mix test set, the 3NN strat-
egy consistently outperforms the 5NN strategy, suggest-
ing that three examples suffice for optimal learning the
1
https://huggingface.co/dmis-lab ATC task in an ICL setting. The inclusion of more demon-
2
https://github.com/ggerganov/ggml
3
https://github.com/hiyouga/LLaMA-Factory stration examples correlates with a significant increase
Model C P F1 Prompt C P F1
Neo test Neo test
LLaMA-3-8b-Instruct-bnb-4bit 0.529 0.539 0.534 LLaMA-3-8b-Instruct
LLaMA-3-8b-Instruct 0.544 0.558 0.551 info + abstract + 3NN 0.832 0.912 0.872
LLaMA-3-70b-Instruct-bnb-4bit 0.642 0.753 0.698 info + abstract + 5NN 0.843 0.914 0.878
info + abstract + 3NN + 3Ens 0.844 0.917 0.880
Gla test
LLaMA-3-8b-Instruct-bnb-4bit
LLaMA-3-8b-Instruct-bnb-4bit 0.553 0.635 0.594
info + abstract + 3NN 0.847 0.916 0.881
LLaMA-3-8b-Instruct 0.569 0.692 0.631
info + abstract + 5NN 0.817 0.890 0.853
LLaMA-3-70b-Instruct-bnb-4bit 0.755 0.882 0.819
info + abstract + 3NN + 3Ens 0.848 0.919 0.884
Mix test
LLaMA-3-70b-Instruct-bnb-4bit
LLaMA-3-8b-Instruct-bnb-4bit 0.546 0.524 0.535 info + abstract + 3NN 0.870 0.935 0.903
LLaMA-3-8b-Instruct 0.563 0.564 0.563 info + abstract + 5NN 0.863 0.930 0.896
LLaMA-3-70b-Instruct-bnb-4bit 0.671 0.779 0.725 info + abstract + 3NN + 3Ens 0.884 0.941 0.912
Table 2 Gla test
Zero-shot results for ATC on three test sets of the AbstRTC LLaMA-3-8b-Instruct
dataset using LLaMA-3. info + abstract + 3NN 0.834 0.929 0.882
info + abstract + 5NN 0.836 0.925 0.881
info + abstract + 3NN + 3Ens 0.872 0.947 0.910
in prompt length, potentially hindering the performance LLaMA-3-8b-Instruct-bnb-4bit
of the LLM or exceeding the maximum size of its con- info + abstract + 3NN 0.827 0.924 0.875
text. Furthermore, the ensembling strategy consistently info + abstract + 5NN 0.816 0.916 0.866
improves the results, even if only slightly, ensuring that info + abstract + 3NN + 3Ens 0.832 0.928 0.880
the robustness of the results can indeed be strengthened LLaMA-3-70b-Instruct-bnb-4bit
through ensembling predictions. info + abstract + 3NN 0.868 0.946 0.907
Overall, the training-free ICL strategy achieves very info + abstract + 5NN 0.865 0.945 0.905
competitive F1-scores of 0.912, 0.910, and 0.929 on Neo, info + abstract + 3NN + 3Ens 0.863 0.944 0.903
Mix, and Gla test sets, respectively. However, these Mix test
results remain lower than those obtained by previous
LLaMA-3-8b-Instruct
training-dependent models (see Table 4, upper rows).
info + abstract + 3NN 0.879 0.938 0.909
info + abstract + 5NN 0.898 0.944 0.921
4.3. Fine-Tuning info + abstract + 3NN + 3Ens 0.884 0.940 0.912
The results achieved by the fine-tuning (FT) strategy on LLaMA-3-8b-Instruct-bnb-4bit
the ATC task are reported in Table 4. Our results show info + abstract + 3NN 0.859 0.926 0.893
info + abstract + 5NN 0.866 0.922 0.894
that fine-tuning significantly outperforms ICL. These
info + abstract + 3NN + 3Ens 0.885 0.940 0.913
findings suggest that the argumentative flow within ab-
stracts cannot be inferred solely from the knowledge LLaMA-3-70b-Instruct-bnb-4bit
acquired during pre-training, and requires additional pa- info + abstract + 3NN 0.905 0.954 0.929
rameters updates to be effectively learned. info + abstract + 5NN 0.906 0.952 0.929
info + abstract + 3NN + 3Ens 0.904 0.952 0.928
In this training-dependent context, we achieve maxi-
mal F1-scores of 0.935, 0.913, and 0.951 on the Neo, Gla, Table 3
and Mix test sets, respectively, establishing new state- Results for ATC on three test sets of AbstRCT dataset with
of-the-art results for the Neo and Mix test sets. These LLaMA-3 models using the 2-step ICL strategy described in
results suggest once again that the sequentiality of argu- the text.
ments inside a specific corpus requires fine-tuning to be
optimally captured.
zero-shot learning (ZSL), in-context learning (ICL) and
fine-tuning (FT). We show that ZSL fails to achieve accept-
5. Conclusion able performance, ICL significantly improves the results,
and FT reaches state-of-the-art performance.
In this work, we address argument type classification
These results support the fact that ATC task cannot
(ATC) in the biomedical AbstRTC dataset with openly
be solved in a zero-shot setting by relying solely on
available LLaMA-3 from the three-fold perspective of
general-purpose language modalities acquired during
Model Neo Gla Mix References
ResAttArg(Ensemble) [27] 0.879 0.877 0.897
SeqMT [28] 0.919 0.924 0.922 [1] R. M. Palau, M.-F. Moens, Argumentation min-
MRC_GEN [35] 0.928 0.926 0.940 ing: The detection, classification and structure of
GIAM [25] 0.930 0.928 0.936 arguments in text, in: Proceedings of ICAIL 2019,
ICAIL ’09, ACM, New York, NY, USA, 2009, pp. 98–
LLaMA-3-8B-Instruct 0.919 0.908 0.939
LLaMA-3-8B-Instruct-bnb-4bit 0.935 0.910 0.953
107. URL: https://doi.org/10.1145/1568234.1568246.
LLaMA-3-70B-Instruct 0.929 0.913 0.940 doi:10.1145/1568234.1568246.
LLaMA-3-70B-Instruct-bnb-4bit 0.921 0.908 0.951 [2] C. Stab, I. Gurevych, Parsing argumentation struc-
tures in persuasive essays, Computational Linguis-
Table 4 tics 43 (2017) 619–659. URL: https://aclanthology.
Fine-tuning results for ATC task on the three test sets of Ab- org/J17-3005. doi:10.1162/COLI_a_00295.
stRCT dataset using LLaMA-3. [3] P. Potash, A. Romanov, A. Rumshisky, Here’s
my point: Joint pointer architecture for argu-
ment mining, in: M. P. et al. (Ed.), Proceed-
pre-training. Additional learning is essential, either in ings of EMNLP 2017, ACL, 2017, pp. 1364–1373.
the form of solved demonstration examples (ICL) or via URL: https://doi.org/10.18653/v1/d17-1143. doi:10.
parameters’ updates (FT). We conjecture that the sequen- 18653/V1/D17-1143.
tial flow of arguments within a text is a corpus-specific [4] T. Kuribayashi, H. Ouchi, N. Inoue, P. Reisert,
feature that cannot be inferred through zero-shot meth- T. Miyoshi, J. Suzuki, K. Inui, An empirical study
ods. of span representations in argumentation structure
Previous works demonstrated that the text of argument parsing, in: A. K. et al. (Ed.), Proceedings of ACL
components alone do not suffice to infer their argumen- 2019, ACL, Florence, Italy, 2019, pp. 4691–4698. URL:
tative roles [2, 4, 6]. Additional contextual, structural https://aclanthology.org/P19-1464. doi:10.18653/
and syntactic features are necessary. In our ICL and FT v1/P19-1464.
settings, comprehensive contextual and structural infor- [5] U. Mushtaq, J. Cabessa, Argument classification
mation is incorporated through task-specific information with BERT plus contextual, structural and syn-
and complete abstract text provided in the prompt. This tactic features as text, in: M. T. et al. (Ed.),
information enables the model to discern the sequence Proceedings of ICONIP 2022, volume 1791 of
of arguments, their associated markers, and other char- CCIS, Springer, 2022, pp. 622–633. URL: https://doi.
acteristics closely associated with their argumentative org/10.1007/978-981-99-1639-9_52. doi:10.1007/
roles. 978-981-99-1639-9\_52.
For future work, the design and implementation of a [6] U. Mushtaq, J. Cabessa, Argument min-
full AM pipeline using LLMs represents a major mile- ing with modular BERT and transfer learning,
stone. In this scenario, the LLM would take raw texts as in: Proceedings of IJCNN 2023, IEEE, 2023,
input and produce a detailed map of the argumentative pp. 1–8. URL: https://doi.org/10.1109/IJCNN54540.
structure as output. We believe that LLMs will substan- 2023.10191968. doi:10.1109/IJCNN54540.2023.
tially transform the landscape of AM and its practical 10191968.
applications. [7] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang,
Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong,
Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang,
Acknowledgments R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie,
J. Wen, A survey of large language models,
This work benefited from access to the computing re-
CoRR abs/2303.18223 (2023). URL: https://doi.org/
sources of the L3i laboratory, operated and hosted by the
10.48550/arXiv.2303.18223. doi:10.48550/ARXIV.
University of La Rochelle. It is financed by the French
2303.18223. arXiv:2303.18223.
government and the Region Nouvelle-Acquitaine. This
[8] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
research also benefited from institutional support RVO:
X. Sun, J. Xu, L. Li, Z. Sui, A survey on in-context
67985807 and partially supported by the grant of the
learning, CoRR abs/2301.00234 (2023). URL: https://
Czech Science Foundation No. GA22-02067S. Finally, we
doi.org/10.48550/arXiv.2301.00234. doi:10.48550/
are grateful to Playtika Ltd. for their support for this
ARXIV.2301.00234. arXiv:2301.00234.
research.
[9] H. Nori, et al., Can generalist foundation models
outcompete special-purpose tuning? case study in
medicine, CoRR abs/2311.16452 (2023). URL: https://
doi.org/10.48550/arXiv.2311.16452. doi:10.48550/
ARXIV.2311.16452. arXiv:2311.16452. language understanding, in: J. B. et al. (Ed.), Pro-
[10] J. Wei, X. Wang, D. Schuurmans, M. Bosma, ceedings of NAACL-HLT 2019, ACL, 2019, pp. 4171–
B. Ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, 4186. URL: https://doi.org/10.18653/v1/n19-1423.
Chain-of-thought prompting elicits reasoning doi:10.18653/V1/N19-1423.
in large language models, in: S. K. et al. (Ed.), [21] G. Zhang, P. Nulty, D. Lillis, Enhancing legal argu-
Proceedings of NeurIPS 2022, volume 35, 2022, ment mining with domain pre-training and neural
pp. 24824–24837. URL: https://proceedings. networks, CoRR abs/2202.13457 (2022). URL: https:
neurips.cc/paper_files/paper/2022/file/ //arxiv.org/abs/2202.13457. arXiv:2202.13457.
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.[22] H. Wang, Z. Huang, Y. Dou, Y. Hong, Ar-
pdf. gumentation mining on essays at multi
[11] S. Lei, G. Dong, X. Wang, K. Wang, S. Wang, Instruc- scales, in: D. S. et al. (Ed.), Proceed-
terc: Reforming emotion recognition in conversa- ings of COLING 2020, ICCL, Barcelona,
tion with a retrieval multi-task llms framework, Spain (Online), 2020, pp. 5480–5493. URL:
CoRR abs/2309.11911 (2023). URL: https://doi.org/ https://aclanthology.org/2020.coling-main.478.
10.48550/arXiv.2309.11911. doi:10.48550/ARXIV. doi:10.18653/v1/2020.coling-main.478.
2309.11911. arXiv:2309.11911. [23] S. Fioravanti, A. Zugarini, F. Giannini, L. Rigutini,
[12] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, M. Maggini, M. Diligenti, Linguistic feature in-
S. Narang, A. Chowdhery, D. Zhou, Self-consistency jection for efficient natural language processing,
improves chain of thought reasoning in language in: IJCNN 2023, June 18-23, 2023, IEEE, 2023,
models, 2023. arXiv:2203.11171. pp. 1–7. URL: https://doi.org/10.1109/IJCNN54540.
[13] H. Nori, N. King, S. M. McKinney, D. Carignan, 2023.10191680. doi:10.1109/IJCNN54540.2023.
E. Horvitz, Capabilities of GPT-4 on medical 10191680.
challenge problems, CoRR abs/2303.13375 (2023). [24] J. Bao, C. Fan, J. Wu, Y. Dang, J. Du, R. Xu, A neu-
URL: https://doi.org/10.48550/arXiv.2303.13375. ral transition-based model for argumentation min-
doi:10.48550/ARXIV.2303.13375. ing, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.),
arXiv:2303.13375. Proceedings of the 59th Annual Meeting of the
[14] T. Mayer, Argument Mining on Clinical Trials, ACL and the 11th International Joint Conference
Theses, Université Côte d’Azur, 2020. URL: https: on Natural Language Processing (Volume 1: Long
//theses.hal.science/tel-03209489. Papers), ACL, Online, 2021, pp. 6354–6364. URL:
[15] R. Mochales, M. Moens, Argumentation min- https://aclanthology.org/2021.acl-long.497. doi:10.
ing, Artificial Intelligence and Law 19 (2011) 1–22. 18653/v1/2021.acl-long.497.
doi:10.1007/s10506-010-9104-x. [25] B. Liu, V. Schlegel, P. Thompson, R. T. Batista-
[16] I. Habernal, I. Gurevych, Argumentation min- Navarro, S. Ananiadou, Global information-
ing in user-generated web discourse, Computa- aware argument mining based on a top-
tional Linguistics 43 (2017) 125–179. URL: https: down multi-turn qa model, Information
//aclanthology.org/J17-1004. doi:10.1162/COLI_ Processing & Management 60 (2023) 103445.
a_00276. URL: https://www.sciencedirect.com/science/
[17] R. Levy, Y. Bilu, D. Hershcovich, E. Aharoni, article/pii/S0306457323001826. doi:https:
N. Slonim, Context dependent claim detection, in: //doi.org/10.1016/j.ipm.2023.103445.
ICCL, 2014. URL: https://api.semanticscholar.org/ [26] A. Galassi, M. Lippi, P. Torroni, Argumentative
CorpusID:18847466. link prediction using residual networks and multi-
[18] S. Eger, J. Daxenberger, I. Gurevych, Neural end- objective learning, in: N. Slonim, R. Aharonov
to-end learning for computational argumentation (Eds.), Proceedings of the 5th Workshop on Ar-
mining, in: R. Barzilay, M.-Y. Kan (Eds.), Proceed- gument Mining, ACL, Brussels, Belgium, 2018,
ings of ACL 2017, ACL, Vancouver, Canada, 2017, pp. 1–10. URL: https://aclanthology.org/W18-5201.
pp. 11–22. URL: https://aclanthology.org/P17-1002. doi:10.18653/v1/W18-5201.
doi:10.18653/v1/P17-1002. [27] A. Galassi, M. Lippi, P. Torroni, Multi-task attentive
[19] V. Niculae, J. Park, C. Cardie, Argument mining residual networks for argument mining, IEEE/ACM
with structured SVMs and RNNs, in: R. Barzi- Transactions on Audio, Speech, and Language Pro-
lay, M.-Y. Kan (Eds.), Proceedings of ACL 2017, cessing 31 (2023) 1877–1892. doi:10.1109/TASLP.
ACL, Vancouver, Canada, 2017, pp. 985–995. URL: 2023.3275040.
https://aclanthology.org/P17-1091. doi:10.18653/ [28] J. Si, L. Sun, D. Zhou, J. Ren, L. Li, Biomedical
v1/P17-1091. argument mining based on sequential multi-task
[20] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: learning, IEEE/ACM Trans. Comput. Biol. Bioin-
pre-training of deep bidirectional transformers for formatics 20 (2022) 864–874. URL: https://doi.org/
10.1109/TCBB.2022.3173447. doi:10.1109/TCBB.
2022.3173447.
[29] T. Mayer, E. Cabrio, S. Villata, Transformer-based
argument mining for healthcare applications, in:
G. D. G. et al. (Ed.), Proceedings of ECAI 2020, vol-
ume 325 of FAIA, IOS Press, 2020, pp. 2108–2115.
URL: https://doi.org/10.3233/FAIA200334. doi:10.
3233/FAIA200334.
[30] B. Molinet, S. Marro, E. Cabrio, S. Villata, T. Mayer,
Acta 2.0: A modular architecture for multi-layer
argumentative analysis of clinical trials, in: L. D.
Raedt (Ed.), Proceedings of IJCAI-22, International
Joint Conferences on Artificial Intelligence Orga-
nization, 2022, pp. 5940–5943. URL: https://doi.
org/10.24963/ijcai.2022/859. doi:10.24963/ijcai.
2022/859, demo Track.
[31] T. Mayer, S. Marro, E. Cabrio, S. Villata, Enhancing
evidence-based medicine with natural language
argumentative analysis of clinical trials, Artificial
Intelligence in Medicine 118 (2021) 102098. URL:
https://www.sciencedirect.com/science/article/pii/
S0933365721000919. doi:https://doi.org/10.
1016/j.artmed.2021.102098.
[32] M. van der Meer, M. Reuver, U. Khurana, L. Krause,
S. B. Santamaría, Will it blend? mixing train-
ing paradigms & prompting for argument quality
prediction, in: G. Lapesa, et al. (Eds.), ArgMin-
ing@COLING 2022, ICCL, 2022, pp. 95–103. URL:
https://aclanthology.org/2022.argmining-1.8.
[33] M. Pojoni, L. Dumani, R. Schenkel, Argument-
mining from podcasts using chatgpt, in: L. Malburg,
D. Verma (Eds.), Proceedings of ICCBR-WS 2023,
volume 3438 of CEUR Workshop Proceedings, CEUR-
WS.org, 2023, pp. 129–144. URL: https://ceur-ws.
org/Vol-3438/paper_10.pdf.
[34] A. Al Zubaer, M. Granitzer, J. Mitrović, Performance
analysis of large language models in the domain of
legal argument mining, Frontiers in Artificial Intel-
ligence 6 (2023). URL: https://www.frontiersin.org/
articles/10.3389/frai.2023.1278796. doi:10.3389/
frai.2023.1278796.
[35] B. Liu, V. Schlegel, R. Batista-Navarro, S. Anani-
adou, Argument mining as a multi-hop generative
machine reading comprehension task, in: The 2023
Conference on Empirical Methods in Natural Lan-
guage Processing, 2023. URL: https://openreview.
net/forum?id=KTFxOnrbvu.
[36] Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Z. Luo, Z. Feng,
Y. Ma, Llamafactory: Unified efficient fine-tuning of
100+ language models, in: Proceedings of the 62nd
Annual Meeting of the ACL (Volume 3: System
Demonstrations), ACL, Bangkok, Thailand, 2024.
URL: http://arxiv.org/abs/2403.13372.
A. Appendix phase II trial. Within the FLOT65+ trial, a total of 143 patients aged ≥65 years
were randomly allocated to receive biweekly oxaliplatin plus 5-fluorouracil (5-FU)
continuous infusion and folinic acid (FLO) or the same regimen in combination
Examples of prompts for LLaMA 3 for the zero-shot learn- with docetaxel 50 mg/m(2) (FLOT). The European Organisation for Research and
ing (ZSL), in-context learning (ICL) and fine-tuning (FT) Treatment of Cancer Quality of Life Questionnaire C30 (EORTC QLQ-C30) and
the gastric module STO22 were administered every 8 weeks until progression.
settings are provided below. Time to definitive deterioration of QOL parameters was analyzed and compared
within the treatment arms. The median age of patients was 70 years. Patients
receiving FLOT exhibited higher response rates and had improved disease-free and
A.1. Zero-Shot Learning progression-free survival (PFS). The proportions of patients with evaluable baseline
EORTC QLQ-C30 and STO22 questionnaires were balanced (83 % in FLOT and 89 %
### Task description: You are an expert biomedical assistant that takes 1) an abstract in FLO). Considering evaluable patients with assessable questionnaires (n = 123),
text and 2) the list of all arguments from this abstract text, and must classify all neither functioning nor symptom parameters differed significantly in favor of one of
arguments into one of two classes: Claim or Premise. 68.0052% of examples are of the two treatment groups. Particularly, there was no significant difference regarding
type Premise and 31.9948% of type Claim. You must absolutely not generate any text time to definitive deterioration of global health status/quality of life from baseline
or explanation other than the following JSON format {"Argument 1": , ..., "Argument n": } (n = 98) achieved comparable QOL results. Although toxicity was higher in patients
receiving FLOT, no negative impact of the addition of docetaxel on QOL parameters
### Class definitions: Claim = A claim in the abstract of an RCT is a statement or could be demonstrated. Thus, elderly patients in need of intensified chemotherapy
conclusion about the findings of the study. Premise = A premise in the abstract of an may receive FLOT without compromising patient-reported outcome parameters.
RCT is a statement that provides an evidence or proof for a claim.
# Arguments:
### Abstract: Few controlled clinical trials exist to support oral combina-
tion therapy in pulmonary arterial hypertension (PAH). Patients with PAH Argument 1=Patients receiving FLOT exhibited higher response rates and had
(idiopathic [IPAH] or associated with connective tissue disease [APAH-CTD]) improved disease-free and progression-free survival (PFS).
taking bosentan (62.5 or 125 mg twice daily at a stable dose for ≥3 months) were Argument 2=there was no significant difference regarding time to definitive
randomized (1:1) to sildenafil (20 mg, 3 times daily; n = 50) or placebo (n = 53). deterioration of global health status/quality of life from baseline (primary endpoint).
The primary endpoint was change from baseline in 6-min walk distance (6MWD) Argument 3=patients receiving FLO or FLOT as palliative treatment (n = 98) achieved
at week 12, assessed using analysis of covariance. Patients could continue in a comparable QOL results.
52-week extension study. An analysis of covariance main-effects model was used, Argument 4=Although toxicity was higher in patients receiving FLOT,
which included categorical terms for treatment, baseline 6MWD (<325 m; ≥325 Argument 5=no negative impact of the addition of docetaxel on QOL parameters
m), and baseline aetiology; sensitivity analyses were subsequently performed. could be demonstrated.
In sildenafil versus placebo arms, week-12 6MWD increases were similar (least Argument 6=elderly patients in need of intensified chemotherapy may receive FLOT
squares mean difference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to 17.1 m]; P without compromising patient-reported outcome parameters.
= 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ± 57.4 m,
respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1 m in # Result:
APAH-CTD (35% of population). One-year survival was 96%; patients maintained
modest 6MWD improvements. Changes in WHO functional class and Borg {"Argument 1": "Premise", "Argument 2": "Premise", "Argument 3": "Premise",
dyspnoea score and incidence of clinical worsening did not differ. Headache, "Argument 4": "Premise", "Argument 5": "Premise", "Argument 6": "Claim"}
diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition
to stable (≥3 months) bosentan therapy, had no benefit over placebo for 12-week ## Example 2
change from baseline in 6MWD. The influence of PAH aetiology warrants future study.
# Abstract:
### Arguments: Argument 1=In sildenafil versus placebo arms, week-12 6MWD
increases were similar (least squares mean difference [sildenafil-placebo], -2.4 m Chemotherapy prolongs survival and improves quality of life (QOL) for good
[90% CI: -21.8 to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± performance status (PS) patients with advanced non-small cell lung cancer (NSCLC).
45.7 versus 11.8 ± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 Targeted therapies may improve chemotherapy effectiveness without worsening
versus 17.5 ± 59.1 m in APAH-CTD (35% of population). toxicity. SGN-15 is an antibody-drug conjugate (ADC), consisting of a chimeric
Argument 2=Changes in WHO functional class and Borg dyspnoea score and murine monoclonal antibody recognizing the Lewis Y (Le(y)) antigen, conjugated
incidence of clinical worsening did not differ. to doxorubicin. Le(y) is an attractive target since it is expressed by most NSCLC.
Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil. SGN-15 was active against Le(y)-positive tumors in early phase clinical trials and
Argument 4=Sildenafil, in addition to stable (≥3 months) bosentan therapy, had no was synergistic with docetaxel in preclinical experiments. This Phase II, open-label
benefit over placebo for 12-week change from baseline in 6MWD. study was conducted to confirm the activity of SGN-15 plus docetaxel in previously
treated NSCLC patients. Sixty-two patients with recurrent or metastatic NSCLC
### Result: expressing Le(y), one or two prior chemotherapy regimens, and PS< or =2 were
randomized 2:1 to receive SGN-15 200 mg/m2/week with docetaxel 35 mg/m2/week
(Arm A) or docetaxel 35 mg/m2/week alone (Arm B) for 6 of 8 weeks. Intrapatient
A.2. In-Context Learning (ICL) dose-escalation of SGN-15 to 350 mg/m2 was permitted in the second half of the
study. Endpoints were survival, safety, efficacy, and quality of life. Forty patients on
### Task description: You are an expert biomedical assistant that takes 1) an
Arm A and 19 on Arm B received at least one treatment. Patients on Arms A and B
abstract text, 2) the list of all arguments from this abstract text, and must classify all
had median survivals of 31.4 and 25.3 weeks, 12-month survivals of 29% and 24%,
arguments into one of two classes: Claim or Premise. 68.0052% of examples are of
and 18-month survivals of 18% and 8%, respectively Toxicity was mild in both arms.
type Premise and 31.9948% of type Claim. You must absolutely not generate any text
QOL analyses favored Arm A. SGN-15 plus docetaxel is a well-tolerated and active
or explanation other than the following JSON format {"Argument 1": , ..., "Argument n": }
alternate schedules to maximize synergy between these agents.
### Class definitions: Claim = A claim in the abstract of an RCT is a statement or
# Arguments:
conclusion about the findings of the study. Premise = A premise in the abstract of an
RCT is a statement that provides an evidence or proof for a claim.
Argument 1=Chemotherapy prolongs survival and improves quality of life (QOL) for
good performance status (PS) patients with advanced non-small cell lung cancer
### Examples:
(NSCLC).
Argument 2=Targeted therapies may improve chemotherapy effectiveness without
## Example 1
worsening toxicity.
Argument 3=Le(y) is an attractive target since it is expressed by most NSCLC.
# Abstract:
Argument 4=SGN-15 was active against Le(y)-positive tumors in early phase clinical
trials and was synergistic with docetaxel in preclinical experiments.
Treatment of patients with advanced or metastatic esophagogastric adenocarcinoma
Argument 5=Patients on Arms A and B had median survivals of 31.4 and 25.3
should not only prolong life but also provide relief of symptoms and improve quality
weeks, 12-month survivals of 29% and 24%, and 18-month survivals of 18% and 8%,
of life (QOL). Esophagogastric adenocarcinoma mainly occurs in elderly patients, but
respectively
they are underrepresented in most clinical trials and often do not receive effective
Argument 6=Toxicity was mild in both arms.
combination chemotherapy, most probably for fear of intolerance. Using validated
Argument 7=QOL analyses favored Arm A.
instruments, we prospectively assessed QOL within the randomized FLOT65+
Argument 8=SGN-15 plus docetaxel is a well-tolerated and active second and third
line treatment for NSCLC patients # Arguments:
# Result: Argument 1=In sildenafil versus placebo arms, week-12 6MWD increases were
similar (least squares mean difference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to
{"Argument 1": "Claim", "Argument 2": "Claim", "Argument 3": "Claim", "Argument 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ±
4": "Premise", "Argument 5": "Premise", "Argument 6": "Premise", "Argument 7": 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1
"Premise", "Argument 8": "Claim"} m in APAH-CTD (35% of population).
Argument 2=Changes in WHO functional class and Borg dyspnoea score and
## Example 3 incidence of clinical worsening did not differ.
Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil.
# Abstract: Argument 4=Sildenafil, in addition to stable (≥3 months) bosentan therapy, had no
benefit over placebo for 12-week change from baseline in 6MWD.
The impact of treatment on health-related quality of life (HRQoL) is an important
consideration in the adjuvant treatment of operable breast cancer. Here we report # Result:
mature HRQoL outcomes from the ATAC trial, comparing anastrozole with tamoxifen
as primary adjuvant therapy for postmenopausal women with localized breast cancer.
Patients completed the Functional Assessment of Cancer Therapy-Breast (FACT-B) A.3. Fine-Tuning (FT)
questionnaire plus endocrine subscale (ES) at baseline, 3 and 6 months, and every 6
### You are an expert in medical analysis. You are given the abstract of a random
months thereafter. Baseline characteristics in the HRQoL sub-protocol were well
controlled trial which contains numbered argument components enclosed by
balanced between the anastrozole (n = 335) and tamoxifen (n = 347) groups in the
tags. Your task is to classify each argument components in the essay as
primary analysis population. As with previously published results at 2 years, there
either "Claim" or "Premise". You must return a list of argument component types in
was no statistically significant difference in the Trial Outcome Index of the FACT-B,
following JSON format: "component_types": [component_type (str), component_type
the primary endpoint of the study, between treatments at 5 years. There were no
(str), ..., component_type (str)]
statistically significant differences between treatment groups in ES total scores.
Consistent with the 2-year analysis, there were differences between treatment groups
### Here is the abstract text: An open, randomized study was performed to assess
in patient-reported side effects: diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%),
the effects of supportive pamidronate treatment on morbidity from bone metastases
vaginal dryness (18.5% vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia
in breast cancer patients. Eighty-one pamidronate patients and 80 control patients
(17.3% vs. 8.1%) were significantly more frequent with anastrozole compared to
were monitored for a median of 18 and 21 months, respectively, for events of skeletal
tamoxifen. Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were
morbidity and the radiologic course of metastatic bone disease. The oral pamidronate
significantly less frequent with anastrozole compared to tamoxifen. In this, the first
dose was 600 mg/d (high dose [HD]) during the earliest study years, then changed
report of HRQoL over 5 years of initial adjuvant therapy with an aromatase inhibitor,
to 300 mg/d (low dose [LD]) because of gastrointestinal toxicity. Twenty-nine of
we conclude that anastrozole and tamoxifen had similar impacts on HRQoL, which
81 pamidronate (HD/LD) patients first received 600 mg/d and were then changed
was maintained or slightly improved during the treatment period for both groups.
to 300 mg/d; 52 of 81 pamidronate LD patients received 300 mg/d throughout the
study. Tumor treatment was unrestricted. An overall intent-to-treat analysis was
# Arguments:
performed. In the pamidronate group, the occurrence of hypercalcemia, severe
bone pain, and symptomatic impending fractures decreased by 65%, 30%, and 50%,
Argument 1=The impact of treatment on health-related quality of life (HRQoL) is an
respectively; event-rates of systemic treatment and radiotherapy decreased by
important consideration in the adjuvant treatment of operable breast cancer.
35% (P < or = .02). The event-free period (EFP), radiologic course of
Argument 2=As with previously published results at 2 years, there was no statistically
disease, and survival did not improve. Subgroup analyses suggested
significant difference in the Trial Outcome Index of the FACT-B, the primary
a dose-dependent treatment effect. Compared with their controls,
endpoint of the study, between treatments at 5 years.
in pamidronate HD/LD patients, events occurred 60% to 90% less frequently (P
Argument 3=There were no statistically significant differences between treatment
< or = .03) and the EFP was prolonged (P = .002). In pamidronate
groups in ES total scores.
LD patients, event-rates decreased by 15% to 45% (P < or = .04).
Argument 4=there were differences between treatment groups in patient-reported
Gastrointestinal toxicity of pamidronate caused a 23% drop-out rate, but
side effects:
other cancer-associated factors seemed to contribute to this toxicity.
Argument 5=diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%), vaginal dryness (18.5%
Pamidronate treatment of breast cancer patients efficaciously reduced skeletal
vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia (17.3% vs. 8.1%) were
morbidity. The effect appeared to be dose-dependent.
significantly more frequent with anastrozole compared to tamoxifen.
Further research on dose and mode of treatment is mandatory.
Argument 6=Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were
significantly less frequent with anastrozole compared to tamoxifen.
{"component_types": ["Premise", "Premise", "Claim", "Premise", "Premise", "Premise",
Argument 7=In this, the first report of HRQoL over 5 years of initial adjuvant therapy
"Claim", "Claim", "Claim", "Claim"]}
with an aromatase inhibitor, we conclude that anastrozole and tamoxifen had similar
impacts on HRQoL, which was maintained or slightly improved during the treatment
period for both groups.
# Result:
{"Argument 1": "Claim", "Argument 2": "Premise", "Argument 3": "Premise", "Argument
4": "Claim", "Argument 5": "Premise", "Argument 6": "Premise", "Argument 7": "Claim"}
# Abstract:
Few controlled clinical trials exist to support oral combination therapy in pulmonary
arterial hypertension (PAH). Patients with PAH (idiopathic [IPAH] or associated with
connective tissue disease [APAH-CTD]) taking bosentan (62.5 or 125 mg twice daily
at a stable dose for ≥3 months) were randomized (1:1) to sildenafil (20 mg, 3 times
daily; n = 50) or placebo (n = 53). The primary endpoint was change from baseline
in 6-min walk distance (6MWD) at week 12, assessed using analysis of covariance.
Patients could continue in a 52-week extension study. An analysis of covariance
main-effects model was used, which included categorical terms for treatment,
baseline 6MWD (<325 m; ≥325 m), and baseline aetiology; sensitivity analyses were
subsequently performed. In sildenafil versus placebo arms, week-12 6MWD increases
were similar (least squares mean difference [sildenafil-placebo], -2.4 m [90% CI: -21.8
to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8
± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ±
59.1 m in APAH-CTD (35% of population). One-year survival was 96%; patients
maintained modest 6MWD improvements. Changes in WHO functional class and
Borg dyspnoea score and incidence of clinical worsening did not differ. Headache,
diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition
to stable (≥3 months) bosentan therapy, had no benefit over placebo for 12-week
change from baseline in 6MWD. The influence of PAH aetiology warrants future study.