1. Introduction

Pisa, Italy ∗ Corresponding author. † The authors contributed equally. $ jeremie.cabessa@uvsq.fr (J. Cabessa); hugoh@playtika.com (H. Hernault); umer.mushtaq@univ-lr.fr (U. Mushtaq)

Argument Mining in BioMedicine: Zero-Shot, In-Context Learning and Fine-tuning with LLMs

Jérémie Cabessa

Hugo Hernault

Umer Mushtaq

1 0 David Lab, University of Versailles Saint-Quentin (UVSQ) - University of Paris-Saclay, 78000 Versailles, France Institute of Computer Science of the Czech Academy of Sciences , 18207 Prague 8 , Czech Republic 1 Laboratoire Informatique , Image, Interaction (L3i) , University of La Rochelle , 17042 La Rochelle , France 2 Playtika Ltd. , CH-1003 Lausanne , Switzerland

2024

000 0 0002

Argument Mining (AM) aims to extract the complex argumentative structure of a text and Argument Type Classification (ATC) is an essential sub-task of AM. Large Language Models (LLMs) have shown impressive capabilities in most NLP tasks and beyond. However, fine-tuning LLMs can be challenging. In-Context Learning (ICL) has been suggested as a bridging paradigm between training-free and fine-tuning settings for LLMs. In ICL, an LLM is conditioned to solve tasks using a few solved demonstration examples included in its prompt. We focuse on AM in the biomedical AbstRCT dataset. We address ATC using quantized and unquantized LLaMA-3 models through zero-shot learning, in-context learning, and fine-tuning approaches. We introduce a novel ICL strategy that combines kNN-based example selection with majority vote ensembling, along with a well-designed fine-tuning strategy for ATC. In zero-shot setting, we show that LLaMA-3 fails to achieve acceptable classification results, suggesting the need for additional training modalities. However, in our ICL training-free setting, LLaMA-3 can leverage relevant information from only a few demonstration examples to achieve very competitive results. Finally, in our fine-tuning setting, LLaMA-3 achieves state-of-the-art performance on ATC task in AbstRCT dataset.

eol>Argument Mining NLP LLMs LLaMA-3 Zero-Shot Learning In-Context Learning Fine-tuning Ensembling

1. Introduction

and fine-tuning approaches. Our contributions are as follows: • In zero-shot learning setting, we show that LLaMA-3 fails to achieve acceptable classification results, suggesting the need for implementing additional training modalities.

Our work sits at the intersection of zero-shot learn

ing, in-context learning and fine-tuning. We implement and compare the performance of the latest openly available LLMs using these three approaches for AM on the AbstRCT dataset.

Our code is freely available on GitHub. 2. Related Works 3. Methodology 3.1. Datasets • We introduce a novel ICL strategy that combines kNN-based example selection with majority vote ensembling. In this training-free setting, LLaMA3 can leverage relevant information from only a few demonstration examples to achieve very competitive results.

In early works, Argument Mining has been approached

using both classical algorithms such as SVM [15, 2, 16, Dataset Split Abstracts ACs 17] as well as recurrent neural network models such as BiLSTMs [ 18, 19, 4 ]. Transformer-based models, such NNeeoo--ttreasitn 315000 62,92191 as BERT [20], have also been utilized for AM, including Gla-test 100 615 multi-scale argument modelling and customized feature- Mix-test 100 609 injected BERT-based models [ 21, 22, 23, 5, 6, 24, 25 ]. AM in the biomedical AbstRCT dataset has been approached Table 1 using LSTMs [26, 27], sequential transfer learning [28] AbstRCT dataset statistics. as well as transformer-based models [ 29, 30, 31 ].

More recently, AM sub-tasks have been modeled as An sample of the AbstRCT dataset is provided below. text generation tasks using LLMs. For the Argument Type The argument components (ACs) and their corresponding Classification (ATC) sub-task, this approach involves us- classes are indicated by bold tags. ing a prompt template to generate the corresponding <AC1: Major Claim>A combination of mitoxantrone plus predclass of an argument component. This method has been nisone is preferable to prednisone alone for reduction of pain in men applied to various AM use-cases, such as podcast tran- wpuitrhpomseetoafsttahtiiscs,thuodrymwoanse-troeasisssteasnst,thperoesfetcattseocfatnhceesre.<t/rAeaCt m1>enTthse scripts and legal documents [ 32, 33, 34 ]. The latest ap- on health-related quality of life (HQL). Men with metastatic prostate proach in this ‘AM using LLM text generation’ direction cancer (n = 161) were randomized to receive either daily prednisone involves a prompt that includes the argument compo- raelocneievoerdmpriteodxnainsotrnoeneal(oenveercyo3ulwdeheakvse) pml uitsopxraendtnroisnoenea.dTdehdosaeftwerho6 nent as the query and the complete text as the context, weeks if there was no improvement in pain. HQL was assessed to output the class of the argument component using a before treatment initiation and then every 3 weeks using the Eurogenerative model [ 35 ]. In this study, the three AM sub- opfe-aLnifOerQguanesiztiaotnionnaifroer CR3e0se(aErOchRaTnCdQTLreQa-tCm3e0n)taonfdCtahneceQruQauliatylitoyftasks are modeled using the Persuasive Essays (PE) and Life Module-Prostate 14 (QOLM-P14), a trial-specific module develAbstRCT datasets. oped for this study. An intent-to-treat analysis was used to deter

In contrast to the fine-tuning approach, a relevant immipnreotvheemmenetadnurdautrioantiobnetwofeeHnQgLroiumppsroofvpeamtieennttsa.n<dACdi2fe:rePnrceems iisne> training-free ICL prompting strategy for LLMs has been At 6 weeks, both groups showed improvement in several HQL doproposed [ 9, 11 ]. This strategy combines kNN-based mains</AC2>, and <AC3: Premise>only physical functioning and pain were better in the mitoxantrone-plus-prednisone group than example selection, generated chain-of-thought prompt- in the prednisone-alone group</AC3>. <AC4: Premise>After 6 ing, and majority vote ensembling for few-shot classifi- weeks, patients taking prednisone showed no improvement in HQL cation. Interestingly, the ICL strategy outperforms the sscigonreificsa,nwthiemreparsotvheomsee ntatskiinng gmloitboaxlaqnutraolniteypolufslipfere(dPni=s.o0n0e9)s,hfoowuerd ifne-tuning approach on the datasets used in the study. functioning domains, and nine symptoms (.001 < P <. 01)</AC4>, and <AC5: Premise>the improvement (> 10 units on a scale of 0 to100) lasted longer than in the prednisone-alone group (.004 < P <.05)</AC5>. <AC6: Premise>The addition of mitoxantrone to prednisone after failure of prednisone alone was associated with improvements in pain, pain impact, pain relief, insomnia, and global quality of life (.001 < P <.003).</AC6> <AC7: Claim>Treatment with mitoxantrone plus prednisone was associated with greater and longer-lasting improvement in several HQL domains and symptoms than treatment with prednisone alone.</AC7>

3.2. Zero-Shot Learning (ZSL) and In-Context Learning (ICL)

the ACs and their corresponding classes in these k abstracts is constructed (kNN). Finally, the LLM predicts the classes yˆ1, . . . , yˆm of c1, . . . , cm on the basis of on this prompt. (2) n-Ensembling (n = 3, 5): The kNN-based examples selection step, which involves randomness, is repeated n times (nEns), leading to a set of n sequences of class predictions {(yˆi,1, . . . , yˆi,m) : i = 1, . . . n}. The final class predictions yˆ1, . . . , yˆm of c1, . . . , cm are obtained by applying a component wise majority vote to the n predictions sequences.

Zero-shot learning (ZSL) is the paradigm where the LLM

is asked to solve a downstream task without receiving The kNN-based example selection optimizes learning any specific solved examples in the prompt. By contrast, from few examples by selecting samples most similar in-context learning (ICL) refers to the emergent ability of to the current instance, rather than choosing them ranLLMs to solve a downstream task based on a few demon- domly. The ensembling step increases prediction robuststration examples given in the prompt as contextual in- ness by selecting the most frequent predictions. Note that formation [ 8 ]. As the major advantage, ZSL and ICL the relevance of the ensembling step relies on the random paradigms do not require any fine-tuning of the model’s selection in the kNN step. This randomness ensures that parameters (i.e. training-free framework). same predictions are not always produced, allowing for

Formally, let x be a query input text and C = majority voting and thereby increasing robustness. [I; t(xi1 , yi1 ); . . . ; t(xik , yik )] be a context composed To aid the LLM in generating predictions, additional of instructions I concatenated with input-output pairs task-specific information is typically included in the (xj , yij ) in text format, where X = {x1, x2, . . . } and prompt. For example, definitions of the ‘Claim’ and Y = {y1, . . . , yk} are the sets of possible input and ‘Premise’ classes, along with their statistics in the Neooutputs, respectively. The ZSL and ICL paradigms corre- train set, can be incorporated in the prompt (info). Morespond to the cases where k = 0 and k > 0, respectively. over, in addition to the ACs c1, . . . , cm whose class are For input x, the LLM M predicts the output yˆ such that to be predicted, the abstract text from which these ACs originate can be included in the prompt (abstract). Acyˆ = arg ymi∈aYx PM(yi | C; x) , cording to this ICL strategy, the classes yˆ1, . . . , yˆm of c1, . . . , cm are predicted all-at-once (see Figure 1). Therewhere PM(yi | C; x) is the probability that M gener- fore, a prompt of the form ‘info + abstract + 3NN + 3Ens’ ates yi when C and x are given as prompt. The main (see Table 3) indicates that the argument components rationale behind ZSL and ICL is that the consideration (ACs) of the abstract are predicted all-at-once, by incorof a well-chosen context C increases the probability of porating additional information and the entire abstract M predicting the correct answer y for input x, i.e., that text as contextual cues in the prompt, and employing PM(y | C; x) > PM(y | x). the ICL strategy with 3NN-based example selection and

We consider a 2-step ICL strategy for argument type 3-ensembling. A similar ICL strategy, where the classes classification (ATC) inspired by a recent study [ 9 ] (see yˆ1, . . . , yˆm are inferred one-by-one (i.e., each model inFigure 1). More precisely, let A be an abstract contain- ference leads to a single prediction yˆj ), has been considing argument components (ACs) c1, . . . , cm with cor- ered but shown to be significantly less eficient. Due to responding true classes y1, . . . , ym, where each yi ∈ space constraints, the latter results are omitted in this {Claim, Premise}. Given the ACs c1, . . . , cm in the work. prompt, the LLM generates the corresponding class predictions yˆ1, . . . , yˆm as follows:

3.3. Fine-tuning

(1) kNN-based examples selection (k = 3, 5): First, 2k neighboring abstracts A1, . . . , A2k of A are se- Fine-tuning (FT) refers to the process of further training a lected according to the following similarity mea- pre-trained LLM on a downstream task. Previous studies sure. For any abstract Ai, let the signature of Ai be indicate that relying solely on the text of an argument the embedding of the first sentence of Ai using the component is insuficient for predicting its argumentative BioBERT model. The abstracts A1, . . . , A2k are the class; additional contextual information is essential for ones whose signatures are the closest, with respect achieving competitive classification accuracy [ 2, 5, 6 ]. to cosine similarity, to the signature of A. Then, k Therefore, we propose a fine-tuning strategy that models abstracts, Ai1 , . . . , Aik , are randomly chosen from the ATC task at the document level. Specifically, we A1, . . . , A2k. Afterwards, a prompt containing all incorporate task-specific information into each training info abstract ex. 1 ex. k

LLM preds … … … … … … … … nENS final preds

4.2. In-Context Learning

sample and generate the class label predictions for the the entire text of the abstract (abstract) significantly imACs of an abstract all-at-once. prove the results. These expected observations serve as an ablation study and justify the usage of the additional 3.4. Implementation Details information and full abstract text (prompt template ‘info + abstract’) in all subsequent experiments.

As the embedding engine, we use dmis-lab’s BioBERT1. In all experiments, we observed that the models conFor zero-shot learning, ICL and fine-tuning, we experi- sistently generated the correct number of classes for each ment with the LLaMA-3-8B-Instruct and LLaMA-3-70B- inference task. This observation remains valid for subInstruct models, as well as various GGML-quantized con- sequent ICL and fine-tuning settings. It demonstrates ifgurations of them 2. For ICL, we set the generate tem- the model’s capability to understand the correspondence perature to 0.1. For fine-tuning, we use LoRA adapters between the number of input ACs and the number of with loraplus_lr_ratio of 16.0. We set batch size classes to predict. of 2 and learning rate of 5e− 5. For implementation, we In ZSL training-free setting, across Neo, Gla and Mix use the LLaMA-Factory3 framework [ 36 ]. An example test sets, the performance of LLMs strongly correlated of the prompts we use for zero-shot learning, in-context with the complexity of these models, achieving maximal learning and fine-tuning with LLaMA-3 are given in Ap- macro F1-scores of 0.698, 0.819 and 0.725, respectively. pendix A. Overall, in ZSL, the LLMs fail to achieve acceptable results. These considerations underscore the need for im4. Results plementing additional learning modalities to address the ATC task efectively.

4.1. Zero-Shot Learning 1https://huggingface.co/dmis-lab 2https://github.com/ggerganov/ggml 3https://github.com/hiyouga/LLaMA-Factory The results for zero-shot learning (ZSL) on ATC task

are reported in Table 2. Recall that zero-shot learning The results for in-context learning (ICL) on the ATC task corresponds to the prompting strategy where no near- are reported in Table 3. First, note that the transition est neighbors are included as demonstration examples, from zero-shot learning (‘info + abstract + 0NN’, Table 2) referred to as ‘info + abstract + 0NN’ in our notation. to in-context learning (‘info + abstract + kNN’, Table 3) In an initial experimentation phase, we observed that drastically improves the results. This validates the efecadding complementary information (info) (definitions of tiveness of the kNNN-based examples selection method. ’Claim’ and ’Premise’ and dataset statistics) and including In addition, except for the Mix test set, the 3NN strategy consistently outperforms the 5NN strategy, suggesting that three examples sufice for optimal learning the ATC task in an ICL setting. The inclusion of more demonstration examples correlates with a significant increase LLaMA-3-8b-Instruct-bnb-4bit 0.529 LLaMA-3-8b-Instruct 0.544 LLaMA-3-70b-Instruct-bnb-4bit 0.642 LLaMA-3-8b-Instruct-bnb-4bit 0.553 LLaMA-3-8b-Instruct 0.569 LLaMA-3-70b-Instruct-bnb-4bit 0.755 LLaMA-3-8b-Instruct-bnb-4bit 0.546 LLaMA-3-8b-Instruct 0.563 LLaMA-3-70b-Instruct-bnb-4bit 0.671

Neo test Gla test Mix test in prompt length, potentially hindering the performance of the LLM or exceeding the maximum size of its context. Furthermore, the ensembling strategy consistently improves the results, even if only slightly, ensuring that the robustness of the results can indeed be strengthened through ensembling predictions.

Overall, the training-free ICL strategy achieves very competitive F1-scores of 0.912, 0.910, and 0.929 on Neo, Mix, and Gla test sets, respectively. However, these results remain lower than those obtained by previous training-dependent models (see Table 4, upper rows).

4.3. Fine-Tuning

Neo test LLaMA-3-8b-Instruct info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens LLaMA-3-8b-Instruct-bnb-4bit info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens LLaMA-3-70b-Instruct-bnb-4bit info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens LLaMA-3-8b-Instruct info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens LLaMA-3-8b-Instruct-bnb-4bit info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens LLaMA-3-70b-Instruct-bnb-4bit info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens

Gla test

Mix test LLaMA-3-8b-Instruct info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens

Prompt The results achieved by the fine-tuning (FT) strategy on the ATC task are reported in Table 4. Our results show 0.859 0.926 0.893 that fine-tuning significantly outperforms ICL. These 0.866 0.922 0.894 0.885 0.940 0.913 ifndings suggest that the argumentative flow within abstracts cannot be inferred solely from the knowledge LLaMA-3-70b-Instruct-bnb-4bit acquired during pre-training, and requires additional pa- info + abstract + 3NN 0.905 0.954 0.929 rameters updates to be efectively learned. info + abstract + 5NN 0.906 0.952 0.929

In this training-dependent context, we achieve maxi- info + abstract + 3NN + 3Ens 0.904 0.952 0.928 mal F1-scores of 0.935, 0.913, and 0.951 on the Neo, Gla, Table 3 and Mix test sets, respectively, establishing new state- Results for ATC on three test sets of AbstRCT dataset with of-the-art results for the Neo and Mix test sets. These LLaMA-3 models using the 2-step ICL strategy described in results suggest once again that the sequentiality of argu- the text. ments inside a specific corpus requires fine-tuning to be optimally captured.

LLaMA-3-8b-Instruct-bnb-4bit info + abstract + 3NN info + abstract + 5NN info + abstract + 3NN + 3Ens

5. Conclusion

In this work, we address argument type classification (ATC) in the biomedical AbstRTC dataset with openly available LLaMA-3 from the three-fold perspective of zero-shot learning (ZSL), in-context learning (ICL) and ifne-tuning (FT). We show that ZSL fails to achieve acceptable performance, ICL significantly improves the results, and FT reaches state-of-the-art performance.

These results support the fact that ATC task cannot be solved in a zero-shot setting by relying solely on general-purpose language modalities acquired during ResAttArg(Ensemble) [27] SeqMT [28] MRC_GEN [ 35 ] GIAM [25] LLaMA-3-8B-Instruct 0.919 LLaMA-3-8B-Instruct-bnb-4bit 0.935 LLaMA-3-70B-Instruct 0.929 LLaMA-3-70B-Instruct-bnb-4bit 0.921 Neo pre-training. Additional learning is essential, either in the form of solved demonstration examples (ICL) or via parameters’ updates (FT). We conjecture that the sequential flow of arguments within a text is a corpus-specific feature that cannot be inferred through zero-shot methods.

Previous works demonstrated that the text of argument components alone do not sufice to infer their argumentative roles [ 2, 4, 6 ]. Additional contextual, structural and syntactic features are necessary. In our ICL and FT settings, comprehensive contextual and structural information is incorporated through task-specific information and complete abstract text provided in the prompt. This information enables the model to discern the sequence of arguments, their associated markers, and other characteristics closely associated with their argumentative roles.

For future work, the design and implementation of a full AM pipeline using LLMs represents a major milestone. In this scenario, the LLM would take raw texts as input and produce a detailed map of the argumentative structure as output. We believe that LLMs will substantially transform the landscape of AM and its practical applications.

Acknowledgments This work benefited from access to the computing re

sources of the L3i laboratory, operated and hosted by the University of La Rochelle. It is financed by the French government and the Region Nouvelle-Acquitaine. This research also benefited from institutional support RVO: 67985807 and partially supported by the grant of the Czech Science Foundation No. GA22-02067S. Finally, we are grateful to Playtika Ltd. for their support for this research.

ARXIV.2311.16452. arXiv:2311.16452. language understanding, in: J. B. et al. (Ed.), Pro[10] J. Wei, X. Wang, D. Schuurmans, M. Bosma, ceedings of NAACL-HLT 2019, ACL, 2019, pp. 4171– B. Ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, 4186. URL: https://doi.org/10.18653/v1/n19-1423. Chain-of-thought prompting elicits reasoning doi:10.18653/V1/N19-1423. in large language models, in: S. K. et al. (Ed.), [21] G. Zhang, P. Nulty, D. Lillis, Enhancing legal arguProceedings of NeurIPS 2022, volume 35, 2022, ment mining with domain pre-training and neural pp. 24824–24837. URL: https://proceedings. networks, CoRR abs/2202.13457 (2022). URL: https: neurips.cc/paper_files/paper/2022/file/ //arxiv.org/abs/2202.13457. arXiv:2202.13457. 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference[.22] H. Wang, Z. Huang, Y. Dou, Y. Hong, Arpdf. gumentation mining on essays at multi [11] S. Lei, G. Dong, X. Wang, K. Wang, S. Wang, Instruc- scales, in: D. S. et al. (Ed.), Proceedterc: Reforming emotion recognition in conversa- ings of COLING 2020, ICCL, Barcelona, tion with a retrieval multi-task llms framework, Spain (Online), 2020, pp. 5480–5493. URL: CoRR abs/2309.11911 (2023). URL: https://doi.org/ https://aclanthology.org/2020.coling-main.478. 10.48550/arXiv.2309.11911. doi:10.48550/ARXIV. doi:10.18653/v1/2020.coling-main.478. 2309.11911. arXiv:2309.11911. [23] S. Fioravanti, A. Zugarini, F. Giannini, L. Rigutini, [12] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, M. Maggini, M. Diligenti, Linguistic feature inS. Narang, A. Chowdhery, D. Zhou, Self-consistency jection for eficient natural language processing, improves chain of thought reasoning in language in: IJCNN 2023, June 18-23, 2023, IEEE, 2023, models, 2023. arXiv:2203.11171. pp. 1–7. URL: https://doi.org/10.1109/IJCNN54540. [13] H. Nori, N. King, S. M. McKinney, D. Carignan, 2023.10191680. doi:10.1109/IJCNN54540.2023.

E. Horvitz, Capabilities of GPT-4 on medical 10191680. challenge problems, CoRR abs/2303.13375 (2023). [24] J. Bao, C. Fan, J. Wu, Y. Dang, J. Du, R. Xu, A neuURL: https://doi.org/10.48550/arXiv.2303.13375. ral transition-based model for argumentation mindoi:10.48550/ARXIV.2303.13375. ing, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), arXiv:2303.13375. Proceedings of the 59th Annual Meeting of the [14] T. Mayer, Argument Mining on Clinical Trials, ACL and the 11th International Joint Conference Theses, Université Côte d’Azur, 2020. URL: https: on Natural Language Processing (Volume 1: Long //theses.hal.science/tel-03209489. Papers), ACL, Online, 2021, pp. 6354–6364. URL: [15] R. Mochales, M. Moens, Argumentation min- https://aclanthology.org/2021.acl-long.497. doi:10. ing, Artificial Intelligence and Law 19 (2011) 1–22. 18653/v1/2021.acl-long.497. doi:10.1007/s10506-010-9104-x. [25] B. Liu, V. Schlegel, P. Thompson, R. T. Batista[16] I. Habernal, I. Gurevych, Argumentation min- Navarro, S. Ananiadou, Global informationing in user-generated web discourse, Computa- aware argument mining based on a toptional Linguistics 43 (2017) 125–179. URL: https: down multi-turn qa model, Information //aclanthology.org/J17-1004. doi:10.1162/COLI_ Processing & Management 60 (2023) 103445. a_00276. URL: https://www.sciencedirect.com/science/ [17] R. Levy, Y. Bilu, D. Hershcovich, E. Aharoni, article/pii/S0306457323001826. doi:https: N. Slonim, Context dependent claim detection, in: //doi.org/10.1016/j.ipm.2023.103445. ICCL, 2014. URL: https://api.semanticscholar.org/ [26] A. Galassi, M. Lippi, P. Torroni, Argumentative CorpusID:18847466. link prediction using residual networks and multi[18] S. Eger, J. Daxenberger, I. Gurevych, Neural end- objective learning, in: N. Slonim, R. Aharonov to-end learning for computational argumentation (Eds.), Proceedings of the 5th Workshop on Armining, in: R. Barzilay, M.-Y. Kan (Eds.), Proceed- gument Mining, ACL, Brussels, Belgium, 2018, ings of ACL 2017, ACL, Vancouver, Canada, 2017, pp. 1–10. URL: https://aclanthology.org/W18-5201. pp. 11–22. URL: https://aclanthology.org/P17-1002. doi:10.18653/v1/W18-5201.

doi:10.18653/v1/P17-1002. [27] A. Galassi, M. Lippi, P. Torroni, Multi-task attentive [19] V. Niculae, J. Park, C. Cardie, Argument mining residual networks for argument mining, IEEE/ACM with structured SVMs and RNNs, in: R. Barzi- Transactions on Audio, Speech, and Language Prolay, M.-Y. Kan (Eds.), Proceedings of ACL 2017, cessing 31 (2023) 1877–1892. doi:10.1109/TASLP. ACL, Vancouver, Canada, 2017, pp. 985–995. URL: 2023.3275040. https://aclanthology.org/P17-1091. doi:10.18653/ [28] J. Si, L. Sun, D. Zhou, J. Ren, L. Li, Biomedical v1/P17-1091. argument mining based on sequential multi-task [20] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: learning, IEEE/ACM Trans. Comput. Biol. Bioinpre-training of deep bidirectional transformers for formatics 20 (2022) 864–874. URL: https://doi.org/

A. Appendix Examples of prompts for LLaMA 3 for the zero-shot learning (ZSL), in-context learning (ICL) and fine-tuning (FT) settings are provided below.

A.1. Zero-Shot Learning ### Task description: You are an expert biomedical assistant that takes 1) an abstract text and 2) the list of all arguments from this abstract text, and must classify all arguments into one of two classes: Claim or Premise. 68.0052% of examples are of type Premise and 31.9948% of type Claim. You must absolutely not generate any text or explanation other than the following JSON format {"Argument 1": <predicted class for Argument 1 (str)>, ..., "Argument n": <predicted class for Argument n (str)>} ### Class definitions: Claim = A claim in the abstract of an RCT is a statement or conclusion about the findings of the study. Premise = A premise in the abstract of an RCT is a statement that provides an evidence or proof for a claim. ### Abstract: Few controlled clinical trials exist to support oral combination therapy in pulmonary arterial hypertension (PAH). Patients with PAH (idiopathic [IPAH] or associated with connective tissue disease [APAH-CTD]) taking bosentan (62.5 or 125 mg twice daily at a stable dose for ≥ 3 months) were randomized (1:1) to sildenafil (20 mg, 3 times daily; n = 50) or placebo (n = 53). The primary endpoint was change from baseline in 6-min walk distance (6MWD) at week 12, assessed using analysis of covariance. Patients could continue in a 52-week extension study. An analysis of covariance main-efects model was used, which included categorical terms for treatment, baseline 6MWD (<325 m; ≥ 325 m), and baseline aetiology; sensitivity analyses were subsequently performed. In sildenafil versus placebo arms, week-12 6MWD increases were similar (least squares mean diference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1 m in APAH-CTD (35% of population). One-year survival was 96%; patients maintained modest 6MWD improvements. Changes in WHO functional class and Borg dyspnoea score and incidence of clinical worsening did not difer. Headache, diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition to stable (≥ 3 months) bosentan therapy, had no benefit over placebo for 12-week change from baseline in 6MWD. The influence of PAH aetiology warrants future study. ### Arguments: Argument 1=In sildenafil versus placebo arms, week-12 6MWD increases were similar (least squares mean diference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1 m in APAH-CTD (35% of population).

Argument 2=Changes in WHO functional class and Borg dyspnoea score and incidence of clinical worsening did not difer.

Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil. Argument 4=Sildenafil, in addition to stable ( ≥ 3 months) bosentan therapy, had no benefit over placebo for 12-week change from baseline in 6MWD. ### Result: A.2. In-Context Learning (ICL) ### Task description: You are an expert biomedical assistant that takes 1) an abstract text, 2) the list of all arguments from this abstract text, and must classify all arguments into one of two classes: Claim or Premise. 68.0052% of examples are of type Premise and 31.9948% of type Claim. You must absolutely not generate any text or explanation other than the following JSON format {"Argument 1": <predicted class for Argument 1 (str)>, ..., "Argument n": <predicted class for Argument n (str)>} ### Class definitions: Claim = A claim in the abstract of an RCT is a statement or conclusion about the findings of the study. Premise = A premise in the abstract of an RCT is a statement that provides an evidence or proof for a claim. ### Examples: ## Example 1 Treatment of patients with advanced or metastatic esophagogastric adenocarcinoma should not only prolong life but also provide relief of symptoms and improve quality of life (QOL). Esophagogastric adenocarcinoma mainly occurs in elderly patients, but they are underrepresented in most clinical trials and often do not receive efective combination chemotherapy, most probably for fear of intolerance. Using validated instruments, we prospectively assessed QOL within the randomized FLOT65+ phase II trial. Within the FLOT65+ trial, a total of 143 patients aged ≥ 65 years were randomly allocated to receive biweekly oxaliplatin plus 5-fluorouracil (5-FU) continuous infusion and folinic acid (FLO) or the same regimen in combination with docetaxel 50 mg/m(2) (FLOT). The European Organisation for Research and Treatment of Cancer Quality of Life Questionnaire C30 (EORTC QLQ-C30) and the gastric module STO22 were administered every 8 weeks until progression. Time to definitive deterioration of QOL parameters was analyzed and compared within the treatment arms. The median age of patients was 70 years. Patients receiving FLOT exhibited higher response rates and had improved disease-free and progression-free survival (PFS). The proportions of patients with evaluable baseline EORTC QLQ-C30 and STO22 questionnaires were balanced (83 % in FLOT and 89 % in FLO). Considering evaluable patients with assessable questionnaires (n = 123), neither functioning nor symptom parameters difered significantly in favor of one of the two treatment groups. Particularly, there was no significant diference regarding time to definitive deterioration of global health status/quality of life from baseline (primary endpoint). Notably, patients receiving FLO or FLOT as palliative treatment (n = 98) achieved comparable QOL results. Although toxicity was higher in patients receiving FLOT, no negative impact of the addition of docetaxel on QOL parameters could be demonstrated. Thus, elderly patients in need of intensified chemotherapy may receive FLOT without compromising patient-reported outcome parameters. Argument 1=Patients receiving FLOT exhibited higher response rates and had improved disease-free and progression-free survival (PFS).

Argument 2=there was no significant diference regarding time to definitive deterioration of global health status/quality of life from baseline (primary endpoint). Argument 3=patients receiving FLO or FLOT as palliative treatment (n = 98) achieved comparable QOL results.

Argument 4=Although toxicity was higher in patients receiving FLOT, Argument 5=no negative impact of the addition of docetaxel on QOL parameters could be demonstrated.

Argument 6=elderly patients in need of intensified chemotherapy may receive FLOT without compromising patient-reported outcome parameters. {"Argument 1": "Premise", "Argument 2": "Premise", "Argument 3": "Premise", "Argument 4": "Premise", "Argument 5": "Premise", "Argument 6": "Claim"} Chemotherapy prolongs survival and improves quality of life (QOL) for good performance status (PS) patients with advanced non-small cell lung cancer (NSCLC). Targeted therapies may improve chemotherapy efectiveness without worsening toxicity. SGN-15 is an antibody-drug conjugate (ADC), consisting of a chimeric murine monoclonal antibody recognizing the Lewis Y (Le(y)) antigen, conjugated to doxorubicin. Le(y) is an attractive target since it is expressed by most NSCLC. SGN-15 was active against Le(y)-positive tumors in early phase clinical trials and was synergistic with docetaxel in preclinical experiments. This Phase II, open-label study was conducted to confirm the activity of SGN-15 plus docetaxel in previously treated NSCLC patients. Sixty-two patients with recurrent or metastatic NSCLC expressing Le(y), one or two prior chemotherapy regimens, and PS< or =2 were randomized 2:1 to receive SGN-15 200 mg/m2/week with docetaxel 35 mg/m2/week (Arm A) or docetaxel 35 mg/m2/week alone (Arm B) for 6 of 8 weeks. Intrapatient dose-escalation of SGN-15 to 350 mg/m2 was permitted in the second half of the study. Endpoints were survival, safety, eficacy, and quality of life. Forty patients on Arm A and 19 on Arm B received at least one treatment. Patients on Arms A and B had median survivals of 31.4 and 25.3 weeks, 12-month survivals of 29% and 24%, and 18-month survivals of 18% and 8%, respectively Toxicity was mild in both arms. QOL analyses favored Arm A. SGN-15 plus docetaxel is a well-tolerated and active second and third line treatment for NSCLC patients . Ongoing studies are exploring alternate schedules to maximize synergy between these agents. Argument 1=Chemotherapy prolongs survival and improves quality of life (QOL) for good performance status (PS) patients with advanced non-small cell lung cancer (NSCLC).

Argument 2=Targeted therapies may improve chemotherapy efectiveness without worsening toxicity.

Argument 3=Le(y) is an attractive target since it is expressed by most NSCLC. Argument 4=SGN-15 was active against Le(y)-positive tumors in early phase clinical trials and was synergistic with docetaxel in preclinical experiments. Argument 5=Patients on Arms A and B had median survivals of 31.4 and 25.3 weeks, 12-month survivals of 29% and 24%, and 18-month survivals of 18% and 8%, respectively Argument 6=Toxicity was mild in both arms.

Argument 7=QOL analyses favored Arm A.

Argument 8=SGN-15 plus docetaxel is a well-tolerated and active second and third line treatment for NSCLC patients The impact of treatment on health-related quality of life (HRQoL) is an important consideration in the adjuvant treatment of operable breast cancer. Here we report mature HRQoL outcomes from the ATAC trial, comparing anastrozole with tamoxifen as primary adjuvant therapy for postmenopausal women with localized breast cancer. Patients completed the Functional Assessment of Cancer Therapy-Breast (FACT-B) questionnaire plus endocrine subscale (ES) at baseline, 3 and 6 months, and every 6 months thereafter. Baseline characteristics in the HRQoL sub-protocol were well balanced between the anastrozole (n = 335) and tamoxifen (n = 347) groups in the primary analysis population. As with previously published results at 2 years, there was no statistically significant diference in the Trial Outcome Index of the FACT-B, the primary endpoint of the study, between treatments at 5 years. There were no statistically significant diferences between treatment groups in ES total scores. Consistent with the 2-year analysis, there were diferences between treatment groups in patient-reported side efects: diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%), vaginal dryness (18.5% vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia (17.3% vs. 8.1%) were significantly more frequent with anastrozole compared to tamoxifen. Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were significantly less frequent with anastrozole compared to tamoxifen. In this, the first report of HRQoL over 5 years of initial adjuvant therapy with an aromatase inhibitor, we conclude that anastrozole and tamoxifen had similar impacts on HRQoL, which was maintained or slightly improved during the treatment period for both groups. Argument 1=The impact of treatment on health-related quality of life (HRQoL) is an important consideration in the adjuvant treatment of operable breast cancer. Argument 2=As with previously published results at 2 years, there was no statistically significant diference in the Trial Outcome Index of the FACT-B, the primary endpoint of the study, between treatments at 5 years.

Argument 3=There were no statistically significant diferences between treatment groups in ES total scores.

Argument 4=there were diferences between treatment groups in patient-reported side efects: Argument 5=diarrhea (anastrozole 3.1% vs. tamoxifen 1.3%), vaginal dryness (18.5% vs. 9.1%), diminished libido (34.0% vs. 26.1%), and dyspareunia (17.3% vs. 8.1%) were significantly more frequent with anastrozole compared to tamoxifen. Argument 6=Dizziness (3.1% vs. 5.4%) and vaginal discharge (1.2% vs. 5.2%) were significantly less frequent with anastrozole compared to tamoxifen.

Argument 7=In this, the first report of HRQoL over 5 years of initial adjuvant therapy with an aromatase inhibitor, we conclude that anastrozole and tamoxifen had similar impacts on HRQoL, which was maintained or slightly improved during the treatment period for both groups. Few controlled clinical trials exist to support oral combination therapy in pulmonary arterial hypertension (PAH). Patients with PAH (idiopathic [IPAH] or associated with connective tissue disease [APAH-CTD]) taking bosentan (62.5 or 125 mg twice daily at a stable dose for ≥ 3 months) were randomized (1:1) to sildenafil (20 mg, 3 times daily; n = 50) or placebo (n = 53). The primary endpoint was change from baseline in 6-min walk distance (6MWD) at week 12, assessed using analysis of covariance. Patients could continue in a 52-week extension study. An analysis of covariance main-efects model was used, which included categorical terms for treatment, baseline 6MWD (<325 m; ≥ 325 m), and baseline aetiology; sensitivity analyses were subsequently performed. In sildenafil versus placebo arms, week-12 6MWD increases were similar (least squares mean diference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1 m in APAH-CTD (35% of population). One-year survival was 96%; patients maintained modest 6MWD improvements. Changes in WHO functional class and Borg dyspnoea score and incidence of clinical worsening did not difer. Headache, diarrhoea, and flushing were more common with sildenafil. Sildenafil, in addition to stable (≥ 3 months) bosentan therapy, had no benefit over placebo for 12-week change from baseline in 6MWD. The influence of PAH aetiology warrants future study. Argument 1=In sildenafil versus placebo arms, week-12 6MWD increases were similar (least squares mean diference [sildenafil-placebo], -2.4 m [90% CI: -21.8 to 17.1 m]; P = 0.6); mean ± SD changes from baseline were 26.4 ± 45.7 versus 11.8 ± 57.4 m, respectively, in IPAH (65% of population) and -18.3 ± 82.0 versus 17.5 ± 59.1 m in APAH-CTD (35% of population).

Argument 2=Changes in WHO functional class and Borg dyspnoea score and incidence of clinical worsening did not difer.

Argument 3=Headache, diarrhoea, and flushing were more common with sildenafil. Argument 4=Sildenafil, in addition to stable ( ≥ 3 months) bosentan therapy, had no benefit over placebo for 12-week change from baseline in 6MWD. ### You are an expert in medical analysis. You are given the abstract of a random controlled trial which contains numbered argument components enclosed by <AC></AC> tags. Your task is to classify each argument components in the essay as either "Claim" or "Premise". You must return a list of argument component types in following JSON format: "component_types": [component_type (str), component_type (str), ..., component_type (str)] ### Here is the abstract text: An open, randomized study was performed to assess the efects of supportive pamidronate treatment on morbidity from bone metastases in breast cancer patients. Eighty-one pamidronate patients and 80 control patients were monitored for a median of 18 and 21 months, respectively, for events of skeletal morbidity and the radiologic course of metastatic bone disease. The oral pamidronate dose was 600 mg/d (high dose [HD]) during the earliest study years, then changed to 300 mg/d (low dose [LD]) because of gastrointestinal toxicity. Twenty-nine of 81 pamidronate (HD/LD) patients first received 600 mg/d and were then changed to 300 mg/d; 52 of 81 pamidronate LD patients received 300 mg/d throughout the study. Tumor treatment was unrestricted. An overall intent-to-treat analysis was performed.<AC> In the pamidronate group, the occurrence of hypercalcemia, severe bone pain, and symptomatic impending fractures decreased by 65%, 30%, and 50%, respectively; event-rates of systemic treatment and radiotherapy decreased by 35% (P < or = .02). </AC><AC> The event-free period (EFP), radiologic course of disease, and survival did not improve. </AC><AC> Subgroup analyses suggested a dose-dependent treatment efect. </AC><AC> Compared with their controls, in pamidronate HD/LD patients, events occurred 60% to 90% less frequently (P < or = .03) and the EFP was prolonged (P = .002). </AC><AC> In pamidronate LD patients, event-rates decreased by 15% to 45% (P < or = .04). </AC><AC> Gastrointestinal toxicity of pamidronate caused a 23% drop-out rate, </AC><AC> but other cancer-associated factors seemed to contribute to this toxicity. </AC><AC> Pamidronate treatment of breast cancer patients eficaciously reduced skeletal morbidity. </AC><AC> The efect appeared to be dose-dependent. </AC><AC> Further research on dose and mode of treatment is mandatory. </AC> {"component_types": ["Premise", "Premise", "Claim", "Premise", "Premise", "Premise", "Claim", "Claim", "Claim", "Claim"]}

[1]

R. M.

Palau , M.-F. Moens , Argumentation mining: The detection, classification and structure of arguments in text , in: Proceedings of ICAIL 2019 , ICAIL '09, ACM , New York, NY, USA, 2009 , pp. 98 - 107 . URL: https://doi.org/10.1145/1568234.1568246. doi: 10 .1145/1568234.1568246.

[2]

Stab , I. Gurevych , Parsing argumentation structures in persuasive essays , Computational Linguistics 43 ( 2017 ) 619 - 659 . URL: https://aclanthology. org/J17-3005. doi: 10 .1162/COLI_a_ 00295 .

[3]

Potash ,

Romanov ,

Rumshisky , Here's my point: Joint pointer architecture for argument mining , in: M. P. et al. (Ed.), Proceedings of EMNLP 2017 , ACL, 2017 , pp. 1364 - 1373 . URL: https://doi.org/10.18653/v1/d17- 1143 . doi: 10 . 18653/V1/D17-1143.

[4]

Kuribayashi ,

Ouchi ,

Inoue ,

Reisert ,

Miyoshi ,

Suzuki ,

Inui , An empirical study of span representations in argumentation structure parsing, in: A. K . et al. (Ed.), Proceedings of ACL 2019 , ACL, Florence, Italy, 2019 , pp. 4691 - 4698 . URL: https://aclanthology.org/P19-1464. doi: 10 .18653/ v1/ P19 -1464.

[5]

Mushtaq ,

Cabessa , Argument classification with BERT plus contextual, structural and syntactic features as text , in: M. T. et al. (Ed.), Proceedings of ICONIP 2022 , volume 1791 of CCIS , Springer, 2022 , pp. 622 - 633 . URL: https://doi. org/10.1007/ 978 -981-99-1639-9_ 52 . doi: 10 .1007/ 978 -981-99-1639-9\_ 52 .

[6]

Mushtaq ,

Cabessa , Argument mining with modular BERT and transfer learning , in: Proceedings of IJCNN 2023 , IEEE, 2023 , pp. 1 - 8 . URL: https://doi.org/10.1109/IJCNN54540. 2023 . 10191968 . doi: 10 .1109/IJCNN54540. 2023 . 10191968 .

[7]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou ,

Min ,

Zhang ,

Dong ,

Du ,

Yang ,

Chen ,

Jiang ,

Ren ,

Li ,

Tang ,

Liu , P. Liu,

Nie ,

Wen , A survey of large language models , CoRR abs/2303 .18223 ( 2023 ). URL: https://doi.org/ 10.48550/arXiv.2303.18223. doi: 10 .48550/ARXIV. 2303.18223. arXiv: 2303 . 18223 .

[8]

Dong ,

Li ,

Dai ,

Zheng ,

Wu ,

Chang ,

Sun ,

Xu ,

Li ,

Sui , A survey on in-context learning , CoRR abs/2301 .00234 ( 2023 ). URL: https:// doi.org/10.48550/arXiv.2301.00234. doi: 10 .48550/ ARXIV.2301.00234. arXiv: 2301 . 00234 .

[9]

Nori , et al., Can generalist foundation models outcompete special-purpose tuning? case study in medicine , CoRR abs/2311 .16452 ( 2023 ). URL: https:// doi.org/10.48550/arXiv.2311.16452. doi: 10 .48550/ 10.1109/TCBB. 2022 . 3173447 . doi: 10 .1109/TCBB. 2022 . 3173447 .

[29]

Mayer ,

Cabrio ,

Villata , Transformer-based argument mining for healthcare applications , in: G. D. G. et al. (Ed.), Proceedings of ECAI 2020 , volume 325 of FAIA , IOS Press, 2020 , pp. 2108 - 2115 . URL: https://doi.org/10.3233/FAIA200334. doi: 10 . 3233/FAIA200334.

[30]

Molinet ,

Marro ,

Cabrio ,

Villata , T. Mayer, Acta 2 . 0: A modular architecture for multi-layer argumentative analysis of clinical trials , in: L. D. Raedt (Ed.), Proceedings of IJCAI-22, International Joint Conferences on Artificial Intelligence Organization , 2022 , pp. 5940 - 5943 . URL: https://doi. org/10.24963/ijcai. 2022 /859. doi: 10 .24963/ijcai. 2022 /859, demo Track.

[31]

Mayer ,

Marro ,

Cabrio ,

Villata , Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials , Artificial Intelligence in Medicine 118 ( 2021 ) 102098 . URL: https://www.sciencedirect.com/science/article/pii/ S0933365721000919. doi:https://doi.org/10. 1016/j.artmed. 2021 . 102098 .

[32] M. van der Meer , M. Reuver, U.

Khurana , L.

Krause , S. B.

Santamaría , Will it blend? mixing training paradigms & prompting for argument quality prediction , in: G. Lapesa , et al. (Eds.), ArgMining@COLING 2022 , ICCL, 2022 , pp. 95 - 103 . URL: https://aclanthology.org/ 2022 .argmining- 1 .8.

[33]

Pojoni ,

Dumani ,

Schenkel , Argumentmining from podcasts using chatgpt , in: L. Malburg , D. Verma (Eds.), Proceedings of ICCBR-WS 2023 , volume 3438 of CEUR Workshop Proceedings , CEURWS.org, 2023 , pp. 129 - 144 . URL: https://ceur-ws. org/ Vol- 3438 /paper_10.pdf .

[34]

Al Zubaer ,

Granitzer ,

Mitrović , Performance analysis of large language models in the domain of legal argument mining , Frontiers in Artificial Intelligence 6 ( 2023 ). URL: https://www.frontiersin.org/ articles/10.3389/frai. 2023 . 1278796 . doi: 10 .3389/ frai. 2023 . 1278796 .

[35]

Liu ,

Schlegel ,

Batista-Navarro ,

Ananiadou , Argument mining as a multi-hop generative machine reading comprehension task , in: The 2023 Conference on Empirical Methods in Natural Language Processing , 2023 . URL: https://openreview. net/forum?id=KTFxOnrbvu.

[36]

Zheng ,

Zhang , J. Zhang,

Ye ,

Luo ,

Feng , Y. Ma, Llamafactory: Unified eficient fine-tuning of 100+ language models , in: Proceedings of the 62nd Annual Meeting of the ACL (Volume 3 : System

Demonstrations)

, ACL, Bangkok, Thailand, 2024 . URL: http://arxiv.org/abs/2403.13372.