Task-Incremental Learning on Long Text Sequences Natalia Graziuso1 , Andrea Zugarini2,* and Stefano Melacci1 1 Department of Information Engineering and Mathematics, University of Siena, Italy 2 expert.ai, Italy Abstract The extraordinary results achieved by Large Language Models are paired with issues that are critical in real-world applications. The costs of inference and, in particular, training are extremely large, both in terms of time and computational resources, and they become prohibitive when working in dynamic environments, where data and tasks are progressively provided over time. The model must be able to adapt to new knowledge, new domains, new settings, without forgetting the previously learned skills. Retraining from scratch easily becomes too costly, thus Continual Learning strategies are of crucial importance. This is even more evident when data consist of “long” documents, that require several resources to be processed by modern neural models, leading to very long prompts. This paper investigates LLM-based Task-Incremental Learning in the case of tasks exploiting long sequences of text, as it is typical in summarization, question-answering on long documents, reviewing long contracts, and several others. We show how adapting the model by Task Arithmetic with LoRA, which was proposed for visual data, yields promising results also in the case of such “long” text data. To our best knowledge, this is the first work along this challenging direction. The outcome of the investigation of this paper is generic enough to represent an important starting point for further research in processing linguistic data in every language. Keywords Continual Learning, Task-Incremental Learning, Long Sequences of Text, Large Language Models 1. Introduction eral strategies based on LoRA [5] to adapt an LLM to multiple tasks that are sequentially proposed over time. The quality of Language Models (LMs) has been rapidly In particular, we first follow the route of training a single improving in the last decade, showing outstanding skills adapter in a sequential manner, then we explore Task when scaled to large data and networks [1], leading to Arithmetic to fuse multiple adapters trained indepen- the nowadays popular Large Language Models (LLMs). dently [6]. We consider the possibility of assigning dif- Solving more complex tasks with LLMs often requires ferent weights to each task, and we shed some light on processing “long” documents and articulated long in- what are the factors that contribute the most to catas- structions. However, handling lengthy prompts can be a trophic forgetting and to effective task adaptation. The significant obstacle for real-world applications, raising outcomes of such an investigation reveals that: (1) there costs and resources required during both inference and, is limited sensitivity to task-order, i.e., regardless of the in particular, training. This issue can become critical sequence in which tasks are presented, the overall av- when the LLM needs to be specialized to many different erage performance remains relatively stable, a property tasks, domains, and, more generally, when it is applied that, to our best knowledge, was never evaluated in the to dynamic settings that require multiple adaptations. case of tasks composed of long documents; (2) despite its For instance, in real-world applications, models need to simplicity, Task Arithmetic demonstrates effectiveness be re-trained from time to time, as new data/tasks be- in addressing forgetting phenomena when learning from come available. In such scenarios, the need for Continual long texts, strongly reducing the gap from multiple mod- Learning (CL) [2, 3] strategies becomes imperative. From els independently adapted to the task data. Moreover, (3) a very generic perspective, CL focuses on the develop- we are the first to evaluate a recently proposed bench- ment of algorithms capable of sequentially learning from mark (SCROLLS [7]) in a CL setting, offering reference a stream of data, while preserving what was learnt in results for further activity in processing long sequences past experiences, avoiding catastrophic forgetting [4]. of text. We remark that while our experiments are based In this work, motivated by the aforementioned issues, on data in English language, the generic issues we explore we study the problem of Continual Learning from “long” about handling long sequences of text are intrinsically sequences of text, exploiting LLMs. We investigate sev- shared by every language. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. 2. Related Work $ natalia.graziuso@student.unisi.it (N. Graziuso); azugarini@expert.ai (A. Zugarini); stefano.melacci@unisi.it In the last few years, a variety of approaches were pro- (S. Melacci) posed by the scientific community in the context of CL © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). (see [3] and references therein). The main goal is the one CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings of learning from newly provided information, with mod- consists in profitably learning from the last-presented els that are capable of acquiring new knowledge without task without forgetting the previous ones [3]. In order forgetting the previously learned one, and, more impor- to cope with TIL on Long Sequences of Text, specifically tantly, without storing the full dataset and retraining focusing on LLMs, we consider different learning strate- from scratch every time [8]. Several efforts are dedi- gies. In this Section we describe each of them in detail, cated to the case of lifelong Reinforcement Learning [9] after having formally introduce the TIL problem. and of Supervised Learning [10], distinguishing among Problem. We are given a model parameterized by scenarios and categories of approaches [11], ranging 𝜃, which is a vector collecting the learnable variables. from parameter isolation, regularization methods, and In TIL, a set 𝒯 of 𝑘 tasks is sequentially presented to replays [12]. Unsupervised or Self-Supervised Learning the model, i.e. one at a time. Each task 𝑡 ∈ 𝒯 , features approaches are also becoming popular [13, 14, 15], and data sampled from a task-specific distribution, collected the case of adaptation of pre-trained backbones [16]. into dataset 𝒟𝑡 := (𝒳𝑡 , 𝒴𝑡 ), composed of raw samples Of course, neural models for processing language are and labeling information, respectively. The model is not a subject of study in the context of CL [17]. We mention only expected to learn from 𝒟𝑡 , but also to not forget the case of language modeling in Lamol [18], which is knowledge already acquired from the past tasks. In the trained to concurrently solve a task and mimic training following, to keep the notation simple, we indicate each examples, thereby preserving the distribution of previ- task by a numerical index, thus 𝑡 ∈ 𝒯 = {1, . . . , 𝑘}. In ous tasks. Sun et al. [12] introduce Distill and Replay, this case of study, the model is a pre-trained LLM with which learns to solve the task, to generate training exam- billions of parameters, and all the TIL tasks are charac- ples formatted as context-question-answer, and to distill terized by long input sequences. Such a combination knowledge from a model trained on the previous task(s). constitutes a computationally demanding mix, making Differently, Reasoning-augmented Continual Learning offline/joint training potentially very expensive, that is [19] focuses on creating reasoning pathways to preserve where CL solutions are very convenient. We consider the and improve LLMs’ reasoning abilities and information case in which LLMs are fine-tuned exploiting adapters transfer. [26]. In particular, we focus on LoRA [5], that introduces Together with works that learn new models from additional learnable parameters while keeping the rest of scratch, several approaches devise fine-tuning strategies the network freezed. This is both less resource demand- for pre-trained Transformers in language processing, that ing, and it also alleviates catastrophic forgetting, since turn out to be efficiently adaptable to a downstream task the LoRA weights 𝜃𝑙 are usually of a number that is a by learning only a small number of task-specific parame- small fraction with respect to total model parameters, ters. It is the case of models that tune the input prompt i.e. |𝜃𝑙 | ≪ |𝜃|. Hence, it is a perfect candidate for the [20] or of generic Adapters [21], such as the popular LoRA experience of this paper. [5], which introduces new weight matrices, parametrized Single-model TIL with LoRA (S-TIL). In the straight- by the product of low-rank ones. Evaluating these mod- forward implementation of a TIL problem, tasks are pre- els with long contexts [22] is not frequent in the scientific sented to the model sequentially starting from the first literature, especially in the case in which multiple fine- one up to the 𝑘-th one. The order may be given a priori, tunings are sequentially applied, typical of CL, which or established according to some criteria, such as tasks is the main focus of this paper. In particular, LoRA and similarity or difficulty (curriculum-like learning [27]). At Task Arithmetic [23] has been jointly studied to handle the beginning, when considering the first task, 𝑡 = 1, CL problems in vision [6], that is what this paper extends we start from a model with freezed parameters 𝜃 and to the case of language and long sequences. We also men- additional trainable weights 𝜃1𝑙 initialized as described tion works that focus instruction-based model for CL, in [5]. At task 𝑡, with 𝑡 > 1 instead, the LoRA weights such as ConTinTin [24], where each task is modelled by are initialized with the LoRA parameters from previous a specific instruction that directly defines the target con- step, i.e., 𝜃𝑡−1 𝑙 . It is worth noticing that in such a way, at cept along with a few instances that illustrate. Scialom the end of the 𝑘 tasks, the final model parameters will et al. [25] and Luo et al. [4] investigate natural language be constituted by the original 𝜃, still unchanged, and a instructions paired with memory buffers and replays. single set of adapter parameters 𝜃𝑘𝑙 , that was sequentially trained over all the tasks. 3. Task-Incremental Learning on Multi-model TIL with LoRA (M-TIL). Another way Long Sequences of Text to face the problem of learning the multiple tasks in TIL, is to build a specialized model per task, independently on Task-Incremental Learning (TIL) is a continual learning the other ones. This usually yields strong performance scenario where the same model is trained on tasks that on each sub-problem, guaranteeing no catastrophic for- are presented in a sequential manner. The main challenge getting issues, since the model to use is simply retrieved Table 1 Selected datasets from the SCROLLS benchmark and their main features. Dataset Task Domain Metric #Examples Train Validation Contract NLI Natural Language Inference Legal EM 7191 1097 Qasper QA Science F1 2567 1726 QuALITY Multi Choice QA Literature, Misc EM 2523 2086 QMSUM Query-based Summarization Meetings ROUGE-L 1257 272 SummScreenFD Summarization TV ROUGE-L 3673 338 in function of the task to solve. At the same time, such long documents. Datasets belong to different domains, a strategy requires the storage, deployment and mainte- and they are about different tasks, that we adapted to nance of 𝑘 independent models, which is unsustainable TIL by means of instruction tuning. An overview of the with billion-sized models like current LLMs. Even when benchmark is provided in Table 1, and here we briefly using adapters such as LoRA, maintaining many of them describe each dataset. can be still hard to handle. Qasper. Qasper [28] (QSPR) is Question Answering Task Arithmetic TIL with LoRA (TA). Based on (QA) dataset on academic papers. Crafted by NLP experts, the concept of “task vectors”, Task Arithmetic (TA) [23] it contains questions based on title and abstract of the pa- was proposed to combine together the weights learned per. There are different kind of inquiries: abstractive, ex- in a multi-model continual learning scenario. A task tractive, yes/no questions, including unanswerable ones. vector represents the direction in the weights space of To answer the question, the entire paper must be read. a pre-trained model toward a certain task. In TA, mul- QuaLITY. QuALITY [29] (QALT) is a multiple-choice tiple directions are fused together via a simple linear QA dataset, drawing upon English source articles with an combination of them. Similarly, LoRA adapters steers average length of about 5,000 tokens. Original texts are the model behavior to improve performance on a spe- provided in HTML format, retaining paragraph breaks cific task. Therefore, LoRA weights trained separately and basic formatting such as italics, but with images (multi-model) can be updated with task arithmetic [6]: removed. Questions are designed to require details from different parts of the text to properly answer them. ∑︁ 𝑙 𝜃final = 𝜆𝑡 𝜃𝑡𝑙 , (1) 𝑡∈𝒯 QMSum. QMSum, presented in [30], is a question- where 𝜆𝑡 is a scalar weighting the importance of task 𝑡. based document summarization benchmark. The dataset is characterized by long meetings transcripts, collecting Fine-tuning by Memory Buffer (FTB). In princi- 1,808 query-summary pairs from 232 different meetings. ple, TA can be applied as it is, without requiring further fine-tuning. However, we also consider refining the pa- ContractNLI. Contract NLI [31] (CNLI) is the first rameters using a memory buffer with examples from all dataset for Natural Language Inference in contracts. the tasks. Indeed, experience replay is a well-known and Given a premise and a contract, a model has to classify effective strategy in Reinforcement Learning and Contin- whether the premise is entailed by, contradicting to or not ual Learning problems. Examples were chosen randomly, mentioned by the contract. There are 607 contracts and evenly distributed across the given tasks. Since we are 17 unique hypotheses, combined to get 10,319 examples. dealing with long documents, we keep it small. SummScreenFD. SummScreen [32] (SumScr) is a summarization dataset of TV series transcripts and hu- man written recaps. Examples come from two differ- 4. Experiments ent sources, but in SCROLLS, authors only kept Forever- Dreaming (FD), due to its greater variety of shows. We experimented LLMs in TIL exploiting sequences of long texts from a benchmark made public to the scientific community in the last few years [7]. Notice that these 4.2. Experimental Setup and Results benchmarks are not designed for TIL. Thus, using them We consider Mistral-7B-v0.1 [33] as the backbone LLM in TIL is indeed a novel experience off the beaten track. for all the fine-tuned models in our TIL experiments. Al- beit trained on a restricted context length of at most 8,192 4.1. Datasets tokens, it supports longer inputs of size up to 32,768. The LLM was quantized via 4-bit quantization in order to We consider five out of seven datasets of SCROLLS [7], fit long sequences on a single A6000 GPU. During train- that is the reference benchmark for tasks composed of ing, the micro batch size was set to 1, with 32 gradient presented in detail in Table 2. The training order does accumulation steps. LoRA adapters were updated with strongly affect the final performance on single tasks, pro- AdamW for 3 epochs in all the experiments, regardless moting higher scores on more recently seen datasets. On of the dataset. At inference time, outputs were generated one hand, this is expected, since the older ones are more using Beam Search with beam size set to 2. We com- likely affected by catastrophic forgetting. Catastrophic pared: (𝑖) Mistral-7B-v0.1-Instruct, the instruction-tuned forgetting (last columns of Table 2) at 𝑡 = 𝑘 = 5 is be- version of mistral, referred to as Mistral-7b-instruct; (𝑖𝑖)low 10% in both cases. On the other hand, there is an The case of multiple independent LoRA adapters, each of evident peak of forgetting in S-TIL↓ at 𝑡 = 3, which is them trained in a single dataset, i.e., M-TIL (Section 3); then reduced when learning from the following tasks. (𝑖𝑖𝑖) Classic TIL with a single model, progressively up- The peak is due to a strong reduction of performance dated on the sequence of tasks, i.e., S-TIL (Section 3), in the first two tasks after having learned from Qasper considering both the case in which tasks are provided in (QSPR). We investigated this aspect, and found that the a certain order (S-TIL↓ ) or in the opposite one (S-TIL↑ ); model fails in generating the perfectly-formatted output (𝑖𝑣) Task Arithmetic (Section 3) with evenly values 𝜆’s string that is then exploited in the EM metric. When (TA) or with tasks-specific 𝜆’s based on prior knowledge moving to the following task, this skill is partially recov- (WTA). ered. We hypothesize that the presence of unanswerable Evaluation. Due to the different nature of each task questions in Qasper negatively bias the types of answers in SCROLLS, there are different metrics to take into ac- in SummmScreenFD (SumScr) and QMSum, where all count for each of them. In particular, summarization-like the questions have an answer instead. tasks (QMSum and SummScreenFD) are evaluated with Comparing Instances S-TIL and M-TIL. Figure 1 ROUGE score [34] (1,2 and L) , whereas, ContractNLI compares the models of Table 2 (for 𝑡 = 𝑘) with M- and QuaLITY are assessed with Exact Match (EM). Fi- TIL, which is composed of multiple adapters, each of nally, results on Qasper are measured by F1. A global them specifically trained on a task, and thus forgetting- overview of the metrics can be found in Table 1. We in- free. Performance of both S-TIL’s are lower of M-TIL, dicate with 𝑆𝑖 the score yielded by the associated metric as expected, but sometimes not far from it. Comparing for task 𝑖. Following the way the SCROLLS benchmark S-TIL↑ and S-TIL↓ , we see that they get similar overall was proposed, scores are averaged to provide a unique performances, but the latter yields better results in three index of Overall Performance 𝑂𝑃 . Since we focus on out of five tasks. The quality of S-TIL↑ (w.r.t. S-TIL↓ ) TIL, we evaluate 𝑂𝑃 after each task 𝑡, and we also com- improves going right-to-left, and, symmetrically, the one pute the Overall Forgetting at task 𝑡 (𝑂𝐹𝑡 ), also known of S-TIL↓ increases going left-to-right, as expected, since as index of negative backward transfer [35], which tells they were trained in opposite order (relative gain is > 1 how strongly the previously considered tasks have been in SumScr due to forward transfer). negatively affected by learning from the current task 𝑡, 100 i.e., a measure of catastrophic forgetting [4]. Formally, M-TIL (Ref) 1.00 S-TIL S-TIL 0.86 [︃ ]︃ 80 𝑡 𝑡−1 1 ∑︁ 1 ∑︁ 𝑂𝑃𝑡 = 𝑆𝑡,𝑖 , 𝑂𝐹𝑡 = (𝑆𝑖,𝑖 − 𝑆𝑡,𝑖 ) , 60 𝑡 𝑖=1 𝑡 − 1 𝑖=1 OPk (%) 0.77 + 0.68 40 0.82 0.78 where [·]+ keeps the positive part, and 𝑆𝑡,𝑖 is the score 1.02 of task 𝑖 after having learned from task 𝑡 ∈ 𝒯 . Since the 0.73 20 0.64 0.33 test set of SCROLLS is not public, we used the SCROLLS 0 validation set as test set, and sampled a sub-portion of SumScr QMSum QSPR QALT CNLI the training data to build a validation set. After cross- Figure 1: Test results in TIL: overall performance at 𝑡 = validation, we set the rank of LoRA to 8, dropout-rate 𝑘 = 5, i.e., 𝑂𝑃𝑘 . We compare the cases of S-TIL↑ and S-TIL↓ to 0.05, and 𝛼 to 16 (see [5] for param description) and (see Table 2), with the ones of multiple-independently trained learning rate 3 · 10−4 (linearly decaying). adapters, i.e., M-TIL. Relative Gain is indicated on the bars. Investigating S-TIL. Dealing with long sequences of text might affect the TIL procedure in function of the or- The Role of TA. We compared all the introduced der in which tasks are presented. We study different task models with the case of merging independently-trained orderings based on the average length of the sequences adapters with TA. Table 3 shows that TA results to be of text in each task, from tasks involving shorter out- a simple yet competitive solution, with average perfor- put sequences to the ones involving longer sequences mance on par with S-TIL↓ . Actually, observing task-wise and vice-versa. As anticipated, we named them S-TIL↑ performance, we can see how TA outperforms S-TIL↓ and S-TIL↓ , respectively. Results of this experience are across all the datasets, with the exception of ContractNLI Table 2 Evaluation score (%) on test data, for each task, after having learned from task 𝑡 (i.e., 𝑆𝑡,𝑖 ) in S-TIL↑ (left) and S-TIL↓ (right). The order of columns (dataset names) reflect the task-order followed during training. Tasks becomes available in order, thus − indicate that the value cannot be computed yet. The 𝑂𝐹𝑡 column is about catastrophic forgetting (the lower the better). 𝑖→ 1.CNLI 2.QALT 3.QSPR 4.QMSum 5.SumScr 𝑖→ 1.SumScr 2.QMSum 3.QSPR 4.QALT 5.CNLI 𝑡↓ 𝑆𝑡,𝑖 𝑂𝐹𝑡 𝑡↓ 𝑆𝑡,𝑖 𝑂𝐹𝑡 1 88.0 - - - - - 1 18.2 - - - - - 2 85.7 49.5 - - - 2.31 2 16.1 22.2 - - - 2.06 3 79.7 43.2 37.1 - - 7.31 3 0.04 0.45 37.4 - - 19.94 4 82.9 40.7 27.6 21.9 - 7.82 4 13.6 13.3 35.8 47.7 - 5.00 5 75.7 39.1 30.2 15.5 18.6 8.99 5 11.8 7.0 32.0 44.2 88.2 7.60 Table 3 Results involving all the competitors. In ROUGE-based evaluations, we also report unigram overlap (ROUGE-1), bigram overlap (ROUGE-2), together with the longest overlapping subsequence (ROUGE-L) – the last one is what is considered when computing 𝑂𝑃𝑘 . Reference results (baseline, and “upper bound”) are in italic. SumScr QMSum QSPR QALT CNLI Method OP𝑘 ROUGE-1/2/L ROUGE-1/2/L F1 EM EM Ref1: Mistral-7b-instruct 18.1 2.3 10.8 16.2 2.7 11.8 5.4 0.0 0.0 5.6 Ref2: M-TIL 29.2 7.1 18.2 29.6 8.5 21.1 38.7 56.7 88.0 44.5 S-TIL↑ 30.0 7.8 18.6 20.6 5.7 15.5 30.2 39.1 75.7 35.8 S-TIL↓ 15.6 3.6 11.8 8.7 2.3 7.0 32.0 44.2 88.2 36.7 TA 20.7 4.56 13.9 18.8 5.6 14.2 36.0 45.6 72.6 36.5 WTA 19.4 4.26 13.4 18.5 5.5 14.1 34.7 47.9 74.7 36.9 TA-FTB 28.6 6.21 17.5 28.0 8.1 20.1 38.3 47.8 75.1 39.8 WTA-FTB 28.6 6.09 17.2 26.9 7.6 19.7 35.6 50.5 78.5 40.3 100 M-TIL (Ref) TA rehashing the memory of the TA/WTA model via fine- 0.89 WTA 0.85 0.84 80 0.82 TA-FTB tuning it on just 50 samples per the tasks (memory WTA-FTB 60 buffer). Despite being a simple refinement stage, results 0.89 OPk (%) 0.84 0.84 0.80 presented in Table 3 show a consistent boost of perfor- 0.98 0.93 0.91 0.89 40 mance when using the memory buffer (FTB), reaching about 39.0 averaged score, when using the weighted 0.95 0.93 0.96 0.94 20 0.67 0.66 0.76 0.73 TA version, significantly reducing the gap from the 𝑘- 0 SumScr QMSum QSPR QALT CNLI independent adapters solution of M-TIL. Figure 2 pro- Figure 2: Test results in TIL with Task Arithmetic (TA). TA vides a quick view on the already presented results of all is explored with or without Fine-tuning by Memory Buffer the TA methods we considered, reporting also the Rela- (FTB), and also in the case of task-specific weights provided tive Gain w.r.t. M-TIL. Indeed, we can observe that the in advance (WTA). Same setting of Figure 1. relative drop in performance is always below the 11%. 5. Conclusions (CNLI), the last task in which S- TIL↓ was specialized. In WTA, 𝜆’s for non-QA datasets were halved, since there We investigated Large Language Models in progressively tasks involve generation of longer outputs that more learning from tasks involving long sequences of text. A strongly condition the behaviour of the LLM, as already pre-trained model was paired with one or more adapters discussed for Qasper. WTA yielded evident improve- (LoRA), and we analyzed the role of Task Arithmetic, ments in the last two datasets, despite being less weighed, showing that it yields performances that are not far from keeping similar performance on the others. This suggests the ones of multiple models independently trained to that appropriately weighing the task-vectors in Eq. 1 is a solve each task. Our results suggests a viable road to viable road to improve the model. mitigate the need of large computational resources when Impact of FTB. We also investigate the impact of learning from tasks based on “long” documents. While we exploited data in English language, the experiences of this [7] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran, paper can be interpreted as generic attempts to leverage A. Haviv, A. Gupta, W. Xiong, M. Geva, J. Berant, long sequences in Continual Learning, in a sense going O. Levy, Scrolls: Standardized comparison over beyond the language barrier. Future work will consider long language sequences, in: Proceedings of the schemes to automatically tune the Task Arithmetic [36]. 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emi- rates. Association for Computational Linguistics, Acknowledgments 2022, pp. 12007–12021. [8] M. Gori, S. Melacci, Collectionless artificial intelli- The work was partially funded by: gence, arXiv preprint arXiv:2309.06938 (2023). • “ReSpiRA - REplicabilità, SPIegabilità e Ragiona- [9] K. Khetarpal, M. Riemer, I. Rish, D. Precup, To- mento”, a project financed by FAIR, Affiliated to wards continual reinforcement learning: A review spoke no. 2, falling within the PNRR MUR pro- and perspectives, Journal of Artificial Intelligence gramme, Mission 4, Component 2, Investment 1.3, Research 75 (2022) 1401–1476. D.D. No. 341 of 03/15/2022, Project PE0000013, [10] M. De Lange, R. Aljundi, M. Masana, S. Parisot, CUP B43D22000900004 1 ; X. Jia, A. Leonardis, G. Slabaugh, T. Tuytelaars, A • “enRichMyData - Enabling Data Enrichment continual learning survey: Defying forgetting in Pipelines for AI-driven Business Products and classification tasks, IEEE transactions on pattern Services”, an Horizon Europe (HE) project, grant analysis and machine intelligence 44 (2021) 3366– agreement ID: 101070284 2 . 3385. [11] G. M. van de Ven, A. S. Tolias, Three continual learning scenarios, in: NeurIPS Continual Learning References Workshop, volume 1, 2018. [12] J. Sun, S. Wang, J. Zhang, C. Zong, Distill and replay [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- for continual language learning, in: Proceedings plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- of the 28th International Conference on Computa- try, A. Askell, et al., Language models are few-shot tional Linguistics, Barcelona, Spain, December 8-13, learners, Advances in neural information process- 2020, pp. 3569–3579. ing systems 33 (2020) 1877–1901. [13] S. Marullo, M. Tiezzi, A. Betti, L. Faggi, E. Meloni, [2] R. Hadsell, D. Rao, A. A. Rusu, R. Pascanu1, Em- S. Melacci, Continual unsupervised learning for bracing change: Continual learning in deep neural optical flow estimation with deep networks, in: networks, Trends in Cognitive Sciences 24 (2020) Conference on Lifelong Learning Agents, PMLR, 1028–1040. 2022, pp. 183–200. [3] L. Wang, X. Zhang, H. Su, J. Zhu, A comprehensive [14] S. Paul, L.-J. Frey, R. Kamath, K. Kersting, M. Mundt, survey of continual learning: Theory, method and Masked autoencoders are efficient continual fed- application, IEEE Transactions on Pattern Analy- erated learners, arXiv preprint arXiv:2306.03542 sis and Machine Intelligence 46 (2024) 5362–5383. (2023). doi:10.1109/TPAMI.2024.3367329. [15] M. Tiezzi, S. Marullo, L. Faggi, E. Meloni, A. Betti, [4] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, Y. Zhang, An S. Melacci, Stochastic coherence over attention tra- empirical study of catastrophic forgetting in large jectory for continuous learning in video streams, language models during continual fine-tuning, 2023. in: L. D. Raedt (Ed.), Proceedings of the Thirty-First arXiv:2308.08747v2, [cs.CL]. International Joint Conference on Artificial Intelli- [5] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, gence, IJCAI-22, International Joint Conferences on S. Wang, L. Wang, W. Chen, Lora: Low-rank adap- Artificial Intelligence Organization, 2022, pp. 3480– tation of large language models, arXiv preprint 3486. URL: https://doi.org/10.24963/ijcai.2022/483. arXiv:2106.09685 (2021). doi:10.24963/ijcai.2022/483, main Track. [6] R. Chitale, A. Vaidya, A. M. Kane, A. Ghotkar, Task [16] S. Marullo, M. Tiezzi, M. Gori, S. Melacci, T. Tuyte- Arithmetic with LoRA for Continual Learning, in: laars, Continual learning with pretrained back- Workshop on Advancing Neural Network Training bones by tuning in the input space, in: 2023 In- at 37th Conference on Neural Information Process- ternational Joint Conference on Neural Networks ing Systems (WANT@NeurIPS 2023), 2023. (IJCNN), IEEE, 2023, pp. 1–9. [17] T. Wu, L. Luo, Y.-F. Li, S. Pan, T.-T. Vu, G. Haffari, 1 Continual learning for large language models: A RESPIRA: https://www.opencup.gov.it/portale/web/opencup/ home/progetto/-/cup/B43D22000900004 survey, arXiv preprint arXiv:2402.01364 (2024). 2 https://doi.org/10.3030/101070284 [18] F.-K. Sun, C.-H. Ho, H.-Y. Lee, Lamol: Language modeling for lifelong language learning, arXiv X. Qiu, D. Radev, Qmsum: A new benchmark for preprint arXiv:1909.03329 (2019). query based multi-domain meeting summarization, [19] X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, in: Proceedings of the 2021 Conference of the North X. Yang, Z. Xi, R. Zheng, T. Yicheng Zou, X. H. American Chapter of the Association for Computa- QiZhang, Trace: A comprehensive benchmark for tional Linguistics: Human Language Technologies, continual learning in large language models, 2023. Online. Association for Computational Linguistics, arXiv:2310.06762v1. 2021, pp. 5905–5921. [20] Q. Zhu, B. Li, F. Mi, X. Zhu, M. Huang, Contin- [31] Y. Koreeda, C. D. Manning, Contractnli: A dataset ual prompt tuning for dialog state tracking, 2022. for document-level natural language inference for arXiv:2203.06654. contracts, in: Findings of the Association for Com- [21] R. He, L. Liu, H. Ye, Q. Tan, B. Ding, L. Cheng, J.-W. putational Linguistics: EMNLP 2021, 2021, pp. 1907– Low, L. Bing, L. Si, On the effectiveness of adapter- 1919. based tuning for pretrained language model adap- [32] M. Chen, Z. Chu, S. Wiseman, K. Gimpel, Summ- tation, arXiv preprint arXiv:2106.03164 (2021). screen: A dataset for abstractive screenplay sum- [22] Y. Chen, S. Qian, Z. Liu, H. Tang, X. Lai, S. Han, J. Jia, marization, in: Proceedings of the 60th Annual Longlora: Efficient fine-tuning of long context large Meeting of the Association for Computational Lin- language models, 2023. arXiv:2309.12307v2. guistics (Volume 1: Long Papers), 2022, pp. 8602– [23] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururan- 8615. gan, L. Schmidt, H. Hajishirzi, A. Farhadi, Editing [33] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- models with task arithmetic, in: The Eleventh Inter- ford, D. S. Chaplot, D. de las Casas, F. Bressand, national Conference on Learning Representations, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- ICLR 2023, Kigali, Rwanda, May 1-5, 2023. A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, [24] W. Yin, J. Li, C. Xiong, Contintin: Continual T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: learning from task instructions, arXiv preprint //arxiv.org/abs/2310.06825. arXiv:2310.06825. arXiv:2203.08512 (2022). [34] C.-Y. Lin, Rouge: A package for automatic eval- [25] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned uation of summaries, in: Text summarization language models are continual learners, in: Pro- branches out, Association for Computational Lin- ceedings of the 2022 Conference on Empirical Meth- guistics, 2004, pp. 74–81. ods in Natural Language Processing, 2022, pp. 6107– [35] D. Lopez-Paz, M. Ranzato, Gradient episodic mem- 6122. ory for continual learning, Advances in neural [26] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, information processing systems 30 (2017). Q. De Laroussilhe, A. Gesmundo, M. Attariyan, [36] M. Tiezzi, S. Marullo, F. Becattini, S. Melacci, Con- S. Gelly, Parameter-efficient transfer learning for tinual neural computation, in: Joint European Con- nlp, in: International conference on machine learn- ference on Machine Learning and Knowledge Dis- ing, PMLR, 2019, pp. 2790–2799. covery in Databases, Springer, 2024, pp. 340–356. [27] X. Wang, Y. Chen, W. Zhu, A survey on curriculum learning, IEEE transactions on pattern analysis and machine intelligence 44 (2021) 4555–4576. [28] P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith, M. Gardner, A dataset of information-seeking ques- tions and answers anchored in research papers, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Online. Association for Computational Linguistics, 2021, pp. 4599–4610. [29] R. Y. Pang, A. Parrish, N. Joshi, N. Nangia, J. Phang, A. Chen, V. Padmakumar, J. Ma, J. Thompson, H. He, et al., Quality: Question answering with long input texts, yes!, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022. [30] M. Zhong, D. Yin, T. Yu, A. Zaidi, R. Mutethia Mu- tuma, A. H. Awadallah, A. Celikyilmaz, Y. Liu,