1. Introduction

Task-Incremental Learning on Long Text Sequences

Natalia Graziuso

Andrea Zugarini

Stefano Melacci

0 0 Department of Information Engineering and Mathematics, University of Siena , Italy 1 expert.ai , Italy

The extraordinary results achieved by Large Language Models are paired with issues that are critical in real-world applications. The costs of inference and, in particular, training are extremely large, both in terms of time and computational resources, and they become prohibitive when working in dynamic environments, where data and tasks are progressively provided over time. The model must be able to adapt to new knowledge, new domains, new settings, without forgetting the previously learned skills. Retraining from scratch easily becomes too costly, thus Continual Learning strategies are of crucial importance. This is even more evident when data consist of “long” documents, that require several resources to be processed by modern neural models, leading to very long prompts. This paper investigates LLM-based Task-Incremental Learning in the case of tasks exploiting long sequences of text, as it is typical in summarization, question-answering on long documents, reviewing long contracts, and several others. We show how adapting the model by Task Arithmetic with LoRA, which was proposed for visual data, yields promising results also in the case of such “long” text data. To our best knowledge, this is the first work along this challenging direction. The outcome of the investigation of this paper is generic enough to represent an important starting point for further research in processing linguistic data in every language.

eol>Continual Learning Task-Incremental Learning Long Sequences of Text Large Language Models

1. Introduction

eral strategies based on LoRA [ 5 ] to adapt an LLM to multiple tasks that are sequentially proposed over time.

The quality of Language Models (LMs) has been rapidly In particular, we first follow the route of training a single improving in the last decade, showing outstanding skills adapter in a sequential manner, then we explore Task when scaled to large data and networks [ 1 ], leading to Arithmetic to fuse multiple adapters trained indepenthe nowadays popular Large Language Models (LLMs). dently [ 6 ]. We consider the possibility of assigning difSolving more complex tasks with LLMs often requires ferent weights to each task, and we shed some light on processing “long” documents and articulated long in- what are the factors that contribute the most to catasstructions. However, handling lengthy prompts can be a trophic forgetting and to efective task adaptation. The significant obstacle for real-world applications, raising outcomes of such an investigation reveals that: (1) there costs and resources required during both inference and, is limited sensitivity to task-order, i.e., regardless of the in particular, training. This issue can become critical sequence in which tasks are presented, the overall avwhen the LLM needs to be specialized to many diferent erage performance remains relatively stable, a property tasks, domains, and, more generally, when it is applied that, to our best knowledge, was never evaluated in the to dynamic settings that require multiple adaptations. case of tasks composed of long documents; (2) despite its For instance, in real-world applications, models need to simplicity, Task Arithmetic demonstrates efectiveness be re-trained from time to time, as new data/tasks be- in addressing forgetting phenomena when learning from come available. In such scenarios, the need for Continual long texts, strongly reducing the gap from multiple modLearning (CL) [ 2, 3 ] strategies becomes imperative. From els independently adapted to the task data. Moreover, (3) a very generic perspective, CL focuses on the develop- we are the first to evaluate a recently proposed benchment of algorithms capable of sequentially learning from mark (SCROLLS [7]) in a CL setting, ofering reference a stream of data, while preserving what was learnt in results for further activity in processing long sequences past experiences, avoiding catastrophic forgetting [ 4 ]. of text. We remark that while our experiments are based

In this work, motivated by the aforementioned issues, on data in English language, the generic issues we explore we study the problem of Continual Learning from “long” about handling long sequences of text are intrinsically sequences of text, exploiting LLMs. We investigate sev- shared by every language. of learning from newly provided information, with mod- consists in profitably learning from the last-presented els that are capable of acquiring new knowledge without task without forgetting the previous ones [ 3 ]. In order forgetting the previously learned one, and, more impor- to cope with TIL on Long Sequences of Text, specifically tantly, without storing the full dataset and retraining focusing on LLMs, we consider diferent learning stratefrom scratch every time [8]. Several eforts are dedi- gies. In this Section we describe each of them in detail, cated to the case of lifelong Reinforcement Learning [9] after having formally introduce the TIL problem. and of Supervised Learning [10], distinguishing among Problem. We are given a model parameterized by scenarios and categories of approaches [11], ranging , which is a vector collecting the learnable variables. from parameter isolation, regularization methods, and In TIL, a set of tasks is sequentially presented to replays [12]. Unsupervised or Self-Supervised Learning the model, i.e. one at a time. Each task ∈ , features approaches are also becoming popular [13, 14, 15], and data sampled from a task-specific distribution, collected the case of adaptation of pre-trained backbones [16]. into dataset := (, ), composed of raw samples

Of course, neural models for processing language are and labeling information, respectively. The model is not a subject of study in the context of CL [17]. We mention only expected to learn from , but also to not forget the case of language modeling in Lamol [18], which is knowledge already acquired from the past tasks. In the trained to concurrently solve a task and mimic training following, to keep the notation simple, we indicate each examples, thereby preserving the distribution of previ- task by a numerical index, thus ∈ = {1, . . . , }. In ous tasks. Sun et al. [12] introduce Distill and Replay, this case of study, the model is a pre-trained LLM with which learns to solve the task, to generate training exam- billions of parameters, and all the TIL tasks are characples formatted as context-question-answer, and to distill terized by long input sequences. Such a combination knowledge from a model trained on the previous task(s). constitutes a computationally demanding mix, making Diferently, Reasoning-augmented Continual Learning ofline/joint training potentially very expensive, that is [19] focuses on creating reasoning pathways to preserve where CL solutions are very convenient. We consider the and improve LLMs’ reasoning abilities and information case in which LLMs are fine-tuned exploiting adapters transfer. [26]. In particular, we focus on LoRA [ 5 ], that introduces

Together with works that learn new models from additional learnable parameters while keeping the rest of scratch, several approaches devise fine-tuning strategies the network freezed. This is both less resource demandfor pre-trained Transformers in language processing, that ing, and it also alleviates catastrophic forgetting, since turn out to be eficiently adaptable to a downstream task the LoRA weights are usually of a number that is a by learning only a small number of task-specific parame- small fraction with respect to total model parameters, t[e2r0s].oIrt oisf gtheenecraisceAodfamptoedrse[ls21th],astutcuhnaestthhee pinoppuutlaprrLoomRpAt ie.xep.e|rie|n≪ c|e of |t.hHisepnacpee,ri.t is a perfect candidate for the [ 5 ], which introduces new weight matrices, parametrized by the product of low-rank ones. Evaluating these models with long contexts [22] is not frequent in the scientific literature, especially in the case in which multiple finetunings are sequentially applied, typical of CL, which is the main focus of this paper. In particular, LoRA and Task Arithmetic [23] has been jointly studied to handle CL problems in vision [ 6 ], that is what this paper extends to the case of language and long sequences. We also mention works that focus instruction-based model for CL, such as ConTinTin [24], where each task is modelled by a specific instruction that directly defines the target concept along with a few instances that illustrate. Scialom et al. [25] and Luo et al. [ 4 ] investigate natural language instructions paired with memory bufers and replays.

Single-model TIL with LoRA (S-TIL). In the straightforward implementation of a TIL problem, tasks are presented to the model sequentially starting from the first one up to the -th one. The order may be given a priori, or established according to some criteria, such as tasks similarity or dificulty (curriculum-like learning [ 27]). At the beginning, when considering the first task, = 1, we start from a model with freezed parameters and additional trainable weights 1 initialized as described in [ 5 ]. At task , with > 1 instead, the LoRA weights are initialized with the LoRA parameters from previous step, i.e., − 1. It is worth noticing that in such a way, at the end of the tasks, the final model parameters will be constituted by the original , still unchanged, and a single set of adapter parameters , that was sequentially trained over all the tasks.

3. Task-Incremental Learning on Long Sequences of Text Task-Incremental Learning (TIL) is a continual learning scenario where the same model is trained on tasks that are presented in a sequential manner. The main challenge

Multi-model TIL with LoRA (M-TIL). Another way to face the problem of learning the multiple tasks in TIL, is to build a specialized model per task, independently on the other ones. This usually yields strong performance on each sub-problem, guaranteeing no catastrophic forgetting issues, since the model to use is simply retrieved in function of the task to solve. At the same time, such a strategy requires the storage, deployment and maintenance of independent models, which is unsustainable with billion-sized models like current LLMs. Even when using adapters such as LoRA, maintaining many of them can be still hard to handle.

Task Arithmetic TIL with LoRA (TA). Based on the concept of “task vectors”, Task Arithmetic (TA) [23] was proposed to combine together the weights learned in a multi-model continual learning scenario. A task vector represents the direction in the weights space of a pre-trained model toward a certain task. In TA, multiple directions are fused together via a simple linear combination of them. Similarly, LoRA adapters steers the model behavior to improve performance on a specific task. Therefore, LoRA weights trained separately (multi-model) can be updated with task arithmetic [ 6 ]: final = ∑︁ , (1) ∈ where is a scalar weighting the importance of task . Fine-tuning by Memory Bufer (FTB). In principle, TA can be applied as it is, without requiring further ifne-tuning. However, we also consider refining the parameters using a memory bufer with examples from all the tasks. Indeed, experience replay is a well-known and efective strategy in Reinforcement Learning and Continual Learning problems. Examples were chosen randomly, evenly distributed across the given tasks. Since we are dealing with long documents, we keep it small.

4. Experiments We experimented LLMs in TIL exploiting sequences of

long texts from a benchmark made public to the scientific community in the last few years [7]. Notice that these benchmarks are not designed for TIL. Thus, using them in TIL is indeed a novel experience of the beaten track.

4.1. Datasets We consider five out of seven datasets of SCROLLS [ 7],

that is the reference benchmark for tasks composed of long documents. Datasets belong to diferent domains, and they are about diferent tasks, that we adapted to TIL by means of instruction tuning. An overview of the benchmark is provided in Table 1, and here we briefly describe each dataset.

Qasper. Qasper [28] (QSPR) is Question Answering

(QA) dataset on academic papers. Crafted by NLP experts, it contains questions based on title and abstract of the paper. There are diferent kind of inquiries: abstractive, extractive, yes/no questions, including unanswerable ones. To answer the question, the entire paper must be read.

QuaLITY. QuALITY [29] (QALT) is a multiple-choice

QA dataset, drawing upon English source articles with an average length of about 5,000 tokens. Original texts are provided in HTML format, retaining paragraph breaks and basic formatting such as italics, but with images removed. Questions are designed to require details from diferent parts of the text to properly answer them.

QMSum. QMSum, presented in [30], is a question

based document summarization benchmark. The dataset is characterized by long meetings transcripts, collecting 1,808 query-summary pairs from 232 diferent meetings.

ContractNLI. Contract NLI [31] (CNLI) is the first

dataset for Natural Language Inference in contracts. Given a premise and a contract, a model has to classify whether the premise is entailed by, contradicting to or not mentioned by the contract. There are 607 contracts and 17 unique hypotheses, combined to get 10,319 examples.

SummScreenFD. SummScreen [32] (SumScr) is a

summarization dataset of TV series transcripts and human written recaps. Examples come from two diferent sources, but in SCROLLS, authors only kept ForeverDreaming (FD), due to its greater variety of shows.

4.2. Experimental Setup and Results We consider Mistral-7B-v0.1 [33] as the backbone LLM

for all the fine-tuned models in our TIL experiments. Albeit trained on a restricted context length of at most 8,192 tokens, it supports longer inputs of size up to 32,768. The LLM was quantized via 4-bit quantization in order to ift long sequences on a single A6000 GPU. During training, the micro batch size was set to 1, with 32 gradient presented in detail in Table 2. The training order does accumulation steps. LoRA adapters were updated with strongly afect the final performance on single tasks, proAdamW for 3 epochs in all the experiments, regardless moting higher scores on more recently seen datasets. On of the dataset. At inference time, outputs were generated one hand, this is expected, since the older ones are more using Beam Search with beam size set to 2. We com- likely afected by catastrophic forgetting. Catastrophic pared: () Mistral-7B-v0.1-Instruct, the instruction-tuned forgetting (last columns of Table 2) at = = 5 is beversion of mistral, referred to as Mistral-7b-instruct; () low 10% in both cases. On the other hand, there is an The case of multiple independent LoRA adapters, each of evident peak of forgetting in S-TIL↓ at = 3, which is them trained in a single dataset, i.e., M-TIL (Section 3); then reduced when learning from the following tasks. () Classic TIL with a single model, progressively up- The peak is due to a strong reduction of performance dated on the sequence of tasks, i.e., S-TIL (Section 3), in the first two tasks after having learned from Qasper considering both the case in which tasks are provided in (QSPR). We investigated this aspect, and found that the a certain order (S-TIL↓) or in the opposite one (S-TIL↑); model fails in generating the perfectly-formatted output () Task Arithmetic (Section 3) with evenly values ’s string that is then exploited in the EM metric. When (TA) or with tasks-specific ’s based on prior knowledge moving to the following task, this skill is partially recov(WTA). ered. We hypothesize that the presence of unanswerable questions in Qasper negatively bias the types of answers in SummmScreenFD (SumScr) and QMSum, where all the questions have an answer instead.

Evaluation. Due to the diferent nature of each task

in SCROLLS, there are diferent metrics to take into account for each of them. In particular, summarization-like tasks (QMSum and SummScreenFD) are evaluated with Comparing Instances S-TIL and M-TIL. Figure 1 ROUGE score [34] (1,2 and L) , whereas, ContractNLI compares the models of Table 2 (for = ) with Mand QuaLITY are assessed with Exact Match (EM). Fi- TIL, which is composed of multiple adapters, each of nally, results on Qasper are measured by F1. A global them specifically trained on a task, and thus forgettingoverview of the metrics can be found in Table 1. We in- free. Performance of both S-TIL’s are lower of M-TIL, dicate with the score yielded by the associated metric as expected, but sometimes not far from it. Comparing for task . Following the way the SCROLLS benchmark S-TIL↑ and S-TIL↓, we see that they get similar overall was proposed, scores are averaged to provide a unique performances, but the latter yields better results in three index of Overall Performance . Since we focus on out of five tasks. The quality of S-TIL ↑ (w.r.t. S-TIL↓) TIL, we evaluate after each task , and we also com- improves going right-to-left, and, symmetrically, the one pute the Overall Forgetting at task (), also known of S-TIL↓ increases going left-to-right, as expected, since as index of negative backward transfer [35], which tells they were trained in opposite order (relative gain is > 1 how strongly the previously considered tasks have been in SumScr due to forward transfer). negatively afected by learning from the current task , i.e., a measure of catastrophic forgetting [ 4 ]. Formally, 100 .00 1 = 1 ∑︁ ,, = =1 [︃ 1 − 1 ∑︁(, − − 1 =1

]︃ ,) ,

+ where [· ]+ keeps the positive part, and , is the score of task after having learned from task ∈ . Since the test set of SCROLLS is not public, we used the SCROLLS validation set as test set, and sampled a sub-portion of the training data to build a validation set. After crossvalidation, we set the rank of LoRA to 8, dropout-rate to 0.05, and to 16 (see [ 5 ] for param description) and learning rate 3 · 10− 4 (linearly decaying).

Investigating S-TIL. Dealing with long sequences of text might afect the TIL procedure in function of the order in which tasks are presented. We study diferent task orderings based on the average length of the sequences of text in each task, from tasks involving shorter output sequences to the ones involving longer sequences and vice-versa. As anticipated, we named them S-TIL↑ and S-TIL↓, respectively. Results of this experience are

The Role of TA. We compared all the introduced models with the case of merging independently-trained adapters with TA. Table 3 shows that TA results to be a simple yet competitive solution, with average performance on par with S-TIL↓. Actually, observing task-wise performance, we can see how TA outperforms S-TIL↓ across all the datasets, with the exception of ContractNLI

M-TIL (Ref) S-TIL

S-TIL 80

We also investigate the impact of (CNLI), the last task in which S- TIL↓ was specialized. In WTA, ’s for non-QA datasets were halved, since there We investigated Large Language Models in progressively tasks involve generation of longer outputs that more learning from tasks involving long sequences of text. A strongly condition the behaviour of the LLM, as already pre-trained model was paired with one or more adapters discussed for Qasper. WTA yielded evident improve- (LoRA), and we analyzed the role of Task Arithmetic, ments in the last two datasets, despite being less weighed, showing that it yields performances that are not far from keeping similar performance on the others. This suggests the ones of multiple models independently trained to that appropriately weighing the task-vectors in Eq. 1 is a solve each task. Our results suggests a viable road to viable road to improve the model. mitigate the need of large computational resources when learning from tasks based on “long” documents. While we rehashing the memory of the TA/WTA model via finetuning it on just 50 samples per the tasks (memory bufer). Despite being a simple refinement stage, results presented in Table 3 show a consistent boost of performance when using the memory bufer (FTB), reaching about 39.0 averaged score, when using the weighted TA version, significantly reducing the gap from the independent adapters solution of M-TIL. Figure 2 provides a quick view on the already presented results of all the TA methods we considered, reporting also the Relative Gain w.r.t. M-TIL. Indeed, we can observe that the relative drop in performance is always below the 11%.

5. Conclusions Acknowledgments

The work was partially funded by: exploited data in English language, the experiences of this paper can be interpreted as generic attempts to leverage long sequences in Continual Learning, in a sense going beyond the language barrier. Future work will consider schemes to automatically tune the Task Arithmetic [36].

[7] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran,

A. Haviv, A. Gupta, W. Xiong, M. Geva, J. Berant, O. Levy, Scrolls: Standardized comparison over long language sequences, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, 2022, pp. 12007–12021. [8] M. Gori, S. Melacci, Collectionless artificial intelligence, arXiv preprint arXiv:2309.06938 (2023). • “ReSpiRA - REplicabilità, SPIegabilità e Ragiona- [9] K. Khetarpal, M. Riemer, I. Rish, D. Precup, Tomento”, a project financed by FAIR, Afiliated to wards continual reinforcement learning: A review spoke no. 2, falling within the PNRR MUR pro- and perspectives, Journal of Artificial Intelligence gramme, Mission 4, Component 2, Investment 1.3, Research 75 (2022) 1401–1476.

D.D. No. 341 of 03/15/2022, Project PE0000013, [10] M. De Lange, R. Aljundi, M. Masana, S. Parisot, CUP B43D22000900004 1; X. Jia, A. Leonardis, G. Slabaugh, T. Tuytelaars, A • “enRichMyData - Enabling Data Enrichment continual learning survey: Defying forgetting in Pipelines for AI-driven Business Products and classification tasks, IEEE transactions on pattern Services”, an Horizon Europe (HE) project, grant analysis and machine intelligence 44 (2021) 3366– agreement ID: 101070284 2. 3385.

[11] G. M. van de Ven, A. S. Tolias, Three continual learning scenarios, in: NeurIPS Continual Learning

Workshop, volume 1, 2018. [12] J. Sun, S. Wang, J. Zhang, C. Zong, Distill and replay for continual language learning, in: Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, December 8-13, 2020, pp. 3569–3579. [13] S. Marullo, M. Tiezzi, A. Betti, L. Faggi, E. Meloni,

S. Melacci, Continual unsupervised learning for optical flow estimation with deep networks, in: Conference on Lifelong Learning Agents, PMLR, 2022, pp. 183–200. [14] S. Paul, L.-J. Frey, R. Kamath, K. Kersting, M. Mundt,

Masked autoencoders are eficient continual federated learners, arXiv preprint arXiv:2306.03542 (2023). [15] M. Tiezzi, S. Marullo, L. Faggi, E. Meloni, A. Betti,

S. Melacci, Stochastic coherence over attention trajectory for continuous learning in video streams, in: L. D. Raedt (Ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, International Joint Conferences on Artificial Intelligence Organization, 2022, pp. 3480– 3486. URL: https://doi.org/10.24963/ijcai.2022/483.

doi:10.24963/ijcai.2022/483, main Track. [16] S. Marullo, M. Tiezzi, M. Gori, S. Melacci, T. Tuytelaars, Continual learning with pretrained backbones by tuning in the input space, in: 2023 International Joint Conference on Neural Networks (IJCNN), IEEE, 2023, pp. 1–9. [17] T. Wu, L. Luo, Y.-F. Li, S. Pan, T.-T. Vu, G. Hafari,

Continual learning for large language models: A survey, arXiv preprint arXiv:2402.01364 (2024).

[18] F.-K. Sun, C.-H. Ho, H.-Y. Lee, Lamol: Language modeling for lifelong language learning, arXiv X. Qiu, D. Radev, Qmsum: A new benchmark for preprint arXiv:1909.03329 (2019). query based multi-domain meeting summarization, [19] X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, in: Proceedings of the 2021 Conference of the North X. Yang, Z. Xi, R. Zheng, T. Yicheng Zou, X. H. American Chapter of the Association for ComputaQiZhang, Trace: A comprehensive benchmark for tional Linguistics: Human Language Technologies, continual learning in large language models, 2023. Online. Association for Computational Linguistics, arXiv:2310.06762v1. 2021, pp. 5905–5921. [20] Q. Zhu, B. Li, F. Mi, X. Zhu, M. Huang, Contin- [31] Y. Koreeda, C. D. Manning, Contractnli: A dataset ual prompt tuning for dialog state tracking, 2022. for document-level natural language inference for arXiv:2203.06654. contracts, in: Findings of the Association for Com[21] R. He, L. Liu, H. Ye, Q. Tan, B. Ding, L. Cheng, J.-W. putational Linguistics: EMNLP 2021, 2021, pp. 1907– Low, L. Bing, L. Si, On the efectiveness of adapter- 1919. based tuning for pretrained language model adap- [32] M. Chen, Z. Chu, S. Wiseman, K. Gimpel, Summtation, arXiv preprint arXiv:2106.03164 (2021). screen: A dataset for abstractive screenplay sum[22] Y. Chen, S. Qian, Z. Liu, H. Tang, X. Lai, S. Han, J. Jia, marization, in: Proceedings of the 60th Annual Longlora: Eficient fine-tuning of long context large Meeting of the Association for Computational Linlanguage models, 2023. arXiv:2309.12307v2. guistics (Volume 1: Long Papers), 2022, pp. 8602– [23] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururan- 8615.

gan, L. Schmidt, H. Hajishirzi, A. Farhadi, Editing [33] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bammodels with task arithmetic, in: The Eleventh Inter- ford, D. S. Chaplot, D. de las Casas, F. Bressand, national Conference on Learning Representations, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.ICLR 2023, Kigali, Rwanda, May 1-5, 2023. A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, [24] W. Yin, J. Li, C. Xiong, Contintin: Continual T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: learning from task instructions, arXiv preprint //arxiv.org/abs/2310.06825. arXiv:2310.06825. arXiv:2203.08512 (2022). [34] C.-Y. Lin, Rouge: A package for automatic eval[25] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned uation of summaries, in: Text summarization language models are continual learners, in: Pro- branches out, Association for Computational Linceedings of the 2022 Conference on Empirical Meth- guistics, 2004, pp. 74–81. ods in Natural Language Processing, 2022, pp. 6107– [35] D. Lopez-Paz, M. Ranzato, Gradient episodic mem6122. ory for continual learning, Advances in neural [26] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, information processing systems 30 (2017).

Q. De Laroussilhe, A. Gesmundo, M. Attariyan, [36] M. Tiezzi, S. Marullo, F. Becattini, S. Melacci, ConS. Gelly, Parameter-eficient transfer learning for tinual neural computation, in: Joint European Connlp, in: International conference on machine learn- ference on Machine Learning and Knowledge Dising, PMLR, 2019, pp. 2790–2799. covery in Databases, Springer, 2024, pp. 340–356. [27] X. Wang, Y. Chen, W. Zhu, A survey on curriculum learning, IEEE transactions on pattern analysis and machine intelligence 44 (2021) 4555–4576. [28] P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith,

M. Gardner, A dataset of information-seeking questions and answers anchored in research papers, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online. Association for Computational Linguistics, 2021, pp. 4599–4610. [29] R. Y. Pang, A. Parrish, N. Joshi, N. Nangia, J. Phang,

A. Chen, V. Padmakumar, J. Ma, J. Thompson, H. He, et al., Quality: Question answering with long input texts, yes!, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language

Technologies, 2022. [30] M. Zhong, D. Yin, T. Yu, A. Zaidi, R. Mutethia Mutuma, A. H. Awadallah, A. Celikyilmaz, Y. Liu,

[1]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[2]

Hadsell ,

Rao ,

A. A.

Rusu , R. Pascanu1 , Embracing change: Continual learning in deep neural networks , Trends in Cognitive Sciences 24 ( 2020 ) 1028 - 1040 .

[3]

Wang ,

Zhang ,

Su ,

Zhu , A comprehensive survey of continual learning: Theory, method and application , IEEE Transactions on Pattern Analysis and Machine Intelligence 46 ( 2024 ) 5362 - 5383 . doi: 10 .1109/TPAMI. 2024 . 3367329 .

[4]

Luo ,

Yang ,

Meng ,

Li ,

Zhou , Y. Zhang, An empirical study of catastrophic forgetting in large language models during continual fine-tuning , 2023 . arXiv: 2308 .08747v2, [cs.CL].

[5]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , Lora: Low-rank adaptation of large language models , arXiv preprint arXiv:2106.09685 ( 2021 ).

[6]

Chitale ,

Vaidya ,

A. M.

Kane ,

Ghotkar , Task Arithmetic with LoRA for Continual Learning , in: Workshop on Advancing Neural Network Training at 37th Conference on Neural Information Processing Systems (WANT@NeurIPS 2023 ), 2023 .