<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Task-Incremental Learning on Long Text Sequences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Natalia Graziuso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Zugarini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Melacci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering and Mathematics, University of Siena</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>expert.ai</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The extraordinary results achieved by Large Language Models are paired with issues that are critical in real-world applications. The costs of inference and, in particular, training are extremely large, both in terms of time and computational resources, and they become prohibitive when working in dynamic environments, where data and tasks are progressively provided over time. The model must be able to adapt to new knowledge, new domains, new settings, without forgetting the previously learned skills. Retraining from scratch easily becomes too costly, thus Continual Learning strategies are of crucial importance. This is even more evident when data consist of “long” documents, that require several resources to be processed by modern neural models, leading to very long prompts. This paper investigates LLM-based Task-Incremental Learning in the case of tasks exploiting long sequences of text, as it is typical in summarization, question-answering on long documents, reviewing long contracts, and several others. We show how adapting the model by Task Arithmetic with LoRA, which was proposed for visual data, yields promising results also in the case of such “long” text data. To our best knowledge, this is the first work along this challenging direction. The outcome of the investigation of this paper is generic enough to represent an important starting point for further research in processing linguistic data in every language.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Continual Learning</kwd>
        <kwd>Task-Incremental Learning</kwd>
        <kwd>Long Sequences of Text</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        eral strategies based on LoRA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to adapt an LLM to
multiple tasks that are sequentially proposed over time.
      </p>
      <p>
        The quality of Language Models (LMs) has been rapidly In particular, we first follow the route of training a single
improving in the last decade, showing outstanding skills adapter in a sequential manner, then we explore Task
when scaled to large data and networks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], leading to Arithmetic to fuse multiple adapters trained
indepenthe nowadays popular Large Language Models (LLMs). dently [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We consider the possibility of assigning
difSolving more complex tasks with LLMs often requires ferent weights to each task, and we shed some light on
processing “long” documents and articulated long in- what are the factors that contribute the most to
catasstructions. However, handling lengthy prompts can be a trophic forgetting and to efective task adaptation. The
significant obstacle for real-world applications, raising outcomes of such an investigation reveals that: (1) there
costs and resources required during both inference and, is limited sensitivity to task-order, i.e., regardless of the
in particular, training. This issue can become critical sequence in which tasks are presented, the overall
avwhen the LLM needs to be specialized to many diferent erage performance remains relatively stable, a property
tasks, domains, and, more generally, when it is applied that, to our best knowledge, was never evaluated in the
to dynamic settings that require multiple adaptations. case of tasks composed of long documents; (2) despite its
For instance, in real-world applications, models need to simplicity, Task Arithmetic demonstrates efectiveness
be re-trained from time to time, as new data/tasks be- in addressing forgetting phenomena when learning from
come available. In such scenarios, the need for Continual long texts, strongly reducing the gap from multiple
modLearning (CL) [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] strategies becomes imperative. From els independently adapted to the task data. Moreover, (3)
a very generic perspective, CL focuses on the develop- we are the first to evaluate a recently proposed
benchment of algorithms capable of sequentially learning from mark (SCROLLS [7]) in a CL setting, ofering reference
a stream of data, while preserving what was learnt in results for further activity in processing long sequences
past experiences, avoiding catastrophic forgetting [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. of text. We remark that while our experiments are based
      </p>
      <p>
        In this work, motivated by the aforementioned issues, on data in English language, the generic issues we explore
we study the problem of Continual Learning from “long” about handling long sequences of text are intrinsically
sequences of text, exploiting LLMs. We investigate sev- shared by every language.
of learning from newly provided information, with mod- consists in profitably learning from the last-presented
els that are capable of acquiring new knowledge without task without forgetting the previous ones [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In order
forgetting the previously learned one, and, more impor- to cope with TIL on Long Sequences of Text, specifically
tantly, without storing the full dataset and retraining focusing on LLMs, we consider diferent learning
stratefrom scratch every time [8]. Several eforts are dedi- gies. In this Section we describe each of them in detail,
cated to the case of lifelong Reinforcement Learning [9] after having formally introduce the TIL problem.
and of Supervised Learning [10], distinguishing among Problem. We are given a model parameterized by
scenarios and categories of approaches [11], ranging  , which is a vector collecting the learnable variables.
from parameter isolation, regularization methods, and In TIL, a set  of  tasks is sequentially presented to
replays [12]. Unsupervised or Self-Supervised Learning the model, i.e. one at a time. Each task  ∈  , features
approaches are also becoming popular [13, 14, 15], and data sampled from a task-specific distribution, collected
the case of adaptation of pre-trained backbones [16]. into dataset  := (, ), composed of raw samples
      </p>
      <p>
        Of course, neural models for processing language are and labeling information, respectively. The model is not
a subject of study in the context of CL [17]. We mention only expected to learn from , but also to not forget
the case of language modeling in Lamol [18], which is knowledge already acquired from the past tasks. In the
trained to concurrently solve a task and mimic training following, to keep the notation simple, we indicate each
examples, thereby preserving the distribution of previ- task by a numerical index, thus  ∈  = {1, . . . , }. In
ous tasks. Sun et al. [12] introduce Distill and Replay, this case of study, the model is a pre-trained LLM with
which learns to solve the task, to generate training exam- billions of parameters, and all the TIL tasks are
characples formatted as context-question-answer, and to distill terized by long input sequences. Such a combination
knowledge from a model trained on the previous task(s). constitutes a computationally demanding mix, making
Diferently, Reasoning-augmented Continual Learning ofline/joint training potentially very expensive, that is
[19] focuses on creating reasoning pathways to preserve where CL solutions are very convenient. We consider the
and improve LLMs’ reasoning abilities and information case in which LLMs are fine-tuned exploiting adapters
transfer. [26]. In particular, we focus on LoRA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], that introduces
      </p>
      <p>
        Together with works that learn new models from additional learnable parameters while keeping the rest of
scratch, several approaches devise fine-tuning strategies the network freezed. This is both less resource
demandfor pre-trained Transformers in language processing, that ing, and it also alleviates catastrophic forgetting, since
turn out to be eficiently adaptable to a downstream task the LoRA weights   are usually of a number that is a
by learning only a small number of task-specific parame- small fraction with respect to total model parameters,
t[e2r0s].oIrt oisf gtheenecraisceAodfamptoedrse[ls21th],astutcuhnaestthhee pinoppuutlaprrLoomRpAt ie.xep.e|rie|n≪ c|e of |t.hHisepnacpee,ri.t is a perfect candidate for the
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which introduces new weight matrices, parametrized
by the product of low-rank ones. Evaluating these
models with long contexts [22] is not frequent in the scientific
literature, especially in the case in which multiple
finetunings are sequentially applied, typical of CL, which
is the main focus of this paper. In particular, LoRA and
Task Arithmetic [23] has been jointly studied to handle
CL problems in vision [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], that is what this paper extends
to the case of language and long sequences. We also
mention works that focus instruction-based model for CL,
such as ConTinTin [24], where each task is modelled by
a specific instruction that directly defines the target
concept along with a few instances that illustrate. Scialom
et al. [25] and Luo et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] investigate natural language
instructions paired with memory bufers and replays.
      </p>
      <p>
        Single-model TIL with LoRA (S-TIL). In the
straightforward implementation of a TIL problem, tasks are
presented to the model sequentially starting from the first
one up to the -th one. The order may be given a priori,
or established according to some criteria, such as tasks
similarity or dificulty (curriculum-like learning [ 27]). At
the beginning, when considering the first task,  = 1,
we start from a model with freezed parameters  and
additional trainable weights  1 initialized as described
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. At task , with  &gt; 1 instead, the LoRA weights
are initialized with the LoRA parameters from previous
step, i.e.,  − 1. It is worth noticing that in such a way, at
the end of the  tasks, the final model parameters will
be constituted by the original  , still unchanged, and a
single set of adapter parameters  , that was sequentially
trained over all the tasks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Task-Incremental Learning on</title>
    </sec>
    <sec id="sec-3">
      <title>Long Sequences of Text</title>
      <sec id="sec-3-1">
        <title>Task-Incremental Learning (TIL) is a continual learning scenario where the same model is trained on tasks that are presented in a sequential manner. The main challenge</title>
        <p>Multi-model TIL with LoRA (M-TIL). Another way
to face the problem of learning the multiple tasks in TIL,
is to build a specialized model per task, independently on
the other ones. This usually yields strong performance
on each sub-problem, guaranteeing no catastrophic
forgetting issues, since the model to use is simply retrieved
in function of the task to solve. At the same time, such
a strategy requires the storage, deployment and
maintenance of  independent models, which is unsustainable
with billion-sized models like current LLMs. Even when
using adapters such as LoRA, maintaining many of them
can be still hard to handle.</p>
        <p>
          Task Arithmetic TIL with LoRA (TA). Based on
the concept of “task vectors”, Task Arithmetic (TA) [23]
was proposed to combine together the weights learned
in a multi-model continual learning scenario. A task
vector represents the direction in the weights space of
a pre-trained model toward a certain task. In TA,
multiple directions are fused together via a simple linear
combination of them. Similarly, LoRA adapters steers
the model behavior to improve performance on a
specific task. Therefore, LoRA weights trained separately
(multi-model) can be updated with task arithmetic [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]:
 final = ∑︁   ,
(1)
∈
where   is a scalar weighting the importance of task .
Fine-tuning by Memory Bufer (FTB). In
principle, TA can be applied as it is, without requiring further
ifne-tuning. However, we also consider refining the
parameters using a memory bufer with examples from all
the tasks. Indeed, experience replay is a well-known and
efective strategy in Reinforcement Learning and
Continual Learning problems. Examples were chosen randomly,
evenly distributed across the given tasks. Since we are
dealing with long documents, we keep it small.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>We experimented LLMs in TIL exploiting sequences of</title>
        <p>long texts from a benchmark made public to the scientific
community in the last few years [7]. Notice that these
benchmarks are not designed for TIL. Thus, using them
in TIL is indeed a novel experience of the beaten track.</p>
        <sec id="sec-4-1-1">
          <title>4.1. Datasets</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>We consider five out of seven datasets of SCROLLS [ 7],</title>
        <p>that is the reference benchmark for tasks composed of
long documents. Datasets belong to diferent domains,
and they are about diferent tasks, that we adapted to
TIL by means of instruction tuning. An overview of the
benchmark is provided in Table 1, and here we briefly
describe each dataset.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Qasper. Qasper [28] (QSPR) is Question Answering</title>
        <p>(QA) dataset on academic papers. Crafted by NLP experts,
it contains questions based on title and abstract of the
paper. There are diferent kind of inquiries: abstractive,
extractive, yes/no questions, including unanswerable ones.
To answer the question, the entire paper must be read.</p>
      </sec>
      <sec id="sec-4-4">
        <title>QuaLITY. QuALITY [29] (QALT) is a multiple-choice</title>
        <p>QA dataset, drawing upon English source articles with an
average length of about 5,000 tokens. Original texts are
provided in HTML format, retaining paragraph breaks
and basic formatting such as italics, but with images
removed. Questions are designed to require details from
diferent parts of the text to properly answer them.</p>
      </sec>
      <sec id="sec-4-5">
        <title>QMSum. QMSum, presented in [30], is a question</title>
        <p>based document summarization benchmark. The dataset
is characterized by long meetings transcripts, collecting
1,808 query-summary pairs from 232 diferent meetings.</p>
      </sec>
      <sec id="sec-4-6">
        <title>ContractNLI. Contract NLI [31] (CNLI) is the first</title>
        <p>dataset for Natural Language Inference in contracts.
Given a premise and a contract, a model has to classify
whether the premise is entailed by, contradicting to or not
mentioned by the contract. There are 607 contracts and
17 unique hypotheses, combined to get 10,319 examples.</p>
      </sec>
      <sec id="sec-4-7">
        <title>SummScreenFD. SummScreen [32] (SumScr) is a</title>
        <p>summarization dataset of TV series transcripts and
human written recaps. Examples come from two
diferent sources, but in SCROLLS, authors only kept
ForeverDreaming (FD), due to its greater variety of shows.</p>
        <sec id="sec-4-7-1">
          <title>4.2. Experimental Setup and Results</title>
        </sec>
      </sec>
      <sec id="sec-4-8">
        <title>We consider Mistral-7B-v0.1 [33] as the backbone LLM</title>
        <p>for all the fine-tuned models in our TIL experiments.
Albeit trained on a restricted context length of at most 8,192
tokens, it supports longer inputs of size up to 32,768. The
LLM was quantized via 4-bit quantization in order to
ift long sequences on a single A6000 GPU. During
training, the micro batch size was set to 1, with 32 gradient presented in detail in Table 2. The training order does
accumulation steps. LoRA adapters were updated with strongly afect the final performance on single tasks,
proAdamW for 3 epochs in all the experiments, regardless moting higher scores on more recently seen datasets. On
of the dataset. At inference time, outputs were generated one hand, this is expected, since the older ones are more
using Beam Search with beam size set to 2. We com- likely afected by catastrophic forgetting. Catastrophic
pared: () Mistral-7B-v0.1-Instruct, the instruction-tuned forgetting (last columns of Table 2) at  =  = 5 is
beversion of mistral, referred to as Mistral-7b-instruct; () low 10% in both cases. On the other hand, there is an
The case of multiple independent LoRA adapters, each of evident peak of forgetting in S-TIL↓ at  = 3, which is
them trained in a single dataset, i.e., M-TIL (Section 3); then reduced when learning from the following tasks.
() Classic TIL with a single model, progressively up- The peak is due to a strong reduction of performance
dated on the sequence of tasks, i.e., S-TIL (Section 3), in the first two tasks after having learned from Qasper
considering both the case in which tasks are provided in (QSPR). We investigated this aspect, and found that the
a certain order (S-TIL↓) or in the opposite one (S-TIL↑); model fails in generating the perfectly-formatted output
() Task Arithmetic (Section 3) with evenly values  ’s string that is then exploited in the EM metric. When
(TA) or with tasks-specific  ’s based on prior knowledge moving to the following task, this skill is partially
recov(WTA). ered. We hypothesize that the presence of unanswerable
questions in Qasper negatively bias the types of answers
in SummmScreenFD (SumScr) and QMSum, where all
the questions have an answer instead.</p>
      </sec>
      <sec id="sec-4-9">
        <title>Evaluation. Due to the diferent nature of each task</title>
        <p>
          in SCROLLS, there are diferent metrics to take into
account for each of them. In particular, summarization-like
tasks (QMSum and SummScreenFD) are evaluated with Comparing Instances S-TIL and M-TIL. Figure 1
ROUGE score [34] (1,2 and L) , whereas, ContractNLI compares the models of Table 2 (for  = ) with
Mand QuaLITY are assessed with Exact Match (EM). Fi- TIL, which is composed of multiple adapters, each of
nally, results on Qasper are measured by F1. A global them specifically trained on a task, and thus
forgettingoverview of the metrics can be found in Table 1. We in- free. Performance of both S-TIL’s are lower of M-TIL,
dicate with  the score yielded by the associated metric as expected, but sometimes not far from it. Comparing
for task . Following the way the SCROLLS benchmark S-TIL↑ and S-TIL↓, we see that they get similar overall
was proposed, scores are averaged to provide a unique performances, but the latter yields better results in three
index of Overall Performance  . Since we focus on out of five tasks. The quality of S-TIL ↑ (w.r.t. S-TIL↓)
TIL, we evaluate  after each task , and we also com- improves going right-to-left, and, symmetrically, the one
pute the Overall Forgetting at task  (), also known of S-TIL↓ increases going left-to-right, as expected, since
as index of negative backward transfer [35], which tells they were trained in opposite order (relative gain is &gt; 1
how strongly the previously considered tasks have been in SumScr due to forward transfer).
negatively afected by learning from the current task ,
i.e., a measure of catastrophic forgetting [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Formally, 100 .00
1

 = 1 ∑︁ ,,  =
 =1
[︃ 1
− 1
∑︁(, −
 − 1 =1
        </p>
        <p>]︃
,) ,</p>
        <p>
          +
where [· ]+ keeps the positive part, and , is the score
of task  after having learned from task  ∈  . Since the
test set of SCROLLS is not public, we used the SCROLLS
validation set as test set, and sampled a sub-portion of
the training data to build a validation set. After
crossvalidation, we set the rank of LoRA to 8, dropout-rate
to 0.05, and  to 16 (see [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for param description) and
learning rate 3 · 10− 4 (linearly decaying).
        </p>
        <p>Investigating S-TIL. Dealing with long sequences of
text might afect the TIL procedure in function of the
order in which tasks are presented. We study diferent task
orderings based on the average length of the sequences
of text in each task, from tasks involving shorter
output sequences to the ones involving longer sequences
and vice-versa. As anticipated, we named them S-TIL↑
and S-TIL↓, respectively. Results of this experience are</p>
        <p>The Role of TA. We compared all the introduced
models with the case of merging independently-trained
adapters with TA. Table 3 shows that TA results to be
a simple yet competitive solution, with average
performance on par with S-TIL↓. Actually, observing task-wise
performance, we can see how TA outperforms S-TIL↓
across all the datasets, with the exception of ContractNLI</p>
        <p>M-TIL (Ref)
S-TIL</p>
        <p>S-TIL
80</p>
        <p>We also investigate the impact of
(CNLI), the last task in which S- TIL↓ was specialized. In
WTA,  ’s for non-QA datasets were halved, since there We investigated Large Language Models in progressively
tasks involve generation of longer outputs that more learning from tasks involving long sequences of text. A
strongly condition the behaviour of the LLM, as already pre-trained model was paired with one or more adapters
discussed for Qasper. WTA yielded evident improve- (LoRA), and we analyzed the role of Task Arithmetic,
ments in the last two datasets, despite being less weighed, showing that it yields performances that are not far from
keeping similar performance on the others. This suggests the ones of multiple models independently trained to
that appropriately weighing the task-vectors in Eq. 1 is a solve each task. Our results suggests a viable road to
viable road to improve the model. mitigate the need of large computational resources when
learning from tasks based on “long” documents. While we
rehashing the memory of the TA/WTA model via
finetuning it on just 50 samples per the tasks (memory
bufer). Despite being a simple refinement stage, results
presented in Table 3 show a consistent boost of
performance when using the memory bufer (FTB), reaching
about 39.0 averaged score, when using the weighted
TA version, significantly reducing the gap from the
independent adapters solution of M-TIL. Figure 2
provides a quick view on the already presented results of all
the TA methods we considered, reporting also the
Relative Gain w.r.t. M-TIL. Indeed, we can observe that the
relative drop in performance is always below the 11%.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The work was partially funded by:
exploited data in English language, the experiences of this
paper can be interpreted as generic attempts to leverage
long sequences in Continual Learning, in a sense going
beyond the language barrier. Future work will consider
schemes to automatically tune the Task Arithmetic [36].</p>
      <p>[7] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran,</p>
      <p>A. Haviv, A. Gupta, W. Xiong, M. Geva, J. Berant,
O. Levy, Scrolls: Standardized comparison over
long language sequences, in: Proceedings of the
2022 Conference on Empirical Methods in Natural
Language Processing, Abu Dhabi, United Arab
Emirates. Association for Computational Linguistics,
2022, pp. 12007–12021.
[8] M. Gori, S. Melacci, Collectionless artificial
intelligence, arXiv preprint arXiv:2309.06938 (2023).
• “ReSpiRA - REplicabilità, SPIegabilità e Ragiona- [9] K. Khetarpal, M. Riemer, I. Rish, D. Precup,
Tomento”, a project financed by FAIR, Afiliated to wards continual reinforcement learning: A review
spoke no. 2, falling within the PNRR MUR pro- and perspectives, Journal of Artificial Intelligence
gramme, Mission 4, Component 2, Investment 1.3, Research 75 (2022) 1401–1476.</p>
      <p>D.D. No. 341 of 03/15/2022, Project PE0000013, [10] M. De Lange, R. Aljundi, M. Masana, S. Parisot,
CUP B43D22000900004 1; X. Jia, A. Leonardis, G. Slabaugh, T. Tuytelaars, A
• “enRichMyData - Enabling Data Enrichment continual learning survey: Defying forgetting in
Pipelines for AI-driven Business Products and classification tasks, IEEE transactions on pattern
Services”, an Horizon Europe (HE) project, grant analysis and machine intelligence 44 (2021) 3366–
agreement ID: 101070284 2. 3385.</p>
      <p>[11] G. M. van de Ven, A. S. Tolias, Three continual
learning scenarios, in: NeurIPS Continual Learning</p>
      <p>Workshop, volume 1, 2018.
[12] J. Sun, S. Wang, J. Zhang, C. Zong, Distill and replay
for continual language learning, in: Proceedings
of the 28th International Conference on
Computational Linguistics, Barcelona, Spain, December 8-13,
2020, pp. 3569–3579.
[13] S. Marullo, M. Tiezzi, A. Betti, L. Faggi, E. Meloni,</p>
      <p>S. Melacci, Continual unsupervised learning for
optical flow estimation with deep networks, in:
Conference on Lifelong Learning Agents, PMLR,
2022, pp. 183–200.
[14] S. Paul, L.-J. Frey, R. Kamath, K. Kersting, M. Mundt,</p>
      <p>Masked autoencoders are eficient continual
federated learners, arXiv preprint arXiv:2306.03542
(2023).
[15] M. Tiezzi, S. Marullo, L. Faggi, E. Meloni, A. Betti,</p>
      <p>S. Melacci, Stochastic coherence over attention
trajectory for continuous learning in video streams,
in: L. D. Raedt (Ed.), Proceedings of the Thirty-First
International Joint Conference on Artificial
Intelligence, IJCAI-22, International Joint Conferences on
Artificial Intelligence Organization, 2022, pp. 3480–
3486. URL: https://doi.org/10.24963/ijcai.2022/483.</p>
      <p>doi:10.24963/ijcai.2022/483, main Track.
[16] S. Marullo, M. Tiezzi, M. Gori, S. Melacci, T.
Tuytelaars, Continual learning with pretrained
backbones by tuning in the input space, in: 2023
International Joint Conference on Neural Networks
(IJCNN), IEEE, 2023, pp. 1–9.
[17] T. Wu, L. Luo, Y.-F. Li, S. Pan, T.-T. Vu, G. Hafari,</p>
      <p>Continual learning for large language models: A
survey, arXiv preprint arXiv:2402.01364 (2024).</p>
      <p>[18] F.-K. Sun, C.-H. Ho, H.-Y. Lee, Lamol: Language
modeling for lifelong language learning, arXiv X. Qiu, D. Radev, Qmsum: A new benchmark for
preprint arXiv:1909.03329 (2019). query based multi-domain meeting summarization,
[19] X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, in: Proceedings of the 2021 Conference of the North
X. Yang, Z. Xi, R. Zheng, T. Yicheng Zou, X. H. American Chapter of the Association for
ComputaQiZhang, Trace: A comprehensive benchmark for tional Linguistics: Human Language Technologies,
continual learning in large language models, 2023. Online. Association for Computational Linguistics,
arXiv:2310.06762v1. 2021, pp. 5905–5921.
[20] Q. Zhu, B. Li, F. Mi, X. Zhu, M. Huang, Contin- [31] Y. Koreeda, C. D. Manning, Contractnli: A dataset
ual prompt tuning for dialog state tracking, 2022. for document-level natural language inference for
arXiv:2203.06654. contracts, in: Findings of the Association for
Com[21] R. He, L. Liu, H. Ye, Q. Tan, B. Ding, L. Cheng, J.-W. putational Linguistics: EMNLP 2021, 2021, pp. 1907–
Low, L. Bing, L. Si, On the efectiveness of adapter- 1919.
based tuning for pretrained language model adap- [32] M. Chen, Z. Chu, S. Wiseman, K. Gimpel,
Summtation, arXiv preprint arXiv:2106.03164 (2021). screen: A dataset for abstractive screenplay
sum[22] Y. Chen, S. Qian, Z. Liu, H. Tang, X. Lai, S. Han, J. Jia, marization, in: Proceedings of the 60th Annual
Longlora: Eficient fine-tuning of long context large Meeting of the Association for Computational
Linlanguage models, 2023. arXiv:2309.12307v2. guistics (Volume 1: Long Papers), 2022, pp. 8602–
[23] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururan- 8615.</p>
      <p>gan, L. Schmidt, H. Hajishirzi, A. Farhadi, Editing [33] A. Q. Jiang, A. Sablayrolles, A. Mensch, C.
Bammodels with task arithmetic, in: The Eleventh Inter- ford, D. S. Chaplot, D. de las Casas, F. Bressand,
national Conference on Learning Representations, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud,
M.ICLR 2023, Kigali, Rwanda, May 1-5, 2023. A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
[24] W. Yin, J. Li, C. Xiong, Contintin: Continual T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
learning from task instructions, arXiv preprint //arxiv.org/abs/2310.06825. arXiv:2310.06825.
arXiv:2203.08512 (2022). [34] C.-Y. Lin, Rouge: A package for automatic
eval[25] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned uation of summaries, in: Text summarization
language models are continual learners, in: Pro- branches out, Association for Computational
Linceedings of the 2022 Conference on Empirical Meth- guistics, 2004, pp. 74–81.
ods in Natural Language Processing, 2022, pp. 6107– [35] D. Lopez-Paz, M. Ranzato, Gradient episodic
mem6122. ory for continual learning, Advances in neural
[26] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, information processing systems 30 (2017).</p>
      <p>Q. De Laroussilhe, A. Gesmundo, M. Attariyan, [36] M. Tiezzi, S. Marullo, F. Becattini, S. Melacci,
ConS. Gelly, Parameter-eficient transfer learning for tinual neural computation, in: Joint European
Connlp, in: International conference on machine learn- ference on Machine Learning and Knowledge
Dising, PMLR, 2019, pp. 2790–2799. covery in Databases, Springer, 2024, pp. 340–356.
[27] X. Wang, Y. Chen, W. Zhu, A survey on curriculum
learning, IEEE transactions on pattern analysis and
machine intelligence 44 (2021) 4555–4576.
[28] P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith,</p>
      <p>M. Gardner, A dataset of information-seeking
questions and answers anchored in research papers, in:
Proceedings of the 2021 Conference of the North
American Chapter of the Association for
Computational Linguistics: Human Language Technologies,
Online. Association for Computational Linguistics,
2021, pp. 4599–4610.
[29] R. Y. Pang, A. Parrish, N. Joshi, N. Nangia, J. Phang,</p>
      <p>A. Chen, V. Padmakumar, J. Ma, J. Thompson, H. He,
et al., Quality: Question answering with long input
texts, yes!, in: Proceedings of the 2022 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language</p>
      <p>Technologies, 2022.
[30] M. Zhong, D. Yin, T. Yu, A. Zaidi, R. Mutethia
Mutuma, A. H. Awadallah, A. Celikyilmaz, Y. Liu,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hadsell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Pascanu1</surname>
          </string-name>
          ,
          <article-title>Embracing change: Continual learning in deep neural networks</article-title>
          ,
          <source>Trends in Cognitive Sciences</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>1028</fpage>
          -
          <lpage>1040</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of continual learning: Theory, method and application</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>46</volume>
          (
          <year>2024</year>
          )
          <fpage>5362</fpage>
          -
          <lpage>5383</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2024</year>
          .
          <volume>3367329</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>An empirical study of catastrophic forgetting in large language models during continual fine-tuning</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2308</volume>
          .08747v2, [cs.CL].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2106.09685</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Chitale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaidya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Kane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghotkar</surname>
          </string-name>
          ,
          <article-title>Task Arithmetic with LoRA for Continual Learning</article-title>
          ,
          <source>in: Workshop on Advancing Neural Network Training at 37th Conference on Neural Information Processing Systems (WANT@NeurIPS</source>
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>