Task-Incremental Learning on Long Text Sequences
                                Natalia Graziuso1 , Andrea Zugarini2,* and Stefano Melacci1
                                1
                                    Department of Information Engineering and Mathematics, University of Siena, Italy
                                2
                                    expert.ai, Italy


                                                Abstract
                                                The extraordinary results achieved by Large Language Models are paired with issues that are critical in real-world applications.
                                                The costs of inference and, in particular, training are extremely large, both in terms of time and computational resources, and
                                                they become prohibitive when working in dynamic environments, where data and tasks are progressively provided over time.
                                                The model must be able to adapt to new knowledge, new domains, new settings, without forgetting the previously learned
                                                skills. Retraining from scratch easily becomes too costly, thus Continual Learning strategies are of crucial importance. This is
                                                even more evident when data consist of “long” documents, that require several resources to be processed by modern neural
                                                models, leading to very long prompts. This paper investigates LLM-based Task-Incremental Learning in the case of tasks
                                                exploiting long sequences of text, as it is typical in summarization, question-answering on long documents, reviewing long
                                                contracts, and several others. We show how adapting the model by Task Arithmetic with LoRA, which was proposed for
                                                visual data, yields promising results also in the case of such “long” text data. To our best knowledge, this is the first work
                                                along this challenging direction. The outcome of the investigation of this paper is generic enough to represent an important
                                                starting point for further research in processing linguistic data in every language.

                                                Keywords
                                                Continual Learning, Task-Incremental Learning, Long Sequences of Text, Large Language Models


                                1. Introduction                                                                    eral strategies based on LoRA [5] to adapt an LLM to
                                                                                                                   multiple tasks that are sequentially proposed over time.
                                The quality of Language Models (LMs) has been rapidly                              In particular, we first follow the route of training a single
                                improving in the last decade, showing outstanding skills                           adapter in a sequential manner, then we explore Task
                                when scaled to large data and networks [1], leading to                             Arithmetic to fuse multiple adapters trained indepen-
                                the nowadays popular Large Language Models (LLMs).                                 dently [6]. We consider the possibility of assigning dif-
                                Solving more complex tasks with LLMs often requires                                ferent weights to each task, and we shed some light on
                                processing “long” documents and articulated long in-                               what are the factors that contribute the most to catas-
                                structions. However, handling lengthy prompts can be a                             trophic forgetting and to effective task adaptation. The
                                significant obstacle for real-world applications, raising                          outcomes of such an investigation reveals that: (1) there
                                costs and resources required during both inference and,                            is limited sensitivity to task-order, i.e., regardless of the
                                in particular, training. This issue can become critical                            sequence in which tasks are presented, the overall av-
                                when the LLM needs to be specialized to many different                             erage performance remains relatively stable, a property
                                tasks, domains, and, more generally, when it is applied                            that, to our best knowledge, was never evaluated in the
                                to dynamic settings that require multiple adaptations.                             case of tasks composed of long documents; (2) despite its
                                For instance, in real-world applications, models need to                           simplicity, Task Arithmetic demonstrates effectiveness
                                be re-trained from time to time, as new data/tasks be-                             in addressing forgetting phenomena when learning from
                                come available. In such scenarios, the need for Continual                          long texts, strongly reducing the gap from multiple mod-
                                Learning (CL) [2, 3] strategies becomes imperative. From                           els independently adapted to the task data. Moreover, (3)
                                a very generic perspective, CL focuses on the develop-                             we are the first to evaluate a recently proposed bench-
                                ment of algorithms capable of sequentially learning from                           mark (SCROLLS [7]) in a CL setting, offering reference
                                a stream of data, while preserving what was learnt in                              results for further activity in processing long sequences
                                past experiences, avoiding catastrophic forgetting [4].                            of text. We remark that while our experiments are based
                                   In this work, motivated by the aforementioned issues,                           on data in English language, the generic issues we explore
                                we study the problem of Continual Learning from “long”                             about handling long sequences of text are intrinsically
                                sequences of text, exploiting LLMs. We investigate sev-                            shared by every language.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy
                                *
                                  Corresponding author.
                                                                                                                   2. Related Work
                                $ natalia.graziuso@student.unisi.it (N. Graziuso);
                                azugarini@expert.ai (A. Zugarini); stefano.melacci@unisi.it                                              In the last few years, a variety of approaches were pro-
                                (S. Melacci)                                                                                             posed by the scientific community in the context of CL
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).                                                   (see [3] and references therein). The main goal is the one


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of learning from newly provided information, with mod-          consists in profitably learning from the last-presented
els that are capable of acquiring new knowledge without         task without forgetting the previous ones [3]. In order
forgetting the previously learned one, and, more impor-         to cope with TIL on Long Sequences of Text, specifically
tantly, without storing the full dataset and retraining         focusing on LLMs, we consider different learning strate-
from scratch every time [8]. Several efforts are dedi-          gies. In this Section we describe each of them in detail,
cated to the case of lifelong Reinforcement Learning [9]        after having formally introduce the TIL problem.
and of Supervised Learning [10], distinguishing among           Problem.        We are given a model parameterized by
scenarios and categories of approaches [11], ranging            𝜃, which is a vector collecting the learnable variables.
from parameter isolation, regularization methods, and           In TIL, a set 𝒯 of 𝑘 tasks is sequentially presented to
replays [12]. Unsupervised or Self-Supervised Learning          the model, i.e. one at a time. Each task 𝑡 ∈ 𝒯 , features
approaches are also becoming popular [13, 14, 15], and          data sampled from a task-specific distribution, collected
the case of adaptation of pre-trained backbones [16].           into dataset 𝒟𝑡 := (𝒳𝑡 , 𝒴𝑡 ), composed of raw samples
   Of course, neural models for processing language are         and labeling information, respectively. The model is not
a subject of study in the context of CL [17]. We mention        only expected to learn from 𝒟𝑡 , but also to not forget
the case of language modeling in Lamol [18], which is           knowledge already acquired from the past tasks. In the
trained to concurrently solve a task and mimic training         following, to keep the notation simple, we indicate each
examples, thereby preserving the distribution of previ-         task by a numerical index, thus 𝑡 ∈ 𝒯 = {1, . . . , 𝑘}. In
ous tasks. Sun et al. [12] introduce Distill and Replay,        this case of study, the model is a pre-trained LLM with
which learns to solve the task, to generate training exam-      billions of parameters, and all the TIL tasks are charac-
ples formatted as context-question-answer, and to distill       terized by long input sequences. Such a combination
knowledge from a model trained on the previous task(s).         constitutes a computationally demanding mix, making
Differently, Reasoning-augmented Continual Learning             offline/joint training potentially very expensive, that is
[19] focuses on creating reasoning pathways to preserve         where CL solutions are very convenient. We consider the
and improve LLMs’ reasoning abilities and information           case in which LLMs are fine-tuned exploiting adapters
transfer.                                                       [26]. In particular, we focus on LoRA [5], that introduces
   Together with works that learn new models from               additional learnable parameters while keeping the rest of
scratch, several approaches devise fine-tuning strategies       the network freezed. This is both less resource demand-
for pre-trained Transformers in language processing, that       ing, and it also alleviates catastrophic forgetting, since
turn out to be efficiently adaptable to a downstream task       the LoRA weights 𝜃𝑙 are usually of a number that is a
by learning only a small number of task-specific parame-        small fraction with respect to total model parameters,
ters. It is the case of models that tune the input prompt       i.e. |𝜃𝑙 | ≪ |𝜃|. Hence, it is a perfect candidate for the
[20] or of generic Adapters [21], such as the popular LoRA      experience of this paper.
[5], which introduces new weight matrices, parametrized
                                                                Single-model TIL with LoRA (S-TIL). In the straight-
by the product of low-rank ones. Evaluating these mod-
                                                                forward implementation of a TIL problem, tasks are pre-
els with long contexts [22] is not frequent in the scientific
                                                                sented to the model sequentially starting from the first
literature, especially in the case in which multiple fine-
                                                                one up to the 𝑘-th one. The order may be given a priori,
tunings are sequentially applied, typical of CL, which
                                                                or established according to some criteria, such as tasks
is the main focus of this paper. In particular, LoRA and
                                                                similarity or difficulty (curriculum-like learning [27]). At
Task Arithmetic [23] has been jointly studied to handle
                                                                the beginning, when considering the first task, 𝑡 = 1,
CL problems in vision [6], that is what this paper extends
                                                                we start from a model with freezed parameters 𝜃 and
to the case of language and long sequences. We also men-
                                                                additional trainable weights 𝜃1𝑙 initialized as described
tion works that focus instruction-based model for CL,
                                                                in [5]. At task 𝑡, with 𝑡 > 1 instead, the LoRA weights
such as ConTinTin [24], where each task is modelled by
                                                                are initialized with the LoRA parameters from previous
a specific instruction that directly defines the target con-
                                                                step, i.e., 𝜃𝑡−1
                                                                             𝑙
                                                                                 . It is worth noticing that in such a way, at
cept along with a few instances that illustrate. Scialom
                                                                the end of the 𝑘 tasks, the final model parameters will
et al. [25] and Luo et al. [4] investigate natural language
                                                                be constituted by the original 𝜃, still unchanged, and a
instructions paired with memory buffers and replays.
                                                                single set of adapter parameters 𝜃𝑘𝑙 , that was sequentially
                                                                trained over all the tasks.
3. Task-Incremental Learning on                                 Multi-model TIL with LoRA (M-TIL). Another way
   Long Sequences of Text                                       to face the problem of learning the multiple tasks in TIL,
                                                                is to build a specialized model per task, independently on
Task-Incremental Learning (TIL) is a continual learning         the other ones. This usually yields strong performance
scenario where the same model is trained on tasks that          on each sub-problem, guaranteeing no catastrophic for-
are presented in a sequential manner. The main challenge        getting issues, since the model to use is simply retrieved
Table 1
Selected datasets from the SCROLLS benchmark and their main features.
             Dataset                   Task                     Domain          Metric         #Examples
                                                                                           Train Validation
          Contract NLI     Natural Language Inference            Legal           EM         7191       1097
             Qasper                   QA                        Science          F1         2567       1726
           QuALITY              Multi Choice QA             Literature, Misc     EM         2523       2086
            QMSUM          Query-based Summarization           Meetings        ROUGE-L      1257       272
         SummScreenFD            Summarization                     TV          ROUGE-L      3673        338


in function of the task to solve. At the same time, such     long documents. Datasets belong to different domains,
a strategy requires the storage, deployment and mainte-      and they are about different tasks, that we adapted to
nance of 𝑘 independent models, which is unsustainable        TIL by means of instruction tuning. An overview of the
with billion-sized models like current LLMs. Even when       benchmark is provided in Table 1, and here we briefly
using adapters such as LoRA, maintaining many of them        describe each dataset.
can be still hard to handle.                                 Qasper. Qasper [28] (QSPR) is Question Answering
Task Arithmetic TIL with LoRA (TA). Based on                 (QA) dataset on academic papers. Crafted by NLP experts,
the concept of “task vectors”, Task Arithmetic (TA) [23]     it contains questions based on title and abstract of the pa-
was proposed to combine together the weights learned         per. There are different kind of inquiries: abstractive, ex-
in a multi-model continual learning scenario. A task         tractive, yes/no questions, including unanswerable ones.
vector represents the direction in the weights space of      To answer the question, the entire paper must be read.
a pre-trained model toward a certain task. In TA, mul-       QuaLITY. QuALITY [29] (QALT) is a multiple-choice
tiple directions are fused together via a simple linear      QA dataset, drawing upon English source articles with an
combination of them. Similarly, LoRA adapters steers         average length of about 5,000 tokens. Original texts are
the model behavior to improve performance on a spe-          provided in HTML format, retaining paragraph breaks
cific task. Therefore, LoRA weights trained separately       and basic formatting such as italics, but with images
(multi-model) can be updated with task arithmetic [6]:       removed. Questions are designed to require details from
                                                             different parts of the text to properly answer them.
                            ∑︁
                     𝑙
                   𝜃final =     𝜆𝑡 𝜃𝑡𝑙 ,              (1)
                            𝑡∈𝒯                            QMSum. QMSum, presented in [30], is a question-
where 𝜆𝑡 is a scalar weighting the importance of task 𝑡.   based document summarization benchmark. The dataset
                                                           is characterized by long meetings transcripts, collecting
Fine-tuning by Memory Buffer (FTB). In princi- 1,808 query-summary pairs from 232 different meetings.
ple, TA can be applied as it is, without requiring further
fine-tuning. However, we also consider refining the pa- ContractNLI. Contract NLI [31] (CNLI) is the first
rameters using a memory buffer with examples from all dataset for Natural Language Inference in contracts.
the tasks. Indeed, experience replay is a well-known and Given a premise and a contract, a model has to classify
effective strategy in Reinforcement Learning and Contin- whether the premise is entailed by, contradicting to or not
ual Learning problems. Examples were chosen randomly, mentioned by the contract. There are 607 contracts and
evenly distributed across the given tasks. Since we are 17 unique hypotheses, combined to get 10,319 examples.
dealing with long documents, we keep it small.               SummScreenFD. SummScreen [32] (SumScr) is a
                                                             summarization dataset of TV series transcripts and hu-
                                                             man written recaps. Examples come from two differ-
4. Experiments                                               ent sources, but in SCROLLS, authors only kept Forever-
                                                             Dreaming (FD), due to its greater variety of shows.
We experimented LLMs in TIL exploiting sequences of
long texts from a benchmark made public to the scientific
community in the last few years [7]. Notice that these       4.2. Experimental Setup and Results
benchmarks are not designed for TIL. Thus, using them  We consider Mistral-7B-v0.1 [33] as the backbone LLM
in TIL is indeed a novel experience off the beaten track.
                                                       for all the fine-tuned models in our TIL experiments. Al-
                                                       beit trained on a restricted context length of at most 8,192
4.1. Datasets                                          tokens, it supports longer inputs of size up to 32,768. The
                                                       LLM was quantized via 4-bit quantization in order to
We consider five out of seven datasets of SCROLLS [7],
                                                       fit long sequences on a single A6000 GPU. During train-
that is the reference benchmark for tasks composed of
ing, the micro batch size was set to 1, with 32 gradient    presented in detail in Table 2. The training order does
accumulation steps. LoRA adapters were updated with         strongly affect the final performance on single tasks, pro-
AdamW for 3 epochs in all the experiments, regardless       moting higher scores on more recently seen datasets. On
of the dataset. At inference time, outputs were generated   one hand, this is expected, since the older ones are more
using Beam Search with beam size set to 2. We com-          likely affected by catastrophic forgetting. Catastrophic
pared: (𝑖) Mistral-7B-v0.1-Instruct, the instruction-tuned  forgetting (last columns of Table 2) at 𝑡 = 𝑘 = 5 is be-
version of mistral, referred to as Mistral-7b-instruct; (𝑖𝑖)low 10% in both cases. On the other hand, there is an
The case of multiple independent LoRA adapters, each of     evident peak of forgetting in S-TIL↓ at 𝑡 = 3, which is
them trained in a single dataset, i.e., M-TIL (Section 3);  then reduced when learning from the following tasks.
(𝑖𝑖𝑖) Classic TIL with a single model, progressively up-    The peak is due to a strong reduction of performance
dated on the sequence of tasks, i.e., S-TIL (Section 3),    in the first two tasks after having learned from Qasper
considering both the case in which tasks are provided in    (QSPR). We investigated this aspect, and found that the
a certain order (S-TIL↓ ) or in the opposite one (S-TIL↑ ); model fails in generating the perfectly-formatted output
(𝑖𝑣) Task Arithmetic (Section 3) with evenly values 𝜆’s     string that is then exploited in the EM metric. When
(TA) or with tasks-specific 𝜆’s based on prior knowledge    moving to the following task, this skill is partially recov-
(WTA).                                                      ered. We hypothesize that the presence of unanswerable
Evaluation. Due to the different nature of each task questions in Qasper negatively bias the types of answers
in SCROLLS, there are different metrics to take into ac- in SummmScreenFD (SumScr) and QMSum, where all
count for each of them. In particular, summarization-like the questions have an answer instead.
tasks (QMSum and SummScreenFD) are evaluated with Comparing Instances S-TIL and M-TIL. Figure 1
ROUGE score [34] (1,2 and L) , whereas, ContractNLI compares the models of Table 2 (for 𝑡 = 𝑘) with M-
and QuaLITY are assessed with Exact Match (EM). Fi- TIL, which is composed of multiple adapters, each of
nally, results on Qasper are measured by F1. A global them specifically trained on a task, and thus forgetting-
overview of the metrics can be found in Table 1. We in- free. Performance of both S-TIL’s are lower of M-TIL,
dicate with 𝑆𝑖 the score yielded by the associated metric as expected, but sometimes not far from it. Comparing
for task 𝑖. Following the way the SCROLLS benchmark S-TIL↑ and S-TIL↓ , we see that they get similar overall
was proposed, scores are averaged to provide a unique performances, but the latter yields better results in three
index of Overall Performance 𝑂𝑃 . Since we focus on out of five tasks. The quality of S-TIL↑ (w.r.t. S-TIL↓ )
TIL, we evaluate 𝑂𝑃 after each task 𝑡, and we also com- improves going right-to-left, and, symmetrically, the one
pute the Overall Forgetting at task 𝑡 (𝑂𝐹𝑡 ), also known of S-TIL↓ increases going left-to-right, as expected, since
as index of negative backward transfer [35], which tells they were trained in opposite order (relative gain is > 1
how strongly the previously considered tasks have been in SumScr due to forward transfer).
negatively affected by learning from the current task 𝑡,
                                                                         100
i.e., a measure of catastrophic forgetting [4]. Formally,                      M-TIL (Ref)
                                                                                                                                 1.00
                                                                               S-TIL
                                                                               S-TIL
                                                                                                                               0.86
                              [︃                         ]︃               80
             𝑡                         𝑡−1
          1 ∑︁                     1 ∑︁
𝑂𝑃𝑡 =           𝑆𝑡,𝑖 , 𝑂𝐹𝑡 =               (𝑆𝑖,𝑖 − 𝑆𝑡,𝑖 ) ,               60
          𝑡 𝑖=1                  𝑡 − 1 𝑖=1
                                                                         OPk (%)


                                                                                                                        0.77


                                                          +
                                                                                                                     0.68


                                                                                   40
                                                                                                              0.82
                                                                                                             0.78


where [·]+ keeps the positive part, and 𝑆𝑡,𝑖 is the score
                                                                                            1.02


of task 𝑖 after having learned from task 𝑡 ∈ 𝒯 . Since the
                                                                                                    0.73


                                                                                   20
                                                                                         0.64


                                                                                                      0.33


test set of SCROLLS is not public, we used the SCROLLS                             0
validation set as test set, and sampled a sub-portion of                                SumScr     QMSum     QSPR    QALT      CNLI

the training data to build a validation set. After cross-      Figure 1: Test results in TIL: overall performance at 𝑡 =
validation, we set the rank of LoRA to 8, dropout-rate         𝑘 = 5, i.e., 𝑂𝑃𝑘 . We compare the cases of S-TIL↑ and S-TIL↓
to 0.05, and 𝛼 to 16 (see [5] for param description) and       (see Table 2), with the ones of multiple-independently trained
learning rate 3 · 10−4 (linearly decaying).                    adapters, i.e., M-TIL. Relative Gain is indicated on the bars.

Investigating S-TIL. Dealing with long sequences of
text might affect the TIL procedure in function of the or-     The Role of TA. We compared all the introduced
der in which tasks are presented. We study different task      models with the case of merging independently-trained
orderings based on the average length of the sequences         adapters with TA. Table 3 shows that TA results to be
of text in each task, from tasks involving shorter out-        a simple yet competitive solution, with average perfor-
put sequences to the ones involving longer sequences           mance on par with S-TIL↓ . Actually, observing task-wise
and vice-versa. As anticipated, we named them S-TIL↑           performance, we can see how TA outperforms S-TIL↓
and S-TIL↓ , respectively. Results of this experience are      across all the datasets, with the exception of ContractNLI
Table 2
Evaluation score (%) on test data, for each task, after having learned from task 𝑡 (i.e., 𝑆𝑡,𝑖 ) in S-TIL↑ (left) and S-TIL↓ (right).
The order of columns (dataset names) reflect the task-order followed during training. Tasks becomes available in order, thus −
indicate that the value cannot be computed yet. The 𝑂𝐹𝑡 column is about catastrophic forgetting (the lower the better).

𝑖→ 1.CNLI 2.QALT 3.QSPR 4.QMSum 5.SumScr                                                       𝑖→ 1.SumScr 2.QMSum 3.QSPR 4.QALT 5.CNLI
 𝑡↓                  𝑆𝑡,𝑖                𝑂𝐹𝑡                                                    𝑡↓                𝑆𝑡,𝑖                  𝑂𝐹𝑡
 1     88.0                -             -               -                 -              -       1         18.2         -     -        -       -       -
 2     85.7               49.5           -               -                 -            2.31      2         16.1       22.2    -        -       -     2.06
 3     79.7               43.2          37.1             -                 -            7.31      3         0.04       0.45   37.4      -       -     19.94
 4     82.9               40.7          27.6            21.9               -            7.82      4         13.6       13.3   35.8     47.7     -     5.00
 5     75.7               39.1          30.2            15.5              18.6          8.99      5         11.8       7.0    32.0     44.2    88.2   7.60


Table 3
Results involving all the competitors. In ROUGE-based evaluations, we also report unigram overlap (ROUGE-1), bigram
overlap (ROUGE-2), together with the longest overlapping subsequence (ROUGE-L) – the last one is what is considered when
computing 𝑂𝑃𝑘 . Reference results (baseline, and “upper bound”) are in italic.

                                                               SumScr                             QMSum                QSPR   QALT    CNLI
        Method                                                                                                                                 OP𝑘
                                                               ROUGE-1/2/L                        ROUGE-1/2/L           F1     EM       EM

        Ref1: Mistral-7b-instruct                       18.1        2.3          10.8      16.2       2.7       11.8   5.4     0.0     0.0     5.6
        Ref2: M-TIL                                     29.2        7.1          18.2      29.6       8.5       21.1   38.7   56.7     88.0    44.5
        S-TIL↑                                          30.0        7.8      18.6          20.6       5.7       15.5   30.2   39.1     75.7    35.8
        S-TIL↓                                          15.6        3.6      11.8          8.7        2.3       7.0    32.0   44.2     88.2    36.7
        TA                                              20.7       4.56          13.9      18.8       5.6       14.2   36.0   45.6     72.6    36.5
        WTA                                             19.4       4.26          13.4      18.5       5.5       14.1   34.7   47.9     74.7    36.9
        TA-FTB                                          28.6       6.21          17.5      28.0       8.1       20.1   38.3   47.8     75.1    39.8
        WTA-FTB                                         28.6       6.09          17.2      26.9       7.6       19.7   35.6   50.5     78.5    40.3

                    100
                              M-TIL (Ref)
                              TA
                                                                                                  rehashing the memory of the TA/WTA model via fine-
                                                                     0.89


                              WTA
                                                                   0.85
                                                                   0.84


                    80
                                                                  0.82


                              TA-FTB                                                              tuning it on just 50 samples per the tasks (memory
                              WTA-FTB
                    60                                                                            buffer). Despite being a simple refinement stage, results
                                                           0.89
          OPk (%)


                                                         0.84
                                                         0.84
                                                        0.80


                                                                                                  presented in Table 3 show a consistent boost of perfor-
                                                 0.98
                                                0.93

                                               0.91
                                               0.89


                    40
                                                                                                  mance when using the memory buffer (FTB), reaching
                                                                                                  about 39.0 averaged score, when using the weighted
                                        0.95
                                        0.93
                             0.96
                             0.94


                    20
                                     0.67
                                     0.66
                           0.76
                           0.73


                                                                                                  TA version, significantly reducing the gap from the 𝑘-
                     0
                          SumScr    QMSum      QSPR     QALT      CNLI                            independent adapters solution of M-TIL. Figure 2 pro-
Figure 2: Test results in TIL with Task Arithmetic (TA). TA
                                                                                                  vides a quick view on the already presented results of all
is explored with or without Fine-tuning by Memory Buffer                                          the TA methods we considered, reporting also the Rela-
(FTB), and also in the case of task-specific weights provided                                     tive Gain w.r.t. M-TIL. Indeed, we can observe that the
in advance (WTA). Same setting of Figure 1.                                                       relative drop in performance is always below the 11%.


                                                                                                  5. Conclusions
(CNLI), the last task in which S- TIL↓ was specialized. In
WTA, 𝜆’s for non-QA datasets were halved, since there                                             We investigated Large Language Models in progressively
tasks involve generation of longer outputs that more                                              learning from tasks involving long sequences of text. A
strongly condition the behaviour of the LLM, as already                                           pre-trained model was paired with one or more adapters
discussed for Qasper. WTA yielded evident improve-                                                (LoRA), and we analyzed the role of Task Arithmetic,
ments in the last two datasets, despite being less weighed,                                       showing that it yields performances that are not far from
keeping similar performance on the others. This suggests                                          the ones of multiple models independently trained to
that appropriately weighing the task-vectors in Eq. 1 is a                                        solve each task. Our results suggests a viable road to
viable road to improve the model.                                                                 mitigate the need of large computational resources when
Impact of FTB.                     We also investigate the impact of                              learning from tasks based on “long” documents. While we
exploited data in English language, the experiences of this           [7] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran,
paper can be interpreted as generic attempts to leverage                  A. Haviv, A. Gupta, W. Xiong, M. Geva, J. Berant,
long sequences in Continual Learning, in a sense going                    O. Levy, Scrolls: Standardized comparison over
beyond the language barrier. Future work will consider                    long language sequences, in: Proceedings of the
schemes to automatically tune the Task Arithmetic [36].                   2022 Conference on Empirical Methods in Natural
                                                                          Language Processing, Abu Dhabi, United Arab Emi-
                                                                          rates. Association for Computational Linguistics,
Acknowledgments                                                           2022, pp. 12007–12021.
                                                                      [8] M. Gori, S. Melacci, Collectionless artificial intelli-
The work was partially funded by:
                                                                          gence, arXiv preprint arXiv:2309.06938 (2023).
        • “ReSpiRA - REplicabilità, SPIegabilità e Ragiona-           [9] K. Khetarpal, M. Riemer, I. Rish, D. Precup, To-
          mento”, a project financed by FAIR, Affiliated to               wards continual reinforcement learning: A review
          spoke no. 2, falling within the PNRR MUR pro-                   and perspectives, Journal of Artificial Intelligence
          gramme, Mission 4, Component 2, Investment 1.3,                 Research 75 (2022) 1401–1476.
          D.D. No. 341 of 03/15/2022, Project PE0000013,             [10] M. De Lange, R. Aljundi, M. Masana, S. Parisot,
          CUP B43D22000900004 1 ;                                         X. Jia, A. Leonardis, G. Slabaugh, T. Tuytelaars, A
        • “enRichMyData - Enabling Data Enrichment                        continual learning survey: Defying forgetting in
          Pipelines for AI-driven Business Products and                   classification tasks, IEEE transactions on pattern
          Services”, an Horizon Europe (HE) project, grant                analysis and machine intelligence 44 (2021) 3366–
          agreement ID: 101070284 2 .                                     3385.
                                                                     [11] G. M. van de Ven, A. S. Tolias, Three continual
                                                                          learning scenarios, in: NeurIPS Continual Learning
References                                                                Workshop, volume 1, 2018.
                                                                     [12] J. Sun, S. Wang, J. Zhang, C. Zong, Distill and replay
    [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-                for continual language learning, in: Proceedings
        plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-              of the 28th International Conference on Computa-
        try, A. Askell, et al., Language models are few-shot              tional Linguistics, Barcelona, Spain, December 8-13,
        learners, Advances in neural information process-                 2020, pp. 3569–3579.
        ing systems 33 (2020) 1877–1901.                             [13] S. Marullo, M. Tiezzi, A. Betti, L. Faggi, E. Meloni,
    [2] R. Hadsell, D. Rao, A. A. Rusu, R. Pascanu1, Em-                  S. Melacci, Continual unsupervised learning for
        bracing change: Continual learning in deep neural                 optical flow estimation with deep networks, in:
        networks, Trends in Cognitive Sciences 24 (2020)                  Conference on Lifelong Learning Agents, PMLR,
        1028–1040.                                                        2022, pp. 183–200.
    [3] L. Wang, X. Zhang, H. Su, J. Zhu, A comprehensive            [14] S. Paul, L.-J. Frey, R. Kamath, K. Kersting, M. Mundt,
        survey of continual learning: Theory, method and                  Masked autoencoders are efficient continual fed-
        application, IEEE Transactions on Pattern Analy-                  erated learners, arXiv preprint arXiv:2306.03542
        sis and Machine Intelligence 46 (2024) 5362–5383.                 (2023).
        doi:10.1109/TPAMI.2024.3367329.                              [15] M. Tiezzi, S. Marullo, L. Faggi, E. Meloni, A. Betti,
    [4] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, Y. Zhang, An            S. Melacci, Stochastic coherence over attention tra-
        empirical study of catastrophic forgetting in large               jectory for continuous learning in video streams,
        language models during continual fine-tuning, 2023.               in: L. D. Raedt (Ed.), Proceedings of the Thirty-First
        arXiv:2308.08747v2, [cs.CL].                                      International Joint Conference on Artificial Intelli-
    [5] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,                gence, IJCAI-22, International Joint Conferences on
        S. Wang, L. Wang, W. Chen, Lora: Low-rank adap-                   Artificial Intelligence Organization, 2022, pp. 3480–
        tation of large language models, arXiv preprint                   3486. URL: https://doi.org/10.24963/ijcai.2022/483.
        arXiv:2106.09685 (2021).                                          doi:10.24963/ijcai.2022/483, main Track.
    [6] R. Chitale, A. Vaidya, A. M. Kane, A. Ghotkar, Task          [16] S. Marullo, M. Tiezzi, M. Gori, S. Melacci, T. Tuyte-
        Arithmetic with LoRA for Continual Learning, in:                  laars, Continual learning with pretrained back-
        Workshop on Advancing Neural Network Training                     bones by tuning in the input space, in: 2023 In-
        at 37th Conference on Neural Information Process-                 ternational Joint Conference on Neural Networks
        ing Systems (WANT@NeurIPS 2023), 2023.                            (IJCNN), IEEE, 2023, pp. 1–9.
                                                                     [17] T. Wu, L. Luo, Y.-F. Li, S. Pan, T.-T. Vu, G. Haffari,
1                                                                         Continual learning for large language models: A
    RESPIRA:       https://www.opencup.gov.it/portale/web/opencup/
    home/progetto/-/cup/B43D22000900004                                   survey, arXiv preprint arXiv:2402.01364 (2024).
2
    https://doi.org/10.3030/101070284                                [18] F.-K. Sun, C.-H. Ho, H.-Y. Lee, Lamol: Language
     modeling for lifelong language learning, arXiv                    X. Qiu, D. Radev, Qmsum: A new benchmark for
     preprint arXiv:1909.03329 (2019).                                 query based multi-domain meeting summarization,
[19] X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin,                       in: Proceedings of the 2021 Conference of the North
     X. Yang, Z. Xi, R. Zheng, T. Yicheng Zou, X. H.                   American Chapter of the Association for Computa-
     QiZhang, Trace: A comprehensive benchmark for                     tional Linguistics: Human Language Technologies,
     continual learning in large language models, 2023.                Online. Association for Computational Linguistics,
     arXiv:2310.06762v1.                                               2021, pp. 5905–5921.
[20] Q. Zhu, B. Li, F. Mi, X. Zhu, M. Huang, Contin-              [31] Y. Koreeda, C. D. Manning, Contractnli: A dataset
     ual prompt tuning for dialog state tracking, 2022.                for document-level natural language inference for
     arXiv:2203.06654.                                                 contracts, in: Findings of the Association for Com-
[21] R. He, L. Liu, H. Ye, Q. Tan, B. Ding, L. Cheng, J.-W.            putational Linguistics: EMNLP 2021, 2021, pp. 1907–
     Low, L. Bing, L. Si, On the effectiveness of adapter-             1919.
     based tuning for pretrained language model adap-             [32] M. Chen, Z. Chu, S. Wiseman, K. Gimpel, Summ-
     tation, arXiv preprint arXiv:2106.03164 (2021).                   screen: A dataset for abstractive screenplay sum-
[22] Y. Chen, S. Qian, Z. Liu, H. Tang, X. Lai, S. Han, J. Jia,        marization, in: Proceedings of the 60th Annual
     Longlora: Efficient fine-tuning of long context large             Meeting of the Association for Computational Lin-
     language models, 2023. arXiv:2309.12307v2.                        guistics (Volume 1: Long Papers), 2022, pp. 8602–
[23] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururan-               8615.
     gan, L. Schmidt, H. Hajishirzi, A. Farhadi, Editing          [33] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
     models with task arithmetic, in: The Eleventh Inter-              ford, D. S. Chaplot, D. de las Casas, F. Bressand,
     national Conference on Learning Representations,                  G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
     ICLR 2023, Kigali, Rwanda, May 1-5, 2023.                         A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
[24] W. Yin, J. Li, C. Xiong, Contintin: Continual                     T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
     learning from task instructions, arXiv preprint                   //arxiv.org/abs/2310.06825. arXiv:2310.06825.
     arXiv:2203.08512 (2022).                                     [34] C.-Y. Lin, Rouge: A package for automatic eval-
[25] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned                uation of summaries, in: Text summarization
     language models are continual learners, in: Pro-                  branches out, Association for Computational Lin-
     ceedings of the 2022 Conference on Empirical Meth-                guistics, 2004, pp. 74–81.
     ods in Natural Language Processing, 2022, pp. 6107–          [35] D. Lopez-Paz, M. Ranzato, Gradient episodic mem-
     6122.                                                             ory for continual learning, Advances in neural
[26] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,               information processing systems 30 (2017).
     Q. De Laroussilhe, A. Gesmundo, M. Attariyan,                [36] M. Tiezzi, S. Marullo, F. Becattini, S. Melacci, Con-
     S. Gelly, Parameter-efficient transfer learning for               tinual neural computation, in: Joint European Con-
     nlp, in: International conference on machine learn-               ference on Machine Learning and Knowledge Dis-
     ing, PMLR, 2019, pp. 2790–2799.                                   covery in Databases, Springer, 2024, pp. 340–356.
[27] X. Wang, Y. Chen, W. Zhu, A survey on curriculum
     learning, IEEE transactions on pattern analysis and
     machine intelligence 44 (2021) 4555–4576.
[28] P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith,
     M. Gardner, A dataset of information-seeking ques-
     tions and answers anchored in research papers, in:
     Proceedings of the 2021 Conference of the North
     American Chapter of the Association for Computa-
     tional Linguistics: Human Language Technologies,
     Online. Association for Computational Linguistics,
     2021, pp. 4599–4610.
[29] R. Y. Pang, A. Parrish, N. Joshi, N. Nangia, J. Phang,
     A. Chen, V. Padmakumar, J. Ma, J. Thompson, H. He,
     et al., Quality: Question answering with long input
     texts, yes!, in: Proceedings of the 2022 Conference
     of the North American Chapter of the Association
     for Computational Linguistics: Human Language
     Technologies, 2022.
[30] M. Zhong, D. Yin, T. Yu, A. Zaidi, R. Mutethia Mu-
     tuma, A. H. Awadallah, A. Celikyilmaz, Y. Liu,