<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Task-Agnostic Experts Composition for Continual Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luigi Quarantiello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Cossu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Lomonaco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Pisa</institution>
          ,
          <addr-line>Largo Bruno Pontecorvo 3, 56127, Pisa -</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Compositionality is one of the fundamental abilities of the human reasoning process, that allows to decompose a complex problem into simpler elements. Such property is crucial also for neural networks, especially when aiming for a more eficient and sustainable AI framework. We propose a compositional approach by ensembling zero-shot a set of expert models, assessing our methodology using a challenging benchmark, designed to test compositionality capabilities. We show that our Expert Composition method is able to achieve a much higher accuracy than baseline algorithms while requiring less computational resources, hence being more eficient.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Compositionality</kwd>
        <kwd>Continual Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Experts training phase</p>
      <p>Experts composition
e1
e7</p>
      <p>Prediction:
[helmet, shoe]
such architectures are able to fit seamlessly to diferent applications and data distributions; these makes
them much more versatile and robust than their “monolithic” counterparts.</p>
      <p>In this work, we mainly explore the composition of knowledge coming from diferent expert models.
We will show that, by enforcing compositionality, we are able to largely surpass the performance of
known approaches, using a compositional benchmark to asses our method.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experts Composition</title>
      <p>To show the benefits of enforcing compositionality to neural models, our proposed methodology consists
mainly in two steps: first, the training of expert models, and then their application in a compositional
scenario. Ideally, especially considering that the number of pretrained models available online is always
increasing, it could be possible to simply download the needed networks, thus avoiding the training
process at all. Unfortunately, in the case of our work, we could not find adequate ready-to-use models.
Hence, we opted to train our own networks, nonetheless keeping them as small as possible to maintain
an eficient approach.</p>
      <p>
        To train the experts, we employed cropped images from the GQA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] dataset. We selected 21 objects
from the dataset, we used the information about their bounding boxes to crop the original images and
we resized them to a resolution of 98x98 pixels (Figure 1, left). We assigned 3 unique classes to each
model, thus resulting in 7 experts. Each model is trained to specialize on its classes, while a generic
“other” class label is used as a container of the remainder of classes. In this way, an expert can both
provide an answer when prompted with samples from its classes, but also detect when an image is out
of its scope.
      </p>
      <p>
        These trained models were then tested on the highly compositional CGQA benchmark [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], that
contains samples made by a 2x2 grid, in which 2 cropped object images taken from GQA are inserted (Figure
1, right). CGQA was originally designed for Continual Learning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a Machine Learning paradigm
in which models are trained on streams of diferent tasks in dynamic environments; nonetheless, the
benchmark can be easily adapted for ofline testing. In the paper, two main scenarios are presented,
the task-incremental and class-incremental learning cases. We will compare our method against the
second one, which is the most challenging setting, since no task identifier is provided and a single-head
classifier must be employed.
      </p>
      <p>Our approach can be formalized as follows: for each query image, we extract its composing quadrants;
then, the prediction for each quadrant is computed as:</p>
      <p>∀ expert,  = arg max expert()
where  is a quadrant from an image. Lastly, the entire prediction for a compositional image is given by
the concatenation of the predictions on its quadrants:</p>
      <p>comp = 1 ∪ 2</p>
      <p>In this fashion, no additional training procedure is needed to recognize the objects composition, since
the experts are employed zero-shot. This results in a substantial saving in training efort with respect to
related approaches, which instead need to be fine-tuned for each experience, incurring also in loss of
accuracy due to the catastrophic forgetting phenomenon.</p>
      <p>With the same models, we also explored the use of a finetuned approach to enforce compositionality.
Specifically, we used the few-shot sys stream provided by CGQA, defining 300 diferent tasks, called
experiences. For each experience, we select the top-performing experts, we freeze them and we train a
single-layer linear classifier on top of their concatenated features. To determine which experts to use,
we perform a forward pass over all models using the training data, collecting the sum of the top logits
per model. This value represents how much current data are in-distribution with the training data of a
given model.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results</title>
      <sec id="sec-3-1">
        <title>3.1. Training of the Experts</title>
        <p>
          Firstly, we trained a set of neural networks to define the expert models. Our main objective was to
design an eficient and sustainable method; therefore, we selected the ResNet-18 architecture [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the
smallest version of the well-known, state-of-the-art computer vision model. As discussed in the previous
section, we trained our models on images of objects from the GQA dataset.
        </p>
        <p>The resulting dataset contains 21 diferent classes; on average, each class has about 11.000 samples. It
was then divided so that each expert is trained to specialize on 3 classes, thus resulting in a set of 7
models. Following the dataset splitting of GQA, on average, each ResNet was trained on about 38.000
images, with about 2000 images as validation set, and 3800 test images.</p>
        <p>In Figure 2, the accuracy of these models on their respective test set is plotted. The employed
architectures have 18 layers, with 3x3 convolutional kernels; we used Adam as optimizer, with a
learning rate of 1e-3. The resulting models have a similar behavior, with an average accuracy of 88%.
Conversely, in terms of computational demand, we used a NVIDIA Tesla T4 GPU, taking about 13
minutes for each training loop. Therefore, we were able to achieve optimal performance over a large
set of images, while using few resources and in a short time.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experts Composition</title>
        <p>
          Successively, we tested the composition of our pretrained experts on the CGQA benchmark. In its
original paper, the authors used the dataset with Continual Learning techniques, namely ER [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], EWC
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], LwF [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], GEM [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and RPSnet [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], as well as the Multitask and Finetune baselines.
For our experiments, we used the data from the con stream, which contains combinations of objects
from the 21 diferent classes. The dataset contains 100 diferent combinations, i.e. labels, with 100 test
instances for each combination.
        </p>
        <p>In Figure 3, we report the results we obtained with our EC method, together with the ones from
the baseline approaches. For some of the baselines, to compare the approaches more fairly, instead of
using randomly initialized networks, we pretrained a ResNet on the entire dataset extracted from GQA,
designing a single expert across all 21 object classes. For the other approaches, we took the accuracy
results directly from the original paper.</p>
        <p>
          As it is clear from the plot, our approach abundantly surpass all the other algorithms, obtaining
an accuracy over the test set of 58%. A second important observation is that, being an un-trained
composition, the performance of our approach does not change over time, and it is already optimal
from the first experience. Conversely, the other methods need to be trained across the experiences,
being able to reach their top accuracy only at the end. For these experiments, we used the Avalanche
library [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>The best behavior is given by the ER algorithm, which achieves 26% of accuracy; in addition, on average,
each of the four baselines we reproduced took 3 hours and 45 minutes to complete the training, while
our experts do not require any additional fine-tuning phase. In other words, our approach turns out to
be more eficient than the other ones from a computational viewpoint, but also more robust to shifts in
the data distribution, which are common especially in real-world scenarios.</p>
        <p>We also tried to compose the expert models using a custom classification head, trained using few
samples for each class. We set the number of experiences to 300, as in the CGQA paper, and took the
average accuracy over the test set. We experimented the composition of 3, 5 and 7 experts, reporting
the obtained results in Table 1. Such method has much worse performance, especially when employing
few experts. This is probably due to the fact that the training samples are too few to train an efective
classifier. We plan to investigate deeper the problem, since we believe that, in some setting, zero-shot
composition is not adequate and a certain degree of fine-tuning is needed to adapt to the problem.</p>
        <p>Number of Experts</p>
        <p>Avg. Test Set Accuracy (%)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this work, we presented a study about enforcing compositionality in neural models. Such property
plays a crucial role in the context of a sustainable AI framework, since it allows to reuse pretrained
models for diferent applications, saving the computational resources needed for additional training
procedure. We trained diferent expert models on a set of objects taken from the GQA dataset, and then
tested their composition on CGQA, an highly compositional benchmark. We assessed our methodology
against several baseline approaches, obtaining better results both in terms of performance and eficiency.
Specifically, we managed to nearly double the accuracy achieved by other methods, while consuming
much less computational resources and training time.</p>
      <p>For this work, we employed a challenging benchmark for compositional approaches, but it is
nonetheless a synthetic and artificial dataset. As future work, we plan to expand our method to more realistic
and complex settings.</p>
      <p>Moreover, we consider this work as an initial step, and we believe that additional endeavours can be
spent to deeper explore such approaches, in diferent directions. As an example, it could be interesting
to study how to obtain expert systems from pretrained architectures, to further enhance the eficiency
of this kind of solutions. Lastly, we want to experiment more on fine-tuned compositions, to adapt the
method for settings in which zero-shot composition is not enough.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI ChatGPT for grammar and spelling
check, paraphrase and reword. After using this tool, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hudson</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          , S. von Arx,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          , et al.,
          <article-title>On the opportunities and risks of foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2108.07258</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Munguia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rothchild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Texier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Carbon emissions and large neural network training</article-title>
          ,
          <source>arXiv preprint arXiv:2104.10350</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Werning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hinzen</surname>
          </string-name>
          , E. Machery,
          <source>The Oxford handbook of compositionality, OUP Oxford</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Braham</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Hamblen,</surname>
          </string-name>
          <article-title>The design of a neural network with a biologically motivated architecture</article-title>
          ,
          <source>IEEE transactions on neural networks 1</source>
          (
          <year>1990</year>
          )
          <fpage>251</fpage>
          -
          <lpage>262</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hussain</surname>
          </string-name>
          ,
          <article-title>Modularity within neural networks</article-title>
          , Queens University Kingston, Ontario, Canada (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Gqa: A new dataset for real-world visual reasoning and compositional question answering</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6700</fpage>
          -
          <lpage>6709</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Ishibuchi,
          <article-title>Does continual learning meet compositionality? new benchmarks and an evaluation framework</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Parisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kemker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Part</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wermter</surname>
          </string-name>
          ,
          <article-title>Continual lifelong learning with neural networks: A review</article-title>
          ,
          <source>Neural networks 113</source>
          (
          <year>2019</year>
          )
          <fpage>54</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaudhry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ajanthan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dokania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          ,
          <article-title>On tiny episodic memories in continual learning</article-title>
          , arXiv preprint arXiv:
          <year>1902</year>
          .
          <volume>10486</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kirkpatrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rabinowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Desjardins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Milan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ramalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grabska-Barwinska</surname>
          </string-name>
          , et al.,
          <article-title>Overcoming catastrophic forgetting in neural networks</article-title>
          ,
          <source>Proceedings of the national academy of sciences 114</source>
          (
          <year>2017</year>
          )
          <fpage>3521</fpage>
          -
          <lpage>3526</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoiem</surname>
          </string-name>
          ,
          <article-title>Learning without forgetting</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>40</volume>
          (
          <year>2017</year>
          )
          <fpage>2935</fpage>
          -
          <lpage>2947</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lopez-Paz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          ,
          <article-title>Gradient episodic memory for continual learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rajasegaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hayat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khan</surname>
          </string-name>
          , L. Shao,
          <article-title>Random path selection for continual learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lomonaco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pellegrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cossu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carta</surname>
          </string-name>
          , G. Grafieti,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Lange</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Masana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pomponi</surname>
          </string-name>
          , G. Van de Ven, et al.,
          <article-title>Avalanche: an end-to-end library for continual learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3600</fpage>
          -
          <lpage>3610</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>