<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Continual learning: an approach via feature maps extrapolation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edoardo De Rose</string-name>
          <email>edoardo.derose@unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Rome, Italy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The semantic image segmentation task consists of classifying each pixel of an image into an instance, where each instance corresponds to a class. This task is a part of the concept of scene understanding or better explaining the global context of an image. In the medical image analysis domain, image segmentation can be used for image-guided interventions, morphology characterization, and diagnostics. The state-of-the-art methods for this task use deep convolutional neural networks, which can learn from data how to perform the segmentation. The recent advances in deep learning allow training networks on small datasets, which is a critical issue for bio-medical images. Moreover, these methods are designed for a static learning scenario, where the data distribution does not change over time. This is not realistic for the dynamic medical imaging environment, where new tasks and data may appear continuously. Therefore, we propose a new continual learning approach, which aims to enable neural networks to learn new tasks sequentially without forgetting the previous ones. The main challenge in this learning scenario is to prevent catastrophic forgetting, which occurs when the network overwrites the knowledge of previous tasks with the knowledge of the current task. The goal is to develop a method that can overcome this challenge and allow models to learn continuously and accumulate knowledge from diferent tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Medical imaging</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Semantic segmentation</kwd>
        <kwd>Continual learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Medical imaging techniques are used to create visual representations of the internal anatomy of
the human body for clinical analysis and medical intervention. Medical imaging techniques
include X-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron
emission tomography (PET), and ultrasound imaging. Extrapolating information from these
images requires, for example, accurate semantic segmentation of the anatomical parts. This
is a key process where a given input signal is divided into constituent regions or partitions
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The pixels of a partition should share some local properties such as intensities, continuity
and regularity of the signal, variance, texture information, and others. With the advancement
of technology and the availability of data, Artificial Intelligence (AI) has become an essential
tool for medical professionals to improve patient outcomes, optimize treatment plans, and
facilitate the diagnosis of various medical conditions. Recently, Deep Learning (DL) techniques
      </p>
      <p>
        Calabria).
gained significant attention in handling a lot of computer vision problems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Specifically, the
Convolutional Neural Networks (CNN), the Fully CNNs (FCNs) and Transformers-based [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
architectures achieved significant success and rapidly became state-of-the-art methodologies in
medical image segmentation and classification [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] due to their ability to automatically learn
relevant features from images and provide accurate results. However, this remarkable success is
achieved in a static learning paradigm where the model is trained using large training data of a
specific task and deployed for testing on data with a similar distribution to the training data. This
paradigm contradicts the real dynamic world medical environment which changes very rapidly.
Standard retraining of the neural network model on new data leads to significant performance
degradation on previously learned knowledge, a phenomenon known as catastrophic forgetting
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Continual learning (CL) approaches come to address this dynamic learning paradigm. It
aims at building neural network models capable of learning sequential tasks while accumulating
and maintaining the knowledge from previous tasks without forgetting. In general, the main
components of a continual learning problem are:
• a sequence of tasks 1; 2; ...; ; ...;  where  is the total number of tasks;
• each task  refer to its own dataset   ;
• the neural network model faces tasks one by one;
• the capacity of the model should be utilized to learn the sequence of the tasks without
forgetting any of them;
• all samples from the current task are observed before switching to the next task;
• the data across the tasks is not assumed to be identically and independently distributed.
      </p>
      <p>In this paper, we propose a new method to solve continual learning problems, in the field of
semantic segmentation, that is based on two main intuitions:
• selecting the model weights that are relevant for each task during the training phase;
• updating the weights that only learn new features during the backpropagation without
forgetting the previous ones.</p>
      <p>
        Our method is motivated by the observation that diferent tasks may have diferent data
distributions, but they may also share common features that are learned by the model. We
tested our method on a case study of computed tomography (CT). It is a 3D x-ray imaging
technique that generates 3D digital gray-scale images of the organs’ internal structures. These
images can be semantically segmented to identify specific morphological components, such as
tissues, vessels, or tumors. Continual learning methods are useful for CT image segmentation
because they allow the system to segment diferent organs or diseases as they become available
or relevant, without losing the ability to segment the ones that were learned before. This can
lead to more eficient and general segmentation of CT images of various organs and diseases.
1.1. Related Works
Several continual learning methods have been proposed to tackle catastrophic forgetting.
Following De Lange et Al [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] continual learning algorithms can be divided into three general
groups. The first group consists of replay-based methods that build and store a memory of the
knowledge learned from old tasks. iCaRL [7] learns in a class-incremental way by having a
ifxed memory that stores samples that are close to the center of each class while ER-Reservoir
[8] uses a Reservoir sampling method as its selection strategy. The methods in the second
group use explicit regularization techniques to supervise the learning algorithm such that the
network parameters are consistent during the learning process. As a notable work, Elastic
weight consolidation (EWC) [9], uses the Fisher information matrix as a proxy for weights’
importance and guides the gradient updates. Some other regularization-based methods have
utilized gradient information to protect previous knowledge, like Orthogonal Gradient Descent
(OGD) in [10] uses the projection of the prediction gradients from new tasks on the subspace of
previous tasks’ gradients to maintain the learned knowledge. Finally, in parameter isolation
methods, in addition to potentially a shared part, diferent subsets of the model parameters are
dedicated to each task [11, 12, 13]. This approach can be viewed as a flexible gating mechanism,
which enhances stability and controls plasticity by activating diferent gates for each task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Experimental design and tasks</title>
      <p>
        We evaluated and validated our continual learning method on a dataset of CT images, but it can
be generalized to any dataset. The tasks involve the semantic segmentation of CT images of
insects, with an emphasis on their internal organs, such as testicles and glands. We treated each
organ segmentation as a diferent task; to deal with the continual learning scenarios, the model
observes all samples from the current task before moving to the next task, and we assumed
that the data from the previous tasks is inaccessible after learning them. As baseline models
to perform semantic segmentation we used two state-of-the-art deep learning models: SegNet
[14] and U-Net [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. U-Net and SegNet are both convolutional networks for biomedical image
segmentation that use an encoder-decoder structure. U-Net uses skip connections to transfer
features from the encoder to the decoder, while SegNet uses non-linear upsampling with pooling
indices to do the same. In general, U-Net also has more feature channels in the upsampling
path than SegNet.
      </p>
      <p>For the specific task and dataset, we fine-tuned and optimized the models. The dataset consisted
of 30 insect samples, diferentiated by age and type, and approximately 4000×2048×2048 images
reconstructed for each sample. We measured the models’ performance on various metrics, such
as intersection over union. We contrasted the models’ results on training all organs jointly or
training them on each organ individually. We employed these results to compare and assess
our proposed continual learning method and how much knowledge is retained by the models
with this strategy.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Continual learning strategy</title>
      <p>Once the continual learning problem is defined, we performed training and optimization of
parameters and weights of the model for the first task  1. Next, we designed the training strategy
for the next tasks  2, without the model forgetting the previous task. The first step in devising
the continual learning training strategy was to examine the similarities and diferences among
the features learned by the models for the old and new tasks. We aimed to identify which parts
of the models could be shared and which parts required fine-tuning for next tasks. To do this
we used the cosine similarity between tensors:
 , =


, ⋅  ,</p>
      <p>|| ,
||||
 , ||
(1)
where ,</p>
      <p>indicates the diferent tasks,  indicates the layer, and  the images features. By
comparing these for diferent tasks, we identified which ones were shared and which ones
were specific. This allowed us to optimize our models and reduce the computational cost of
continual learning training. Once the model parameters, that need to be changed to learn new
tasks, were determined we defined a strategy to update the weights preventing catastrophic
forgetting. In the high-dimensional parameter space of the model just trained, there could
be update directions causing large changes in the predictions from  ∈  1, while there also
exist updates that minimally afect such predictions. In particular, moving locally along the
 -th direction of the gradient of the model ±∇  (;  ) , where  are the weights, leads to the
biggest change in model prediction   (;  )
weights moving orthogonal to ∇  (;  )</p>
      <p>. Inspired by M. Farajtabar et Al [15] we updated the
. This leads to the least change (or no change, locally)
to the prediction of  ∈  1 and a learning direction for the prediction  ∈  2. As mentioned
above, although tasks 1 and 2 are diferent and have diferent distributions, they may have
some similar features that could interfere with the finding of orthogonal directions. From this
intuition, we deduced that before calculating the possible orthogonal directions, the similar
features between the tasks should be eliminated on the new task, because they were already part
of the model’s knowledge. This is done using the Grad-CAM [16] approach, which highlights
the important weights for the model’s old predictions. Finally to perform the entire steps just
described, in particular the orthogonal gradient descent, the only things that need to be stored
from the previous task are the old gradient. These are necessary to calculate the new orthogonal
directions for the new task.</p>
      <p>The continual learning strategy just proposed is shown in figure 1 and can be resumed in
these steps:
1. The model is trained for the current task and their gradients are stored;
2. Layers that need to be updated for the next task are identified using the cosine similarity;
3. The weights of the layers that were common or ”similar” to both tasks are frozen;
4. The features in the new task similar to the old task are deleted using Grad-CAM;
5. The model is trained on the new task using the orthogonal gradient direction.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions</title>
      <p>We proposed a new method for continual learning of image segmentation, inspired by M.
Farajtabar et Al [15]. Our method uses feature map extrapolation, orthogonal gradient descent,
and similarity measures to find out which layers and weights need to change for each new task.
We freeze the weights of the shared layers and only use the old gradients to avoid forgetting
the previous task and learning the new task. Our method is also faster and cheaper than other
methods because we optimize the parameters for each task and we only save the old gradients
instead of reply-based methods.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>I am thankful to my supervisors, Francesco Calimeri and Pierangela Bruno, for introducing and
guiding me to this exciting and challenging research field. I also appreciate the University of
Calabria for ofering this engaging doctoral program and for the high-quality education that I
received during my studies.
[7] S.-A. Rebufi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl: Incremental classifier and
representation learning, in: Proceedings of the IEEE conference on Computer Vision and
Pattern Recognition, 2017, pp. 2001–2010.
[8] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, M. Ranzato,</p>
      <p>On tiny episodic memories in continual learning, arXiv preprint arXiv:1902.10486 (2019).
[9] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,
J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in
neural networks, Proceedings of the national academy of sciences 114 (2017) 3521–3526.
[10] M. Farajtabar, N. Azizan, A. Mott, A. Li, Orthogonal gradient descent for continual
learning, in: International Conference on Artificial Intelligence and Statistics, PMLR, 2020,
pp. 3762–3773.
[11] J. Yoon, E. Yang, J. Lee, S. J. Hwang, Lifelong learning with dynamically expandable
networks, arXiv preprint arXiv:1708.01547 (2017).
[12] G. Jerfel, E. Grant, T. Grifiths, K. A. Heller, Reconciling meta-learning and continual
learning with online mixtures of tasks, Advances in neural information processing systems
32 (2019).
[13] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu,
R. Pascanu, R. Hadsell, Progressive neural networks, arXiv preprint arXiv:1606.04671
(2016).
[14] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder
architecture for image segmentation, IEEE transactions on pattern analysis and machine
intelligence 39 (2017) 2481–2495.
[15] M. Farajtabar, N. Azizan, A. Mott, A. Li, Orthogonal gradient descent for continual
learning, in: S. Chiappa, R. Calandra (Eds.), Proceedings of the Twenty Third International
Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine
Learning Research, PMLR, 2020, pp. 3762–3773. URL: https://proceedings.mlr.press/v108/
farajtabar20a.html.
[16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual
explanations from deep networks via gradient-based localization, in: Proceedings of the
IEEE international conference on computer vision, 2017, pp. 618–626.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Sobieranski</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>von Wangenheim, 3d segmentation algorithms for computerized tomographic imaging: a systematic literature review</article-title>
          ,
          <source>Journal of digital imaging 31</source>
          (
          <year>2018</year>
          )
          <fpage>799</fpage>
          -
          <lpage>850</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation, in: Medical Image Computing</article-title>
          and Computer-Assisted InterventionMICCAI
          <year>2015</year>
          : 18th International Conference, Munich, Germany, October 5-
          <issue>9</issue>
          ,
          <year>2015</year>
          , Proceedings,
          <source>Part III 18</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naseer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hayat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Transformers in vision: A survey, ACM computing surveys (CSUR) 54 (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Litjens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kooi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Bejnordi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Setio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciompi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghafoorian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Van Der Laak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Van</given-names>
            <surname>Ginneken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. I.</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <article-title>A survey on deep learning in medical image analysis</article-title>
          ,
          <source>Medical image analysis 42</source>
          (
          <year>2017</year>
          )
          <fpage>60</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>McCloskey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation</article-title>
          , volume
          <volume>24</volume>
          ,
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          ,
          <year>1989</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>M. De Lange</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aljundi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Masana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parisot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Leonardis</surname>
            , G. Slabaugh,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tuytelaars</surname>
          </string-name>
          ,
          <article-title>A continual learning survey: Defying forgetting in classification tasks</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>44</volume>
          (
          <year>2021</year>
          )
          <fpage>3366</fpage>
          -
          <lpage>3385</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>