Continual Learning for medical image classification Alessandro Quarta1,2,3 , Pierangela Bruno1 and Francesco Calimeri1,3 1 Department of Mathematics and Computer Science, University of Calabria, Rende, Italy 2 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy 3 DLVSystem Srl, Rende, Italy Abstract Continual Learning (CL) is a novel paradigm in which the trained model is computed via a stream of data where tasks and data are only available over-time. Indeed, such approaches are able to learn new skills and knowledge without forgetting the previous ones: no access to previously encountered data and mitigate catastrophic forgetting. In this work, we propose a comparison of different CL algorithms in performing the classification of medical images. In particular, we aim to highlight the potential and ability of current methods in preventing catastrophic forgetting of the previous tasks when a new one is learned. CL-based methods have been tested for the classification of medical images showing the viability and effectiveness of these approaches. Keywords Continual Learning, Deep Learning, Medical Imaging, Lifelong Learning, Incremental Learning, Online Learning 1. Introduction Artificial Intelligence, and especially techniques based on Machine learning, have sped up the automation of many processes, achieving performance comparable to humans in some specific tasks. In particular, in the last decades, supervised neural networks have shown a great deal of potential in dealing with numerous tasks such as natural language processing [1], object detection [2] and medical imaging [3]. In this context, in order to properly deploy a predictive model, huge amounts of data com- plying with high-quality standards are needed. However, such a requirement is not always satisfiable, depending on several factors. For instance, data frequently come from heterogeneous sources, might feature a large number of missing or null values, and may also be subject to usage restrictions because of particular agreements or privacy concerns. For example, the data could be stored in predefined servers without the possibility of extraction, thus making it difficult to implement integration mechanisms between sources, or the data may be available for the training phases of the various models only for a very limited period of time. In recent years, these problems have attracted a lot of attention from the scientific community which proposed and studied different techniques to overcome such limitations. 1st AIxIA Workshop on Artificial Intelligence For Healthcare ∗ Corresponding author. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Continual Learning (CL) – referred to also as continuous learning, incremental learning, lifelong learning, and online learning – aims at defining models that continuously learn and evolve according to new amounts of data, retaining previously learned knowledge [4]. In this way, the model is able to incrementally learn and autonomously change its behaviour without forgetting the original task [4]. In general, humans learn concepts sequentially, and while humans may gradually forget old information, a complete loss of previous knowledge is rarely attested. By contrast, artificial neural networks work differently: they suffer from catastrophic forgetting of old concepts as new ones are learned [5]. This is a direct result of a more widespread issue with neural networks known as the stability-plasticity dilemma [6]. Stability refers to the ability to preserve prior knowledge while encoding it, while plasticity refers to the capacity to integrate new knowledge. Furthermore, other potential duties can exist, especially in healthcare. For example, it is not easy to train specialized models from scratch, so that they can accomplish and predict all possible tasks and activities. It would require a huge annotated dataset containing, for example, all possible disease or image modalities that make such approaches non-scalable. Instead, more realistic scenarios include physicians that receive a model already trained on typical tasks; when a (completely) new activity occurs, the model can be trained again to solve new problems. In particular, the model should first swiftly master the new task with little guidance by drawing on prior knowledge from completed tasks. Additionally, it must be able to combine new information with existing knowledge to enhance itself for both the current task and any upcoming ones. Eventually, it should be possible to learn a new task without using training data for previous tasks, which might no longer be accessible. In this work, we propose the use of CL approach to support Deep Learning (DL) architectures in classifying medical images; furthermore, we aim to show the feasibility and viability of such approaches via careful experimental analysis. The remainder of the paper is structured as follows: In Section 2 we describe the related work; then, we report about a careful experimental activity in Section 3, which is discussed in Section 4; eventually, we draw our conclusions in Section 5. 2. Related works Early studies discovered the catastrophic forgetting problem when learning samples of various input patterns sequentially. There have been several avenues investigated, and the majority of approaches to continuous learning do not rely on a single method to tackle catastrophic forgetting. Each strategy has benefits and drawbacks [7] but in many cases combining methods enables the discovery of the best answers. Lesort et al. [7] describe several continual learning methods, classifying them into four different classes, as shown in Figure 1. Baweja et al. [8] investigate the potential of Elastic Weight Consolidation (EwC) [9] algorithm in biomedical imaging to prevent the catastrophic forgetting of a neural network that aims to learn two different tasks sequentially. The goal is the multi-class segmentation of cerebrospinal fluid (CSF), grey matter (GM), and white matter (WM) and the segmentation of white matter lesions (WML) in MRI images. This investigation on bio-medical data shows that the technique Figure 1: The four different classes defined by [7]. Figure extracted from [7] is promising for alleviating catastrophic forgetting [8]. Wachinger et al. [10] suggest a continuous learning method for segmenting the brain in which a single network is successively trained on samples from various domains. The authors extend an importance-driven methodology to segment medical images. Specifically, they included learning rate regularization to stop the network’s information from being lost. In particular, using Memory Aware Synapses (MAS) [11], for each network parameter, a weight is given to reflect its significance for a particular task. The important weights are calculated by MAS using an unsupervised method, where the importance weights represent the output of the network’s sensitivity to changes in its parameters. Wachinger et al. [10] implemented MAS- LR, a parameter-specific learning rate based on the parameter’s significance. In order to do this, the authors offered MAS-Fix, which only tweaks irrelevant factors while freezing crucial parameters throughout the training of a new assignment. As a result, the learning rate for critical parameters will be reduced while remaining constant for unimportant parameters. Gonzalez et al. [12] proposed a fair multi-model benchmark. They distinguished between whether domain knowledge information is available during inference because it is uncertain in the incremental domain scenario. In the more straightforward scenario, which we refer to as Domain Knowledge, test inputs have the form (x, I ), where I declares that x belongs to Xi . The capacity to keep domain-specific parameters that are not shared across domains is the core advantage of having domain identification information. As a result, only the shared parameters of the model need to be trained repeatedly. In contrast to the feature extraction part of the model, the final network layers in a classification model are frequently domain-specific and set during inference based on the domain precedence of the test instance Therefore, the authors proposed a fair evaluation method. A comparison with the benchmark is possible if domain identity data can be employed during inference. They suggested the use of an oracle to determine the closest domain for an incoming image if domain information is not available. If the suggested approach needs domain knowledge, then a comparison of both the continual learning method and the benchmark that uses domain knowledge extrapolated from the oracle should be done. Only the benchmark would use oracle if the procedure does not employ domain knowledge. They make use of two different CL algorithms: EwC [9] and LwF [13]. A multi-domain learner can include new domains into lifelong learning with just a few labeled instances, maintaining performance on earlier domains [14]. In this study, Karani et al. provided strategies for segmentation across scanning protocols in a lifetime learning environment based on adaptive Batch Normalization (BN) layers. For each protocol/scanner, they specifically train a CNN with standard convolutional filters and unique BN parameters. To teach the network the proper convolutional filters, images from a few scanners are used as training data at first. By fine-tuning the BN parameters with a small number of labeled images, it can then be adjusted to work with different protocols and scanners. Perkonigg et al. [15] proposed the Dynamic Memory (DM) and the DM method with pseudo- domain detection (DM-PD) approach that operate without domain knowledge, representing a more realistic assumption in clinical practice. The DM technique kept M-samples diversified and indicative of the visual variances across all domains. Selecting which image-target pairs to preserve in memory, without explicit domain knowledge, is a crucial stage in this method. Indeed, the pseudo-domain module distinguishes between multiple domains, acting as a hold for the unidentified, actual domains. During ongoing training, these pseudo-domains are utilized to balance the memory M and training-mini-batch T. 3. Methodologies and Experimental Activities In this work, we proposed the use of different CL algorithms to support and improve the performance of the neural network in medical image classification. In particular, we compared Naive, Replay, CWR*, ICaRL, and Cumulative approaches. Naive and Cumulative methods are used as the lower and upper bound of our comparison. These approaches are tested in the task of multi-classification using a medical dataset that contains several pathologies, human districts, and image modalities. 3.1. Dataset In this work, we used a collection of standardized biomedical photos, MedMNIST v2 [16]. MedMNIST v2 has 12 datasets for 2D images and 6 datasets for 3D. It includes several data types (binary/multi-class, multi-label, and ordinal regression), dataset sizes (ranging from 100 to 100,000), and tasks. An example of the dataset is shown in Figure 2. All datasets in MedMNIST v2 are subdivided into train-validation-test folders. Figure 2: Example of images in MedMNIST dataset. Source: https://medmnist.com/ 3.2. CL architecture The methodology of continuous learning algorithms would be used in our study to handle both task-incremental difficulties (since it manages various types of images with various protocol acquisitions) and class-incremental problems (because we have many new classes for each experience). Nevertheless, for a first evaluation, even though each dataset relates to a different task, it is possible to think of it as a class-incremental scenario study since practically all tasks are a multi-class problem. Hence, it is as though we observed a single dataset with multiple classes each corresponding to a different pathology. • Naive: it consists in applying a backpropagation algorithm every time a new stream of data is available [17]. It represents the lower bound from a CL point of view. • Replay: random selection from the historical experience data; the Random Memory (RM) size is fixed, and filled with previous data. Keep roughly an equal number of instances for experience, and replace examples randomly [18]. • CWR*: designed for the fully connected linear classifier (and perhaps extended to several layers); using two distinct memory systems, one for memory consolidation and the other for improved plasticity; very straightforward and effective approach, regardless of the circumstance or experience content (NI, NC, NIC) [19]. • ICaRL: a hybrid continual learning strategy. Combining the replay and regularization methods; distillation for the regularization of representation learning (feature extractor); template matching with the closest prototype (as a classifier); and more sophisticated example management through herding. However, this method is difficult to scale and has inefficient example management with large memory sizes [20]. • Cumulative: for every experience, store all data and re-train from scratch [17]. It repre- sents the upper bound from a CL point of view. 3.3. Training phase For an objective assessment of the CL approach employed throughout the model’s training phase, we solely used 2D images. According to the work in [16], we made use of ResNet-18 [21] to identify various pathologies for each dataset because it performs well enough in average w.r.t. the other models for 28x28 images that were examined in the research. In particular, we focused on the following data sets: Colon Pathology, Dermatoscope, Retinal OCT, Blood Cell Microscope and Kidney Cortex Microscope [16]. We applied the same model to all datasets, as we first aimed at assessing whether an approach based on continuous learning techniques can be applied to the context of medical images. The images are first pre-processed with standard procedures, and then transformed to RGB and normalized so that all images could be processed by the same neural network. All experiments have been performed on a GNU/Linux machine equipped with a NVIDIA Quadro P6000 GPU with 24 GB of RAM and up to 250W. The training was performed with the same hyper-parameters for all approaches, except for the ICaRL (that follows the parameters proposed in the original paper [20]. Specifically, in order to subdivide the dataset into a training- validation-test set, we followed the split ratio provided in [16] that differs for each dataset; for example, for the Dermatoscope dataset, the authors suggested a 7:1:2 (training:validation:testing) split ratio (please refer to [16] for more detail on the other dataset distributions). Then, we set the learning rate (lr) equal to 0.0001, the mini-batch size to 64 images, and the number of epochs equal to 20; indeed, considering that the images are small (28x28) the model was capable to achieve good performance (over 90% for every single experience/stream of data) in a small number of epochs. According to the CL principles and strategies, we used a small set of pathologies for each experience. The implementation of CL algorithms was made by the use of Avalanche, a novel framework released by ContinualAI [22]. 3.4. Evaluation metrics We assessed the ability of continuous learning to improve performance on previously seen domains by adding new domains backward transfer (BWT) [23]. BWT measures how learning a new domain affects performance on prior tasks. Avoiding negative BWT is especially crucial for CL since negative BWT values signify catastrophic forgetting. Mathematically the BWT is defined as, 𝑇 −1 1 𝐵𝑊 𝑇 = ∑ R − Ri,i 𝑇 − 1 𝑖=1 T ,i where 𝑅𝑖,𝑖 is the test classification accuracy of the model on task 𝑡𝑖 after observing the last sample from task 𝑡𝑖 and 𝑅𝑇 ,𝑖 is the test performance on all 𝑇 tasks. By definition, it is possible to evaluate the forgetting of a neural model as 𝐹 𝑜𝑟𝑔𝑒𝑡𝑡𝑖𝑛𝑔 = −𝐵𝑊 𝑇, therefore lower the measure better the performance. In fact, this measure shows the capability of a Neural Network to maintain knowledge about previous experiences. Additionally, we used also the top-1 accuracy metrics, in order to make a consideration on the overall performance of the model. Table 1 Results obtained using different CL strategies. The approaches are evaluated in terms of BWT and Top-1 Accuracy. Naive Replay CWR* ICaRL Cumulative BWT -0.9247 -0.7305 -0.1928 -0.2465 0 Top-1 Acc. avg. 0.2334 0.3790 0.5431 0.5154 0.6285 4. Results and discussion In this section, we discuss the experimental results and effectiveness of the continuous learning methodologies herein considered. One of the crucial issues is how to establish boundaries for continuous learning methodologies. Hence, to the best of our knowledge, and according to the survey on CL approaches provided in [7] the current continuous learning approaches still fall short of matching the performance of a group of models (or a single model, depending on the problem to be solved) which is trained with all data. The cumulative approach refers to training a model with all the data at once. One of the goals of the CL method is to match and, if feasible, even outperform the performance of a model that has been trained using all the data. However, it is also possible to establish a lower bound, which is illustrated by the so-called Naive technique, which uses back-propagation without considering a new domain. The BWT and accuracy of the models, which are two helpful measures for assessing the CL performance from the initial stage, are shown in Table 1. We started the performance evaluation of the models from the Cumulative approach, pointing out that compared to the work in [16] we have obtained lower accuracy (i.e. 0.63 Top-1 Accuracy average): this is mainly due to the fact that performance of a single model responsible for the classification of all the different pathologies is lower than the performance of multiple models that focus on the classification of pathologies only related to a specific dataset at each time. However, as previously mentioned, the main goal of this work was to compare the CL models with a possible (upper) limit case to identify the best approach usable in medical imaging; therefore, our primary objective is not to obtain a model that could outperform the state-of-the- art. CL approaches can be evaluated according to the accuracy and forgetting value. In the following, we propose a detailed analysis of the different strategies of CL. In spite of the fact that they are widely used in the literature, in our experiments random replay’s accuracy and BWT results look inadequate, scoring 0.38 and −0.73, respectively. This could be caused by the unbalanced distribution of images across classes and a high number of images. Another motivation could be the memory capacity limit of the stored data, which was equal to 1500 while in [18] appears to be the limit above which there is no real advantage compared to the problems of an excessive cost of storage and privacy preserve. Thus, it becomes evident that remembering only a few examples of previous experiences is insufficient in a scenario where the number of images per dataset is relevant, or huge. Instead, we found promising results for both CWR* and ICaRL according to each metric taken into account. Indeed, CWR* achieved the best results (i.e., Top-1 Accuracy average of 0.54 and BWT Figure 3: The figure shows the decrease in performance in terms of accuracy of different CL approaches against the increasing number of experiences. This is due to the fact that for every new experience the model has to recognize the new and older classes. Nevertheless, the figure does not show the cumulative, as, in our study, cumulative is not evaluated in terms of different experiences (all data sets are available at the same time). of −0.19), even if slightly below the Cumulative strategy. Figure 3 illustrates how the model performance degrades over experiences due to catastrophic forgetting, which is anyway mitigated by the CL strategies. It is worth noting that the cumulative strategy was intentionally omitted from the plot, given that, in this study, it was assessed towards a model that used all available datasets simultaneously instead of storing data from previous experiences and re-train the model from scratch. In a nutshell, with the cumulative strategy the model received all data together, in a sort of “single big experience”; interestingly, as shown in Table 1, it outperforms other strategies in accuracy. This was made possible since the problem can be traced back to a class-incremental problem. 5. Conclusion In this work, we presented a comparison of different CL strategies used to support neural networks during the classification of medical images. We reported the results of an ad-hoc experimental analysis showing that the best results are obtained using CWR* and ICaRL. Although the results are slightly lower than the upper bound (i.e. cumulative strategy), the overall performance results are promising. Hence, our approach, featuring a comparative evaluation of CL means, can be of help to support the proper choice and use of these methods in medical imaging. As future work is concerned, we plan to include other metrics in our experimental comparison (e.g., forward transfer) and to evaluate the variation of the training time of different approaches. Acknowledgements This work has been partially funded by PON “Ricerca e Innovazione” 2014-2020, CUP: H25F21001230004. References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, Interna- tional journal of computer vision 115 (2015) 211–252. [3] M. H. Hesamian, W. Jia, X. He, P. Kennedy, Deep learning techniques for medical image segmentation: achievements and challenges, Journal of digital imaging 32 (2019) 582–596. [4] C. S. Lee, A. Y. Lee, Clinical applications of continual learning machine learning, The Lancet Digital Health 2 (2020) e279–e281. [5] R. M. French, Catastrophic forgetting in connectionist networks, Trends in cognitive sciences 3 (1999) 128–135. [6] S. T. Grossberg, Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control, volume 70, Springer Science & Business Media, 2012. [7] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, N. Díaz-Rodríguez, Continual learn- ing for robotics: Definition, framework, learning strategies, opportunities and challenges, Information fusion 58 (2020) 52–68. [8] C. Baweja, B. Glocker, K. Kamnitsas, Towards continual learning in medical imaging, arXiv preprint arXiv:1811.02496 (2018). [9] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (2017) 3521–3526. [10] C. Wachinger, Importance driven continual learning for segmentation across domains, in: Machine Learning in Medical Imaging: 11th International Workshop, MLMI 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Proceedings, volume 12436, Springer Nature, 2020, p. 423. [11] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, T. Tuytelaars, Memory aware synapses: Learning what (not) to forget, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 139–154. [12] C. Gonzalez, G. Sakas, A. Mukhopadhyay, What is wrong with continual learning in medical image segmentation?, arXiv preprint arXiv:2010.11008 (2020). [13] Z. Li, D. Hoiem, Learning without forgetting, IEEE transactions on pattern analysis and machine intelligence 40 (2017) 2935–2947. [14] N. Karani, K. Chaitanya, C. Baumgartner, E. Konukoglu, A lifelong learning approach to brain mr segmentation across scanners and protocols, 2018. [15] M. Perkonigg, J. Hofmanninger, C. J. Herold, J. A. Brink, O. Pianykh, H. Prosch, G. Langs, Dynamic memory to alleviate catastrophic forgetting in continual learning with medical imaging, Nature Communications 12 (2021) 1–12. [16] J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, B. Ni, Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification, arXiv preprint arXiv:2110.14795 (2021). [17] K.-H. Thung, C.-Y. Wee, A brief review on multi-task learning, Multimedia Tools and Applications 77 (2018) 29705–29725. [18] L. Pellegrini, G. Graffieti, V. Lomonaco, D. Maltoni, Latent replay for real-time continual learning, in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2020, pp. 10203–10209. [19] V. Lomonaco, D. Maltoni, L. Pellegrini, Rehearsal-free continual learning over small non-iid batches., in: CVPR Workshops, volume 1, 2020, p. 3. [20] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl: Incremental classifier and representation learning, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010. [21] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [22] V. Lomonaco, L. Pellegrini, A. Cossu, A. Carta, G. Graffieti, T. L. Hayes, M. De Lange, M. Masana, J. Pomponi, G. M. Van de Ven, et al., Avalanche: an end-to-end library for continual learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3600–3610. [23] D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, Advances in neural information processing systems 30 (2017).