=Paper= {{Paper |id=Vol-3831/paper2 |storemode=property |title=ProtoAL: Interpretable deep active learning with prototypes for medical imaging |pdfUrl=https://ceur-ws.org/Vol-3831/paper2.pdf |volume=Vol-3831 |authors=Iury Santos,André Carvalho |dblpUrl=https://dblp.org/rec/conf/explimed/SantosC24 }} ==ProtoAL: Interpretable deep active learning with prototypes for medical imaging== https://ceur-ws.org/Vol-3831/paper2.pdf
                                ProtoAL: Interpretable Deep Active Learning with
                                Prototypes for Medical Imaging
                                Iury B de A Santos1 , André C P L F Carvalho1,*
                                1
                                    Instituto de Ciências Matemáticas e de Computação (ICMC), University of São Paulo (USP), São Carlos, Brazil


                                              Abstract
                                              The adoption of deep learning algorithms in the medical imaging area is a prominent research issue, with
                                              high potential for advancing AI-based Computer-aided diagnosis solutions. However, current solutions
                                              face challenges due to a lack of interpretability features and high data demands, prompting recent efforts
                                              to address these issues. In this study, we propose the ProtoAL model, where we integrate an interpretable
                                              deep network into the deep active learning framework. This approach aims to address both challenges
                                              by focusing on the medical imaging context and utilizing an inherently interpretable model based on
                                              prototypes. We evaluated ProtoAL on the Messidor dataset, achieving results as 0.81 in F1-score and
                                              accuracy, and an area under the precision-recall curve of 0.79 while utilizing only 76.54% of the available
                                              labeled data. The yielded results were inline with baselines investigated, while providing interpretability
                                              prototypes requiring less training data. These capabilities can enhances the practical usability of a deep
                                              learning model in the medical field, providing a means of trust calibration in domain experts and a
                                              suitable solution for learning in the data scarcity context often found.

                                              Keywords
                                              Deep Active Learning, Interpretability, Medical Imaging




                                1. Introduction
                                Deep learning (DL) is an area of machine learning (ML) that has been rapidly growing in recent
                                years. Employed for training of deep artificial neural network architectures, its use has been
                                extensively explored, due to its impressive results in various areas, such as bioinformatics [1],
                                natural language processing [2, 3], and image processing [4, 5]. Some of the main applications
                                in image processing are in medicine, where AI-based Computer-aided diagnosis (AI-CAD)
                                solutions often use DL. In these cases, DL models support medical diagnoses based on images
                                such as magnetic resonance imaging (MRI), X-rays, computed tomography and conventional
                                photographs.
                                   Despite considerable community interest, practical application of AI-CAD solutions encoun-
                                ters obstacles, including the lack of interpretability features in models. These models are often
                                perceived as black boxes, making it challenging for humans to understand their internal rea-
                                soning, which raises trust issues among experts and regulatory concerns [6, 7, 8]. Current
                                AI-CAD solutions often lack robustness, being susceptible to biases during training and failing

                                EXPLIMED - First Workshop on Explainable Artificial Intelligence for the medical domain - 19-20 October 2024, Santiago
                                de Compostela, Spain
                                *
                                  Corresponding author.
                                $ iuryandrade@usp.br (I. B. d. A. Santos); andre@icmc.usp.br (A. C. P. L. F. Carvalho)
                                 0000-0001-7234-6877 (I. B. d. A. Santos); 0000-0002-4765-6459 (A. C. P. L. F. Carvalho)
                                            © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
to provide experts with confidence estimations or limitations regarding the results. This poses
challenges, especially in healthcare settings, where less experienced professionals may overly
rely on computational models [6, 9, 10].
   ProtoPNet, introduced by Chen et al. [11], is a deep neural network (DNN) architecture
aiming to enhance interpretability features within DNN models. During inference, ProtoPNet
showcases prototypes that share similar features with the input image in a “bag-of-features”
format. Several studies have applied the ProtoPNet architecture in medical image analysis. For
instance, Mohammadjafari et al. [12] used ProtoPNet to classify MRI brain scans as either healthy
or indicative of Alzheimer’s disease. Vaseli et al. [13] introduced ProtoASNet, a modified version
of ProtoPNet tailored for handling spatio-temporal data and integrating aleatory uncertainty
estimation into prototypes in the context of aortic stenosis. Wei et al. [14] presented the
MProtoNet model, specifically designed for tumor brain classification tasks in multi-parametric
MRI (mpMRI) analysis.
   The use of DL is also affected by limited availability of large datasets, especially in supervised
learning. Besides data being abundant, labeling requires expert input, resulting in high demands
in costs and time. To address this issues, deep active learning (DAL) emerges as a feasible
approach, extending active l earning (AL) [15] concepts to work with DNNs. DAL assumes
that a model with comparable results can be achieved using less but carefully selected training
instances. Several works have investigated DAL in medicine, with satisfactory outcomes with
less data compared with models trained on the entire dataset [16, 17, 18, 19, 20]. Smailagic et al.
[21] proposed the O-medal method, which employs online training and eliminates the necessity
of the complete model re-training at each DAL cycle. This method presents a more feasible
scheme within the context of DNNs.
   Limited research exists on the intersection of DAL and interpretable models. Phillips et al.
[22] used LIME interpretability to aid experts in understanding query batches and tracking
uncertainty bias. Das et al. [23] employed AL for anomaly detection, providing experts with
explanations of model decisions using tree-based ensembles. They introduced a novel formalism
for compact descriptions to enhance diversity and generate interpretable rule sets. Liu et al.
[24] proposed a DAL approach based on model interpretability, aiming to maximize repre-
sentativeness in unlabeled data by selecting examples from different linear separable regions
of an interpretable DNN. Furthermore, Mondal and Ganguly [25] trained an explainer model
alongside a classifier to select new instances based on dissimilarity of explanations compared to
already labeled instances. Except for Das et al. [23], none of these works applied the approaches
with medical data. Das et al. [23] only explored the use in tabular data.
   The main contribution of our work is ProtoAL 1 , a novel model capable of integrating an
interpretable DNN into the DAL framework, specifically designed for medical image analysis.
ProtoAL achieves interpretability through prototypes, providing explanations based on image
patches from the training dataset, aligning closely with clinical practices in a visually intuitive
manner. Additionally, we investigated how the DAL framework impacts the objective of reducing
the required training instances to achieve results comparable to conventionally trained models.
To our knowledge, ProtoAL is the first model to integrate an intrinsically interpretable DNN
through visual prototypes into the DAL framework. To assess its effectiveness, ProtoAL was

1
    ProtoAL source code is available at https://github.com/IuryBAS/ProtoAL_paper
compared against baselines using robust predictive performance measures, such as the area
under the precision-recall curve (AUPRC).


2. Methods


                     DAL cycle                  Model train cycle
                                                Interpretable learning
                                                        model M

                     Train model M                                                            Evaluate U
                         with L                                                                using M




                                                                                           Unlabeled dataset
                                                                                                   U     Inference
                                                                         Selected Instances
                                                                                                            mode
                              Labeled dataset                                          ?
                                    L                                         ?
                                                                                      ?
                                                                                          Search/Query
                                                                                            strategy
                                                                                                Q
                                                       +
                                                                +
                                                           +
                     Include new labeled                                            Send instances to
                        instances in L                                               oracle labeling

                                                       Oracle


Figure 1: Schematic view illustrating the DAL and model training cycles. In the DAL cycle, labeled
instances are added to ℒ by selecting unlabeled instances from 𝒰 using a search strategy. Meanwhile,
the learning model 𝑀 (ProtoPNet) undergoes training iterations within each DAL cycle.


   Our model, ProtoAL, aims to integrate interpretable DL into the DAL framework, with focus
on the AI-CAD to medical imaging. The DAL framework allows train a learning model, 𝑀 , using
a training set, ℒ, consisting of selected instances based on a search strategy 𝑄, which criteria
typically target uncertainty instances or aim to enhance diversity. These selected instances
are part of a large, unlabeled dataset 𝒰, and are labeled by an oracle before being added to
ℒ. The oracle can operate online, where domain experts label the selected instances; or in a
“simulated” manner, where ground-truth labels are hidden and only revealed for the selected
instances. The DAL framework assumes that selected instances offer pertinent information
about the problem, which enables a model achieve comparable performance to one trained on
a fully random instance dataset while needing fewer examples. This is particularly beneficial
in medical context, where large datasets of unlabeled data exists, but labeled datasets suitable
for DL training are scarce, allowing reduce the expenses associated with expert labeling. The
DAL cycle is repeated until a stop condition is reached, as a satisfactory performance, budget
constraints or depletion of the 𝒰 dataset.
   Figure 1 illustrates our workflow, divided in two cycles: the outer DAL cycle, referring to
the process described above; and the training of the interpretable model itself, with its inner
workings (detailed in the Subsection 2.2). The proposed ProtoAL model uses the ProtoPNet
as the base model ℳ, integrating an intrinsically-oriented DNN into the DAL framework.
This allows the interpretable capabilities of the ProtoPNet to be incorporated into the DAL
framework, requiring fewer training instances and taking advantage of the more informative
and significant ones. With ProtoAL, the adoption of models like ProtoPNet in the medical
context becomes more feasable, where large labeled datasets are scarce.

2.1. Deep Active Learning
We use the O-medal method [21] to define our DAL framework. The O-medal operate similarly
to the framework described above, but differs by avoiding the need to retrain the model from
scratch at each DAL iteration. It offers a more reliable, better fitting, and computationally
efficient approach. Also, ℒ does not include all previously labeled instances. Instead, it consists
of newly labeled instances and a partition 𝑝 of the previously labeled data.
   The 𝒬 strategy was based on uncertainty estimation. A common approach involves using
Monte Carlo Dropout (MC Dropout)[26], with the Dropout technique [27] working as a Bayesian
approximation. During the inference of the 𝒰 instances, 𝑇 forward steps are performed, and
averaging is applied across each instance. The 𝑛 most uncertain instances are labeled and added
to the ℒ set [26]. Throughout this, all layers except the dropout layers remain frozen. As oracle,
we adopted the simulated strategy, only revealing the instances labels when selected to labeling.

2.2. Prototypical Neural Network
The ProtoPNet [11] is an interpretability-oriented DNN that uses prototypes to explain the
reasoning of the learning model. Unlike explainability in posthoc approaches, the ProtoPNet
proposes an intrinsically oriented method for achieving interpretable DNN, trough imposed
constraints on the learning process, which considers the explanations during training.
   In summary, the ProtoPNet model compromises a backbone neural architecture 𝑓 , such
as VGGNet [28], ResNet [29], or any other selected architecture, extended by a prototypical
layer 𝑔p and a fully connected layer ℎ. The input feature extraction is realized by 𝑓 , while 𝑔p
learns 𝑚 prototypes P = {p𝑗 }𝑚   𝑗=1 of 𝐻1 × 𝑊1 × 𝐷 shape, with 𝐻, 𝑊 , and 𝐷 representing
height, width and dimensions of the prototypes [11]. The prototype activation pattern acts as
a patch representing some prototypical image patch in the original pixel space, where 𝑝 can
be understood as a latent representation of some prototypical part. The squared 𝐿2 distances
between the input patches and the prototypes of 𝑝 are inverted into similarity scores, forming
an activation map where the values indicate how strongly a prototypical part is present in the
image. The activation map is reduced using global max-pooling to a single similarity score,
indicating how much a prototypical part is present in some input image patche. The 𝑚 similarity
scores produced by 𝑔p are multiplied by the outputs of ℎ and normalized using softmax, yielding
the predicted probabilities of the image [11].
   The main advantage of ProtoPNet is provide visual explanations of the predictions, which it
optimizes and learns alongside the classifier. These prototypes are related to instances from the
training set, representing real cases and attributes.
3. Experiments
3.1. Dataset
The experiments were carried out using the Messidor dataset [30]2 of diabetic retinopathy (DR). It
compromises 1200 color images of the eyes fundus, obtained from three different ophthalmologic
departments. Experts evaluated each image and classified it based on the retinopathy grade
and risk of macular edema. Retinopathy grade ranges from 0 (normal) to 3, considering the
number of microaneurysms, hemorrhages and the presence of neovascularization. Corrections
were made by correcting mislabeling and removing duplicated images files according to the
instructions in the dataset download page3 .
  Following the preprocessing outlined in Smailagic et al. [21], we grouped the retinopathy
grades as healthy (DR = 0) or diseased (DR ≥ 1). The risk of macular edema feature was not
used in the experiments. The images were resized to 512 × 512, and data were augmented by
randomly applying rotations up to 15 degrees, horizontal flips and scaling in the range [0.9, 1].
The train, validation and test sets were composed of 759, 190 and 238 instances, respectively.

3.2. Baselines
We compared the ProtoAL model in two perspectives, to observe both the interpretability and
DAL framework factors. For such, we adopted three baselines, targeting distinct contexts: (i) A
vanilla ResNet-18 model, trained conventionally and without interpretability features, with
access to the entire training dataset from beginning; (ii) the ProtoPNet standalone baseline,
with the aim of evaluate the performance of the interpretability model used in ProtoAL without
incorporating it into the DAL framework; and (iii) the ProtoAL with random search strategy,
utilizing as search strategy the selection of random instances from 𝒰. This baseline allows to
observe the impact of the MC Dropout as search strategy in the performance of the ProtoAL
model. The ResNet-18 and ProtoPNet were pretrained on the ImageNet dataset [31], both when
used as baselines and backbone in the ProtoAL.

3.3. Implementation details
As mentioned earlier, the ProtoAL model follows the structure of the O-medal method, while
employing a ProtoPNet as DNN model. The model hyper-parameters were optimized using
grid search[32]. This was deemed the most appropriate approach given the computational
constraints associated with exhaustive search. The runs were executed by varying seed values
(0, 1, 2, 5, 10, 12, 42, 123, 1234, 12345), and hyper-parameters for both the DAL method and
ProtoPNet model, as MC Dropout steps (10, 30, 50), instances to label per DAL iteration (10, 20,
30), batch size (32, 64) and epochs per DAL iteration (5, 10, 20).
   We conducted runs using both random (ProtoAL-Random) and MC Dropout (ProtoAL-MC) as
search strategy. The stop condition for the DAL cycle was determined as no remaining instances
left to be labeled in 𝒰. The number of runs was dynamically adjusted based on the labeled
instances per DAL iteration. The fixed hyper-parameters follows the Smailagic et al. [21] work.
2
    Kindly provided by the Messidor program partners (see https://www.adcis.net/en/third-party/messidor/).
3
    https://www.adcis.net/en/third-party/messidor/
The ℒ set initially consisted of 100 randomly selected instances. The percentage of previously
labeled examples forming ℒ was set to 0.875.
   A ResNet-18 was used as the backbone DNN of the ProtoAL, with prototype layers featuring
256 channels. We settled 12 prototypes, with 6 prototypes allocated for each class (healthy and
diseased). During training, ProtoPNet undergoes warm-up for 5 epochs during the first DAL
iteration. After warm-up, joint training is performed for 𝑒 epochs (epochs per DAL iteration
hyper-parameter), succeeded by a projection (push) step and last layer optimization for 15 steps.
Excepting for warm-up, this cycle repeats for all DAL iterations. We used the Adam optimizer,
with learning rates as outlined in Chen et al. [11]. Learning rate was decreased exponentially
per epoch. All experiments were run on Nvidia Tesla V100 GPU, with 32GB VRAM. The DNN
models were implemented with PyTorch 2.0.1 framework and Python 3.10.


4. Experimental Results
During the grid search, we trained a total of 540 ProtoAL models. We selected the best model
configuration using AUPRC, based on validation set results. The best run was achieved by a
model trained with a batch size of 32, employing 10 iterations of MC Dropout per instance,
and selecting 30 new examples at each DAL iteration. For the joint optimization phase, the
ProtoPNet model underwent 10 training epochs. Table 1 presents a comparison of results
between ProtoAL with MC dropout (ProtoAL-MC) and the baselines evaluated on the test set.
   The ProtoAL-MC achieved an AUPRC of 0.79, along with an F1-Score and accuracy of 0.81.
The ProtoAL-Random, utilizing random search strategy, achieved 0.77 in AUPRC and 0.79 in
both F1-Score and accuracy. These results demonstrate that the MC Dropout search strategy
yielded superior results, particularly evident in the AUPRC metric.
   ProtoPNet and ResNet-18 obtained comparable or superior results to ProtoAL in specific
metrics. It is worth noting that both baselines had access to the entire training dataset from the
start, being more exposed to the instances. It can be observed that ProtoPNet exhibits a decline




Table 1
Evaluation test results of the ProtoAL-MC and baselines
in relation to AUPRC, F1-Score, Precision, Recall and
Accuracy
      Model       AUPRC    F1-Score   Precision   Recall   Accuracy
 ResNet-18        0.8462    0.8699     0.9067     0.8359    0.8655
 ProtoPNet        0.7900    0.8273     0.8512     0.8046    0.8193
 ProtoAL-Random   0.7773    0.7999     0.8571      0.75     0.7983
 ProtoAL-MC       0.7935    0.8181     0.8684     0.7734    0.8151

                                                                      Figure 2
                                                                      Comparisons of ProtoAL-MC model and the
                                                                      baselines, evaluated on the validation set
in performance compared to ResNet-18, possibly due to the added complexity of optimization
from the inclusion of interpretability features. Consequently, ProtoAL’s performance remains
consistent and similar to that of the ProtoPNet baseline.
   The ProtoAL-MC model achieved results comparable to the ProtoPNet baseline in the 17th
DAL iteration out of 23 total, with the ℒ dataset consisting of 512 instances, having labeled
581 instances and leaving 178 instances in the 𝒰 set. By this stage, ProtoAL-MC required only
76.54% of the training instances, out of the initial total of 759 examples, to achieve comparable
results to models trained with all available examples. Figure 2 presents a comparison between
ProtoAL-MC and the baseline models across DAL cycle steps when evaluated on the validation
set, with ProtoAL-MC achieving its best performance in approximately 400, out of 600 steps.
The steps are in respect to the total of model training steps (joint, push and last only phases)
during all DAL iterations. Figure 3 displays prototypes demonstrating correct classification of a
image with DR for qualitative evaluation. It shows the three most similar prototypes for both
healthy and diseased cases, along with prototype weight, similarity scores, and prediction value.
   Through prototypes, a domain expert can observe the most similar instances for both disease
and healthy cases, along with their respective relevant regions in the input image. As inter-
pretability criteria, experts can rely on visual inspection, and prototype similarity scores and
weights, assessing whether the model’s inferences are meaningful and relevant. For instance, a
expert can verify whether a highly relevant prototype corresponds to a meaningful region of the
           Test image with the activation map
             and patch similar to a prototype




                                                               Prototype weight = 1.081                      Prototype weight = 1.083                      Prototype weight = 1.065
Diseased




                                                 Looks like (sim =4.540)                       Looks like (sim =3.516)                       Looks like (sim =3.310)
           Prototype patch and the image
             source from the training set




                                                                  1.081 x 4.540           +                    1.083 x 3.516            +                    1.065 x 3.310            = 12.240
           Test image with the activation map
             and patch similar to a prototype




                                                             Prototype weight = 0.927                      Prototype weight = 0.910                      Prototype weight = 0.910
Healthy




                                                Looks like (sim =3.246 )                      Looks like (sim =2.994)                       Looks like (sim =0.872)
           Prototype patch and the image
             source from the training set




                                                                0.927 x 3.246             +                  0.910 x 2.994              +                  0.910 x 0.872              = 6.527
Figure 3: The most similar prototypes for both diseased and healthy conditions for an accurately
classified image with DR. It includes similarity scores, prototype weights, and prediction values. For
each prototype, the relevant activated region from the input image and its corresponding prototype
from a training image are shown
input image, assess if it contains features consistent with the predicted output, and observe the
similarity score it achieves. With such information, an expert has the tools to more accurately
calibrate their confidence and trust in the model’s prediction.


5. Discussion and conclusion
Our model, ProtoAL, integrates an interpretable DNN model with prototypes into a DAL frame-
work. Specifically tailored for the AI-CAD context of medical imaging, the aim was to enhance
the reliability of AI-CAD solutions in practical medical contexts while exploiting the capacity to
comprehend the model decisions and training in limited labeled datasets. Quantitative results
presented in Table 1 demonstrate the success of providing a interpretable model while utilizing
a reduced amount of training data, addressing two key challenges in the adoption of AI in
medical contexts: lack of interpretability and scarcity of labeled datasets. The qualitative results
in Figure 3 illustrate prototypes relevant to the network’s inference reasoning for a diseased
input image, providing visual explanations that domain experts can interpret.
   ProtoAL offers interpretability features lacking in the ResNet-18 baseline, which enhance the
practical usability of ProtoAL as an AI-CAD solution while maintaining a performance level
similar to that of the ProtoPNet model, albeit with reduced training data demands. However,
preliminary experiments with ProtoPNet using a greater number of prototypes resulted in
the repetition of similar or identical prototypes, which is undesirable as it does not promote
prototype diversity. Furthermore, our work is limited to quantitative evaluation using standard
metrics such as AUPRC and F1-Score, which do not assess the interpretability characteristics of
the model.
   Future works could explore how to enhance integration of interpretability features within the
DAL framework, including leveraging information from prototypes to refine search strategies.
Moreover, investigate how to promote prototype diversity and integrate domain experts in
evaluation of the results would provide valuable insights and a important qualitative analysis.


Acknowledgments
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível
Superior – Brasil (CAPES) – Finance Code 001 and grant #2022/05788-4, São Paulo Research
Foundation (FAPESP).


References
 [1] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W.
     Nelson, A. Bridgland, et al., Improved protein structure prediction using potentials from
     deep learning, Nature 577 (2020) 706–710.
 [2] D. W. Otter, J. R. Medina, J. K. Kalita, A survey of the usages of deep learning for natural
     language processing, IEEE trans. on neural networks and learning systems 32 (2020)
     604–624.
 [3] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, J. Gao, Deep learning–
     based text classification: a comprehensive review, ACM Computing Surveys (CSUR) 54
     (2021) 1–40.
 [4] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, D. Terzopoulos, Image
     segmentation using deep learning: A survey, IEEE trans. on pattern analysis and machine
     intelligence (2021).
 [5] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, R. Qu, A survey of deep learning-based
     object detection, IEEE access 7 (2019) 128837–128868.
 [6] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der
     Laak, B. van Ginneken, C. I. Sánchez, A survey on deep learning in medical image analysis,
     Medical Image Analysis 42 (2017) 60–88. doi:10.1016/j.media.2017.07.005.
 [7] A. Esteva, K. Chou, S. Yeung, N. Naik, A. Madani, A. Mottaghi, Y. Liu, E. Topol, J. Dean,
     R. Socher, Deep learning-enabled medical computer vision, npj Digital Medicine 4 (2021)
     5. doi:10.1038/s41746-020-00376-2.
 [8] A. Vellido, The importance of interpretability and visualization in machine learning for
     applications in medicine and health care, Neural Computing and Applications 32 (2020)
     18069–18083. doi:10.1007/s00521-019-04051-w.
 [9] S. Gaube, H. Suresh, M. Raue, A. Merritt, S. J. Berkowitz, E. Lermer, J. F. Coughlin, J. V.
     Guttag, E. Colak, M. Ghassemi, Do as ai say: susceptibility in deployment of clinical
     decision-aids, NPJ digital medicine 4 (2021) 1–8.
[10] D. W. Bates, D. Levine, A. Syrowatka, M. Kuznetsova, K. J. T. Craig, A. Rui, G. P. Jackson,
     K. Rhee, The potential of artificial intelligence to improve patient safety: a scoping review,
     NPJ digital medicine 4 (2021) 1–8.
[11] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, J. K. Su, This looks like that: deep learning for
     interpretable image recognition, Advances in neural information processing systems 32
     (2019).
[12] S. Mohammadjafari, M. Cevik, M. Thanabalasingam, A. Basar, Using protopnet for inter-
     pretable alzheimer’s disease classification., in: Canadian Conference on AI, 2021.
[13] H. Vaseli, A. N. Gu, S. N. Ahmadi Amiri, M. Y. Tsang, A. Fung, N. Kondori, A. Saadat,
     P. Abolmaesumi, T. S. Tsang, Protoasnet: Dynamic prototypes for inherently interpretable
     and uncertainty-aware aortic stenosis classification in echocardiography, in: International
     Conference on Medical Image Computing and Computer-Assisted Intervention, Springer,
     2023, pp. 368–378.
[14] Y. Wei, R. Tam, X. Tang, Mprotonet: A case-based interpretable model for brain tumor
     classification with 3d multi-parametric magnetic resonance imaging, in: Medical Imaging
     with Deep Learning, PMLR, 2024, pp. 1798–1812.
[15] B. Settles, Active learning literature survey (2009).
[16] Z. Zhao, Z. Zeng, K. Xu, C. Chen, C. Guan, Dsal: Deeply supervised active learning from
     strong and weak labelers for biomedical image segmentation, IEEE journal of biomedical
     and health informatics 25 (2021) 3744–3751.
[17] X. Wu, C. Chen, M. Zhong, J. Wang, J. Shi, Covid-al: The diagnosis of covid-19 with deep
     active learning, Medical Image Analysis 68 (2021) 101913.
[18] V. Nath, D. Yang, H. R. Roth, D. Xu, Warm start active learning with proxy labels and
     selection via semi-supervised fine-tuning, in: International Conference on Medical Image
     Computing and Computer-Assisted Intervention, Springer, 2022, pp. 297–308.
[19] S. Belharbi, I. Ben Ayed, L. McCaffrey, E. Granger, Deep active learning for joint classi-
     fication & segmentation with weak annotator, in: Proceedings of the IEEE/CVF Winter
     Conference on Applications of Computer Vision, 2021, pp. 3338–3347.
[20] M. L. Di Scandalea, C. S. Perone, M. Boudreau, J. Cohen-Adad, Deep active learning for
     axon-myelin segmentation on histology data, arXiv preprint arXiv:1907.05143 (2019).
[21] A. Smailagic, P. Costa, A. Gaudio, K. Khandelwal, M. Mirshekari, J. Fagert, D. Walawalkar,
     S. Xu, A. Galdran, P. Zhang, et al., O-medal: Online active deep learning for medical image
     analysis, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10
     (2020) e1353.
[22] R. Phillips, K. H. Chang, S. A. Friedler, Interpretable active learning, in: S. A. Friedler,
     C. Wilson (Eds.), Proceedings of the 1st Conference on Fairness, Accountability and
     Transparency, volume 81 of Proceedings of Machine Learning Research, PMLR, 2018, pp.
     49–61. URL: https://proceedings.mlr.press/v81/phillips18a.html.
[23] S. Das, M. R. Islam, N. K. Jayakodi, J. R. Doppa, Active anomaly detection via ensembles:
     Insights, algorithms, and interpretability, arXiv preprint arXiv:1901.08930 (2019).
[24] Q. Liu, Z. Liu, X. Zhu, Y. Xiu, Deep active learning by model interpretability, arXiv preprint
     arXiv:2007.12100 (2020).
[25] I. Mondal, D. Ganguly, Alex: Active learning based enhancement of a classification model’s
     explainability, in: Proceedings of the 29th ACM International Conference on Information
     & Knowledge Management, 2020, pp. 3309–3312.
[26] Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Representing model uncer-
     tainty in deep learning, in: international conference on machine learning, PMLR, 2016, pp.
     1050–1059.
[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple
     way to prevent neural networks from overfitting, The journal of machine learning research
     15 (2014) 1929–1958.
[28] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
     recognition, arXiv preprint arXiv:1409.1556 (2014).
[29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro-
     ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     770–778.
[30] E. Decencière, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez,
     P. Massin, A. Erginay, et al., Feedback on a publicly distributed image database: the
     messidor database, Image Analysis & Stereology 33 (2014) 231–234.
[31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
     A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, Interna-
     tional journal of computer vision 115 (2015) 211–252.
[32] S. M. LaValle, M. S. Branicky, S. R. Lindemann, On the relationship between classical grid
     search and probabilistic roadmaps, The International Journal of Robotics Research 23
     (2004) 673–692.