<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Tracin in Semantic Segmentation of Tumor</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Torda</string-name>
          <email>tommaso.torda@uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simona Gargiulo</string-name>
          <email>simona.gargiulo@uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Greta Grillo</string-name>
          <email>greta.grillo@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Ciardiello</string-name>
          <email>andrea.ciardiello@uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cecilia Voena</string-name>
          <email>cecilia.voena@roma1.infn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Giagu</string-name>
          <email>stefano.giagu@uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Scardapane</string-name>
          <email>simone.scardapane@uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute for Nuclear Physics Rome Division</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sapienza” University of Rome</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, thanks to improved computational power and the availability of big data, AI has become a fundamental tool in basic research and industry. Despite this very rapid development, deep neural networks remain black boxes that are dificult to explain. While a multitude of explainability (xAI) methods have been developed, their efectiveness and usefulness in realistic use cases is understudied. This is a major limitation in the application of these algorithms in sensitive fields such as clinical diagnosis, where the robustness, transparency and reliability of the algorithm are indispensable for its use. In addition, the majority of works have focused on feature attribution (e.g., saliency maps) techniques, neglecting other interesting families of xAI methods such as data influence methods. The aim of this work is to implement, extend and test, for the first time, data influence functions in a challenging clinical problem, namely, the segmentation of tumor brains in Magnetic Resonance Images (MRI). We present a new methodology to calculate an influence score that is generalizable for all semantic segmentation tasks where the diferent labels are mutually exclusive, which is the standard framework for these tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>The implementation of Artificial Intelligence (AI) algorithms in the medical domain is
continuously increasing, driven by advances in the AI field both from the algorithm and the
computational power side, in particular in medical image analysis. Several tasks can nowadays
be performed by AI models on medical images, like classification, registration and segmentation.</p>
      <p>
        Medical image segmentation is an essential task for diagnosis, treatment planning and
monitoring in the clinical management of many diseases. This task, which consists in the
outline of an organ or a lesion in a medical image, is often performed by a clinician (radiologist),
but nowadays Deep Neural Networks (DNN), in particular Convolutional Neural Networks
(CNN) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have proved their efectiveness. While high-performance AI algorithms can be
CEUR
Workshop
Proceedings
developed, the adoption of AI solutions in the clinical practice is currently strongly limited by
lack of trustworthiness due to the little transparency of decisional processes and validation
mechanisms of such complex models.
      </p>
      <p>In the medical image analysis domain, there is a wide literature about explainability (xAI)
methods with most work related to classification tasks (see [ 2, 3] for recent and comprehensive
reviews). A relevant point that needs to be addressed is how to evaluate the quality of an
explanation. Figures of merit have been identified, like for example the ease of use, the plausibility
(correctness of the explanation and correspondence to what the user expects), the faithfulness
(how accurately the explanation reflects the model’s true decision process), robustness (efect
of changing some aspects of the DNN model) but a standard does not exist yet. One of the
challenges here is that the explanation must be helpful for end users, in this case radiologists
and clinicians, making this problem interdisciplinary. Some guidelines have been proposed, like
the INTRPTR guidelines [4].</p>
      <p>In this paper we want to address the issue of xAI in the segmentation task performed by a
DNN on multimodal Magnetic Resonance Images (MRI) of brain tumors, one of the leading
causes of death worldwide [5].</p>
      <p>Since segmentation is a localization problem, the application of visual xAI methods is not
obvious, because generated saliency maps, which show the importance of the pixels in the
segmentation, is not a useful information alone. An interesting xAI algorithm that has become
very popular over the past years is TracIn [6], that is never applied to segmentation tasks
in medical image imaging so far. TracIn belongs to the class of xAI techniques based on
approximating the influence a example used in the training process of the network has on the
predictions made by the model.</p>
      <p>The aim of our work is to implement this technique in a specific clinical problem, the
segmentation of tumor brains in multimodal MRI, and to provide information regarding the
robustness of the algorithm with respect to diferent training strategies. To this purpose, the
original    algorithm is modified since it was originally developed for classification tasks.
We consider as reference datasets Brats19 1, and a standard 2D UNet [7].</p>
    </sec>
    <sec id="sec-3">
      <title>2. Material and Methods</title>
      <sec id="sec-3-1">
        <title>2.1. Image dataset</title>
        <p>The brain tumor segmentation challenge, BraTS 2, is aimed at evaluating state-of-the-art methods
for the segmentation of brain tumors in multimodal MRI. The training dataset for BraTS2019
is composed of 259 cases of high-grade gliomas (HGG) and 76 cases of low-grade gliomas
(LGG), manually annotated by both clinicians and board-certified radiologists. For each patient
four MRI scans taken with diferent modalities are provided: T1, T1Gd, T2, T2-FLAIR. with an
image’s shape of voxels 240 × 240 × 155. We focused only on HGG patients, dividing the dataset
into 207 train patients and 52 validation patients. In the manual label of BraTS19 four classes
are provided: the GD-enhancing tumor (ET - label 4), the peritumoral edema (ED - label 2),
1https://www.med.upenn.edu/cbica/brats2019/data.html
2http://braintumorsegmentation.org/
the necrotic and non-enhancing tumor core (NCR/NET - label 1) and the background (BKG
label 0). In the following, we will refer to the GD-enhancing tumor (ET) as label 3 instead of
label 4. Each pixel is exclusively assigned to one of these classes.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Segmentation algorithm</title>
        <p>
          To solve the segmentation task, we chose a popular and well-established neural network, the
UNet for 2D segmentation [7], it is easy to implement and has extensive literature.
For the 2D UNet we considered separately the 155 slices along the longitudinal axis. We then
took only the central volume of the brain, reducing the number of slices to 10, because influence
xAI methods are quite computationally expensive, and we want to reduce the computational
time. We also cropped the image on the x-y axes from 240 to 192 pixels. We applied the following
data augmentation transformations: elastic transformation, random crop and mirroring with
probability of 50%, and then we normalized the intensity between [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ].
        </p>
        <p>As activation function, we used the softmax function. Furthermore, we choose the mean of the
Dice Coeficients (D) as loss function along each class. The D is defined for each class as
This metric measures the overlap between the prediction mask (P) and ground truth (GT) and it is
necessary when the region of interest is smaller than the background area. The implementation
from the computational point of view is done using the Soft Dice defined as</p>
        <p>D =
2( ∩  )
 ∪</p>
        <p>.</p>
        <p>D =</p>
        <p>2 ∑ ∗ 
∑ 2 + ∑ 2 +</p>
        <p>,
ℒ = 1 −
1</p>
        <p>
          ∑
3 ∈[
          <xref ref-type="bibr" rid="ref1">1,3</xref>
          ]
        </p>
        <p>D ,

where the sum is made over all the pixels of the mask, the product is made pixel by pixel, and  is
a small arbitrary parameter in order to avoid NaN values. The Dice score for each separate label
is also used as a metric in order to evaluate the trained network performances. The complete
loss function than became
Where  is the class index, and we excluded the background from the loss function calculation.
We used Adam optimizer with learning rate of 1e-4 and weight decay of 1e-5 for the first 5
epochs, we switch on stochastic gradient descent (SGD) for the remain epochs. In total we
trained the model for 30 epochs, with the train batch size equal to 10. Results of the training
process are provided in Table 1.</p>
        <sec id="sec-3-2-1">
          <title>Train</title>
          <p>Val
NCR/NET
0.90
0.77
ED
0.93
0.87
ET
0.93
0.87</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Explainability algorithms</title>
        <p>2.3.1. TracIn
The idea of influence based xAI methods is to estimate the efect of removing an example train
 ̄ from the dataset on the loss function ℒ. Pruthi et al. [6], implemented an influence function
that monitors loss changes during training, and involves only first order derivative of ℒ. The
ifnal equation they obtained, assuming stochastic gradient descent, i.e  +1 =   −   ∇ℒ (  , ) ,
where   are the parameters of the network at the epoch  , is</p>
        <p>Tracin(,  ′) = 1 ∑   ∇ℒ (  ,  ′) ⋅ ∇ℒ (  , )</p>
        <p>∈
where  are checkpoints,  and  ′ a train and a test example respectively,  the batch size and  
the learning rate. We call opponent an example that has a negative value of influence score, and
proponent an example that has a positive value of influence score. We chose the checkpoints
after the transition to SGD, we considered the first 10 epochs for the calculation of     .
2.3.2. Extended methodology for segmentation
Originally,     was proposed for classification tasks now we generalize it for segmentation
task.</p>
        <p>We first consider ℒ defined in (3). When we evaluate the scalar product between the gradient
of ℒ as (4), we obtain</p>
        <p>Tracin(,  ′) ≈</p>
        <p>
          ∇  () ⋅ ∇  ( ′),
∑
(,)∈[
          <xref ref-type="bibr" rid="ref1">1,3</xref>
          ]
where   is the dice coeficient corresponding to the class  .
        </p>
        <p>Since we do not want to mix influence contributions for pixels belonging to diferent classes,
we decided to compute the     for individual labels.</p>
        <p>Tracin(,  ′) ≈ ∇  () ⋅ ∇  ( ′),
in this way we get 3x3 matrices of     , for  =  we compute influence score between
same classes for the treat examples, and for  ≠  the influence between diferent classes. A
precondition that we must meet is that the segmentation’s classes are mutually exclusive.
However, this is not a strong constraint because it is the standard framework for this type of
task.</p>
        <p>Also, we consider for each class only regions with NN output over a certain threshold (0.8).
This is to further reduce the contribution of averaging heterogeneous pixels together.
(4)
(5)
(6)</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.4. Consistency tests</title>
        <p>After defining a methodology for calculating     for segmentation task, some consistency
tests are carried out to verify the goodness of the algorithm.</p>
        <p>Robustness tests are proposed to check the stability of xAI’s method with respect to small
variations in both statistics and training strategy [8].</p>
        <p>We have several metrics that we can adopt to study the robustness of the algorithm. In
general, let us call   () the i-th explanation for the example  , then we can test the Robustness
(R) of the explanation as
 =</p>
        <p>∑ [  (),   ()]
where  is a normalization, and  can be any similarity metrics, in our case we chose the cosine
similarity (, ) =
Statistical robustness

 
|||| 2⋅|||| 2
. In our case   () =   (, )</p>
        <p>where  is the train dataset.
1
 ≠
.
(7)
(8)
2.4.1. Self-influence
One of the main check to show in the case of    
dataset against itself. Self-influence is a matrix where    
themselves are reported. We defined the normalized self-influence as
is the self-influence matrix of the train
scores of train examples against
diagonal of the self-influence matrix.
2.4.2. Robustness test
 (,  )
 =
√
  (, )
  (,  )</p>
        <p>×   ( ,  )

.</p>
        <p>Where  (,  )
 and   (,  )
 are respectively the self-influence and the    
score
between pair of train examples  and  for Dice  and  .</p>
        <p>We expect that each train example is the main proponent of itself thus higher value of SI on the
We repeat the training of the network 3 times, leaving parameters and hyperparameters
unchanged. We produce 3 explanation vector   () for the same dataset, and using (8) we
evaluate the statistical robustness of    
Small dataset variation robustness
We use the method of k-fold cross-validation. We divide the dataset into 3 groups, as shown
in the Figure 1. Then we produced the explanation vector   () , for the 3 diferent training.
Applying (8), only on the intersection between pair of explanation   (),   () , we evaluate the
robustness of</p>
        <p>respect of small variation on the dataset.</p>
        <p>Transformations robustness Take the neural network that is invariant with respect to the
symmetry group  and an explanation () . We can measure the invariance of ()
under 
using (8). Assume that the element on the dataset  , transforms under the application of  as
where ()
is a representation of the element  ∈  . If (()) ≈ ()
, than  is
 ′ = ()
invariant.</p>
        <p>We tested    
using (8).</p>
        <p>invariance using 2 symmetries that are commonly used in training neural
networks in biomedical tasks: elastic deformation and mirroring.</p>
        <p>After training the neural network, we produce an explanation for each transformation by taking
the train dataset and applying those transformations on it. Then we evaluate the robustness
80%</p>
        <p>20%</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results</title>
      <p>Test
Train</p>
      <p>Val
respect a small variation of 10% on
Self-influence
The normalized self-influence defined in (7), collects a lot of important information, first we
expect the matrix to have a bright diagonal, as each example must be a strong proponent for
the prediction of itself.</p>
      <p>The two Figures in 2 are obtained by ordering the train examples in such a way that each
slice of the same patient is close to the others. The figure on the left is obtained using the
original     calculation (Eq. 5), where the full loss function is considered, i.e., the sum over
all classes. What we expect to see is a 10 × 10 block matrix (10 are the slices we consider of
the single patient). The influence matrix we observe present a bright diagonal, but there are
contributions of the same order of magnitude among all other patients. This means that on
average, any patient contributes equally on the prediction of the others.</p>
      <p>The figure on the right is instead obtained by implementing our methodology (Eq. 6). In
this case the first block of 2070 × 2070 influences corresponds to pixels belonging to label
1 of patients sorted as described above, the second block will be the mixed influences
between labels 1 and 2 of the same list of patients, and so on. The clustering that emerges
is not sensitive to local diferences between patients, but only to global diferences between
classes. In fact, as we can see, in the blocks along the diagonal corresponding to the
    evaluated among the same tumor classes, we have stereotyped matrices, which
exhibit homogeneous influence by modulus and sign regardless of the example chosen.
Going outside the diagonal, the modulus of influence decreases, indicating that diferent
regions have lower influence among them. However we have noise between label 1 and 3,
this means that the predictions on these regions are not completely de-correlated with each other.</p>
      <p>Robustness test We first checked the statistical correlation between the influence vectors
produced by repeating the training and keeping the same parameters and hyperparameters.
The average correlation for the individual Dices is then used as normalization for K-fold
crossvalidation.</p>
      <p>Significant statistical variation is observed for label 3 in each test in Figure 3. A small variation
of 10% on the dataset has comparable efects on the robustness of the explanation. However,
what we notice is that     is not robust to repeated trains (Table 3).</p>
      <p>This might seem surprising, but it is related to what was seen above in the case of the
selfinfluence matrix. Under the assumption that     is a  ℎ  explanation of the network
decision model, the fact that the explanation vector is not robust for repeated trains has a direct
interpretation. At each training the network is initialized in a random parameter space, the
train of the network occurs stochastically (the batches and optimization change each time), and,
at the end of each training cycle, we end up in a diferent, and hopefully, equivalent minimum in
the landscape of the loss function. As we saw for the case of the self-inference matrix, examples
belonging to the same class have stereotypical influence. This means that each prediction in the
network is homogeneously influenced by any other example we have used, since they have a
comparable amount of information that can be extracted. The non robustness of the explanation
means that we are looking at contour lines in the loss function landscape that are essentially
lfat, where each trajectory in this framework is analogous to all others. Thus, the explanation
vector changes with each iteration, having no more important examples of trains than others
for reaching the optimal minimum during the minimization process. What remains truly robust
is the influence the diferent classes have on each other as it can be seen in Figure 4.</p>
      <p>Regarding the invariance of the     , we observe that when the neural network is invariant
with respect to a  group, the     also resepect such invariance (Figure 5 and Table 3). The
results are in agreement with what was previously observed in [8].</p>
      <sec id="sec-4-1">
        <title>Original vs Elastic deformation</title>
      </sec>
      <sec id="sec-4-2">
        <title>Original vs Mirroring</title>
      </sec>
      <sec id="sec-4-3">
        <title>Cosine similarity 0.93 ±0.18 0.95 ±0.12</title>
        <p>under popular transformations. Mean and standard deviation</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>In this paper we proposed an extended methodology to implement     in a specific problem,
brain tumor multiclass MRI segmentation. However, this is generalizable to all segmentation
problems where classes are mutually exclusive. We found that an explanation cannot be given
between examples (slices) of the same class, but only by separating the analysis by distinct
classes and reducing the number of pixels on each examples. This is similar to the information
you can gather with a saliency map, where the most influential pixels for predicting a class are
those belonging to the same one.</p>
      <p>We then analyzed the robustness of the algorithm in several frameworks. First by re-training
the network leaving the parameters and hyperparameters unchanged, then by changing the 10%
of the dataset. In this case we concluded that     is not robust for diferent trains. Under
the assumption that     is a  ℎ  explanation of the network decision model, what we
are observing is a flat loss landscape, where several trajectories turn out to be equivalent for
reaching the minimum. What really remains informative then are the diferences between
diferent classes, but because of the nature of segmentation, diferences on the same classes
not only become irrelevant but in general, even for diferent tasks, are not robust. Leaving the
network unchanged, and instead studying the robustness toward certain transformations, we
verified the invariance of     respect these.</p>
      <p>Like other post-hoc visual explainability techniques,     also sufers from a dificulty in
interpreting its results.</p>
      <p>Future work should focus not only on giving an explanation on the basis of influences but also
on giving information regarding the faithfulness of the algorithm.
[2] B. H. van der Velden, H. J. Kuijf, K. G. Gilhuijs, M. A. Viergever, Explainable artificial
intelligence (XAI) in deep learning-based medical image analysis, Medical Image Analysis 79
(2022) 102470. URL: https://doi.org/10.1016%2Fj.media.2022.102470. doi:10.1016/j.media.
2022.102470.
[3] K. Borys, Y. A. Schmitt, M. Nauta, C. Seifert, N. Krämer, C. M. Friedrich, F. Nensa, Explainable
ai in medical imaging: An overview for clinical practitioners – beyond saliency-based
xai approaches, European Journal of Radiology 162 (2023) 110786. URL: https://www.
sciencedirect.com/science/article/pii/S0720048X23001006. doi:https://doi.org/10.1016/
j.ejrad.2023.110786.
[4] H. Chen, C. Gomez, C.-M. Huang, M. Unberath, Explainable medical imaging ai needs
human-centered design: Guidelines and evidence from a systematic review, 2022.
arXiv:2112.12596.
[5] A. Wadhwa, A. Bhardwaj, V. Singh Verma, A review on brain tumor segmentation of mri
images, Magnetic Resonance Imaging 61 (2019) 247–259. URL: https://www.sciencedirect.
com/science/article/pii/S0730725X19300347. doi:https://doi.org/10.1016/j.mri.2019.
05.043.
[6] G. Pruthi, F. Liu, M. Sundararajan, S. Kale, Estimating training data influence by tracing
gradient descent, NeurIPS 2020 (2020). arXiv:2002.08484.
[7] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image
segmentation, 2015. arXiv:1505.04597.
[8] J. Crabbé, M. van der Schaar, Evaluating the robustness of interpretability methods through
explanation invariance and equivariance, 2023. arXiv:2304.06715.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          , J. Malik,
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          ,
          <year>2014</year>
          . arXiv:
          <volume>1311</volume>
          .
          <fpage>2524</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>