<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Validation of ML Models from the Field of XAI for Computer Vision in Autonomous Driving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Mastroianni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sibylle D. Sager-Müller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lucerne University of Applied Sciences and Arts (Hochschule Luzern)</institution>
          ,
          <addr-line>Suurstoffi 1, 6343 Rotkreuz</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe Explainable Artificial Intelligence (XAI) methods for image segmentation for autonomous driving applications. The analysis is conducted using metrics such as efficiency, robustness, localization, and complexity. Four XAI methods, namely Gradient-weighted Class Activation Mapping (GradCAM), Local Interpretable Model Agnostic Explanations (LIME), Feature Ablation, and Saliency are applied and assessed on a dataset of street images.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Validation</kwd>
        <kwd>Metrics</kwd>
        <kwd>Captum</kwd>
        <kwd>XAI Methods</kwd>
        <kwd>Segmentation</kwd>
        <kwd>Computer Vision1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        truth masks is Cityscapes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Images from Zurich, Switzerland, were selected from this
dataset. For the segmentation, pre-trained models from PyTorch were used [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Segmentation Models</title>
      <p>
        To evaluate the segmentation of street elements, the pre-trained models DeeplabV3 and the
Fully Convolutional Network (FCN) both with ResNet101 architecture from PyTorch were
analyzed. These models were trained using part of the COCO val2017 dataset and the 20
segmentation classes from Pascal VOC. The decision to use pre-trained models was because
these 20 classes include relevant street elements such as bicycles, buses, cars, motorbikes,
people, and trains. The models were selected due to their high average Intersection over
Union (IoU) scores, which are listed on the PyTorch website [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>For the evaluation of the segmentation models, two metrics were employed: the IoU
value and the Matching Area. The evaluation, shown in Table 1, involved calculating the
average values of 122 images from Zurich for both metrics.</p>
    </sec>
    <sec id="sec-3">
      <title>3. XAI Methods</title>
      <p>
        Various XAI methods can be utilized. These methods are classified into two categories:
model-specific and model-agnostic. Model-specific methods analyze how changes in the
input features change the output, whereas model-agnostic methods work by manipulating
input data and analyzing the respective model predictions. Within the subclasses of specific
and agnostic, a further differentiation is made based on whether the method is local or
global. Additionally, local methods explain the individual predictions of models, while global
methods explain the behavior of the model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this evaluation, a total of four XAI methods
are applied. The selection of methods was partially based on Munn and Pitman [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
authors dedicated a chapter in their book to the topic of explainability for image data. In this
chapter, the methods, GradCAM and LIME, were introduced, which was one reason for
choosing these two methods. Two further methods were selected: Feature Ablation and
Saliency.
      </p>
      <p>
        Gradient-weighted Class Activation Mapping (GradCAM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a technique that
analyzes gradient information for any convolutional layer of a model and generates a
heatmap that highlights important regions in the image.
      </p>
      <p>
        Local Interpretable Model Agnostic Explanations (LIME) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] involves multiple
iterations of removing specific regions of an image to determine which specific areas are
more or less important. It is a model-agnostic, perturbation-based method.
      </p>
      <p>
        Feature Ablation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is also a perturbation-based method. It calculates the difference of
the attribution in the model output of each feature when it is active and when it is replaced.
Multiple features can be turned off together instead of one at a time.
      </p>
      <p>
        Saliency [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is a method that follows the gradient of a class through the model using
backpropagation. During this process, each pixel is minimally changed, and the resulting
changes are observed in the prediction of the class score.
      </p>
      <p>In the implementation of XAI methods, an image representing a street scene was
selected. The following two segmentation classes were examined: cars and persons. The
calculated attribution values from the methods were normalized and a heatmap was
generated and overlaid on the original image, as shown in Figure 1. While the original image
had dimensions of 2048 x 1024 pixels, it was resized to 512 x 256 pixels to enhance
computational speed.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Metrics</title>
      <p>
        Hedström et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] compare various XAI libraries of evaluation metrics for XAI methods.
Metrics were divided into six categories: Faithfulness, Robustness, Localisation, Complexity,
Axiomatic and Randomisation. Three metrics based on these categories were used to
evaluate XAI methods and an additional efficiency metric was employed. The selection of
the following metrics was guided by the idea to test and understand metrics from different
categories but else chosen somewhat arbitrary.
      </p>
      <p>To measure the efficiency of a method, the runtime was recorded. The time taken to
compute the attributions for the "Car" and "Person" classes was measured.</p>
      <p>
        To calculate robustness, the average sensitivity [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] was applied. Zhou et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
describe that models usually do not adapt well to new environments when new factors such
as weather or illumination conditions are introduced. For this reason, the original images
were modified to various brightness levels while preserving the objects in the image. For
calculating the average sensitivity, the following formula (1) was used:
      </p>
      <p>= 1  1=1|


− 


|
(1)

levels.</p>
      <p>To obtain comparable values, the average sensitivity was calculated using the same
method for the "Car" and "Person" classes. A low value of average sensitivity indicates good
robustness, where 0 represents the lowest and 1 the highest possible value. Figure 2
illustrates an example of how the average attribution may appear at different brightness
ensured when the average attribution remains equal to or only slightly deviates from the
original.</p>
      <p>
        Next, we consider the metric Relevance Rank Accuracy (RRA) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which falls in the
category of Localization. This metric measures how many of the highly attributed pixels are
located within the ground truth mask. In our evaluation, highly attributed pixels are defined
as those falling within the top 20% range. This means when attributions range from 0 to 1,
values from 0.8 to 1 are marked as highly attributed. An example of calculating the RRA is
illustrated in Figure 3. Not only is the RRA an important indicator, but also the False Positive
(FP) Rate of the RRA. A method might represent all pixels in the image as highly important,
resulting in an RRA of 1, which is the highest possible value. Therefore, to achieve more
comparable results, the FP RRA was additionally examined.
      </p>
      <p>
        The sparseness [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] was measured as the last metric, falling in the category of
Complexity. Here, the Gini index (G) is used for calculation. In our case, G indicates how
scattered or concentrated the attributions are distributed in an image. A value of 0 (the
smallest value) indicates a large dispersion, where each pixel is crucial for segmentation. A
value of 1 (the largest value) indicates that the important attributions are concentrated. For
calculating the Gini index, the following formula (2) was used [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]:
 =
2 ∑ =1  ( )  + 1
      </p>
      <p>∑ =1  ( )
−

(2)</p>
      <p>Let  be the indexing for the position of an attribution in an array,  ( ) be the value of the
attribution at position  , and  be the number of pixels in the image. The higher the Gini
index, the better. A lower value indicates that the importance of all pixels is equal. Our
images consist of different objects and only the pixels representing the object should be
marked as important, which favors a higher G value.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation of XAI methods</title>
      <p>In the evaluation, the metrics of average sensitivity, RRA and sparseness were calculated
for the targets Car and Person. The target is a parameter that can be selected in every
method. The first number in Table 2 represents the Car class, while the second number
represents the Person class.</p>
      <p>In terms of efficiency, GradCAM clearly outperforms all other XAI methods, followed by
Saliency, then Feature Ablation and LIME. In terms of robustness, measured by average
sensitivity, both GradCAM and Saliency show remarkable performance. In particular,
Saliency shows almost no change in attribution values despite brightness variations. In the
RRA and FP RRA category, Feature Ablation and LIME using Max(0,RRA - FP RRA) show
identical top performance. The performance of Saliency in this category is considered
significantly off target. The low, almost zero RRA means that Saliency only finds very few
highly attributed pixels, present in the ground truth mask. In terms of the sparseness metric,
the Saliency method emerges as the top performer. The other three methods show similar
values that are significantly different from that of Saliency. Furthermore, it is observed that
LIME and Feature Ablation exhibit similar or even identical values. The similarity may arise
from the fact that both LIME and Feature Ablation are perturbation-based methods.</p>
      <p>Additionally, it is notable that in most cases, the values from the category "Car" are better
than those from the category "Person." One possible reason for the lower performance
values in the "Person" class could be the downsizing of the image, as described in Chapter
3. Table 3 shows how the values of the Matching Area, which indicates how well the
segmentation aligns with the ground truth mask, get worse when the image is resized.
Another reason could be the general size of the objects in the image. A car, in comparison,
has a significantly larger area than a person and is therefore easier to detect.
In Figure 4, the measured metrics are visualized in a radar chart with best scores on the
outer circle and worse scores in the inner area.</p>
      <p>To evaluate the methods by the used metrics, we need to distinguish between their
importance. Two of the most relevant metrics are shown in the lower part of the spider plot,
namely the metrics RRA and FP RRA. They are a central aspect in the human interpretability,
as they indicate in the heatmap (Figure 1) what is considered important. Due to the bad
interpretability, we decide not to pursue Saliency further. Setting Saliency aside as a
method, the three remaining methods can be ordered according to their performance:
GradCAM demonstrates superiority, followed by Feature Ablation and LIME.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In summary, GradCAM emerges in this study as the first choice for evaluating XAI
methods for image segmentation, not only because of its favorable metrics but also because
of its interpretability in the heatmap. Figure 1 illustrates how GradCAM considers not only
the objects themselves but also contextual features such as the road. The heatmap shows a
gradient from orange (important) to red (very important), which is missing in other
methods. The situation for Saliency was different: While it performed well in the metrics,
its heatmap provided almost no information, making interpretation difficult. This highlights
the importance of understanding and selecting a comprehensive set of metrics that can
provide a clear understanding of the reliability of the methods. We acknowledge that the
results are specifically tailored to this case and cannot be directly applied to other fields,
such as medicine. Nevertheless, they represent a crucial first step toward future
autonomous driving applications.</p>
      <p>The current study is to the best of our knowledge the first one to evaluate a subset of XAI
methods for image segmentation, an application area of computer vision that has not
received as much attention as, e.g., image classification. Future research should involve a
variety of data sets and segmentation models and prioritize the evaluation of a broader
range of XAI methods for image segmentation. This research should include the
development and application of various evaluation metrics, with a particular focus on
interpretability. By optimizing these evaluation criteria, an understanding of the specific
strengths and weaknesses of different XAI methods can be achieved, which eventually leads
to the identification of the most suitable methods for specific applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] 'Algorithm Descriptions',
          <source>Captum. Accessed: Nov. 25</source>
          ,
          <year>2023</year>
          . [Online]. Available: https://captum.ai/docs/attribution_algorithms
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hedström</surname>
          </string-name>
          et al.,
          <article-title>'Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations and Beyond'</article-title>
          . arXiv, Apr.
          <volume>27</volume>
          ,
          <year>2023</year>
          . Accessed: Mar.
          <volume>20</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/2202.06861
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] 'Dataset Overview',
          <string-name>
            <given-names>Cityscapes</given-names>
            <surname>Dataset</surname>
          </string-name>
          .
          <source>Accessed: Dec. 15</source>
          ,
          <year>2023</year>
          . [Online]. Available: https://www.cityscapes-dataset.com/dataset-overview/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] 'Models and pre-trained weights'</article-title>
          ,
          <source>PyTorch. Accessed: Nov. 28</source>
          ,
          <year>2023</year>
          . [Online]. Available: https://pytorch.org/vision/stable/models.html#semantic-segmentation
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Molnar</surname>
          </string-name>
          , G. Casalicchio, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Bischl</surname>
          </string-name>
          , '
          <string-name>
            <surname>Interpretable Machine Learning -- A Brief History</surname>
          </string-name>
          ,
          <article-title>State-of-the-Art and Challenges'</article-title>
          , vol.
          <volume>1323</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>417</fpage>
          -
          <lpage>431</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -65965-3_
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Munn</surname>
          </string-name>
          and D. Pitman,
          <article-title>Explainable AI for practitioners: designing and implementing explainable ML solutions</article-title>
          . Beijing, Sebastopol, CA:
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , '
          <article-title>Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization'</article-title>
          ,
          <source>Int. J. Comput. Vis.</source>
          , vol.
          <volume>128</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>336</fpage>
          -
          <lpage>359</lpage>
          , Feb.
          <year>2020</year>
          , doi: 10.1007/s11263-019-01228- 7.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , '
          <article-title>"Why Should I Trust You?": Explaining the Predictions of Any Classifier'</article-title>
          . arXiv, Feb.
          <volume>26</volume>
          ,
          <year>2016</year>
          . Accessed: May 13,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/1602.04938.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ismail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gunady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Bravo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Feizi</surname>
          </string-name>
          , '
          <article-title>Benchmarking Deep Learning Interpretability in Time Series Predictions'</article-title>
          . arXiv, Oct.
          <volume>26</volume>
          ,
          <year>2020</year>
          . Accessed: Mar.
          <volume>20</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/
          <year>2010</year>
          .13924
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , '
          <article-title>Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps'</article-title>
          . arXiv, Apr.
          <volume>19</volume>
          ,
          <year>2014</year>
          . Accessed: Mar.
          <volume>20</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/1312.6034
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C.-K. Yeh</surname>
            ,
            <given-names>C.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Suggala</surname>
            ,
            <given-names>D. I.</given-names>
          </string-name>
          <string-name>
            <surname>Inouye</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Ravikumar</surname>
          </string-name>
          , '
          <article-title>On the (In)fidelity and Sensitivity for Explanations'</article-title>
          . arXiv, Nov.
          <volume>03</volume>
          ,
          <year>2019</year>
          . Accessed: Mar.
          <volume>20</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/
          <year>1901</year>
          .09392
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Berrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Worrall</surname>
          </string-name>
          , and E. Nebot, '
          <article-title>Automated Evaluation of Semantic Segmentation Robustness for Autonomous Driving'</article-title>
          ,
          <source>IEEE Trans. Intell. Transp. Syst.</source>
          , vol.
          <volume>21</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>1951</fpage>
          -
          <lpage>1963</lpage>
          , May
          <year>2020</year>
          , doi: 10.1109/TITS.
          <year>2019</year>
          .
          <volume>2909066</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Arras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Osman</surname>
          </string-name>
          , and W. Samek, '
          <article-title>Ground Truth Evaluation of Neural Network Explanations with CLEVR-XAI'</article-title>
          ,
          <source>Inf. Fusion</source>
          , vol.
          <volume>81</volume>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>40</lpage>
          , May
          <year>2022</year>
          , doi: 10.1016/j.inffus.
          <year>2021</year>
          .
          <volume>11</volume>
          .008.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chalasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          , '
          <article-title>Concise Explanations of Neural Networks using Adversarial Training'</article-title>
          . arXiv, Jul.
          <volume>04</volume>
          ,
          <year>2020</year>
          . Accessed: Mar.
          <volume>20</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/
          <year>1810</year>
          .06583
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15] 'Gini Koeffizient Definition und Berechnung',
          <source>Studyflix. Accessed: Mar. 31</source>
          ,
          <year>2024</year>
          . [Online]. Available: https://studyflix.de/wirtschaft/gini-koeffizient-898
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>