<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>World Conference on eXplainable Artificial Intelligence:
July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Investigating the Relationship Between Debiasing and Artifact Removal using Saliency Maps</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lukasz Sztukiewicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ignacy Stępka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michał Wiliński</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jerzy Stefanowski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computing Science, Poznan University of Technology</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>The widespread adoption of machine learning systems has raised critical concerns about fairness and bias, making mitigating harmful biases essential for AI development. In this paper, we investigate the relationship between debiasing and removing artifacts in neural networks for computer vision tasks. First, we introduce a set of novel XAI-based metrics that analyze saliency maps to assess shifts in a model's decision-making process. Then, we demonstrate that successful debiasing methods systematically redirect model focus away from protected attributes. Finally, we show that techniques originally developed for artifact removal can be efectively repurposed for improving fairness. These findings provide evidence for the existence of a bidirectional connection between ensuring fairness and removing artifacts corresponding to protected attributes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep learning</kwd>
        <kwd>Fairness</kwd>
        <kwd>Debiasing</kwd>
        <kwd>Saliency maps</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>that techniques originally developed for artifact removal, such as the family of ClArC methods [8], also
optimize fairness even though their explicit goal is to remove the designated artifact. These findings
point to the existence of an inherent relationship between improving fairness and steering the saliency
away from the protected attributes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Debiasing methods are an active area of research, usually in the context of tabular data, with a vast
landscape of methods applied at various stages of model development [7]. The methods employed in
our study represent approaches to debiasing in a post-hoc manner, that is, after a model is trained,
within a binary classification setup. In our work, we consider three groups of methods. The first group
consists of simple threshold optimizers, represented in our experiments by ThrOpt[9]. The second
group focuses on approaches that optimize fairness with adversarial fine-tuning and is represented
by ZhangAL[10] and SavaniAFT[11]. Finally, the third group focuses on concept-based interventions
(artifact removal), exemplified by ClArC variants [12, 8], which operate directly on the model’s internal
representations utilizing Concept Activation Vectors (CAVs) through interventions in activation space.</p>
      <p>Saliency maps are explainable AI methods that provide insights into model decision-making process
by highlighting regions of input data that influence predictions. These techniques can generally be
categorized into gradient-based [13, 14] and relevance-based methods [15]. Integrated Gradients (IG)
[13] attributes predictions to input features by integrating gradients along a path from a baseline to the
input, satisfying important axioms, including sensitivity and implementation invariance. Layer-wise
Relevance Propagation (LRP) [15] employs a diferent approach based on a conservation principle, where
relevance scores are propagated backward through the network layers while maintaining a constant
sum. To improve the faithfulness of our study, we conducted experiments with multiple saliency map
methods, each providing a diferent perspective on model predictions and associated limitations [16].</p>
      <p>Quantitative evaluation of saliency maps is crucial for assessing whether models make decisions
based on appropriate features rather than biased artifacts or protected attributes. Early approaches,
such as the inside-outside ratio [17, 18], established a foundation by quantifying the relevance contained
within a bounding box relative to the relevance outside it. This concept has been further developed
as part of the Quantus toolbox [19], which provides a framework for evaluating explanations through
various localization metrics. Motzkus et al. [20] advanced this approach by adapting the inside-outside
metric to compute the ratio of positively attributed relevance within a binary class mask to the overall
positive relevance, specifically focusing on the context of individual concepts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Metrics for Saliency Maps</title>
      <p>In this section, we present metrics designed to quantify the importance of protected attributes in the
model’s decision-making process. Our focus is specifically on localized features that can be roughly
bounded by rectangular regions of interest (ROIs). These metrics evaluate whether an ROI plays an
important role in the model’s reasoning by analyzing saliency maps. In principle, they can be used with
any standard saliency map generation method that suits the practical needs of an application.</p>
      <p>To establish our framework, we define several key components. Image  is a 2D array with 
representing the intensity (or relevance) of the pixel (, ). Within this image, we consider a 2D array
(ROI)  such that || &lt; | |.</p>
      <p>Rectangle Relevance Fraction (RRF) provides a direct measure of the ROI’s importance in the
context of the model’s prediction by calculating what percentage of the total relevance falls within the
region.</p>
      <p>
        ∑︀(,)∈ 
RRF = ∑︀
(,)∈ 
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
ADR =
1
      </p>
      <p>∑︁ v − d
|| (,)∈
where v and d represent pixel intensities in Vanilla (corresponding to the base model) and debiased
saliency maps, respectively. A positive ADR value indicates that Vanilla generally assigned higher
importance to pixels within the ROI compared to the debiased model, suggesting a successful reduction
in the model’s reliance on these features.</p>
      <p>Decreased Intensity Fraction (DIF) quantifies the proportion of pixels within the ROI that show
reduced importance after debiasing. Specifically, it calculates the fraction of pixels where the debiased
model shows lower saliency values compared to the Vanilla model. It is defined as:
DIF provides insight into how widespread the changes are within the ROI, complementing the ADR’s
measurement of average change magnitude.</p>
      <p>Rectangle Diference Distribution Testing (RDDT) metric assesses whether Vanilla assigns higher
importance to pixels within the ROI compared to the debiased model. For each image, we compute the
diference between the mean intensities of vanilla and debiased saliency maps within the ROI:
 =  vanilla −  debiased
where  vanilla and  debiased represent the mean pixel intensities within the ROI for the Vanilla and
debiased models respectively. We then perform a one-sample t-test on these diferences across with
0 :   = 0 and 1 :   &gt; 0. The test returns 1 if  &lt; 0.01, indicating statistically significant evidence
that the Vanilla model assigns a higher importance to the ROI than the debiased model, and 0 otherwise.</p>
      <p>It aids in understanding the relative ROI’s contribution to the overall decision-making process of the
model.</p>
      <p>Average Diference in Region (ADR) provides a direct measure of how the saliency values within
the ROI change after debiasing. It is defined as:</p>
      <p>DIF =
1</p>
      <p>
        ∑︁ 1{d&lt;v
|| (,)∈ }
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(4)
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In the experiments below, we aim to explore the following two research questions. RQ1: Is there a
bidirectional relationship between shifting the importance of pixels in the saliency map out of the ROI
and optimizing fairness metrics? RQ2: Are debiasing methods capable of decreasing the saliency within
ROI w.r.t. a standard end-to-end trained Vanilla model?</p>
      <p>For our experiments, we utilize methods detailed in Sec. 2, implemented within the DetoxAI library
[21]. We compute metrics and generate visualizations using LRP and Integrated Gradients. To ensure
reproducibility, we have open-sourced a GitHub repository containing the relevant implementations 1.</p>
      <p>The experimental procedure begins by fine-tuning a pre-trained ResNet-50 [ 22] on the target task’s
training set, yielding our Vanilla model. This fine-tuning uses a batch size of 128, the Adam optimizer,
and a learning rate of 3 · 10− 4 for a single epoch. Subsequently, we apply the considered debiasing
methods using a disjoint hold-out (debias) set. Finally, we evaluate the resulting models on a test set,
calculating prediction performance, fairness, and our proposed metrics. Notably, both the training and
debias datasets maintain the same protected attribute-target (PA-T) correlation, reflecting a common
practical scenario where the split strategy is fixed. In contrast, the test set intentionally balances the PA-T
correlation to systematically assess predictive performance (Accuracy) and fairness (EqualizedOdds) [23].
1https://github.com/DetoxAI/saliency-fairness-metrics</p>
      <sec id="sec-4-1">
        <title>4.1. Qualitative assessment</title>
        <p>We perform a qualitative assessment of the debiasing by inspecting the relevancy maps before and after
applying diferent debiasing methods. Fig. 2 presents LRP saliency maps for images aggregated by PA-T
combinations, where the protected attribute is WearingNecktie and the target attribute is Smiling. The
black rectangles highlight the ROI roughly corresponding to the necktie area (see Fig. 1).</p>
        <p>Several key observations can be made from these visualizations. The Vanilla model (second column)
shows considerable attention to the necktie region, particularly for the (PA=1, T=0) combination,
indicating that the model has learned to associate the necktie area with its predictions. Interestingly,
for the (PA=1, T=1) combination (bottom row), the necktie area shows strong negative relevance (blue),
suggesting the model uses this feature to make negative predictions about smiling.</p>
        <p>Simple threshold optimization (ThrOpt) does not substantially alter the saliency patterns compared
to Vanilla, maintaining similar attention to the necktie area. This suggests that merely adjusting
classification thresholds does not change the underlying reasoning of the model. Adversarial
finetuning methods (SavaniAFT and ZhangAL) show modest reductions in the attention to the ROI but
largely preserve the overall saliency patterns of the Vanilla model. The ClArC-based methods show the
most noticeable shifts. A-ClArC reduces the saliency in the necktie region across all PA-T combinations,
redirecting attention to facial features, relevant to the Smiling attribute. RR-ClArC shows the most
visible improvements, excluding the second row, almost completely eliminating the relevance from ROI.
These observations suggest that, while all debiasing methods may improve fairness metrics, they difer
in how they alter the model’s underlying decision-making process. Methods from the ClArC family
most efectively redirect the model’s attention away from the protected attribute region.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Quantitative experiments</title>
        <p>While the CelebA dataset exhibits inherent attribute correlations, we artificially enforced specific PA-T
correlations in our experimental framework to amplify the biases. This was done by rebalancing the
dataset by undersampling attribute combinations to control their correlation with the target, as captured
by Yule’s correlation coeficient .</p>
        <p>In this experiment, we considered two PA-T combinations: WearingHat–Smiling and WearingNecktie–
Smiling, using saliency maps generated with LRP [15] and IntegratedGradients [13]. However, in the
following, we only report the results for LRP and WearingNecktie–Smiling combination (in Fig. 3),
while we move the rest to the Appendix, because the conclusions from all experiment variants are
the same. In these plots, we report metrics from Sec. 3 along with EqualizedOdds calculated as:
EqualizedOdds = max (︀ |TPR =1 − TPR =0|, |FPR =1 − FPR =0|︀) , where TPR and FPR stand for
true and false positive rates respectively, and   = 0,   = 1 protected attribute value assignments.</p>
        <p>First, it is clear that as  increases, all methods achieve a higher EqualizedOdds value, which indicates
more bias in their predictions. The best performing method for this metric is ZhangAL, which
optimizes it directly internally. However, most methods decrease the EqualizedOdds score w.r.t. Vanilla’s,
confirming that they are efective.</p>
        <p>ThrOpt, a post-hoc classification threshold optimization method, does not shift the relevancy in or
out of the ROI. Its bars are empty for ADR and RDDT and equal to Vanilla on DIF and RRF, indicating
that no change in the saliency maps was recorded. This is expected since ThrOptdoes not intervene
into the reasoning process. This method decreases in accuracy as the correlation grows larger.</p>
        <p>SavaniAFT and ZhangAL both perform well across most metrics. ZhangAL scores remarkably well in
saliency map-based metrics. It lowers all but one metric value in the first row of the plot, showing that
it moves the saliency out of ROI. As correlation grows, accuracy of the model also grows. In addition,
it also scores visibly well on the metrics in the lower row, which measure the improvement over the
Vanilla model within the ROI. This provides evidence that optimizing with a fairness-oriented objective
as a fine-tuning step can significantly shift the model’s reasoning process.</p>
        <p>RR-ClArC and A-ClArC do not optimize any fairness objective. Yet, they efectively debias the model
(as captured by EqualizedOdds) and significantly shift model relevancy within the ROI. Both score high
at DIF and ADR, and often appear on RDDT (the more bars the better). Regarding attention outside the
ROI, they tend to lower the RRF with respect to Vanilla, which suggests that more attention is given to
features outside the ROI, - the desired outcome. Both methods cause decrease in accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Experiments show that efective debiasing methods decrease saliency within the ROI compared to the
Vanilla model, which positively answers RQ2. Both qualitative and quantitative analyses reveal that
while threshold optimization (ThrOpt) produces no changes in saliency maps, fine-tuning-based
approaches yield significant improvements. Notably, ZhangAL and SavaniAFT and ClArC-based methods
(A-ClArC and RR-ClArC) redirect the attention away from protected features towards task-relevant
features such as facial expressions for smile detection. For the latter, the saliency redirection is stronger
while achieving competitive EqualizedOdds, despite not directly optimizing any fairness objective.</p>
      <p>These findings provide evidence for a bidirectional relationship between shifting pixel importance in
saliency maps away from regions of interest and optimizing fairness metrics, validating the premise
of RQ1. They confirm that methods that efectively redirect model attention away from protected
attributes tend to score better on EqualizedOdds, and vice versa.</p>
      <p>We believe that this research provides useful evidence for further work on fairness methods, which
could adapt concept removal methods directly in the field of fair machine learning.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Authors Ignacy Stępka, Lukasz Sztukiewicz, Michal Wiliński were supported by a Ministry of Science and
Higher Education grant No. MNiSW/2025/DPI/56 under the FERS program, co-financed by the European
Union. Jerzy Stefanowski was supported by a National Science Centre grant (No. 2022/47/D/ST6/01770).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, LeChat, and Grammarly for: Grammar
and spelling checking, paraphrasing and rewording. After using these tools/services, the authors
reviewed and edited the content as needed and assume full responsibility for the content of the publication.
[4] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, Man is to computer programmer as
woman is to homemaker? debiasing word embeddings, Advances in Neural Information Processing
Systems 29 (2016).
[5] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of</p>
      <p>International Conference on Computer Vision (ICCV), 2015, pp. 3730–3738.
[6] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan, A survey on bias and fairness in
machine learning, ACM Computing Surveys 54 (2021).
[7] S. Caton, C. Haas, Fairness in machine learning: A survey, ACM Comput. Surv. 56 (2024).
[8] M. Dreyer, F. Pahde, C. J. Anders, W. Samek, S. Lapuschkin, From hope to safety: Unlearning biases
of deep models via gradient penalization in latent space, Proceedings of the AAAI Conference on
Artificial Intelligence 38 (2024) 21046–21054.
[9] M. Hardt, E. Price, N. Srebro, Equality of opportunity in supervised learning, in: Proceedings of
the 30th International Conference on Neural Information Processing Systems, NIPS’16, Curran
Associates Inc., Red Hook, NY, USA, 2016, p. 3323–3331.
[10] B. H. Zhang, B. Lemoine, M. Mitchell, Mitigating unwanted biases with adversarial learning, in:
Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, Association
for Computing Machinery, New York, NY, USA, 2018, p. 335–340.
[11] Y. Savani, C. White, N. S. Govindarajulu, Intra-processing methods for debiasing neural networks,
in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information
Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 2798–2810.
[12] C. J. Anders, L. Weber, D. Neumann, W. Samek, K.-R. Müller, S. Lapuschkin, Finding and removing
clever hans: Using explanation methods to debug and improve deep models, Information Fusion
77 (2022) 261–295.
[13] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: International</p>
      <p>Conference on Machine Learning, PMLR, 2017, pp. 3319–3328.
[14] C. Molnar, Interpretable Machine Learning, 2 ed., 2022.
[15] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise explanations
for non-linear classifier decisions by layer-wise relevance propagation, PLOS ONE 10 (2015) 1–46.
[16] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use
interpretable models instead, Nature Machine Intelligence 1 (2019) 206–215.
[17] M. Kohlbrenner, A. Bauer, S. Nakajima, A. Binder, W. Samek, S. Lapuschkin, Towards best practice
in explaining neural network decisions with lrp, in: 2020 International Joint Conference on Neural
Networks (IJCNN), IEEE, 2020, pp. 1–7.
[18] S. Bach, A. Binder, G. Montavon, K.-R. Müller, W. Samek, Analyzing classifiers: Fisher vectors
and deep neural networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2015) 2912–2920.
[19] A. Hedström, L. Weber, D. Krakowczyk, D. Bareeva, F. Motzkus, W. Samek, S. Lapuschkin, M.
M.C. Höhne, Quantus: An explainable ai toolkit for responsible evaluation of neural network
explanations and beyond, Journal of Machine Learning Research 24 (2023) 1–11.
[20] F. Motzkus, G. Mikriukov, C. Hellert, U. Schmid, Locally testing model detections for semantic
global concepts, in: World Conference on Explainable Artificial Intelligence, Springer, 2024, pp.
137–159.
[21] I. Stępka, L. Sztukiewicz, M. Wiliński, J. Stefanowski, DetoxAI: a Python toolkit for
debiasing deep learning models in computer vision, 2025. URL: https://arxiv.org/abs/2505.05492.
arXiv:2505.05492.
[22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[23] D. Brzezinski, J. Stachowiak, J. Stefanowski, I. Szczech, R. Susmaga, S. Aksenyuk, U. Ivashka,
O. Yasinskyi, Properties of fairness measures in the context of varying class imbalance and
protected group ratios, ACM Transactions on Knowledge Discovery from Data 18 (2024) 1–18.
0.5
0.0
0.002
0.000
0.70
0.65
0.20
0.15
n
a
e
m
-F0.10
R
R
0.05
0.00
0.90
0.85
0.70
0.65
0.15
0.00</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Buyl</surname>
          </string-name>
          , T. De Bie,
          <article-title>Inherent limitations of ai fairness</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>67</volume>
          (
          <year>2024</year>
          )
          <fpage>48</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wachter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mittelstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Counterfactual explanations without opening the black box: automated decisions and the gdpr</article-title>
          ,
          <source>Harvard Journal of Law and Technology</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Image fairness in deep learning: Problems, models, and challenges</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>12875</fpage>
          -
          <lpage>12893</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>0.20 0.60 0.80 0.00 0.20 0.60 0.80 0.00 0.20 0.60 0.80 0.40 phi 0.00 0.20 0.60 0.80 0.00 0.20 0.60 0.80 0.00 0.20 0.60 0.80 0.40 phi 0.20 0.60 0.80 0.00 0.20 0.60 0.80 0.00 0.20 0.60 0.80 0.40 phi 0</source>
          .40 phi
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>