<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explainable Artificial Intelligence Beyond Feature Attributions: The Validity and Reliability of Feature Selection Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Raphael Wallsberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Knauer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephan Matzka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Applied Sciences Berlin, School of Engineering II - Technology and Life, KI-Werkstatt</institution>
          ,
          <addr-line>12459 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Explainable artificial intelligence (XAI) ofers powerful tools to increase the transparency of opaque machine learning models. In contrast to feature attribution methods, XAI-based feature selections provide practitioners with a simple, but often more easily interpretable subset of a model's most influential features. In this work, we systematically evaluate feature selection explanations based on Shapley efects and Shapley Additive Global importancE values (SAGE values) across diferent machine learning algorithms and tabular datasets, and find that they can ofer valid and reliable explanations. We derive under which conditions global post-hoc explainers can likely be trusted, laying the groundwork for future research into the validity and reliability of feature selection explanations across a broader range of settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;XAI</kwd>
        <kwd>Validity</kwd>
        <kwd>Reliability</kwd>
        <kwd>Shapley Efects</kwd>
        <kwd>SAGE</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Explainable artificial intelligence (XAI) systems have found increasing adoption across industries
in recent years. A major driving force has been the call for transparency to not only foster
trust among users, but also to comply with regulatory standards [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Practitioners frequently
use XAI methods to describe a feature’s influence on a predictive model via feature attribution
or selection, i.e., via assigning a numerical or binary score to each feature. Feature selection
approaches are of particular importance in practice because a small subset of the most influential
features is often easier to interpret than a list of numerical scores, especially for non-technical
users [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Out of the variety of feature attribution and selection approaches, Shapley efects [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] or
Shapley Additive Global importancE values (SAGE values [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) have gained increasing popularity
for (arguably) three reasons. First, they are based on the Shapley value, a unique solution
concept from cooperative game theory that fulfills well-defined desiderata [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ]. Second, they
are model-agnostic, i.e., they can be applied to any predictive model post-hoc. Third, they ofer
global explanations across the entire dataset, whereas local methods such as SHapley Additive
exPlanations (SHAP [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) exclusively explain single instances. Despite their popularity, there is
only limited evidence whether feature selection explanations based on Shapley efects or SAGE
values are valid, though, [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ], and no evidence whether the selection process is reliable, i.e.,
whether the selected feature subsets are stable or robust to slight perturbations in the input
(via bootstrapping). Intuitively, we expect similar inputs to produce similar explanations. It
therefore remains challenging for practitioners to decide when and how these approaches can
be trusted and efectively applied.
      </p>
      <p>
        Our contributions are as follows:
1. We evaluate feature selection explanations based on Shapley efects and SAGE
values with two common machine learning baselines for small- and medium-sized tabular
data: L2-regularized logistic regression and XGBoost [
        <xref ref-type="bibr" rid="ref9">9, 10</xref>
        ]. To the best of our knowledge,
we are first to assess the selection reliability in addition to the selection validity for these
global explanation methods.
2. We highlight under which conditions Shapley efects and SAGE values can ofer
valid and reliable explanations, and show that our conclusions appear to be relatively
robust to predictive model choices and input data changes (Sect. 3.2).
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The Shapley value has served as a useful solution concept for feature attribution or selection in
XAI. It can be understood as a weighted average over the marginal contributions of a feature to
each feature subset. The weights can be axiomatically derived to uniquely define the Shapley
value for each feature [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ]. Shapley efects approximate this weighted average with respect
to the model output [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], SAGE values with respect to the model performance [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In terms
of the validity, feature selection explanations based on Shapley values are not guaranteed
to yield optimal feature subsets [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. SAGE’s global explanation approach, for example, does
not necessarily return the best subset, but has been shown to perform better than SHAP’s
local explanation method for feature selection [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Although the reliability of explanations is
considered a key open challenge in XAI research [11], its assessment has so far been limited to
local explanation approaches such as SHAP [12, 13, 14, 15]. Given that both Shapley efects and
SAGE use Monte Carlo simulations to approximate Shapley values, though, the evaluation of
their selection reliability is of central importance for transparency and, ultimately, for building
trust.
      </p>
      <p>In the next section, we therefore extend the prior research by systematically assessing Shapley
efects and SAGE not only in terms of the selection validity, but also in terms of the selection
reliability by evaluating how stable or robust these global explanation methods are to small
perturbations in the input data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>In the following, we first provide details on our experimental setup, including the employed
datasets and methods as well as evaluation metrics to assess Shapley efects and SAGE in terms
of their selection validity and reliability. We then present our experimental results. Overall, we
demonstrate that feature selection explanations can be valid and reliable if the number of labels
in the smaller class per feature is suficiently large for a given predictive model.</p>
      <sec id="sec-3-1">
        <title>3.1. Experimental Setup</title>
        <sec id="sec-3-1-1">
          <title>3.1.1. Datasets and methods</title>
          <p>
            We evaluated Shapley efects and SAGE as feature selectors with two common machine learning
baselines for small- and medium-sized tabular data: L2-regularized logistic regression and
XGBoost [
            <xref ref-type="bibr" rid="ref9">9, 10</xref>
            ], using the PermutationEstimator class.
          </p>
          <p>We expected feature selection with logistic regression to perform reasonably well when the
number of labels in the smaller class per feature, or outcome events per variable (EPV), was at
least 10 to 15 [16, 17]. Therefore, we leveraged two synthetic binary classification datasets from
our prior work - one with 1090 instances, 53 numerical features, and an EPV of about 5; the other
with a smaller number of 14 features and thus an EPV of about 18 [18]. Our L2-regularization
hyperparameter was tuned using a nested, stratified, 3-fold cross-validation procedure. The
number of relevant features per dataset ranged from  = 2 to  = 5 [18], and we selected the
top- most influential features according to their Shapley efects or SAGE values. As a reference,
we compare our feature selection explanations to greedy or optimal feature selection strategies
that were recently employed on the same datasets [18].</p>
          <p>For XGBoost, we hypothesized that an EPV of at least 200 was needed for valid and reliable
selections [19]. We therefore used two diferent synthetic binary classification datasets from
our prior work - one with 20,000 instances, 10 numerical and categorical features, and an EPV
of 100 [20]; the other with a decreased class imbalance and thus an EPV of 200. We numerically
encoded ordinal features, one-hot encoded nominal features, and used XGBoost with its default
settings [20]. The number of relevant features was fixed at  = 6 after one-hot encoding, and
we again selected the top- most influential features according to their Shapley efects or SAGE
values. To put our feature selection explanations into context, we compare them to explanations
based on mean absolute SHAP values as recently investigated in [20].</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Evaluation metrics</title>
          <p>To assess the selection validity and reliability of our global explanation methods, we used 100
nonparametric bootstrap samples, i.e., 100 random samples from our datasets with replacement.
In terms of the validity, we were interested whether the selected feature subsets were true to
the data, i.e., how often the top- most influential features according to Shapley efects or SAGE
values matched the  relevant features in our bootstrap samples. With respect to the reliability,
we assessed how stable or robust these global explanation methods were to small perturbations
in the input (via bootstrapping) with the stability measure proposed by Nogueira et al. (2018)
[21]. We regarded stability scores of &lt;0.40 as poor, 0.40 to 0.75 as intermediate to good, and
Optimal</p>
          <p>SAGE
Greedy</p>
          <p>4/5
&gt;0.75 as excellent [21]. Finally, we evaluated the discriminative performance by computing
the mean test area under the receiver operating characteristic curve (AUC) using a (nested)
stratified, 3-fold cross-validation procedure [18].</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Results</title>
        <p>Fig. 1 shows the validity for logistic regression, with  = 5 as an illustrative example, Fig. 3 the
reliability for logistic regression; Fig. 2 and Fig. 4 depict the validity and reliability for XGBoost.</p>
        <p>For logistic regression, we observe that both Shapley efects and SAGE perform relatively
well, given suficiently large EPVs. At   = 5, no selection strategy consistently identifies
the correct feature subset across diferent levels of . The selection stability is best for greedy
selection at 0.88 (95% confidence interval (CI) [0.88, 0.89]). Feature selections based on Shapley
efects and best subset selection still yield excellent stabilties at 0.81 (95% CI [0.80, 0.83]) and
0.77 (95% CI [0.75, 0.78]), whereas SAGE achieves the worst stability at 0.74 (95% CI [0.73, 0.76]).
At   = 18, the correct features are most frequently found by optimal selection, followed
by SAGE and Shapley efects. With respect to the reliability, the optimal selection strategy
performs best with a stability score of 0.94 (95% CI [0.93, 0.96]). Shapley efects and SAGE still</p>
        <p>SAGE
achieve excellent stabilities at 0.85 (95% CI [0.83, 0.87]) and 0.82 (95% CI [0.80, 0.84]). Greedy
selection performs worst at 0.67 (95% CI [0.66, 0.68]). In terms of the discriminative performance,
all strategies reach a mean test AUCs between 0.97 and 1.0 at  = 2,  = 4, and  = 5 , and
between 0.83 and 0.87 at  = 3.</p>
        <p>For XGBoost, Shapley efects and SAGE also perform relatively well, albeit at much higher
EPVs. At   = 100, the correct feature subset is almost never selected. In terms of the
reliability, SAGE, Shapley efects, and SHAP perform similarly. With stabilities of only 0.74
(95% CI [0.73, 0.76]), 0.73 (95% CI [0.72, 0.75]), and 0.73 (95% CI [0.71, 0.75]), no strategy reaches
excellent scores. At   = 200, though, Shapley efects and SAGE select the correct feature
subset most of the time, whereas SHAP rarely recovers the correct features. With respect to the
reliability, both Shapley efects and SAGE reach excellent stabilities at 0.83 (95% CI [0.81, 0.86])
and 0.81 (95% CI [0.79, 0.83]), whereas SHAP does not at 0.69 (95% CI [0.67, 0.71]). Regarding
the discrimination, all methods achieve a mean test AUC of 1.0.
5/6
6/6
1.0
0.9
e
rco0.8
S
y
t
i
l
i
b
ta0.7
S
0.6
0.5
1.0
0.9
e
rco0.8
S
y
t
i
l
i
b
ta0.7
S
0.6
0.5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Feature selection explanations are useful tools to understand which features are most influential
for a given predictive model. In this work, we find that Shapley efects and SAGE values can ofer
valid and reliable explanations given suficiently large (algorithm-specific) EPVs. Increasing
EPVs from 5 to 18 and 100 to 200 not only enhances validity but furthermore improves the
stability from intermediate or good to excellent for these XAI methods. Nevertheless, this
conclusion is based on only two common machine learning models for small- and
mediumsized tabular data, L2-regularized logistic regression and XGBoost, on four synthetic datasets
with two distinct data generating processes. Although these datasets provide clear ground
truth explanations to evaluate XAI methods, further experiments are necessary to study the
broader applicability of these methods for feature selection - for instance with respect to varying
levels of correlations between features, missing values, or noise [18] and throughout various
datasets and XAI benchmarks such as [22]. Additionally, it would be interesting to investigate
feature selection explanations on even smaller sample sizes, where data-driven feature selection
strategies may need to be complemented with prior knowledge in the form of causal graphs [23]
or large language models [24]. We hope that future work will corroborate that XAI approaches
are suficiently valid and reliable to be applied across a variety of settings, can be trusted
by practitioners, and can comply with regulatory standards that demand increasing levels of
explainability and transparency for machine learning services.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was funded by the Bundesministerium für Bildung und Forschung (16DHBKI071).
deep learning for data-scarce classification applications, arXiv preprint arXiv:2405.07662
(2024).
[10] L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep
learning on typical tabular data?, Advances in neural information processing systems 35
(2022) 507–520.
[11] L. Longo, M. Brcic, F. Cabitza, J. Choi, R. Confalonieri, J. Del Ser, R. Guidotti, Y. Hayashi,
F. Herrera, A. Holzinger, et al., Explainable artificial intelligence (xai) 2.0: A manifesto
of open challenges and interdisciplinary research directions, Information Fusion (2024)
102301.
[12] C. Agarwal, N. Johnson, M. Pawelczyk, S. Krishna, E. Saxena, M. Zitnik, H. Lakkaraju,
Rethinking stability for attribution-based explanations, arXiv preprint arXiv:2203.06877
(2022).
[13] C. Agarwal, S. Krishna, E. Saxena, M. Pawelczyk, N. Johnson, I. Puri, M. Zitnik, H. Lakkaraju,
Openxai: Towards a transparent evaluation of model explanations, Advances in Neural
Information Processing Systems 35 (2022) 15784–15799.
[14] D. Alvarez-Melis, T. S. Jaakkola, On the robustness of interpretability methods, arXiv
preprint arXiv:1806.08049 (2018).
[15] H. Baniecki, P. Biecek, Manipulating shap via adversarial data perturbations (student
abstract), Proceedings of the AAAI Conference on Artificial Intelligence (2022).
[16] F. E. Harrell, Regression modeling strategies: with applications to linear models, logistic
regression, and survival analysis, Springer, 2015.
[17] G. Heinze, C. Wallisch, D. Dunkler, Variable selection–a review and recommendations for
the practicing statistician, Biometrical journal 60 (2018) 431–449.
[18] R. Knauer, E. Rodner, Cost-sensitive best subset selection for logistic regression: A
mixedinteger conic optimization perspective, in: German Conference on Artificial Intelligence
(Künstliche Intelligenz), Springer, 2023, pp. 114–129.
[19] T. Van Der Ploeg, P. C. Austin, E. W. Steyerberg, Modern modelling techniques are data
hungry: a simulation study for predicting dichotomous endpoints, BMC medical research
methodology 14 (2014) 1–13.
[20] R. Wallsberger, R. Knauer, S. Matzka, Explainable artificial intelligence in mechanical
engineering: A synthetic dataset for comprehensive failure mode analysis, in: 2023 Fifth
International Conference on Transdisciplinary AI (TransAI), IEEE, 2023, pp. 249–252.
[21] S. Nogueira, K. Sechidis, G. Brown, On the stability of feature selection algorithms, Journal
of Machine Learning Research 18 (2018) 1–54.
[22] Y. Liu, S. Khandagale, C. White, W. Neiswanger, Synthetic benchmarks for scientific
research in explainable machine learning, Advances in Neural Information Processing
Systems (2021).
[23] T. Heskes, E. Sijben, I. G. Bucur, T. Claassen, Causal shapley values: Exploiting causal
knowledge to explain individual predictions of complex models, Advances in neural
information processing systems 33 (2020) 4778–4789.
[24] N. Kroeger, D. Ley, S. Krishna, C. Agarwal, H. Lakkaraju, Are large language models post
hoc explainers?, arXiv preprint arXiv:2310.05797 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bunte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Burton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Großmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kleen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>K.</given-names>
            <surname>Markert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Meeß</surname>
          </string-name>
          , et al.,
          <source>Deutsche normungsroadmap künstliche intelligenz</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>European</given-names>
            <surname>Commission</surname>
          </string-name>
          ,
          <article-title>Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT</article-title>
          AND
          <string-name>
            <surname>OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE</surname>
            <given-names>ACTS</given-names>
          </string-name>
          , https://artificialintelligenceact.eu/the-act/,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Covert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.-I. Lee</surname>
          </string-name>
          ,
          <article-title>Explaining by removing: A unified framework for model explanation</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Owen</surname>
          </string-name>
          ,
          <article-title>Sobol'indices and shapley value</article-title>
          ,
          <source>SIAM/ASA Journal on Uncertainty Quantification</source>
          <volume>2</volume>
          (
          <year>2014</year>
          )
          <fpage>245</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Staum</surname>
          </string-name>
          ,
          <article-title>Shapley efects for global sensitivity analysis: Theory and computation</article-title>
          ,
          <source>SIAM/ASA Journal on Uncertainty Quantification</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <fpage>1060</fpage>
          -
          <lpage>1083</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Covert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.-I. Lee</surname>
          </string-name>
          ,
          <article-title>Understanding global feature contributions with additive importance measures</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>17212</fpage>
          -
          <lpage>17223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fryer</surname>
          </string-name>
          , I. Strümke,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Shapley values for feature selection: The good, the bad, and the axioms</article-title>
          ,
          <source>Ieee Access</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>144352</fpage>
          -
          <lpage>144360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Knauer</surname>
          </string-name>
          , E. Rodner,
          <article-title>Squeezing lemons with hammers: An evaluation of automl and tabular</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>