=Paper=
{{Paper
|id=Vol-3761/extendedabstract1
|storemode=property
|title=Explaining Change in Models and Data with Global Feature Importance and Effects
|pdfUrl=https://ceur-ws.org/Vol-3761/extendedabstract1.pdf
|volume=Vol-3761
|authors=Maximilian Muschalik,Fabian Fumagalli,Barbara Hammer,Eyke Hüllermeier
|dblpUrl=https://dblp.org/rec/conf/tempxai/MuschalikFHH24
}}
==Explaining Change in Models and Data with Global Feature Importance and Effects==
Explaining Change in Models and Data with Global
Feature Importance and Effects
Maximilian Muschalik1,2,∗,† , Fabian Fumagalli3,∗,† , Barbara Hammer3 and Eyke Hüllermeier1,2
1
LMU Munich, D-80539 Munich, Germany
2
MCML, Munich
3
Bielefeld University, D-33619 Bielefeld, Germany
Abstract
In dynamic machine learning environments, where data streams continuously evolve, traditional explanation
methods struggle to remain faithful to the underlying model or data distribution. Therefore, this work presents
a unified framework for efficiently computing incremental model-agnostic global explanations tailored for
time-dependent models. By extending static model-agnostic methods such as Permutation Feature Importance,
SAGE, and Partial Dependence Plots into the online learning context, the proposed framework enables the
continuous updating of explanations as new data becomes available. These incremental variants ensure that
global explanations remain relevant while minimizing computational overhead. The framework also addresses
key challenges related to data distribution maintenance and perturbation generation in online learning, offering
time and memory efficient solutions like geometric reservoir-based sampling for data replacement.
Keywords
Explainable Artificial Intelligence, Interpretable Machine Learning, Online Learning, Concept Drift
1. Introduction
In applied machine learning, data often evolves over time, which necessitates changes to prediction
models. Ensuring the reliability of such time-dependent models is increasingly important in high-stake
applications, such as financial services [1], sensor [2, 3] and network [4] analysis. In recent years,
eXplainable Artificial Intelligence (XAI) has targeted such time-dependent explanations of predictions
that react to changes in the underlying data distributions and prediction models [5]. In extreme cases,
where data is observed sequentially over time from a data stream, models are updated incrementally
with each now observation, known as online learning or incremental learning [6]. In this context,
re-computing XAI methods from scratch can become computationally infeasible, where incremental
variants have been proposed [7, 8, 9, 10, 11].
In this work, we present a unified framework that allows to efficiently compute incremental variants of
model-agnostic global explanations (MAGEs). We demonstrate that existing incremental XAI techniques
are summarized in the incremental MAGE framework. Furthermore, static MAGEs cover a wide range
of existing model-agnostic XAI methods, including Shapley interactions [12], which expand the range
of efficient incremantal XAI techniques for interpretability of black-box online learning models.
2. Background
We first introduce background on model-agnostic global explanations (Section 2.1), as well as online
learning from data streams (Section 2.2). We consider a trained black-box model 𝑓 ∶ 𝒳 → 𝒴 with
input domain 𝒳 equipped with a 𝑑-dimensional feature representation 𝒟 = {1, … , 𝑑}, e.g. 𝒳 = ℝ𝑑 , and
TempXAI@ECML-PKDD’24: Explainable AI for Time Series and Data Streams Tutorial-Workshop, Sep. 9th , 2024, Vilnius, Lithunia
∗
Corresponding author.
†
These authors contributed equally.
Envelope-Open Maximilian.Muschalik@lmu.de (M. Muschalik); ffumagalli@techfak.uni-bielefeld.de (F. Fumagalli)
Orcid 0000-0002-6921-0204 (M. Muschalik); 0000-0003-3955-3510 (F. Fumagalli); 0000-0002-0935-5591 (B. Hammer);
0000-0002-9944-4108 (E. Hüllermeier)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
output domain 𝒴. We do do not make any further assumption on the model architecture and instead
only allow access to the model by predicting instances. This is known as model-agnostic explanations
[13].
2.1. Model-Agnostic Global Explanations
A global explanation of a black-box model considers the behavior of 𝑓 across a whole labeled dataset
(𝑥𝑗 , 𝑦𝑗 ) ∈ 𝒳 × 𝒴 with 𝑗 = 1, … , 𝑛. Global feature importance (FI) is an instance of global explanations that
outputs an importance score 𝜙 FI ∶ 𝒟 → ℝ for every feature 𝑖 ∈ 𝒟 [14]. Global FI measures a change in
a model’s performance, if the model’s access to this feature’s information is restricted. Permutation FI
(PFI) [15, 16] is computed by permuting the values of the target feature and measuring the change in
performance across a dataset. By permuting the feature’s value, the model’s access to this information
is limited and, thus, PFI yields an efficient way to compute global FI. However, a feature’s information
provided to the model’s performance might strongly depend on other features. Therefore, perturbing a
single feature’s value in the presence of all remaining features is a limitation of PFI. Shapley additive
global importance (SAGE) [14, 17] accounts for this limitation by computing the increase in loss across
sampled permutations 𝜋 ∶ 𝒟 → 𝒟 over the feature space. For such a permutation 𝜋, each feature
𝑖 ∈ 𝒟 appears at a certain position. SAGE proposes to measure the average increase in loss for the
preceding features of 𝑖 with and without 𝑖. By sampling over several permutations, an approximation of
the Shapley Value (SV) [18] is obtained, a concept from cooperative game theory that guarantees that
the SAGE values fairly decompose the overall loss. While global FI quantifies the impact of individual
features, it is limited in its expressivity.
To understand Feature Effects (FE), Partial Dependence Plots (PDPs) [19] visualize the effect of
imputing a specified feature’s value cross all observations and compute the average prediction of all
observations, when this feature’s value is set. The PDP visualizes this average prediction across a
range of different values, which allows to globally interpret the effect of changing this feature’s value
on average [20]. Besides PDP, there exist other FE methods [20, 21, 22] with extensions to regional
explanations [20, 23]. Another way of quantifying FEs is by using interaction indices that distribute
contributions to all individuals and groups of features up to a maximum group size 𝑘. In recent work,
several Shapley-based interaction indices have been proposed [24, 25, 26] as well as their efficient
computation in a model-agnostic setting [12, 24, 25, 27, 28, 29]. Model-agnostic global explanations
were widely applied in static environments [30], however, in practice, data is often of dynamic nature,
where explanations become outdated when models are adapted over time.
2.2. Online Learning From Data Streams
In many real-world applications [2] data is observed sequentially over time. In an extreme setting, we
observe a data stream (𝑥0 , 𝑦0 ), … , (𝑥𝑡 , 𝑦𝑡 ), where at time 𝑡 the data point (𝑥𝑡 , 𝑦𝑡 ) is observed. The goal of
online learning [6] is to train a time-dependent model 𝑓𝑡 by using the current observation (𝑥𝑡 , 𝑦𝑡 ) once to
obtain an updated model 𝑓𝑡+1 , i.e.
IncrementalUpdate (𝑓𝑡 , 𝑥𝑡 , 𝑦𝑡 ) ⟶ 𝑓𝑡+1 .
Prominent instances of online learning algorithms include Hoeffding adaptive trees [31] and adaptive
random forests [32], where splits and tree-structures are replaced, if they become outdated. Other
training schemes, such stochastic gradient descent, inherently allow for incremental updates [6]. Online
learning is especially important, if the underlying data distribution changes over time. This phenomenon
is known as concept drift and occurs in many applications [33]. Detecting concept drift and reacting
adequately by updating the model accordingly is one of the major applications of incremental learning
[6]. A common approach to detect concept drift is via accuracy-based drift detectors, where a sudden
change in accuracy of the model indicates a change of distributions [33]. Recently, it was proposed
to enhance such detection schemes using global FI methods [5]. However, the computation of such
methods is a challenging problem that has been mainly considered in static scenarios.
3. A Unified Framework for Explaining Change in Models and Data
We now present unified framework that allows to efficiently explore incremental variants of model-
agnostic global explanations in an online learning setting. In a static setting, a global explanation
is typically computed for individual features (global FI) or groups of features (global FE), which we
summarizes in the following definition.
Definition 1 (ℰ). Global explanations are computed for every element in the explanation domain ℰ,
which is a collection of features and interactions ℰ ⊆ 2𝒟 .
Given an explanation domain, the explanation can be computed for each element in a static setting.
Definition 2 (Static MAGE). A static Model-Agnostic Global Explanation (MAGE) 𝜙𝑓 ∶ ℰ → ℝ for
a set of features 𝑆 is
𝑛
1
𝜙𝑓 (𝑆) ∶= ∑ 𝜆𝑓 (𝑥𝑗 , 𝑦𝑗 , 𝑆, 𝒫𝑥𝑗 ,𝑆 ) .
𝑛 𝑗=1
Here, 𝒫𝑥𝑗 ,𝑆 is a set of data points and 𝜆𝑓 is a method-specific explanation function.
Typically, the perturbation data 𝒫𝑥,𝑆 is constructed by using a combination of the data point 𝑥 and
another sampled data point 𝑥,̃ where the feature’s values from 𝑆 and −𝑆 ∶= 𝒟 ∖ 𝑆 are taken from either
𝑥 or 𝑥̃ [14]. Thereby, the sampling of 𝑥̃ may be done dependently or independently of 𝑥. Instantiations
of static MAGEs include PFI [15] where ℰ contains individual features, and 𝜆𝑓 measure the increase in
loss. Therein, 𝒫𝑥𝑗 ,𝑆 includes a single data point constructed by the values of 𝑥𝑗 for features in −𝑆 and
the values of 𝑆 from another data point obtained from the dataset using a permutation. SAGE [14] is
also covered in this framework by choosing 𝜆𝑓 as the average over sampled permutations over 𝒟, as
described in Section 2.1. Lastly, PDPs [19] are contained in this framework, where 𝜆𝑓 is chosen as the
prediction of a combination of 𝑥𝑗 and 𝑥̃ ∈ 𝒫𝑥𝑗 ,𝑆 , where 𝒫𝑥,𝑆 contains the data points for which the PDP
is visualized.
Having established a unified view on static MAGEs, we now turn our focus to an online learning
setting as described in Section 2.2. Using the observed data points at time 𝑡 a naive way to compute
MAGEs is via Definition 2 as
𝑡−1
1
𝜙𝑡 (𝑆) ∶= ∑ 𝜆𝑓𝑡 (𝑥𝑠 , 𝑦𝑠 , 𝑆, 𝒫𝑥𝑠 ,𝑆 ) . (1)
𝑡 𝑠=0
Re-computing Eq. 1 at every time step 𝑡 is an exhaustive operative, since static MAGEs are already
time-consuming when computed once [14]. Moreover, Eq. 1 requires to store the full data stream,
which is typically considered infeasible. As a remedy, practitioners might restrict the computation to
a time window of fixed size [5]. However, reducing the number of observations lowers the quality
of the explanation, which increases the variance. In the following, we propose a framework for an
incremental computation of 𝜙𝑡 , similar to the incremental update of the model 𝑓𝑡 . Our goal is to leverage
the previously calculated MAGE and update this explanation using the currently available datapoint, i.e.
IncrementalUpdate (𝜙𝑡 , 𝑓𝑡 , 𝑥𝑡 , 𝑦𝑡 ) ⟶ 𝜙𝑡+1
By introducing a smoothing parameter 0 < 𝛼 < 1, we define the incremental MAGE.
Definition 3 (Incremental MAGE). Let 0 < 𝛼 < 0. We define an incremental MAGE as
𝜙𝑡 (𝑆) ∶= (1 − 𝛼) ⋅ 𝜙𝑡−1 (𝑆) + 𝛼 ⋅ 𝜆𝑓𝑡 (𝑥𝑡 , 𝑦𝑡 , 𝑆, 𝒫𝑥𝑡 ,𝑆 ) .
The incremental MAGE computes a single term of the sum in Eq. 1 at each time step and exploits the
previously computed MAGE values. This drastically reduces the computational complexity, which is at
time 𝑡 equal to computing MAGE once with Eq. 1. However, the incremental MAGE allows to obtain 𝜙𝑡
for every time step 𝑡 without sacrificing computational resources. Incremental variants of PFI [7] and
SAGE [8], as well as PDP [9] have been recently proposed. They can be viewed as an instantiation of
incremental MAGEs.
A major challenge in computing incremental MAGEs is the maintenance of the perturbation dataset
𝒫𝑥,𝑆 over time, i.e. efficiently constructing perturbed data points that adhere to the data distribution.
Reservoir sampling [34] has been adapted to efficiently store the data distribution with minimum
resources [7]. Geometric sampling [7] proposes to store a reservoir of fixed lengths, where data points
are replaced over time and more recent observations have a higher probability to be present in the
reservoir compared to older observations. This mechanism allows to maintain a time-dependent
marginal data distribution with limited resources. More advanced techniques maintain conditional
distributions using online decision trees and allow for conditional sampling as required for instance in
conditional SAGE [8]. It has been shown that both sampling techniques yield substantially different
explanations [35]. Geometric sampling with marginal distributions highlights the structure of the
model, whereas observational approaches via conditional sampling include the data distribution in the
explanation [35].
4. Conclusion and Future Work
We summarized popular model-agnostic global explanation techniques, such as FI-based PFI and SAGE,
as well as FE-based PDPs, into the MAGE framework for static learning environments. We then proposed
the incremental MAGE framework to directly compute these explanations for online learning on data
streams. Incremental MAGE allows to incrementally update previous estimates of MAGEs at each time
step using minimal resources. We have shown that incremental variants, such as iPFP, iSAGE and iPDP
can be summarized in the incremental MAGE framework. Incremental MAGE offers opportunities to
expand the range of incremental variants of MAGE-techniques. For instance, recently proposed methods
to estimate Shapley interactions [12, 25, 29] may be placed in the incremental MAGE framework to
discover for complex interactions beyond isolated FE. Moreover, with increasing variety of explanations
using different complexity levels, human-centered presentations and visualizations are important future
work.
Acknowledgments
We gratefully acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research
Foundation): TRR 318/1 2021 – 438445824.
References
[1] J. M. Clements, D. Xu, N. Yousefi, D. Efimov, Sequential Deep Learning for Credit Risk Monitoring
with Tabular Financial Data, CoRR abs/2012.15330 (2020). arXiv:2012.15330 .
[2] M. Bahri, A. Bifet, J. Gama, H. M. Gomes, S. Maniu, Data stream analysis: Foundations, major
tasks and tools, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11 (2021)
e1405. doi:10.1002/widm.1405 .
[3] N. Davari, B. Veloso, R. P. Ribeiro, P. M. Pereira, J. Gama, Predictive maintenance based on
anomaly detection using deep learning for air production unit in the railway industry, in: 8th
IEEE International Conference on Data Science and Advanced Analytics (DSAA 2021), IEEE, 2021,
pp. 1–10. doi:10.1109/DSAA53316.2021.9564181 .
[4] B. G. Atli, A. Jung, Online Feature Ranking for Intrusion Detection Systems, CoRR abs/1803.00530
(2018). arXiv:1803.00530 .
[5] M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, Agnostic explanation of model change
based on feature importance, Künstliche Intell. 36 (2022) 211–224. URL: https://doi.org/10.1007/
s13218-022-00766-6. doi:10.1007/S13218- 022- 00766- 6 .
[6] V. Losing, B. Hammer, H. Wersing, Incremental On-line Learning: A Review and Comparison
of State of the Art Algorithms, Neurocomputing 275 (2018) 1261–1274. doi:10.1016/j.neucom.
2017.06.084 .
[7] F. Fumagalli, M. Muschalik, E. Hüllermeier, B. Hammer, Incremental permutation feature impor-
tance (ipfi): towards online explanations on data streams, Mach. Learn. 112 (2023) 4863–4903.
URL: https://doi.org/10.1007/s10994-023-06385-y. doi:10.1007/S10994- 023- 06385- Y .
[8] M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, isage: An incremental version of SAGE
for online explanation on data streams, in: D. Koutra, C. Plant, M. G. Rodriguez, E. Baralis,
F. Bonchi (Eds.), Machine Learning and Knowledge Discovery in Databases: Research Track
- European Conference, ECML PKDD 2023, Turin, Italy, September 18-22, 2023, Proceedings,
Part III, volume 14171 of Lecture Notes in Computer Science, Springer, 2023, pp. 428–445. URL:
https://doi.org/10.1007/978-3-031-43418-1_26. doi:10.1007/978- 3- 031- 43418- 1\_26 .
[9] M. Muschalik, F. Fumagalli, R. Jagtani, B. Hammer, E. Hüllermeier, ipdp: On partial dependence
plots in dynamic modeling scenarios, in: L. Longo (Ed.), Explainable Artificial Intelligence -
First World Conference, xAI 2023, Lisbon, Portugal, July 26-28, 2023, Proceedings, Part I, volume
1901 of Communications in Computer and Information Science, Springer, 2023, pp. 177–194. URL:
https://doi.org/10.1007/978-3-031-44064-9_11. doi:10.1007/978- 3- 031- 44064- 9\_11 .
[10] A. P. Cassidy, F. A. Deviney, Calculating feature importance in data streams with concept drift
using online random forest, in: 2014 IEEE International Conference on Big Data (Big Data 2014),
2014, pp. 23–28. doi:10.1109/BigData.2014.7004352 .
[11] H. M. Gomes, R. F. d. Mello, B. Pfahringer, A. Bifet, Feature scoring using tree-based ensembles for
evolving data streams, in: 2019 IEEE International Conference on Big Data (Big Data 2019), 2019,
p. 761–769.
[12] F. Fumagalli, M. Muschalik, P. Kolpaczki, E. Hüllermeier, B. E. Hammer, SHAP-IQ: Unified
approximation of any-order shapley interactions, in: Thirty-seventh Conference on Neural
Information Processing Systems (NeurIPS 2023), 2023.
[13] A. Adadi, M. Berrada, Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence
(XAI), IEEE Access 6 (2018) 52138–52160. doi:10.1109/ACCESS.2018.2870052 .
[14] I. Covert, S. M. Lundberg, S.-I. Lee, Understanding Global Feature Contributions With Additive Im-
portance Measures, in: Proceedings of International Conference on Neural Information Processing
Systems (NeurIPS 2020), 2020, p. 17212–17223.
[15] L. Breiman, Random Forests, Machine Learning 45 (2001) 5–32.
[16] A. Fisher, C. Rudin, F. Dominici, All Models are Wrong, but Many are Useful: Learning a Variable’s
Importance by Studying an Entire Class of Prediction Models Simultaneously, Journal of Machine
Learning Research 20 (2019) 1–81.
[17] G. Casalicchio, C. Molnar, B. Bischl, Visualizing the Feature Importance for Black Box Models,
volume 11051 of Lecture Notes in Computer Science, Springer International Publishing, Cham, 2019,
p. 655–670. doi:10.1007/978- 3- 030- 10925- 7\_40 .
[18] L. S. Shapley, A Value for n-Person Games, in: Contributions to the Theory of Games (AM-28),
Volume II, Princeton University Press, New Jersey, USA, 1953, pp. 307–318.
[19] J. H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of
Statistics 29 (2001) 1189–1232. URL: http://www.jstor.org/stable/2699986.
[20] J. Herbinger, B. Bischl, G. Casalicchio, REPID: regional effect plots with implicit interac-
tion detection, in: G. Camps-Valls, F. J. R. Ruiz, I. Valera (Eds.), International Conference
on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, vol-
ume 151 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 10209–10233. URL:
https://proceedings.mlr.press/v151/herbinger22a.html.
[21] D. W. Apley, J. Zhu, Visualizing the effects of predictor variables in black box supervised learning
models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (2016). URL:
https://api.semanticscholar.org/CorpusID:88522102.
[22] S. M. Lundberg, G. G. Erion, H. Chen, A. J. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb,
N. Bansal, S. Lee, From local explanations to global understanding with explainable AI for trees,
Nature Machine Intelligence 2 (2020) 56–67. doi:10.1038/s42256- 019- 0138- 9 .
[23] J. Herbinger, B. Bischl, G. Casalicchio, Decomposing global feature effects based on feature
interactions, CoRR abs/2306.00541 (2023). URL: https://doi.org/10.48550/arXiv.2306.00541. doi:10.
48550/ARXIV.2306.00541 . arXiv:2306.00541 .
[24] M. Sundararajan, K. Dhamdhere, A. Agarwal, The Shapley Taylor Interaction Index, in: Proceedings
of the 37th International Conference on Machine Learning, (ICML 2020), volume 119 of Proceedings
of Machine Learning Research, PMLR, 2020, pp. 9259–9268.
[25] C. Tsai, C. Yeh, P. Ravikumar, Faith-Shap: The Faithful Shapley Interaction Index, Journal of
Machine Learning Research 24 (2023) 1–42.
[26] S. Bordt, U. von Luxburg, From Shapley Values to Generalized Additive Models and back, in:
International Conference on Artificial Intelligence and Statistics (AISTATS 2023), volume 206 of
Proceedings of Machine Learning Research, PMLR, 2023, pp. 709–745.
[27] P. Kolpaczki, M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, SVARM-IQ: Efficient
approximation of any-order Shapley interactions through stratification, in: Proceedings of The
27th International Conference on Artificial Intelligence and Statistics, (AISTATS 2024), volume
238 of Proceedings of Machine Learning Research, PMLR, 2024, pp. 3520–3528.
[28] F. Fumagalli, M. Muschalik, P. Kolpaczki, E. Hüllermeier, B. Hammer, KernelSHAP-IQ: Weighted
least square optimization for shapley interactions, in: Forty-first International Conference on
Machine Learning, 2024. URL: https://openreview.net/forum?id=d5jXW2H4gg.
[29] M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, Beyond treeshap: Efficient computation of
any-order shapley interactions for tree ensembles, in: Thirty-Eighth AAAI Conference on Artificial
Intelligence, (AAAI 2024), AAAI Press, 2024, pp. 14388–14396. doi:10.1609/AAAI.V38I13.29352 .
[30] I. Covert, S. Lundberg, S.-I. Lee, Explaining by Removing: A Unified Framework for Model
Explanation, Journal of Machine Learning Research 22 (2021) 1–90.
[31] G. Hulten, L. Spencer, P. Domingos, Mining time-changing data streams, in: Proceedings of the
seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD
2001), ACM Press, 2001, p. 97–106. doi:10.1145/502512.502529 .
[32] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes, T. Abdessalem,
Adaptive random forests for evolving data stream classification, Machine Learning 106 (2017)
1469–1495. doi:10.1007/s10994- 017- 5642- 8 .
[33] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under Concept Drift: A Review, IEEE
Transactions on Knowledge and Data Engineering (2018) 2346–2363. doi:10.1109/TKDE.2018.
2876857 .
[34] J. S. Vitter, Random Sampling with a Reservoir, ACM Transactions on Mathematical Software 11
(1985) 37–57. doi:10.1016/j.ipl.2005.11.003 .
[35] F. Fumagalli, M. Muschalik, E. Hüllermeier, B. Hammer, On feature removal for explainability in
dynamic environments, in: 31st European Symposium on Artificial Neural Networks, Computa-
tional Intelligence and Machine Learning, ESANN 2023, Bruges, Belgium, October 4-6, 2023, 2023.
URL: https://doi.org/10.14428/esann/2023.ES2023-148. doi:10.14428/ESANN/2023.ES2023- 148 .