1. Introduction

Nature Machine Intelligence

1613-0073

10.1038/s42256

Feature Importance and Efects

Maximilian Muschalik

Maximilian.Muschalik@lmu.de 1

Fabian Fumagalli

ffumagalli@techfak.uni-bielefeld.de 0

Barbara Hammer

Eyke Hüllermeier

Munich

Workshop

Explainable Artificial Intelligence, Interpretable Machine Learning, Online Learning, Concept Drift

0 Bielefeld University , D-33619 Bielefeld , Germany 1 LMU Munich , D-80539 Munich , Germany

2020

2 2020 0000 0002

In dynamic machine learning environments, where data streams continuously evolve, traditional explanation methods struggle to remain faithful to the underlying model or data distribution. Therefore, this work presents a unified framework for eficiently computing incremental model-agnostic global explanations tailored for time-dependent models. By extending static model-agnostic methods such as Permutation Feature Importance, SAGE, and Partial Dependence Plots into the online learning context, the proposed framework enables the continuous updating of explanations as new data becomes available. These incremental variants ensure that global explanations remain relevant while minimizing computational overhead. The framework also addresses key challenges related to data distribution maintenance and perturbation generation in online learning, ofering time and memory eficient solutions like geometric reservoir-based sampling for data replacement.

1. Introduction

In applied machine learning, data often evolves over time, which necessitates changes to prediction models. Ensuring the reliability of such time-dependent models is increasingly important in high-stake applications, such as financial services [ 1 ], sensor [ 2, 3 ] and network [ 4 ] analysis. In recent years, eXplainable Artificial Intelligence (XAI) has targeted such time-dependent explanations of predictions that react to changes in the underlying data distributions and prediction models [ 5 ]. In extreme cases, where data is observed sequentially over time from a data stream, models are updated incrementally with each now observation, known as online learning or incremental learning [6]. In this context, re-computing XAI methods from scratch can become computationally infeasible, where incremental variants have been proposed [7, 8, 9, 10, 11].

In this work, we present a unified framework that allows to eficiently compute incremental variants of model-agnostic global explanations (MAGEs). We demonstrate that existing incremental XAI techniques are summarized in the incremental MAGE framework. Furthermore, static MAGEs cover a wide range of existing model-agnostic XAI methods, including Shapley interactions [12], which expand the range of eficient incremantal XAI techniques for interpretability of black-box online learning models.

2. Background

We first introduce background on model-agnostic global explanations (Section 2.1), as well as online learning from data streams (Section 2.2). We consider a trained black-box model ∶ → with input domain equipped with a -dimensional feature representation = {1, … , } , e.g. = ℝ , and

CEUR

ceur-ws.org output domain . We do do not make any further assumption on the model architecture and instead only allow access to the model by predicting instances. This is known as model-agnostic explanations [13].

2.1. Model-Agnostic Global Explanations

A global explanation of a black-box model considers the behavior of across a whole labeled dataset ( , ) ∈ × with = 1, … , . Global feature importance (FI) is an instance of global explanations that outputs an importance score FI ∶ → ℝ for every feature ∈ [14]. Global FI measures a change in a model’s performance, if the model’s access to this feature’s information is restricted. Permutation FI (PFI) [15, 16] is computed by permuting the values of the target feature and measuring the change in performance across a dataset. By permuting the feature’s value, the model’s access to this information is limited and, thus, PFI yields an eficient way to compute global FI. However, a feature’s information provided to the model’s performance might strongly depend on other features. Therefore, perturbing a single feature’s value in the presence of all remaining features is a limitation of PFI. Shapley additive global importance (SAGE) [14, 17] accounts for this limitation by computing the increase in loss across sampled permutations ∶ → over the feature space. For such a permutation , each feature ∈ appears at a certain position. SAGE proposes to measure the average increase in loss for the preceding features of with and without . By sampling over several permutations, an approximation of the Shapley Value (SV) [18] is obtained, a concept from cooperative game theory that guarantees that the SAGE values fairly decompose the overall loss. While global FI quantifies the impact of individual features, it is limited in its expressivity.

To understand Feature Efects (FE), Partial Dependence Plots (PDPs) [ 19] visualize the efect of imputing a specified feature’s value cross all observations and compute the average prediction of all observations, when this feature’s value is set. The PDP visualizes this average prediction across a range of diferent values, which allows to globally interpret the efect of changing this feature’s value on average [20]. Besides PDP, there exist other FE methods [20, 21, 22] with extensions to regional explanations [20, 23]. Another way of quantifying FEs is by using interaction indices that distribute contributions to all individuals and groups of features up to a maximum group size . In recent work, several Shapley-based interaction indices have been proposed [24, 25, 26] as well as their eficient computation in a model-agnostic setting [12, 24, 25, 27, 28, 29]. Model-agnostic global explanations were widely applied in static environments [30], however, in practice, data is often of dynamic nature, where explanations become outdated when models are adapted over time.

2.2. Online Learning From Data Streams

In many real-world applications [ 2 ] data is observed sequentially over time. In an extreme setting, we observe a data stream ( 0, 0), … , ( , ), where at time the data point ( , ) is observed. The goal of online learning [6] is to train a time-dependent model by using the current observation ( , ) once to obtain an updated model +1 , i.e.

IncrementalUpdate( , , ) ⟶ +1 .

Prominent instances of online learning algorithms include Hoefding adaptive trees [ 31] and adaptive random forests [32], where splits and tree-structures are replaced, if they become outdated. Other training schemes, such stochastic gradient descent, inherently allow for incremental updates [6]. Online learning is especially important, if the underlying data distribution changes over time. This phenomenon is known as concept drift and occurs in many applications [ 33]. Detecting concept drift and reacting adequately by updating the model accordingly is one of the major applications of incremental learning [6]. A common approach to detect concept drift is via accuracy-based drift detectors, where a sudden change in accuracy of the model indicates a change of distributions [33]. Recently, it was proposed to enhance such detection schemes using global FI methods [ 5 ]. However, the computation of such methods is a challenging problem that has been mainly considered in static scenarios.

3. A Unified Framework for Explaining Change in Models and Data

We now present unified framework that allows to eficiently explore incremental variants of modelagnostic global explanations in an online learning setting. In a static setting, a global explanation is typically computed for individual features (global FI) or groups of features (global FE), which we summarizes in the following definition.

Definition 1 ( ℰ). Global explanations are computed for every element in the explanation domain ℰ, which is a collection of features and interactions ℰ ⊆ 2 .

Given an explanation domain, the explanation can be computed for each element in a static setting.

1 =1

−1 1 =0 () ∶= ∑ ( , , , , ) .

Here, , is a set of data points and is a method-specific explanation function. is visualized.

Typically, the perturbation data , is constructed by using a combination of the data point and another sampled data point ̃, where the feature’s values from and − ∶= ∖ are taken from either or ̃ [14]. Thereby, the sampling of ̃ may be done dependently or independently of . Instantiations of static MAGEs include PFI [15] where ℰ contains individual features, and measure the increase in loss. Therein, , includes a single data point constructed by the values of for features in − and the values of from another data point obtained from the dataset using a permutation. SAGE [14] is also covered in this framework by choosing as the average over sampled permutations over , as described in Section 2.1. Lastly, PDPs [19] are contained in this framework, where is chosen as the prediction of a combination of and ∈̃

, , where , contains the data points for which the PDP

Having established a unified view on static MAGEs, we now turn our focus to an online learning setting as described in Section 2.2. Using the observed data points at time a naive way to compute MAGEs is via Definition 2 as Definition 2 (Static MAGE). a set of features is

A static Model-Agnostic Global Explanation (MAGE) ∶ ℰ → ℝ for () ∶= ∑ ( , , , , ) .

(1)

Re-computing Eq. 1 at every time step is an exhaustive operative, since static MAGEs are already time-consuming when computed once [14]. Moreover, Eq. 1 requires to store the full data stream, which is typically considered infeasible. As a remedy, practitioners might restrict the computation to a time window of fixed size [ 5 ]. However, reducing the number of observations lowers the quality of the explanation, which increases the variance. In the following, we propose a framework for an incremental computation of , similar to the incremental update of the model . Our goal is to leverage the previously calculated MAGE and update this explanation using the currently available datapoint, i.e.

IncrementalUpdate( , , , ) ⟶ +1 By introducing a smoothing parameter 0 < < 1 , we define the incremental MAGE. Definition 3 (Incremental MAGE). Let 0 < < 0 . We define an incremental MAGE as () ∶= (1 − ) ⋅ −1 () + ⋅ ( , , , , ) .

The incremental MAGE computes a single term of the sum in Eq. 1 at each time step and exploits the previously computed MAGE values. This drastically reduces the computational complexity, which is at time equal to computing MAGE once with Eq. 1. However, the incremental MAGE allows to obtain for every time step without sacrificing computational resources. Incremental variants of PFI [ 7] and SAGE [8], as well as PDP [9] have been recently proposed. They can be viewed as an instantiation of incremental MAGEs.

A major challenge in computing incremental MAGEs is the maintenance of the perturbation dataset , over time, i.e. eficiently constructing perturbed data points that adhere to the data distribution. Reservoir sampling [34] has been adapted to eficiently store the data distribution with minimum resources [7]. Geometric sampling [7] proposes to store a reservoir of fixed lengths, where data points are replaced over time and more recent observations have a higher probability to be present in the reservoir compared to older observations. This mechanism allows to maintain a time-dependent marginal data distribution with limited resources. More advanced techniques maintain conditional distributions using online decision trees and allow for conditional sampling as required for instance in conditional SAGE [8]. It has been shown that both sampling techniques yield substantially diferent explanations [35]. Geometric sampling with marginal distributions highlights the structure of the model, whereas observational approaches via conditional sampling include the data distribution in the explanation [35].

4. Conclusion and Future Work

We summarized popular model-agnostic global explanation techniques, such as FI-based PFI and SAGE, as well as FE-based PDPs, into the MAGE framework for static learning environments. We then proposed the incremental MAGE framework to directly compute these explanations for online learning on data streams. Incremental MAGE allows to incrementally update previous estimates of MAGEs at each time step using minimal resources. We have shown that incremental variants, such as iPFP, iSAGE and iPDP can be summarized in the incremental MAGE framework. Incremental MAGE ofers opportunities to expand the range of incremental variants of MAGE-techniques. For instance, recently proposed methods to estimate Shapley interactions [12, 25, 29] may be placed in the incremental MAGE framework to discover for complex interactions beyond isolated FE. Moreover, with increasing variety of explanations using diferent complexity levels, human-centered presentations and visualizations are important future work.

Acknowledgments References

We gratefully acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): TRR 318/1 2021 – 438445824. [6] V. Losing, B. Hammer, H. Wersing, Incremental On-line Learning: A Review and Comparison of State of the Art Algorithms, Neurocomputing 275 (2018) 1261–1274. doi:10.1016/j.neucom. 2017.06.084. [7] F. Fumagalli, M. Muschalik, E. Hüllermeier, B. Hammer, Incremental permutation feature importance (ipfi): towards online explanations on data streams, Mach. Learn. 112 (2023) 4863–4903.

URL: https://doi.org/10.1007/s10994-023-06385-y. doi:10.1007/S10994- 023- 06385- Y. [8] M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, isage: An incremental version of SAGE for online explanation on data streams, in: D. Koutra, C. Plant, M. G. Rodriguez, E. Baralis, F. Bonchi (Eds.), Machine Learning and Knowledge Discovery in Databases: Research Track - European Conference, ECML PKDD 2023, Turin, Italy, September 18-22, 2023, Proceedings, Part III, volume 14171 of Lecture Notes in Computer Science, Springer, 2023, pp. 428–445. URL: https://doi.org/10.1007/978-3-031-43418-1_26. doi:10.1007/978- 3- 031- 43418- 1\_26. [9] M. Muschalik, F. Fumagalli, R. Jagtani, B. Hammer, E. Hüllermeier, ipdp: On partial dependence plots in dynamic modeling scenarios, in: L. Longo (Ed.), Explainable Artificial Intelligence First World Conference, xAI 2023, Lisbon, Portugal, July 26-28, 2023, Proceedings, Part I, volume 1901 of Communications in Computer and Information Science, Springer, 2023, pp. 177–194. URL: https://doi.org/10.1007/978-3-031-44064-9_11. doi:10.1007/978- 3- 031- 44064- 9\_11. [10] A. P. Cassidy, F. A. Deviney, Calculating feature importance in data streams with concept drift using online random forest, in: 2014 IEEE International Conference on Big Data (Big Data 2014), 2014, pp. 23–28. doi:10.1109/BigData.2014.7004352. [11] H. M. Gomes, R. F. d. Mello, B. Pfahringer, A. Bifet, Feature scoring using tree-based ensembles for evolving data streams, in: 2019 IEEE International Conference on Big Data (Big Data 2019), 2019, p. 761–769. [12] F. Fumagalli, M. Muschalik, P. Kolpaczki, E. Hüllermeier, B. E. Hammer, SHAP-IQ: Unified approximation of any-order shapley interactions, in: Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023), 2023. [13] A. Adadi, M. Berrada, Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI), IEEE Access 6 (2018) 52138–52160. doi:10.1109/ACCESS.2018.2870052. [14] I. Covert, S. M. Lundberg, S.-I. Lee, Understanding Global Feature Contributions With Additive Importance Measures, in: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS 2020), 2020, p. 17212–17223. [15] L. Breiman, Random Forests, Machine Learning 45 (2001) 5–32. [16] A. Fisher, C. Rudin, F. Dominici, All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously, Journal of Machine Learning Research 20 (2019) 1–81. [17] G. Casalicchio, C. Molnar, B. Bischl, Visualizing the Feature Importance for Black Box Models, volume 11051 of Lecture Notes in Computer Science, Springer International Publishing, Cham, 2019, p. 655–670. doi:10.1007/978- 3- 030- 10925- 7\_40. [18] L. S. Shapley, A Value for n-Person Games, in: Contributions to the Theory of Games (AM-28),

Volume II, Princeton University Press, New Jersey, USA, 1953, pp. 307–318. [19] J. H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of

Statistics 29 (2001) 1189–1232. URL: http://www.jstor.org/stable/2699986. [20] J. Herbinger, B. Bischl, G. Casalicchio, REPID: regional efect plots with implicit interaction detection, in: G. Camps-Valls, F. J. R. Ruiz, I. Valera (Eds.), International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 10209–10233. URL: https://proceedings.mlr.press/v151/herbinger22a.html. [21] D. W. Apley, J. Zhu, Visualizing the efects of predictor variables in black box supervised learning models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (2016). URL: https://api.semanticscholar.org/CorpusID:88522102. [22] S. M. Lundberg, G. G. Erion, H. Chen, A. J. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, S. Lee, From local explanations to global understanding with explainable AI for trees,

[1]

J. M.

Clements ,

Xu ,

Yousefi ,

Efimov , Sequential Deep Learning for Credit Risk Monitoring with Tabular Financial Data , CoRR abs/ 2012 .15330 ( 2020 ). arXiv: 2012 .15330.

[2]

Bahri ,

Bifet ,

Gama ,

H. M.

Gomes ,

Maniu , Data stream analysis: Foundations, major tasks and tools , Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11 ( 2021 ) e1405 . doi: 10 .1002/widm.1405.

[3]

Davari ,

Veloso ,

R. P.

Ribeiro ,

P. M.

Pereira ,

Gama , Predictive maintenance based on anomaly detection using deep learning for air production unit in the railway industry , in: 8th IEEE International Conference on Data Science and Advanced Analytics (DSAA 2021 ), IEEE, 2021 , pp. 1 - 10 . doi: 10 .1109/DSAA53316. 2021 . 9564181 .

[4]

B. G.

Atli ,

Jung , Online Feature Ranking for Intrusion Detection Systems , CoRR abs/ 1803 .00530 ( 2018 ). arXiv: 1803 .00530.

[5]

Muschalik ,

Fumagalli ,

Hammer , E. Hüllermeier, Agnostic explanation of model change based on feature importance , Künstliche Intell . 36 ( 2022 ) 211 - 224 . URL: https://doi.org/10.1007/ s13218-022-00766-6. doi: 10 .1007/S13218- 022- 00766- 6.