<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Beyond Static Importance: Quantifying Stability and Distribution Drift</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcin Ostrowski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olgierd Hryniewicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Systems Research Institute Polish Academy of Sciences</institution>
          ,
          <addr-line>Newelska 6, 01-147 Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Feature importance plays a fundamental role in machine learning and serves as a cornerstone of explainable machine learning. In temporal settings, where data accumulates sequentially, the relevance of features may evolve, introducing challenges for interpretation. While temporal variation in feature importance is increasingly relevant for applications such as clinical monitoring and time-series prediction, it remains underexplored in the literature. In this paper, we propose a novel methodology for quantifying the temporal stability of local feature attributions. Our approach combines exponentially weighted moving average (EWMA) model with performance metrics. The goal is to compute a feature-wise stability metric that reflects how consistently a feature contributes to model predictions over time. To complement this, we introduce a distributional drift score based on the Wasserstein distance, capturing shifts in the underlying feature distributions. Together, these two signals form a diagnostic framework that distinguishes between shifts due to data dynamics and those arising from model behavior. We evaluate our approach on a simulated dataset reflecting mental health monitoring scenario, as well as a publicly available benchmark time-series dataset. In both cases, the proposed metrics uncover nuanced patterns of feature behavior, enabling practitioners to identify features that are not only important but also temporally reliable. Our results demonstrate that assessing both the stability of explanations and the drift of features provides a more robust foundation for trustworthy model interpretation in dynamic environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI</kwd>
        <kwd>Feature Importance</kwd>
        <kwd>Time Series Analysis</kwd>
        <kwd>Shapley Values</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Although methods of explainable AI (XAI) have been advancing significantly in recent years [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], yet
substantial challenges persist, particularly in the context of temporal data streams. While explanation
methods such as SHAP or LIME ofer insights into model behavior at a given point in time, they
often neglect how explanations evolve as models undergo retraining or are exposed to new data. This
oversight is of critical importance, particularly in the healthcare domain, where temporal consistency
of model reasoning is imperative for establishing trust and ensuring usability [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        Recent eforts have begun to explore explanation dynamics in time-dependent settings [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ], but
much of the work either focuses on specific time series models or ofers descriptive analyses without
actionable insights. In this work, we address a particular question of how stable feature attributions are
over time and what instabilities might signal.
      </p>
      <p>
        Feature attribution stability is interpreted as the temporal consistency of a feature’s importance, as
measured by an explanation method. Capturing these fluctuations is expected to yield new insights
into model robustness, data drift, or redundant feature use. For instance, a feature whose importance
lfuctuates erratically over time may signify a model that is excessively sensitive to noise or evolving
distributions. While the notion of stability has been examined in static contexts, such as in feature
selection [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], few methods exist for assessing and interpreting explanation stability over evolving data.
      </p>
      <p>
        To address this gap, we propose a novel approach that quantifies fluctuations in feature importance
over time, accounting for the performance and changes of predictions as more data becomes available,
thereby improving the interpretability of temporal machine learning models. Moreover, this approach
takes into account the drift of features to enhance the understanding of changes in the features
themselves over time. We validate our approach using a diverse set of datasets, including a case study
in mental health monitoring and benchmark dataset from the UCI Machine Learning Repository [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Our results demonstrate that the proposed method provides valuable insights into the dynamics of
feature importance over time, considering both the temporal nature of the data and model performance.
The proposed metric captures the magnitude and direction of fluctuations in feature values, which,
when incorporated into a broader framework, can ofer a more comprehensive understanding of feature
behavior and improve feature selection.
      </p>
      <p>The structure of the paper is as follows. In the next section, we present a brief description of related
works. This is followed by a presentation of the proposed approach in Section 3. The experimental
results using simulated data and benchmark datasets are presented in Section 4. Finally, Section 5
outlines the conclusions and discusses directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Comprehension of the role that input features play in machine learning models’ predictions constitutes
a fundamental principle of explainable AI [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These approaches can be broadly categorised into two
distinct classifications: model-specific and model-agnostic. The eficacy of model-specific methods
is contingent upon the utilization of internal model structures for the determination of importance.
To illustrate, decision trees and ensemble models, including random forests and gradient boosting
machines, are known to provide feature importance based on criteria such as Gini gain or information
gain [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In a similar vein, linear models employ a ranking system that prioritizes features based
on the magnitude of their coeficients [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, the capacity for cross-model comparisons or
generalisations is limited.
      </p>
      <p>
        On the contrary, model-agnostic methods are characterised by their ability to ofer greater flexibility.
Permutation importance is a method of assessing a feature’s relevance. It does so by evaluating the
impact on model performance when feature values are randomly shufled [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Despite its extensive
utilization, the method is susceptible to collinearity, a factor that frequently results in an underestimation
of the significance of correlated features [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Shapley values [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], are predicated on the principles of
cooperative game theory. These values function to distribute prediction contributions among features
in a equitable manner. The SHAP framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] has been instrumental in facilitating the calculation
of Shapley values for intricate models, thereby establishing itself as a prevalent instrument for both
local and global explanations.
      </p>
      <p>
        While these approaches have been the focus of extensive research for static datasets, their application
to temporal or sequential data remains limited. The majority of explanation methods treat time steps
independently, neglecting to consider how the importance of features evolves or fluctuates over time
[
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. Recent works have begun to address this issue. For instance, Rojat et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose
temporallyaware feature attribution for time series forecasting models. Arsenault et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] investigated the impact
of temporal windowing on explanation variance. The extant literature indicates that explanation stability
is influenced by a number of factors, including data drift, modeling choices, retraining frequency, and
attribution methods. However, there is a paucity of methods that attempt to quantify these dynamics
comprehensively, especially in model-agnostic settings.
      </p>
      <p>
        Concurrently, the evaluation of explanation methods has emerged as a prominent research subject
in the field [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. A variety of metrics have been proposed to evaluate the quality of explanations,
including stability, fidelity, consistency, and sensitivity. However, these metrics frequently operate
within a static framework and cannot efectively address the temporal characteristics inherent in
data or the continual refinement of models over extended periods. Moreover, as noted by recent
critiques [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ], explanation reliability can be compromised by issues such as feature redundancy,
model non-identifiability, and attribution instability. For example, in the presence of collinear features,
shifts in attribution do not necessarily reflect genuine model drift but may simply reflect equivalent
representations – a core challenge our framework aims to accommodate.
      </p>
      <p>In this work, we quantify the evolution of feature attributions through two complementary metrics:
a stability metric and a distribution drift score. Unlike static assessments, our approach decomposes
explanation dynamics into interpretable signals that reeflct both model behavior and data characteristics
over time. The stability metric captures deviations of feature importance from a smoothed historical
baseline, weighted by model performance, thereby accounting for fluctuations in predictive reliability.
In parallel, the distribution drift score quantifies changes in the underlying feature distributions using
the Wasserstein distance.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Assessing Variations in Temporal Datasets</title>
      <p>In this section, we will present two complementary metrics developed to capture the temporal dynamics
inherent in the feature importances of multivariate time-series data. These metrics include a stability
metric based on Exponentially Weighted Moving Average (EWMA) and a drift score based on Wasserstein
distance. Collectively, these measures facilitate comprehensive monitoring of the consistency and
distributional evolution of explanatory signals over time.</p>
      <p>Let X ∈ R×  denote a sample of multivariate time series with  features and  time points.
Let also x := X· , ∈ R be the vector of all feature observations at a time point 1 ≤  ≤ ,
where  &gt; 1. Feature importance scores are computed based on the realizations x of the time series.
Let () ∈ R denote the importance of the feature  ∈ {1, 2 . . . , } at the time point , derived
from a model-specific feature attribution method, with the full sequence over time represented by
 () = {1(), 2(), . . . , ()}.</p>
      <sec id="sec-3-1">
        <title>3.1. Stability Metric</title>
        <p>We propose a methodology for deriving a feature-wise stability score, a metric that quantifies the
discrepancy between a feature’s perceived importance and its exponentially weighted historical trend.</p>
        <p>The stability metric is then defined as:
where:
 () = 1 −  −1 1 ∑︁ ()
=2</p>
        <p>
          ^())︁ 2
︁( () − 
|()| + |^()| + 
• () ≥ 0 is a weight function assigned at time point  = 2, . . . , , incorporating model
performance  of the model. If () = 1, ∀, no weighting is applied.
• ̂︀() is the EWMA of past feature importance scores, recursively defined as:
^() =  ^(−)1 + (1 −  )(),

where  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the smoothing factor controlling the influence of past values. The process is
initialized as ^1() = 1().
        </p>
        <p>•  &gt; 0.</p>
        <p>The proposed approach ensures that importance deviations are scaled in relation to their respective
magnitudes, thus preventing excessive penalization of low-importance features. A high  () indicates
that the importance of feature  exhibits a close correspondence with its historical trend, thereby
suggesting temporal reliability.</p>
        <p>The parameter  is a component in the analysis, which governs the rate at which past observations
decay. When  is set to a larger value, it places more emphasis on the most recent data points, yielding
a smoother baseline. Conversely, when  is set to a smaller value, it gives greater weight to older
observations.
(1)
(2)</p>
        <p>The well-known EWMA is a time series analysis technique that assigns exponentially decreasing
weights to past observations. This property renders the EWMA particularly useful for detecting trends
and anomalies in noisy data. In contrast to the simple moving average, which assigns equal weight to
all past observations, the EWMA assigns greater weight to more recent data points. This characteristic
renders the EWMA more responsive to changes in the time series.</p>
        <p>The squared diference between the feature importance value and the EWMA of past feature
importance scores is aiming to capture the variability in feature importance over time. This formulation
gives more weight to recent time points and models with higher predictive performance, making
it particularly responsive to evolving patterns. Additionally, the weight function adjusts for model
performance, assigning greater importance to better-performing models while diminishing the influence
of models with lower accuracy.</p>
        <p>The stability metric is generalizable and can be applied with any feature importance method as
defined in Eq. (1). In our experiments, we selected Shapley values, which provide an attribution-based
measure of feature importance.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Measuring the Distributional Drift</title>
        <p>While the stability metric is capable of capturing fluctuations in importance, it does not have the capacity
to detect whether the feature itself is undergoing a distributional shift. To address this limitation, we
propose a drift score, computed as the Wasserstein-1 distance between the empirical distribution
of a feature at the current time point and a reference distribution estimated from a kernel density
estimate over a rolling window of past values, capturing recent trends in the feature’s distribution. Both
distributions are approximated using kernel density estimation.</p>
        <p>Let () denote the empirical distribution of the feature  at a time point , and let ̂︀() denote a
smoothed estimate of its past distribution. Then the drift score is defined as:</p>
        <p>Drift() =</p>
        <p>1 ∑︁ 1((), ̂︀()).
 − 1 =2
(3)</p>
        <p>
          In this context, 1(· , · ) denotes the first-order Wasserstein distance originally introduced in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] in
the context of optimal transport problems. In practice, the estimation of  is achieved through the
utilization of kernel density estimation over a designated sliding window or by computing a smoothed
histogram over the designated time period. This signal captures the evolution of the underlying values
of a feature, independent of its importance to the model.
        </p>
        <p>Employing the stability metric  () and the drift score Drift(), can facilitate the origin of
temporal variations. For example, when high  () and low Drift() are observed, the feature
exhibits both stable importance and a stationary distribution over time. This suggests that the feature is
robust and consistently relevant for the model across time, making it a strong candidate for long-term
interpretability and reliable decision-making. On the contrary, when low  () and high Drift() are
reported, the feature’s importance fluctuates over time in conjunction with significant distributional
changes. This indicates that the model’s reliance on the feature is adapting to underlying shifts in the
data-generating process, potentially reflecting concept drift or context-sensitive importance. Another
scenario would be to observe low  () and low Drift(). The feature’s distribution remains largely
stable, yet its attributed importance varies. This scenario may signal model instability, such as overfitting
or sensitivity to transient patterns, since fluctuations in explanatory power are not explained by input
variation. Alternatively, high  () and high Drift() would mean that despite the feature undergoing
distributional changes, its importance remains stable. This indicates model robustness to input drift,
suggesting that the feature maintains its predictive role under varying conditions — a desirable trait in
non-stationary environments.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Numerical Results</title>
      <p>
        We begin by validating the proposed stability and drift metrics using a simulated dataset derived from
real-world clinical observations, followed by experiments on a publicly available benchmark dataset
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In both cases, the input consists of multivariate time series with associated labels, where each
time point corresponds to a set of feature observations and an outcome. The goal of the experiments is
to assess how feature importance evolves over time, and to evaluate whether the proposed metrics —
stability and distributional drift — can meaningfully capture changes in model attribution dynamics
under temporal shifts in data. To this end, we apply a rolling-window training procedure using XGBoost
classifiers and compute both feature importance and drift signals across time. Full implementation
details and data are available in the publicly accessible repository [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <sec id="sec-4-1">
        <title>4.1. Simulation study</title>
        <p>Our evaluation is motivated by a real-world clinical problem concerning the remote sensor-based
monitoring of bipolar disorder patients. In this particular applied problem, considering the feature’s
importance for a specific time point does not provide comprehensive insights into the temporal influence
of the feature. If a feature’s importance may change over time significantly, ranging from the most
important to least important, without a clear pattern, the inference based on such a feature might not
be as reliable as assumed. Our approach aims to identify features that demonstrate temporal stability in
both importance and distributional behavior, thus enhancing interpretability and trust in model-based
inference.</p>
        <p>
          The real-world dataset comprises acoustic and psychiatric data collected from a patient diagnosed
with bipolar disorder. For further details see [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] with the protocol of this clinical study. Acoustic
features describe the manner of speaking with physical descriptors such as shimmer, jitter, and energy
extracted from patients’ speech. Additionally, each feature vector is associated with the patient’s
mental state at the time of the recording. Bipolar disorder is a serious mental illness characterized
by fluctuations from depressive through euthymic to manic states. Previous research confirms that
acoustic features extracted from speech serve as valid markers for assessing the severity of manic and
depressive symptoms[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. In this work, we aim to assess the stability of the feature’s importance in
time.
        </p>
        <p>To reduce complexity and enable stable classification under limited data, the original multiclass
labeling was binarized mapping states into two categories: euthymia, which is considered the healthy
state, and non-euthymia.</p>
        <p>
          Considering the limited size of the real data and its complex nature, we simulated a larger, controlled
dataset that permits more rigorous and repeatable validation of the proposed metrics. The resulting
dataset is publicly available for further investigation [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Fig. 1 shows the simulated changes in mental
state over time. We selected four voice characteristics – jitter, shimmer, energy, and voice pitch –
and used their means and standard deviations from the original dataset to model two distributions:
a base distribution representing prior knowledge of mental state classification and a patient-specific
distribution incorporating slight variations to reflect individual voice characteristics. Fig. 2 illustrates
distributions for two exemplary simulated variables. More figures and details can be found in the
publicly available repository [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>We simulated observations from patient-specific distribution over five-year period. Each observation
corresponds to a specific day, and the dataset was divided into two categories of periods: ground truth
periods, which spanned nine days around a psychiatrist appointment where the patient’s mental state
was known, and inter-appointment periods, representing the intervals between psychiatrist visits where
the patient’s mental state was inferred. Each inter-appointment period consisted of 64 days, resulting
in five meetings per year.</p>
        <p>Then, we simulated 180 observations from the base distribution, with 60 observations from each
state. This dataset was used as the prior knowledge dataset, while the five-year dataset included 1 825
observations with varying label distributions.</p>
        <p>
          As the next step, we trained XGBoost classifiers [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] from the XGBoost package in a sequential
manner. The first model was trained on the first ground truth period ( 9 observations) and tested on
the subsequent inter-appointment period (64 observations). The second model was trained on the first
two ground truth periods combined with the first inter-appointment period (a total of 82 observations),
while the test set consisted of the second inter-appointment period (64 observations). Next models
followed the same pattern, each incorporating more prior data.
        </p>
        <p>For each window, model performance was evaluated using the area under the receiver operating
characteristic (ROC) curve (AUC), or the  1 score if ROC AUC was unavailable, from the Scikit-learn
package. Distribution drift was calculated for each window using Kernel Density Estimation. The drift
between the current window and a smoothed estimate of the past distribution was then computed using
the Wasserstein distance. Shapley values, computed using the SHAP package’s Explainer interface,
were calculated for the prediction of the class representing the non-euthymic state. To calculate the
stability metric, a vaue of  = 0.7 was used in all experiments (see Eq. (2)).</p>
        <p>For the simulated medical dataset, Fig. 3 presents Shapley values over time for five-year simulations,
along with stability metrics incorporating model weighting by performance. Most stability metric
values exceed 0.875, suggesting either minimal fluctuations in Shapley values or that the corresponding
features have little influence on the model’s predictions. However, a closer inspection reveals that jitter,
despite having the highest absolute Shapley values, exhibits significant variability in both magnitude
and direction over time. This instability is captured in the stability metric, where jitter consistently
shows the lowest stability values across both simulation periods.</p>
        <p>Fig. 4 illustrates temporal changes in Shapley values for jitter and energy, alongside model
performance at each time point. Notably, energy shows considerable fluctuations in its Shapley values,
but these gradually stabilize at low positive values, with occasional negative spikes. This leads to an
approximately 0.9 stability metric. On the other hand, jitter experiences both magnitude and direction
lfuctuations before stabilizing at positive values. However, due to more significant fluctuations and a
later stabilization, jitter achieves a lower stability metric of around 0.85, before eventually reaching
0.9. The performance-based weighting smooths some Shapley value changes, particularly between
time points 2, 3, and 4 where energy’s Shapley values show noticeable changes. Notably, feature F0
maintains a high stability metric, reflecting its consistently low Shapley values relative to the other
features throughout the time period.</p>
        <p>Table 1 presents the stability metric and distribution drift values for the features considered in the
simulated dataset. Since no universal thresholds exist for interpreting these metrics, we analyze them
relative to one another. The cell color intensity reflects these internal comparisons and helps surface
outliers and patterns.</p>
        <p>We observe that jitter, shimmer and energy exhibit relatively low distribution drift (all below 0.03)
and moderate-to-high stability scores, ranging from 0.89 to 0.96. These features maintain consistent
importance over time and stem from relatively stable data distributions – indicating strong potential as
interpretable and dependable predictors in a temporal context.</p>
        <p>In contrast, the F0 feature stands out with the highest distribution drift (0.26), substantially exceeding
that of the other features. Despite this, it shares the highest stability score (0.96), indicating that although
the underlying distribution of this feature changes substantially, its relevance to model predictions
remains stable – suggesting model robustness or invariance to feature drift.</p>
        <p>A particularly instructive case is jitter. Although it frequently appears as one of the most influential
features (based on raw Shapley values), its lower stability metric indicates substantial temporal variability.
This highlights the risk of over-interpreting raw importance scores without considering temporal
consistency. Features like energy and shimmer, with slightly lower but more stable contributions, may
ofer more reliable insight for longitudinal interpretation or downstream decision-making.</p>
        <p>This example illustrates the core utility of the proposed approach: it enables a nuanced decomposition
of explanation quality, capturing both temporal stability and drift sensitivity. Such decompositions are
essential when local explanations are used to guide clinical insight or intervention strategy. Traditional
feature importance analysis cannot ofer this level of granularity – particularly when dealing with
sequential or drifting data.</p>
        <p>A current limitation of the analysis is the omission of long-term average Shapley trends, which could
ofer additional insights into persistent feature relevance. This omission could result in incomplete
inferences about the features, as the evolution of the overall importance of features across the entire
time period is not taken into account. The incorporation of this information could facilitate a more
comprehensive understanding of feature behavior and stability.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiments for Benchmark Dataset</title>
        <p>
          To assess the generalizability of our proposed metrics beyond the medical domain, we extend our
experimental framework to a publicly available time-series dataset, namely the Rocket League dataset[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
This dataset consisted of 7 189 observations with 16 explanatory features and was divided into 44
contiguous gameplay segments, each corresponding to a distinct phase within a Rocket League match.
The dataset was designed for binary classification.
        </p>
        <p>The temporal division strategy and the structural framework (XGBoost) utilized in the medical
simulation were employed in this investigation. The performance-weighted stability metric and distribution
drift computations were also adopted.</p>
        <p>Fig. 5 illustrates temporal changes in mean Shapley values and the corresponding stability metrics for
selected features in the Rocket League dataset. We highlight three features that reflect diverse dynamic
behaviors over time.</p>
        <p>First, consider BallAcceleration. While its Shapley values oscillate in both magnitude and
sign during the early intervals, the feature eventually stabilizes with increasingly consistent positive
contributions. Despite initial fluctuations, the combination of late-stage consistency and strong positive
attribution boosts its overall stability metric to around 0.92, placing it in the relative high-stability
category.</p>
        <p>In contrast, accelerate shows highly variable Shapley values with no consistent trend – swinging
from strong positive to negative contributions across diferent intervals. This behavior results in a low
stability metric of approximately 0.69, even though its distribution drift is minimal. Such a combination
– low stability with low drift – may suggest overfitting or excessive context dependence, limiting the
interpretability and reliability of this feature.</p>
        <p>Finally, PlayerSpeed provides a third example, showing persistent relevance across the entire
timeline. While its SHAP values exhibit some fluctuation in magnitude, their sign and overall importance
remain stable. This yields a relatively high stability score (above 0.93) despite the feature undergoing
substantial distributional shift, as reflected in one of the highest drift values in the dataset. This resilience
implies that the model reliably incorporates PlayerSpeed despite changes in its input distribution – a
sign of potential robustness or invariance.
Figure 5: (a) Changes over time in mean Shapley values for features in Rocket League dataset and (b) values of
stability metric for Rocket League dataset, with weighting by performance in Eq. (1) and  = 0.7 in Eq. (2).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This study introduces a novel framework for evaluating the temporal stability of feature importance,
while jointly accounting for distributional drift in the underlying data. We operationalize this idea
through a performance-weighted stability metric and a drift score based on the Wasserstein distance,
applied to local explanations derived from Shapley values. By incorporating exponentially weighted
moving averages and performance-based weights, the proposed stability metric captures deviations in
feature importance over time in a calibrated manner. The drift score complements this by quantifying
changes in the empirical distribution of each feature, thereby disentangling model instability from data
drift.</p>
      <p>Our empirical validation on both a simulated clinical dataset and a publicly available benchmark
time series demonstrates the utility of the framework in diagnosing and interpreting model behavior
over time. The results afirm that reliable model interpretation cannot rely solely on feature importance
magnitudes; instead, it should incorporate both temporal consistency and feature dynamics.</p>
      <p>While the proposed metrics ofer valuable diagnostic insights, their full utility emerges when
integrated with complementary analytical signals. For instance, temporal autocorrelation or variance
decomposition could provide further context about the persistence and volatility of individual features.
Additionally, a formal framework for aggregating multiple explanation signals – stability, drift, rank
variance, and uncertainty – could further improve understanding of the modeling process under
temporal shifts and its outputs. Future work will extend the analysis to additional feature importance metrics,
further exploring their impact on stability evaluation. Finally, future research will focus on refining
the weighting schemes in the stability metric, extending the approach to additional feature attribution
methods beyond Shapley values, evaluating model robustness under adversarial or synthetic distribution
shifts, and applying the framework to other domains. We also plan to release an open-source toolkit
implementing these metrics to facilitate adoption and further experimentation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The project "ExplainMe: Explainable Artificial Intelligence for Monitoring Acoustic Features extracted
from Speech" (FENG.02.02-IP.05-0302/23) is carried out within the First Team programme of the
Foundation for Polish Science co-financed by the European Union under the European Funds for Smart
Economy 2021-2027 (FENG).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 in order to: Grammar and spelling
check. After using these tools, the authors reviewed and edited the content as needed and takes full
responsibility for the publication’s content.
2939672.2939785. arXiv:1603.02754.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models (</article-title>
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1802</year>
          .
          <year>01933</year>
          . arXiv:
          <year>1802</year>
          .
          <year>01933</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gohel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohanty</surname>
          </string-name>
          ,
          <article-title>Explainable ai: current status and future directions (</article-title>
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2107.07045. arXiv:
          <volume>2107</volume>
          .
          <fpage>07045</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Molnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Casalicchio</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Bischl,</surname>
          </string-name>
          <article-title>Interpretable machine learning - a brief history, state-ofthe-art and challenges, Koprinska I</article-title>
          . et al. (
          <article-title>eds) ECML PKDD 2020 Workshops</article-title>
          .
          <source>ECML PKDD 2020. Communications in Computer and Information Science</source>
          , vol
          <volume>1323</volume>
          . Springer, Cham (
          <year>2020</year>
          )
          <fpage>417</fpage>
          -
          <lpage>431</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -65965-3_
          <fpage>28</fpage>
          . arXiv:
          <year>2010</year>
          .09337.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rojat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Puget</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Filliat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Del</given-names>
            <surname>Ser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Díaz-Rodríguez</surname>
          </string-name>
          ,
          <article-title>Explainable artificial intelligence (xai) on timeseries data: A survey (</article-title>
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2104.00950. arXiv:
          <volume>2104</volume>
          .
          <fpage>00950</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>P.-D. Arsenault</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Patenande</surname>
          </string-name>
          ,
          <article-title>A survey of explainable artificial intelligence (xai) in financial time series forecasting (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2407.15909. arXiv:
          <volume>2407</volume>
          .
          <fpage>15909</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , G. Ifrim,
          <article-title>Robust explainer recommendation for time series classification</article-title>
          ,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>38</volume>
          (
          <year>2024</year>
          )
          <fpage>3372</fpage>
          -
          <lpage>3413</lpage>
          . URL: http://dx.doi.org/10.1007/ s10618-024-01045-8. doi:
          <volume>10</volume>
          .1007/s10618-024-01045-8.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. I.</given-names>
            <surname>Grisci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dorn</surname>
          </string-name>
          ,
          <article-title>Analysis and comparison of feature selection methods towards performance and stability</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>249</volume>
          (
          <year>2024</year>
          )
          <article-title>123667</article-title>
          . doi:
          <volume>10</volume>
          .1016/j. eswa.
          <year>2024</year>
          .
          <volume>123667</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mathonat</surname>
          </string-name>
          , Rocket League Skillshots,
          <source>UCI Machine Learning Repository</source>
          ,
          <year>2020</year>
          . DOI: https://doi.org/10.24432/C5S035.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <article-title>The mythos of model interpretability (</article-title>
          <year>2016</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.1606.03490. arXiv:
          <volume>1606</volume>
          .
          <fpage>03490</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <article-title>Greedy function approximation: A gradient boosting machine</article-title>
          .,
          <source>The Annals of Statistics</source>
          <volume>29</volume>
          (
          <year>2001</year>
          ). doi:
          <volume>10</volume>
          .1214/aos/1013203451.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <source>The Elements of Statistical Learning</source>
          , Springer New York,
          <year>2009</year>
          . doi:
          <volume>10</volume>
          .1007/978-0-
          <fpage>387</fpage>
          -84858-7.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dominici</surname>
          </string-name>
          ,
          <article-title>All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>20</volume>
          (
          <issue>177</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>81</lpage>
          ,
          <year>2019</year>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1801</year>
          .
          <volume>01489</volume>
          . arXiv:
          <year>1801</year>
          .01489.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Strobl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Boulesteix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeileis</surname>
          </string-name>
          , T. Hothorn,
          <article-title>Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics 8 (</article-title>
          <year>2007</year>
          ). doi:
          <volume>10</volume>
          .1186/
          <fpage>1471</fpage>
          -2105-8-25.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Shapley</surname>
          </string-name>
          , 17.
          <string-name>
            <given-names>A</given-names>
            <surname>Value for n-Person</surname>
          </string-name>
          <string-name>
            <surname>Games</surname>
          </string-name>
          , Princeton University Press,
          <year>1953</year>
          , pp.
          <fpage>307</fpage>
          -
          <lpage>318</lpage>
          . doi:
          <volume>10</volume>
          .1515/
          <fpage>9781400881970</fpage>
          -
          <lpage>018</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions (</article-title>
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .48550/ ARXIV.1705.07874. arXiv:
          <volume>1705</volume>
          .
          <fpage>07874</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Villani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lockhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magazzeni</surname>
          </string-name>
          ,
          <article-title>Feature importance for time series data: Improving kernelshap (</article-title>
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2210.02176. arXiv:
          <volume>2210</volume>
          .
          <fpage>02176</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>K. K. Leung</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Rooke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zuberi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Volkovs</surname>
          </string-name>
          ,
          <article-title>Temporal dependencies in feature importance for time series predictions (</article-title>
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2107.14317. arXiv:
          <volume>2107</volume>
          .
          <fpage>14317</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pawlicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pawlicka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Uccello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Szelest</surname>
          </string-name>
          , S. D'Antonio,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kozik</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Choraś, Evaluating the necessity of the multiple metrics for assessing explainable ai: A critical examination</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>602</volume>
          (
          <year>2024</year>
          )
          <article-title>128282</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2024</year>
          .
          <volume>128282</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mariotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Alonso-Moral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatt</surname>
          </string-name>
          ,
          <article-title>Measuring model understandability by means of shapley additive explanations</article-title>
          ,
          <source>in: 2022 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/fuzz-ieee55066.
          <year>2022</year>
          .
          <volume>9882773</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alvarez-Melis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          ,
          <article-title>On the robustness of interpretability methods (</article-title>
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          . 48550/ARXIV.
          <year>1806</year>
          .
          <volume>08049</volume>
          . arXiv:
          <year>1806</year>
          .08049.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Slack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hilgard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          ,
          <article-title>Fooling lime and shap: Adversarial attacks on post hoc explanation methods</article-title>
          ,
          <source>in: Proceedings of the AAAI/ACM Conference on AI</source>
          ,
          <string-name>
            <surname>Ethics</surname>
          </string-name>
          , and Society, AIES '20,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>186</lpage>
          . doi:
          <volume>10</volume>
          .1145/3375627.3375830.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Kantorovich</surname>
          </string-name>
          ,
          <article-title>Mathematical methods of organizing and planning production</article-title>
          ,
          <source>Management Science</source>
          <volume>6</volume>
          (
          <year>1960</year>
          )
          <fpage>366</fpage>
          -
          <lpage>422</lpage>
          . doi:
          <volume>10</volume>
          .1287/mnsc.6.4.366.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ostrowski</surname>
          </string-name>
          , Stability metric github repository,
          <year>2025</year>
          . URL: https://github.com/Zylaz/ StabilityMetric.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sokół-Szawłowska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kamińska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sochacka</surname>
          </string-name>
          ,
          <article-title>Moodmon: novel optimization of bipolar disorder monitoring through patient-driven voice parameter submission and ai technology</article-title>
          ,
          <source>Advances in Psychiatry and Neurology/Postępy Psychiatrii i Neurologii</source>
          <volume>33</volume>
          (
          <year>2024</year>
          )
          <fpage>230</fpage>
          -
          <lpage>240</lpage>
          . URL: http://dx.doi.org/10.5114/ppn.
          <year>2024</year>
          .
          <volume>147100</volume>
          . doi:
          <volume>10</volume>
          .5114/ppn.
          <year>2024</year>
          .
          <volume>147100</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kaczmarek-Majer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dominiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Antosik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hryniewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kamińska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Opara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Owsiński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Radziszewska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sochacka</surname>
          </string-name>
          , L. Święcicki,
          <article-title>Acoustic features from speech as markers of depressive and manic symptoms in bipolar disorder: A prospective study</article-title>
          ,
          <source>Acta Psychiatrica Scandinavica</source>
          <volume>151</volume>
          (
          <year>2024</year>
          )
          <fpage>358</fpage>
          -
          <lpage>374</lpage>
          . doi:
          <volume>10</volume>
          .1111/acps.13735.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Xgboost: A scalable tree boosting system (</article-title>
          <year>2016</year>
          )
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          . doi:
          <volume>10</volume>
          .1145/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>