1. Introduction

Tracking Adaptation Time: Metrics for Temporal Distribution Shift

Lorenzo Iovine

Giacomo Zifer

Emanuele Della Valle

0 0 Politecnico di Milano, DEIB , Via Giuseppe Ponzio, 34, 20133 Milan , Italy

2026

Evaluating robustness under temporal distribution shift remains an open challenge. Existing metrics quantify the average decline in performance, but fail to capture how models adapt to evolving data. As a result, temporal degradation is often misinterpreted: when accuracy declines, it is unclear whether the model is failing to adapt or whether the data itself has become inherently more challenging to learn. In this work, we propose three complementary metrics to distinguish adaptation from intrinsic dificulty in the data. Together, these metrics provide a dynamic and interpretable view of model behavior under temporal distribution shift. Results show that our metrics uncover adaptation patterns hidden by existing analysis, ofering a richer understanding of temporal robustness in evolving environments.

eol>Temporal Distribution Shift Metrics Concept Drift

1. Introduction

To illustrate this phenomenon, Figure 1 shows a two-dimensional UMAP [ 2 ] projection of feature representations extracted from the Yearbook dataset [ 3 ], grouped in 10-year intervals. Initially (1955–1964), classes remain largely separable, with clear boundaries between male and female portraits. However, as time progresses, the clusters gradually overlap, indicating that visual features from diferent years become increasingly intertwined, a possible consequence of emerging visual diversity, with previously uncommon facial traits and aesthetics becoming more frequent. This suggests that the drop in model performance observed in later years does not necessarily arise from poor generalization, but from a genuine increase in data dificulty, the classes themselves become less separable as the underlying generative process evolves. In other words, temporal degradation often reflects complex relations between data and the problem, rather than a failure to adapt.

In this work, we contribute by introducing three complementary metrics to address this limitation. Together, these metrics isolate the model’s temporal adaptation dynamics from the intrinsic evolution of data dificulty. We demonstrate the efect of our metrics on datasets from Wild-Time. Our results suggest that what appears as a persistent ID-OOD gap in prior work often reflects a temporal lag in adaptation rather than a complete failure to generalize, motivating a reevaluation of how we measure and compare models under temporal distribution shift. To facilitate adoption and reproducibility, we release the code to compute the proposed metrics.1

2. Related Work

Temporal distribution shift refers to changes in the data distribution that occur as time progresses. Unlike domain shifts, where diferent datasets represent distinct environments or acquisition sources, temporal shifts emerge naturally as the world evolves: sensor properties change, user behavior drifts, visual or textual styles shift, and contextual conditions evolve. Large-scale benchmarks such as WILDS [ 4 ] and Wild-Time have made this setting explicit by organizing data along temporal axes. Datasets like FMoW-Time (a subset of the Functional Map of the World dataset [ 5 ]) and Yearbook expose temporal evolution through gradual changes in content and context, making them ideal testbeds for studying model robustness to time-dependent drift. These works have shown that even high-capacity models exhibit substantial degradation when evaluated on future samples, underscoring the challenge of temporal generalization. However, most temporal benchmarks measure only static performance snapshots at diferent time points. Such evaluations quantify accuracy decay but fail to reveal how performance changes unfold, whether due to intrinsic task dificulty or to a lack of temporal adaptation. This limitation motivates the need for metrics that explicitly capture the temporal dynamics of adaptation rather than static robustness summaries.

The broader concept of concept drift originates from the data stream mining literature [ 6, 7 ]. It 1https://github.com/lorenzoiovine99/Metrics-for-Temporal-Distribution-Shift denotes any change in the joint distribution P(X,Y) that afects a model’s predictive behavior over time or across contexts. Such changes can stem from variations in the data-generating process, user populations, environmental conditions, or even shifts in the underlying concept being learned. In other words, while temporal distribution shift specifically refers to distributional changes induced by time, concept drift can occur for many other reasons — temporal, environmental, behavioral, or stochastic. Research on concept drift has developed adaptive methods capable of detecting and responding to such changes online. Classical algorithms, like ADWIN [8], monitor error rates or distributional statistics to detect significant drift, while streaming and Continual Learning approaches such as EWC [ 9], SI [10], and A-GEM [11] introduce mechanisms to balance plasticity and stability in non-stationary environments. Despite this, most evaluations in these settings focus on aggregate measures, such as average accuracy, forgetting, or backward transfer, that summarize long-term performance rather than capturing finegrained temporal adaptation behavior. Consequently, even in the context of streaming and continual learning, the community lacks metrics that quantify how efectively a model adapts to evolving data distributions over time.

Evaluating models under non-stationarity has traditionally relied on metrics designed for discrete tasks or static shifts. In Continual Learning [12] forward and backward transfer measures were introduced to quantify interference and knowledge retention across tasks [13]. While efective for taskstructured evaluation, these metrics are ill-suited for continuous temporal evolution, where boundaries between distributions are not well-defined. Temporal benchmarks, on the other hand, often report aggregated accuracy or ID–OOD comparisons across years. Although informative, these metrics cannot distinguish between degradation due to intrinsic dificulty (e.g., increased noise or reduced separability) and degradation caused by insuficient adaptation. Both scenarios can produce similar accuracy curves but reflect fundamentally diferent model behaviors. Similar limitations have been noted in domain generalization and robustness research [14], where performance gaps may conflate distribution hardness with model capacity. In temporal settings, this ambiguity obscures our understanding of whether a model fails because the data becomes harder, or because it cannot adjust quickly enough to a changing environment. Hence, there remains a need for evaluation metrics that explicitly capture temporal adaptation dynamics, describing how rapidly, stably, and persistently a model adapts as data distributions evolve. Addressing this gap is the central focus of this work.

3. Problem Statement

We consider a supervised learning setting where data distributions evolve over time. Let = {(, )}=1, (1) denote the dataset available at time t, sampled from an underlying joint distribution (, ). In the presence of temporal distribution shift, these distributions can change as t increases. This setting reflects many real-world scenarios in which models are trained on past data and then deployed in the future: the world changes, and the model gradually becomes misaligned with the data it encounters.

Existing benchmarks, such as Wild-Time, summarize this phenomenon through In-Distribution (ID) and Out-of-Distribution (OOD) accuracy: performance on data from the training period versus future periods. While intuitive, this pair of metrics provides only a static picture of what is inherently a dynamic process. A gap between ID and OOD accuracy can be observed even when the model adapts correctly, simply because future data is intrinsically harder (less separable, noisier, or more ambiguous). Conversely, two models with identical ID–OOD gaps might exhibit entirely diferent temporal behaviors: one may degrade immediately and never recover, while another might adapt rapidly to new conditions. Therefore, these static measures conflate intrinsic dificulty with adaptation capability, making it impossible to determine why performance changes through time.

What is missing is a principled way to evaluate how models adapt, not only how much their performance drops. We argue that measuring robustness under temporal shift should involve characterizing the temporal adaptation dynamics. Understanding these dynamics is essential for comparing algorithms designed for continual, adaptive, or streaming learning, and for identifying the true limitations of current approaches.

This leads to our central research question: How can we quantitatively distinguish between degradation caused by intrinsic data dificulty and degradation caused by insuficient temporal adaptation in machine learning models? Addressing this question requires moving beyond static ID/OOD metrics toward a framework that explicitly captures adaptation over time.

4. Proposed Metrics

To characterize how models evolve over time, we introduce three complementary post-hoc metrics that quantify distinct aspects of temporal adaptation. They capture how long a model remains valid after training, when degradation becomes evident, and how efectively it adapts to future data. All metrics are computed after the full temporal sequence has been observed, allowing a retrospective evaluation of how well a model trained at time t would have matched the performance of an "oracle" model retrained on the data distribution of each target time . Formally, the oracle represents the maximum achievable performance for a given period under ideal adaptation. By normalizing a model’s temporal performance against this reference, our metrics isolate the degradation due to lack of adaptation from that caused by increasing data dificulty.

4.1. Temporal Transfer Ratio (TTR)

Let (, ) denote the accuracy obtained when a model is trained on data from time t and evaluated on data from time . We define the Temporal Transfer Ratio (TTR) as: (, ) = (, ) (, )

(2) (4) The denominator (, ) represents the "oracle" performance level, the accuracy that would be achievable if the model were trained directly on the target time. To ensure interpretability, we clip (, ) to 1 in cases where a model trained at time t outperforms the oracle trained at , enforcing (, ) ∈ [ 0, 1 ] . It quantifies how much of the realistic maximum at time is preserved by a model trained in the past. A value close to 1 indicates strong temporal transfer, while lower values indicate growing misalignment between the model and the evolving data distribution. This function serves as the foundation for all three proposed metrics.

4.2. Stability Horizon (SH)

The Stability Horizon (SH) captures how long a model trained at time t remains reliable before its performance drops below an acceptable level. Formally, for a chosen tolerance threshold ∈ [ 0, 1 ]: () = {ℎ ≥ 0 : (, + ℎ) ≥ }. (3) Intuitively, () measures the number of future time steps for which a model trained at t maintains at least a fraction of the oracle accuracy. This provides a direct estimate of the temporal validity window of the model, the time span during which it can be deployed without retraining. The threshold can be adapted to the application: for instance, = 0.9 corresponds to a 10% acceptable loss in accuracy.

4.3. Drift Horizon (DH)

While the Stability Horizon identifies when performance falls below a target level, the Drift Horizon (DH) detects when that drop becomes statistically significant. We define a cumulative drift statistic as: 0 = 0, ℎ = max(0, ℎ−1 + (|(, + ℎ) − (, )| − )). where is a small tolerance parameter that filters out random fluctuations. The DH is then the smallest temporal distance h for which the cumulative deviation exceeds a significance threshold : = {ℎ ∈ [1, ]|ℎ > }, where H is the maximum evaluation horizon (number of future steps available). Intuitively, answers the question: if I stop updating the model after a given time step, after how long will performance degradation become statistically evident? This metric captures the onset of observable drift and provides a temporal notion of performance stability that is less dependent on arbitrary accuracy thresholds.

4.4. Temporal Adaptation Score (TAS)

The Temporal Adaptation Score (TAS) measures how efectively a model trained at time t generalizes to future periods relative to their intrinsic dificulty. For each training time t, we compute the average OOD accuracy over the next n time steps and normalize it by the average oracle accuracy on those same time steps: () = 1 ∑︁ (, + ),

=1 () = 1 ∑︁ ( + , + ),

=1 = () .

()

As TAS is derived from averaged TTR values, it inherits the same clipping strategy. A TAS close to 1 indicates that the model achieves nearly the same performance as an oracle retrained for all future time steps, suggesting strong adaptation across time. Lower TAS values signal limited adaptability or growing temporal misalignment. Unlike the ID-OOD gap, TAS captures relative adaptation: how well a model follows the evolving oracle rather than its absolute performance drop.

4.5. Interpretation and Comparative Insights

Each proposed metric provides a complementary view of temporal adaptation: • TAS measures how well a model follows the oracle, i.e., its relative adaptation to future data. • SH measures for how long the model remains above an acceptable performance threshold, i.e., its temporal validity. • DH measures after how many time steps performance degradation becomes statistically evident, i.e., its sensitivity to drift.

When comparing models, these metrics enable nuanced interpretations. Two models with similar average OOD accuracy may exhibit very diferent TAS values, revealing which model better maintains relative performance to the oracle across time. Conversely, models with similarly TAS may difer in SH or DH, distinguishing those that remain stable for longer periods from those that degrade more abruptly. Together, TAS, SH, and DH form a coherent metric suite that separates adaptation from dificulty, providing interpretable temporal diagnostics that static metrics such as ID-OOD accuracy cannot capture.

5. Experimental Setup and Results

We empirically evaluate the proposed metrics to assess their ability to disentangle temporal adaptation from intrinsic data dificulty. Our evaluation is designed to answer two main questions: (i) whether the metrics provide additional insight beyond standard ID–OOD accuracy comparisons, and (ii) whether (5) (6) they consistently characterize temporal adaptation dynamics across datasets with diferent types of temporal shifts.

To this end, we conduct experiments on two benchmarks from the Wild-Time suite, Yearbook and FMoW-Time, which exhibit complementary temporal behaviors. We compare multiple learning paradigms and analyze the proposed metrics both quantitatively and qualitatively, highlighting how they capture stability, drift onset, and relative adaptation over time.

5.1. Datasets

We evaluate the proposed metrics on two temporal benchmarks from the Wild-Time suite: Yearbook and Functional Map of the World-Time (FMoW-Time). Both datasets are explicitly organized along a temporal axis, allowing a controlled analysis of model behavior under real temporal distribution shifts. Yearbook consists of grayscale portraits of American high school students from 1930 to 2013. Each year defines a distinct data subset, and the task is gender classification. Temporal shifts reflect gradual changes in photographic style, lighting, and fashion trends rather than abrupt domain changes. FMoWTime contains satellite images of land-use scenes captured over multiple years and geographic regions. The task is to classify the functional category of each scene (e.g., airport, hospital, residential area). Temporal shifts arise from environmental changes, sensor updates, and evolving land use, producing a rich and realistic testbed for long-term adaptation analysis.

While Yearbook exhibits smooth, visually interpretable drifts dominated by the evolution of P(X), FMoW features complex multimodal shifts combining temporal and spatial factors. Together, they provide complementary perspectives on temporal robustness: gradual aesthetic drift versus heterogeneous real-world dynamics.

5.2. Models and Evaluated Metrics

We evaluate several models representative of diferent learning paradigms. Empirical Risk Minimization (ERM), the baseline model trained independently on each year’s data. CORAL, a domain alignment method minimizing feature-level covariance shift. Elastic Weight Consolidation (EWC), a continual learning regularization approach that penalizes the deviation from previous parameters. Synaptic Intelligence (SI), an alternative regularization-based continual learner. Fine-Tuning (FT), sequentially updates the model using new data without explicit drift control. And, finally, a pipeline which combines Momentum Contrastive Learning techniques [15] with Streaming Machine Learning [16] models, designed to improve temporal adaptability [17].

For FMoW, we computed ID and OOD accuracy, TAS, SH and DH. For Yearbook we focused on ID, OOD and TAS. All experiments were re-run from scratch using the oficial Wild-Time implementation. The combination of the metrics enables both global and dynamic analysis: ID-OOD scores reveal absolute robustness, whereas TAS-SH-DH expose temporal adaptation dynamics.

5.3. Results on FMoW-Time

To illustrate how the proposed metrics are derived, Figure 2 shows the temporal accuracy matrices (, ) and their normalized counterparts (, ) for the Fine-Tuning (FT) model on FMoW-Time. Each row corresponds to a model trained at time and evaluated across future years > . The normalized matrix (, ) captures how well the model maintains its relative performance over time, with darker cells indicating a smaller deviation from the oracle.

In our experiments with this dataset, we configured the parameters of the proposed metrics to reflect realistic temporal dynamics in the dataset. Specifically, we set the tolerance threshold for the Stability Horizon to = 0.6 , and the deviation threshold for the Drift Horizon to = 0.15 . These values were chosen to balance sensitivity and robustness, ensuring that minor fluctuations in yearly performance do not dominate the evaluation while still capturing substantial temporal drifts.

(a) (, ): Accuracy (b) (, ): TTR

For Fine-Tuning, the computed Stability and Drift Horizons are reported per year below, showing how long the model remains reliable after training and when performance degradation becomes evident: = [ 4, 6, 5, 5, 5, 4, 4, 6, 6, 6 ] (average 5.1 years) = [ 2, 7, 7, 7, 6, 7, 7, 7, 7, 7 ] (average 6.4 years) We use 7 as a sentinel value that indicates that the threshold was not crossed within the observable window (H=6). In these cases, the value is truncated to the maximum observable window, meaning that the model maintained acceptable performance or did not show statistically significant drift throughout the evaluation period. This behavior highlights temporal persistence rather than an absolute duration beyond seven years. Overall, this pattern indicates that FT maintains acceptable performance for approximately five years on average, while significant drift is only detected after about six years, suggesting relatively stable behavior over time.

Table 1 compares the six evaluated models, MoCo (with SML), Synaptic Intelligence (SI), CORAL, Empirical Risk Minimization (ERM), A-GEM, and Fine-Tuning (FT), across all proposed metrics. Traditional ID and OOD accuracies show overall consistency with the Wild-Time benchmark, while TAS, SH, and DH provide deeper insights into temporal adaptation. For instance, although MoCo+SML achieves a higher ID accuracy, its TAS and SH values are significantly lower, revealing poor retention of relative performance over time. In contrast, methods such as SI and FT exhibit stronger temporal robustness, maintaining higher TAS and SH despite similar absolute accuracy levels. Interestingly, the Drift Horizon remains consistently around six years for most models, indicating a common temporal limit beyond which adaptation becomes inefective.

5.4. Results on Yearbook

The Yearbook dataset provides a long temporal span, making it a suitable benchmark to analyze gradual visual and distributional changes over decades. Table 2 reports the average and worst-case results for the evaluated models across standard metrics (ID, OOD) and the proposed ones (TAS). As observed in prior work, all models achieve strong ID and OOD accuracy, yet the proposed TAS metric provides additional insight into how well each model adapts to the temporal evolution of the data.

To better understand this distinction, Figure 3 analyzes the temporal behavior of MoCo+SML. The plot shows the yearly evolution of In-Distribution (ID) accuracy, Out-of-Distribution (OOD) accuracy, and the corresponding Temporal Adaptation Score (TAS). A clear example of how TAS complements ID–OOD analysis appears around 1970: while ID and OOD accuracies exhibit a sharp gap, approximately from 99% to 77%, the TAS value at that point remains close to 90%. This indicates that, although the apparent drop suggests poor generalization, the model still retains 90% of its oracle performance. In other words, the degradation is not purely due to lack of temporal adaptation, but rather to an intrinsic increase in data dificulty, as also reflected by the decline in ID accuracy observed in subsequent years. TAS thus allows distinguishing between a genuine adaptation failure and a natural evolution of the task itself, revealing that models may remain relatively well-aligned with the underlying temporal dynamics even when absolute performance decreases.

6. Conclusions and Future Work

This work introduced a framework for evaluating temporal adaptation in machine learning models. We argued that existing metrics—such as In-Distribution (ID) and Out-of-Distribution (OOD) accuracy—confuse intrinsic data dificulty with a model’s actual capacity to adapt over time. To address this ambiguity, we proposed three complementary post-hoc metrics: the Temporal Adaptation Score (TAS), the Stability Horizon (SH), and the Drift Horizon (DH). Together, they provide a dynamic and interpretable view of model robustness under temporal distribution shift.

Experiments on the Yearbook and FMoW benchmarks demonstrated that these metrics uncover adaptation patterns that remain hidden under static ID–OOD comparisons. In particular, TAS captures relative adaptation rather than absolute accuracy, distinguishing genuine temporal misalignment from intrinsic degradation in data quality. The Stability Horizon quantifies how long a model remains reliable, while the Drift Horizon identifies when degradation becomes statistically significant. Our results suggest that what has often been interpreted as a persistent failure to generalize over time may, in many cases, reflect a slower but still efective adaptation process.

Several extensions are worth pursuing. First, while our metrics are post-hoc and require full temporal supervision, future research could investigate online approximations that estimate adaptation dynamics during deployment. Second, integrating these metrics with temporal model selection or active retraining strategies could enable automatic detection of retraining points, reducing computational costs while maintaining accuracy. Third, exploring their applicability beyond classification (e.g., to regression, forecasting, or multimodal temporal learning) would help assess their generality across domains. Finally, a deeper theoretical analysis of the relationship between temporal shift magnitude, adaptation speed, and stability horizons could lead to more formal guarantees of temporal robustness.

Overall, our findings highlight that evaluating adaptation under temporal distribution shifts requires going beyond static accuracy metrics. By explicitly modeling how performance evolves over time, we can move toward a more faithful understanding of model behavior in truly dynamic environments.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT (by OpenAI) for grammar and spelling checks. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [7] I. Žliobaitė, M. Pechenizkiy, J. Gama, An overview of concept drift applications, Big data analysis: new algorithms for a new society (2015) 91–114. [8] A. Bifet, R. Gavalda, Learning from time-changing data with adaptive windowing, in: Proceedings of the 2007 SIAM international conference on data mining, SIAM, 2007, pp. 443–448. [9] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (2017) 3521–3526. [10] F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, in: International conference on machine learning, PMLR, 2017, pp. 3987–3995. [11] A. Chaudhry, M. Ranzato, M. Rohrbach, M. Elhoseiny, Eficient lifelong learning with a-gem, arXiv preprint arXiv:1812.00420 (2018). [12] Y. Hsu, Y. Liu, Z. Kira, Re-evaluating continual learning scenarios: A categorization and case for strong baselines, CoRR abs/1810.12488 (2018). arXiv:1810.12488. [13] D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, Advances in neural information processing systems 30 (2017). [14] R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, L. Schmidt, Measuring robustness to natural distribution shifts in image classification, Advances in Neural Information Processing Systems 33 (2020) 18583–18599. [15] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: CVPR, 2020. [16] H. M. Gomes, J. Read, A. Bifet, J. P. Barddal, J. Gama, Machine learning for streaming data: state of the art, challenges, and opportunities, KDD 21 (2019) 6–22. [17] L. Iovine, G. Zifer, A. Proia, E. Della Valle, Towards streaming land use classification of images with temporal distribution shifts, ESANN Proceedings (2025).

[1]

Yao ,

Choi ,

Cao ,

Lee ,

P. W. W.

Koh ,

Finn , Wild-time: A benchmark of in-thewild distribution shift over time , Advances in Neural Information Processing Systems 35 ( 2022 ) 10309 - 10324 .

[2]

McInnes ,

Healy ,

Melville , Umap: Uniform manifold approximation and projection for dimension reduction , arXiv preprint arXiv: 1802 . 03426 ( 2018 ).

[3]

Ginosar ,

Rakelly ,

Sachs ,

Yin ,

A. A.

Efros , A century of portraits: A visual historical record of american high school yearbooks , in: Proceedings of the IEEE International Conference on Computer Vision Workshops , 2015 , pp. 1 - 7 .

[4]

P. W.

Koh ,

Sagawa ,

Marklund ,

S. M.

Xie ,

Zhang ,

Balsubramani ,

Hu ,

Yasunaga ,

R. L.

Phillips ,

Gao , et al., Wilds: A benchmark of in-the-wild distribution shifts , in: International conference on machine learning, PMLR , 2021 , pp. 5637 - 5664 .

[5]

Christie ,

Fendley , J. Wilson,

Mukherjee , Functional map of the world , in: Proceedings of the IEEE Conference on CVPR , 2018 .

[6]

Gama , I. Žliobaitė,

Bifet ,

Pechenizkiy ,

Bouchachia , A survey on concept drift adaptation, ACM computing surveys (CSUR) 46 ( 2014 ) 1 - 37 .