<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tracking Adaptation Time: Metrics for Temporal Distribution Shift</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Iovine</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giacomo Zifer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Della Valle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano, DEIB</institution>
          ,
          <addr-line>Via Giuseppe Ponzio, 34, 20133 Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Evaluating robustness under temporal distribution shift remains an open challenge. Existing metrics quantify the average decline in performance, but fail to capture how models adapt to evolving data. As a result, temporal degradation is often misinterpreted: when accuracy declines, it is unclear whether the model is failing to adapt or whether the data itself has become inherently more challenging to learn. In this work, we propose three complementary metrics to distinguish adaptation from intrinsic dificulty in the data. Together, these metrics provide a dynamic and interpretable view of model behavior under temporal distribution shift. Results show that our metrics uncover adaptation patterns hidden by existing analysis, ofering a richer understanding of temporal robustness in evolving environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Temporal Distribution Shift</kwd>
        <kwd>Metrics</kwd>
        <kwd>Concept Drift</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        To illustrate this phenomenon, Figure 1 shows a two-dimensional UMAP [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] projection of feature
representations extracted from the Yearbook dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], grouped in 10-year intervals. Initially (1955–1964),
classes remain largely separable, with clear boundaries between male and female portraits. However,
as time progresses, the clusters gradually overlap, indicating that visual features from diferent years
become increasingly intertwined, a possible consequence of emerging visual diversity, with previously
uncommon facial traits and aesthetics becoming more frequent. This suggests that the drop in model
performance observed in later years does not necessarily arise from poor generalization, but from a
genuine increase in data dificulty, the classes themselves become less separable as the underlying
generative process evolves. In other words, temporal degradation often reflects complex relations
between data and the problem, rather than a failure to adapt.
      </p>
      <p>In this work, we contribute by introducing three complementary metrics to address this limitation.
Together, these metrics isolate the model’s temporal adaptation dynamics from the intrinsic evolution
of data dificulty. We demonstrate the efect of our metrics on datasets from Wild-Time. Our results
suggest that what appears as a persistent ID-OOD gap in prior work often reflects a temporal lag in
adaptation rather than a complete failure to generalize, motivating a reevaluation of how we measure
and compare models under temporal distribution shift. To facilitate adoption and reproducibility, we
release the code to compute the proposed metrics.1</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Temporal distribution shift refers to changes in the data distribution that occur as time progresses.
Unlike domain shifts, where diferent datasets represent distinct environments or acquisition sources,
temporal shifts emerge naturally as the world evolves: sensor properties change, user behavior drifts,
visual or textual styles shift, and contextual conditions evolve. Large-scale benchmarks such as WILDS [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and Wild-Time have made this setting explicit by organizing data along temporal axes. Datasets like
FMoW-Time (a subset of the Functional Map of the World dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) and Yearbook expose temporal
evolution through gradual changes in content and context, making them ideal testbeds for studying
model robustness to time-dependent drift. These works have shown that even high-capacity models
exhibit substantial degradation when evaluated on future samples, underscoring the challenge of
temporal generalization. However, most temporal benchmarks measure only static performance snapshots
at diferent time points. Such evaluations quantify accuracy decay but fail to reveal how performance
changes unfold, whether due to intrinsic task dificulty or to a lack of temporal adaptation. This
limitation motivates the need for metrics that explicitly capture the temporal dynamics of adaptation
rather than static robustness summaries.
      </p>
      <p>
        The broader concept of concept drift originates from the data stream mining literature [
        <xref ref-type="bibr" rid="ref6">6, 7</xref>
        ]. It
1https://github.com/lorenzoiovine99/Metrics-for-Temporal-Distribution-Shift
denotes any change in the joint distribution P(X,Y) that afects a model’s predictive behavior over time or
across contexts. Such changes can stem from variations in the data-generating process, user populations,
environmental conditions, or even shifts in the underlying concept being learned. In other words, while
temporal distribution shift specifically refers to distributional changes induced by time, concept drift
can occur for many other reasons — temporal, environmental, behavioral, or stochastic. Research on
concept drift has developed adaptive methods capable of detecting and responding to such changes
online. Classical algorithms, like ADWIN [8], monitor error rates or distributional statistics to detect
significant drift, while streaming and Continual Learning approaches such as EWC [ 9], SI [10], and
A-GEM [11] introduce mechanisms to balance plasticity and stability in non-stationary environments.
Despite this, most evaluations in these settings focus on aggregate measures, such as average accuracy,
forgetting, or backward transfer, that summarize long-term performance rather than capturing
finegrained temporal adaptation behavior. Consequently, even in the context of streaming and continual
learning, the community lacks metrics that quantify how efectively a model adapts to evolving data
distributions over time.
      </p>
      <p>Evaluating models under non-stationarity has traditionally relied on metrics designed for discrete
tasks or static shifts. In Continual Learning [12] forward and backward transfer measures were
introduced to quantify interference and knowledge retention across tasks [13]. While efective for
taskstructured evaluation, these metrics are ill-suited for continuous temporal evolution, where boundaries
between distributions are not well-defined. Temporal benchmarks, on the other hand, often report
aggregated accuracy or ID–OOD comparisons across years. Although informative, these metrics cannot
distinguish between degradation due to intrinsic dificulty (e.g., increased noise or reduced separability)
and degradation caused by insuficient adaptation. Both scenarios can produce similar accuracy curves
but reflect fundamentally diferent model behaviors. Similar limitations have been noted in domain
generalization and robustness research [14], where performance gaps may conflate distribution hardness
with model capacity. In temporal settings, this ambiguity obscures our understanding of whether a
model fails because the data becomes harder, or because it cannot adjust quickly enough to a changing
environment. Hence, there remains a need for evaluation metrics that explicitly capture temporal
adaptation dynamics, describing how rapidly, stably, and persistently a model adapts as data distributions
evolve. Addressing this gap is the central focus of this work.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Statement</title>
      <p>We consider a supervised learning setting where data distributions evolve over time. Let
 = {(, )}=1,
(1)
denote the dataset available at time t, sampled from an underlying joint distribution (,  ). In
the presence of temporal distribution shift, these distributions can change as t increases. This setting
reflects many real-world scenarios in which models are trained on past data and then deployed in the
future: the world changes, and the model gradually becomes misaligned with the data it encounters.</p>
      <p>Existing benchmarks, such as Wild-Time, summarize this phenomenon through In-Distribution
(ID) and Out-of-Distribution (OOD) accuracy: performance on data from the training period versus
future periods. While intuitive, this pair of metrics provides only a static picture of what is inherently
a dynamic process. A gap between ID and OOD accuracy can be observed even when the model
adapts correctly, simply because future data is intrinsically harder (less separable, noisier, or more
ambiguous). Conversely, two models with identical ID–OOD gaps might exhibit entirely diferent
temporal behaviors: one may degrade immediately and never recover, while another might adapt
rapidly to new conditions. Therefore, these static measures conflate intrinsic dificulty with adaptation
capability, making it impossible to determine why performance changes through time.</p>
      <p>What is missing is a principled way to evaluate how models adapt, not only how much their
performance drops. We argue that measuring robustness under temporal shift should involve characterizing
the temporal adaptation dynamics. Understanding these dynamics is essential for comparing algorithms
designed for continual, adaptive, or streaming learning, and for identifying the true limitations of
current approaches.</p>
      <p>This leads to our central research question: How can we quantitatively distinguish between
degradation caused by intrinsic data dificulty and degradation caused by insuficient temporal adaptation in
machine learning models? Addressing this question requires moving beyond static ID/OOD metrics
toward a framework that explicitly captures adaptation over time.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Metrics</title>
      <p>To characterize how models evolve over time, we introduce three complementary post-hoc metrics that
quantify distinct aspects of temporal adaptation. They capture how long a model remains valid after
training, when degradation becomes evident, and how efectively it adapts to future data. All metrics
are computed after the full temporal sequence has been observed, allowing a retrospective evaluation of
how well a model trained at time t would have matched the performance of an "oracle" model retrained
on the data distribution of each target time  . Formally, the oracle represents the maximum achievable
performance for a given period under ideal adaptation. By normalizing a model’s temporal performance
against this reference, our metrics isolate the degradation due to lack of adaptation from that caused by
increasing data dificulty.</p>
      <sec id="sec-4-1">
        <title>4.1. Temporal Transfer Ratio (TTR)</title>
        <p>Let (,  ) denote the accuracy obtained when a model is trained on data from time t and evaluated on
data from time  . We define the Temporal Transfer Ratio (TTR) as:
(,  ) =
(,  )
(,  )</p>
        <p>
          (2)
(4)
The denominator (,  ) represents the "oracle" performance level, the accuracy that would be
achievable if the model were trained directly on the target time. To ensure interpretability, we clip (,  ) to 1
in cases where a model trained at time t outperforms the oracle trained at  , enforcing (,  ) ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] .
It quantifies how much of the realistic maximum at time  is preserved by a model trained in the past.
A value close to 1 indicates strong temporal transfer, while lower values indicate growing misalignment
between the model and the evolving data distribution. This function serves as the foundation for all
three proposed metrics.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Stability Horizon (SH)</title>
        <p>
          The Stability Horizon (SH) captures how long a model trained at time t remains reliable before its
performance drops below an acceptable level. Formally, for a chosen tolerance threshold  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]:
 () = {ℎ ≥ 0 : (,  + ℎ) ≥ }.
(3)
Intuitively,  () measures the number of future time steps for which a model trained at t maintains at
least a fraction  of the oracle accuracy. This provides a direct estimate of the temporal validity window
of the model, the time span during which it can be deployed without retraining. The threshold  can be
adapted to the application: for instance,  = 0.9 corresponds to a 10% acceptable loss in accuracy.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Drift Horizon (DH)</title>
        <p>While the Stability Horizon identifies when performance falls below a target level, the Drift Horizon
(DH) detects when that drop becomes statistically significant. We define a cumulative drift statistic as:
0 = 0,
ℎ = max(0, ℎ−1 + (|(,  + ℎ) − (, )| − )).
where  is a small tolerance parameter that filters out random fluctuations. The DH is then the smallest
temporal distance h for which the cumulative deviation exceeds a significance threshold :
 = {ℎ ∈ [1, ]|ℎ &gt; },
where H is the maximum evaluation horizon (number of future steps available). Intuitively, 
answers the question: if I stop updating the model after a given time step, after how long will performance
degradation become statistically evident? This metric captures the onset of observable drift and provides
a temporal notion of performance stability that is less dependent on arbitrary accuracy thresholds.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Temporal Adaptation Score (TAS)</title>
        <p>The Temporal Adaptation Score (TAS) measures how efectively a model trained at time t generalizes to
future periods relative to their intrinsic dificulty. For each training time t, we compute the average
OOD accuracy over the next n time steps and normalize it by the average oracle accuracy on those
same time steps:
() = 1 ∑︁ (,  + ),</p>
        <p>=1
() = 1 ∑︁ ( + ,  + ),</p>
        <p>=1
  =
() .</p>
        <p>()</p>
        <p>As TAS is derived from averaged TTR values, it inherits the same clipping strategy. A TAS close to
1 indicates that the model achieves nearly the same performance as an oracle retrained for all future
time steps, suggesting strong adaptation across time. Lower TAS values signal limited adaptability or
growing temporal misalignment. Unlike the ID-OOD gap, TAS captures relative adaptation: how well a
model follows the evolving oracle rather than its absolute performance drop.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Interpretation and Comparative Insights</title>
        <p>Each proposed metric provides a complementary view of temporal adaptation:
• TAS measures how well a model follows the oracle, i.e., its relative adaptation to future data.
• SH measures for how long the model remains above an acceptable performance threshold, i.e., its
temporal validity.
• DH measures after how many time steps performance degradation becomes statistically evident,
i.e., its sensitivity to drift.</p>
        <p>When comparing models, these metrics enable nuanced interpretations. Two models with similar
average OOD accuracy may exhibit very diferent TAS values, revealing which model better maintains
relative performance to the oracle across time. Conversely, models with similarly TAS may difer in
SH or DH, distinguishing those that remain stable for longer periods from those that degrade more
abruptly. Together, TAS, SH, and DH form a coherent metric suite that separates adaptation from
dificulty, providing interpretable temporal diagnostics that static metrics such as ID-OOD accuracy
cannot capture.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup and Results</title>
      <p>We empirically evaluate the proposed metrics to assess their ability to disentangle temporal adaptation
from intrinsic data dificulty. Our evaluation is designed to answer two main questions: (i) whether the
metrics provide additional insight beyond standard ID–OOD accuracy comparisons, and (ii) whether
(5)
(6)
they consistently characterize temporal adaptation dynamics across datasets with diferent types of
temporal shifts.</p>
      <p>To this end, we conduct experiments on two benchmarks from the Wild-Time suite, Yearbook
and FMoW-Time, which exhibit complementary temporal behaviors. We compare multiple learning
paradigms and analyze the proposed metrics both quantitatively and qualitatively, highlighting how
they capture stability, drift onset, and relative adaptation over time.</p>
      <sec id="sec-5-1">
        <title>5.1. Datasets</title>
        <p>We evaluate the proposed metrics on two temporal benchmarks from the Wild-Time suite: Yearbook
and Functional Map of the World-Time (FMoW-Time). Both datasets are explicitly organized along a
temporal axis, allowing a controlled analysis of model behavior under real temporal distribution shifts.
Yearbook consists of grayscale portraits of American high school students from 1930 to 2013. Each
year defines a distinct data subset, and the task is gender classification. Temporal shifts reflect gradual
changes in photographic style, lighting, and fashion trends rather than abrupt domain changes.
FMoWTime contains satellite images of land-use scenes captured over multiple years and geographic regions.
The task is to classify the functional category of each scene (e.g., airport, hospital, residential area).
Temporal shifts arise from environmental changes, sensor updates, and evolving land use, producing a
rich and realistic testbed for long-term adaptation analysis.</p>
        <p>While Yearbook exhibits smooth, visually interpretable drifts dominated by the evolution of P(X),
FMoW features complex multimodal shifts combining temporal and spatial factors. Together, they
provide complementary perspectives on temporal robustness: gradual aesthetic drift versus heterogeneous
real-world dynamics.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Models and Evaluated Metrics</title>
        <p>We evaluate several models representative of diferent learning paradigms. Empirical Risk Minimization
(ERM), the baseline model trained independently on each year’s data. CORAL, a domain alignment
method minimizing feature-level covariance shift. Elastic Weight Consolidation (EWC), a continual
learning regularization approach that penalizes the deviation from previous parameters. Synaptic
Intelligence (SI), an alternative regularization-based continual learner. Fine-Tuning (FT), sequentially
updates the model using new data without explicit drift control. And, finally, a pipeline which combines
Momentum Contrastive Learning techniques [15] with Streaming Machine Learning [16] models,
designed to improve temporal adaptability [17].</p>
        <p>For FMoW, we computed ID and OOD accuracy, TAS, SH and DH. For Yearbook we focused on ID,
OOD and TAS. All experiments were re-run from scratch using the oficial Wild-Time implementation.
The combination of the metrics enables both global and dynamic analysis: ID-OOD scores reveal
absolute robustness, whereas TAS-SH-DH expose temporal adaptation dynamics.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results on FMoW-Time</title>
        <p>To illustrate how the proposed metrics are derived, Figure 2 shows the temporal accuracy matrices
(,  ) and their normalized counterparts (,  ) for the Fine-Tuning (FT) model on FMoW-Time. Each
row corresponds to a model trained at time  and evaluated across future years  &gt;  . The normalized
matrix (,  ) captures how well the model maintains its relative performance over time, with darker
cells indicating a smaller deviation from the oracle.</p>
        <p>In our experiments with this dataset, we configured the parameters of the proposed metrics to reflect
realistic temporal dynamics in the dataset. Specifically, we set the tolerance threshold for the Stability
Horizon to  = 0.6 , and the deviation threshold for the Drift Horizon to  = 0.15 . These values were
chosen to balance sensitivity and robustness, ensuring that minor fluctuations in yearly performance
do not dominate the evaluation while still capturing substantial temporal drifts.</p>
        <p>(a) (,  ): Accuracy
(b) (,  ): TTR</p>
        <p>
          For Fine-Tuning, the computed Stability and Drift Horizons are reported per year below, showing
how long the model remains reliable after training and when performance degradation becomes evident:
 = [
          <xref ref-type="bibr" rid="ref4 ref4 ref4 ref5 ref5 ref5 ref6 ref6 ref6 ref6">4, 6, 5, 5, 5, 4, 4, 6, 6, 6</xref>
          ] (average 5.1 years)
 = [
          <xref ref-type="bibr" rid="ref2 ref6">2, 7, 7, 7, 6, 7, 7, 7, 7, 7</xref>
          ] (average 6.4 years)
We use 7 as a sentinel value that indicates that the threshold was not crossed within the observable
window (H=6). In these cases, the value is truncated to the maximum observable window, meaning that
the model maintained acceptable performance or did not show statistically significant drift throughout
the evaluation period. This behavior highlights temporal persistence rather than an absolute duration
beyond seven years. Overall, this pattern indicates that FT maintains acceptable performance for
approximately five years on average, while significant drift is only detected after about six years,
suggesting relatively stable behavior over time.
        </p>
        <p>Table 1 compares the six evaluated models, MoCo (with SML), Synaptic Intelligence (SI), CORAL,
Empirical Risk Minimization (ERM), A-GEM, and Fine-Tuning (FT), across all proposed metrics.
Traditional ID and OOD accuracies show overall consistency with the Wild-Time benchmark, while TAS, SH,
and DH provide deeper insights into temporal adaptation. For instance, although MoCo+SML achieves
a higher ID accuracy, its TAS and SH values are significantly lower, revealing poor retention of relative
performance over time. In contrast, methods such as SI and FT exhibit stronger temporal robustness,
maintaining higher TAS and SH despite similar absolute accuracy levels. Interestingly, the Drift Horizon
remains consistently around six years for most models, indicating a common temporal limit beyond
which adaptation becomes inefective.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Results on Yearbook</title>
        <p>The Yearbook dataset provides a long temporal span, making it a suitable benchmark to analyze gradual
visual and distributional changes over decades. Table 2 reports the average and worst-case results for
the evaluated models across standard metrics (ID, OOD) and the proposed ones (TAS). As observed
in prior work, all models achieve strong ID and OOD accuracy, yet the proposed TAS metric provides
additional insight into how well each model adapts to the temporal evolution of the data.</p>
        <p>To better understand this distinction, Figure 3 analyzes the temporal behavior of MoCo+SML. The
plot shows the yearly evolution of In-Distribution (ID) accuracy, Out-of-Distribution (OOD) accuracy,
and the corresponding Temporal Adaptation Score (TAS). A clear example of how TAS complements
ID–OOD analysis appears around 1970: while ID and OOD accuracies exhibit a sharp gap, approximately
from 99% to 77%, the TAS value at that point remains close to 90%. This indicates that, although the
apparent drop suggests poor generalization, the model still retains 90% of its oracle performance. In
other words, the degradation is not purely due to lack of temporal adaptation, but rather to an intrinsic
increase in data dificulty, as also reflected by the decline in ID accuracy observed in subsequent years.
TAS thus allows distinguishing between a genuine adaptation failure and a natural evolution of the task
itself, revealing that models may remain relatively well-aligned with the underlying temporal dynamics
even when absolute performance decreases.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>This work introduced a framework for evaluating temporal adaptation in machine learning models.
We argued that existing metrics—such as In-Distribution (ID) and Out-of-Distribution (OOD)
accuracy—confuse intrinsic data dificulty with a model’s actual capacity to adapt over time. To address
this ambiguity, we proposed three complementary post-hoc metrics: the Temporal Adaptation Score
(TAS), the Stability Horizon (SH), and the Drift Horizon (DH). Together, they provide a dynamic and
interpretable view of model robustness under temporal distribution shift.</p>
      <p>Experiments on the Yearbook and FMoW benchmarks demonstrated that these metrics uncover
adaptation patterns that remain hidden under static ID–OOD comparisons. In particular, TAS captures
relative adaptation rather than absolute accuracy, distinguishing genuine temporal misalignment from
intrinsic degradation in data quality. The Stability Horizon quantifies how long a model remains reliable,
while the Drift Horizon identifies when degradation becomes statistically significant. Our results
suggest that what has often been interpreted as a persistent failure to generalize over time may, in
many cases, reflect a slower but still efective adaptation process.</p>
      <p>Several extensions are worth pursuing. First, while our metrics are post-hoc and require full temporal
supervision, future research could investigate online approximations that estimate adaptation dynamics
during deployment. Second, integrating these metrics with temporal model selection or active retraining
strategies could enable automatic detection of retraining points, reducing computational costs while
maintaining accuracy. Third, exploring their applicability beyond classification (e.g., to regression,
forecasting, or multimodal temporal learning) would help assess their generality across domains. Finally,
a deeper theoretical analysis of the relationship between temporal shift magnitude, adaptation speed,
and stability horizons could lead to more formal guarantees of temporal robustness.</p>
      <p>Overall, our findings highlight that evaluating adaptation under temporal distribution shifts requires
going beyond static accuracy metrics. By explicitly modeling how performance evolves over time, we
can move toward a more faithful understanding of model behavior in truly dynamic environments.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT (by OpenAI) for grammar and spelling
checks. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[7] I. Žliobaitė, M. Pechenizkiy, J. Gama, An overview of concept drift applications, Big data analysis:
new algorithms for a new society (2015) 91–114.
[8] A. Bifet, R. Gavalda, Learning from time-changing data with adaptive windowing, in: Proceedings
of the 2007 SIAM international conference on data mining, SIAM, 2007, pp. 443–448.
[9] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan,
T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks,
Proceedings of the national academy of sciences 114 (2017) 3521–3526.
[10] F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, in: International
conference on machine learning, PMLR, 2017, pp. 3987–3995.
[11] A. Chaudhry, M. Ranzato, M. Rohrbach, M. Elhoseiny, Eficient lifelong learning with a-gem, arXiv
preprint arXiv:1812.00420 (2018).
[12] Y. Hsu, Y. Liu, Z. Kira, Re-evaluating continual learning scenarios: A categorization and case for
strong baselines, CoRR abs/1810.12488 (2018). arXiv:1810.12488.
[13] D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, Advances in neural
information processing systems 30 (2017).
[14] R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, L. Schmidt, Measuring robustness to natural
distribution shifts in image classification, Advances in Neural Information Processing Systems 33
(2020) 18583–18599.
[15] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation
learning, in: CVPR, 2020.
[16] H. M. Gomes, J. Read, A. Bifet, J. P. Barddal, J. Gama, Machine learning for streaming data: state
of the art, challenges, and opportunities, KDD 21 (2019) 6–22.
[17] L. Iovine, G. Zifer, A. Proia, E. Della Valle, Towards streaming land use classification of images
with temporal distribution shifts, ESANN Proceedings (2025).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <article-title>Wild-time: A benchmark of in-thewild distribution shift over time</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>10309</fpage>
          -
          <lpage>10324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Melville</surname>
          </string-name>
          , Umap:
          <article-title>Uniform manifold approximation and projection for dimension reduction</article-title>
          , arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>03426</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ginosar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rakelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sachs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          ,
          <article-title>A century of portraits: A visual historical record of american high school yearbooks</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision Workshops</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Marklund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Balsubramani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Gao</surname>
          </string-name>
          , et al.,
          <article-title>Wilds: A benchmark of in-the-wild distribution shifts</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5637</fpage>
          -
          <lpage>5664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Christie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fendley</surname>
          </string-name>
          , J. Wilson,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Functional map of the world</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on CVPR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          , I. Žliobaitė,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pechenizkiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouchachia</surname>
          </string-name>
          ,
          <article-title>A survey on concept drift adaptation, ACM computing surveys (CSUR) 46 (</article-title>
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>