<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Streaming Continual Learning for Earth Observation Multimodal Foundation Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcello M. Declich</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano, Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)</institution>
          ,
          <addr-line>Via Leonardo da Vinci, 32, Milan, 20131</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Earth Observation (EO) systems generate continuous multimodal data streams at unprecedented scales. However, in this context, the literature ofers solutions based on foundation models that operate within static training paradigms, which limit their efectiveness. Trained once on historical datasets and deployed without further learning, these models face critical issues when confronted with the dynamic nature of the environment, which includes emerging phenomena, sensor degradation, and evolving environmental patterns. This vision paper identifies three fundamental gaps: (1) the absence of memory-eficient anti-forgetting mechanisms at the foundation scale, (2) static cross-modal fusion strategies that cannot adapt to changing observational contexts, and (3) temporal representations that fail to distinguish cyclical patterns from distributional drift. Addressing these limitations requires convergence of foundation models, Continual Learning, and Streaming Machine Learning. This work envisions three key research directions: eficient model updating through selective replay and parameter regularization, explicit drift detection mechanisms, and context-dependent fusion strategies. These directions aim to enable EO systems that continuously learn from terabyte-per-day satellite streams while maintaining transfer learning capabilities and computational feasibility essential for operational deployment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Geospatial Foundation Models</kwd>
        <kwd>Streaming Continual Learning</kwd>
        <kwd>Multimodal Fusion</kwd>
        <kwd>Concept Drift Detection</kwd>
        <kwd>Earth Observation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, the availability of digitized information about the Earth has increased exponentially,
opening new opportunities for applying large Artificial Intelligence (AI) analysis models in fields such
as environmental monitoring, disaster management and response, urban planning and smart cities,
security, and defense applications.</p>
      <p>These AI models are inherently multimodal systems, meaning that they must process and integrate
data from diferent types of representations. Earth Observation (EO) data is inherently multimodal,
ranging from high-resolution optical and Synthetic Aperture Radar (SAR) satellite imagery to
groundbased sensor networks, from textual reports to elevation and 3D products (e.g., Digital Elevation Models,
LiDAR point clouds). In addition, EO include radiometric measurements, i.e., the intensity recorded
by sensors across diferent wavelengths of the electromagnetic spectrum. This diversity particularly
benefits from AI approaches that can exploit a complex ecosystem of interconnected information. Data
is complementary, and their interconnection enables a richer understanding of the Earth system.</p>
      <p>The core challenge of multimodal learning is to combine information across modalities, which is
usually referred to as cross-modal fusion. For example, data coming from optical sensors is obscured
by clouds, but SAR images penetrate them and provide an image of the Earth regardless of weather
conditions. At the same time, this diversity amplifies the challenges of generalizing EO data.</p>
      <p>This heterogeneity of data is accompanied by diversity in data generation and reception. Data are
produced constantly as a continuous datastream (unbounded sequence of data generated and transmitted
over time), with each satellite generating terabytes of data daily. The revisit frequency (the time interval
between successive observations of the same location by a satellite or satellite constellation) for a single
1st Streaming Continual Learning Bridge at AAAI26, January 21, 2026, Singapore.
$ marcellomatteo.declich@polimi.it (M. M. Declich)
 https://github.com/MarcelloMatteoDeclich (M. M. Declich)
0009-0008-2985-5906 (M. M. Declich)</p>
      <p>© 2026 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
region varies, ranging from days for polar-orbiting satellites to continuous observation for geostationary
satellites. Resolution also varies significantly with spectral bands of Sentinel Satellite capturing data at
10, 20,or 60 m resolution1</p>
      <p>Moreover, the assumption that data is independent and identically distributed (i.i.d.), which is usually
made in traditional machine learning, does not hold in the EO scenario. According to this assumption,
each observation must be drawn from the same probability distribution and not influence the others.</p>
      <p>A first class of violations of the i.i.d assumption in EO data arises from spatial and temporal dependence
among observations. EO measurements are autocorrelated at multiple scales (e.g., diurnal, seasonal, or
orbital) and dependent on one another in several complementary ways.</p>
      <p>First, dependence may arise across diferent spatial tiles observed at the same time. Images acquired
at nearby locations are often correlated due to shared environmental conditions and similar land cover
types (e.g. fields, buildings, trees, etc.), leading to spatial dependence.</p>
      <p>Second, temporal dependence may arise when the same spatial tile is observed at diferent time points.
In this case, repeated acquisitions of the same location may be inherently correlated, as they reflect
the temporal evolution of the same underlying scene. Such dependence may be driven by periodic
patterns, including diurnal or seasonal cycles, such as vegetation cycles as illustrated in the top rows of
Figure 1. However, temporal dependence is not limited to strictly periodic phenomena. Observations
collected at adjacent time points tend to be strictly correlated due to gradual and continuous changes in
environmental conditions.</p>
      <p>Additionally, the observation modalities themselves are non-stationary in their information
acquisition characteristics. Dependence may also arise from the sensing process and temporal acquisition
patterns, as diferent modalities can be partially or entirely unavailable depending on the time of day
or atmospheric conditions. For instance, optical sensors cannot acquire data during nighttime hours.
Furthermore, the correlation structure between modalities is dynamic rather than fixed: observations
from diferent sensors may exhibit strong correlation under certain conditions, such as during clear,
sunny weather, while becoming uncorrelated or complementary under others, such as during cloudy
conditions when synthetic aperture radar provides information unavailable to optical instruments.</p>
      <p>A distinct, yet related, violation of the i.i.d. assumption occurs when the data distribution evolves over
time. Such changes are commonly referred to as concept drift and may be driven by long-term climate
trends, sensor degradation, anthropogenic activities, or extreme events. In this case, the relationship
between inputs and their associated outputs is no longer stationary is no longer stationary, as illustrated
in the bottom rows of Figure 1.</p>
      <p>
        Drifts are extensively addressed by two research areas: Continual Learning (CL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Streaming
Machine Learning (SML) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] from two diferent perspectives. CL aims to avoid catastrophic forgetting, the
issue that arises when a model adapts to a new distribution and may lose knowledge it acquired from
older distributions. The CL goal is to strike a balance between acquiring new knowledge (plasticity)
and retaining the past (stability). On the other hand, SML specifically focuses on real-time adaptation,
often emphasizing the current distribution and potentially ignoring the past. SML also proposes explicit
concept drift detectors to detect changes and prevent models from experiencing strong performance
decreases. Recently, Streaming Continual Learning (SCL) [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ] has been introduced as a distinct
and unified paradigm that bridges CL and SML to produce complete and integrated solutions with the
goals of rapid adaption to drifts and knowledge retention.
      </p>
      <p>
        Nevertheless, CL and SML approaches have been little studied in EO, and further investigation is
needed. The literature mainly reports applications to unimodal cases or simple vision tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and
their extension to complex multimodal terabyte-scale fusion data scenarios needs to be explored.
      </p>
      <p>In recent years, research in EO has focused on the development of Foundation Models (FM). FMs
are trained on huge amounts of unlabeled multimodal data using a self-supervised approach. The
models do not learn to solve a specific task but instead learn a representation of the data by combining
the various data modalities (and thus learn to balance them).
1For ESA Sentinel-2 instrument specifications, see https://www.esa.int/Applications/Observing_the_Earth/Copernicus/
Sentinel-2/Instrument</p>
      <p>FMs have high transfer learning capabilities, cross-modality understanding of the data, and simplify
the implementation of downstream tasks by reducing the amount of task-specific training data, allowing
for few-shot adaptation capabilities. However, they operate in a static paradigm that trains the models
once on historical data and then deploys them without further learning.</p>
      <p>FMs, SML, and CL developed solutions for their own domains, but when applied to operational
EO, they are insuficient on their own for creating a system that is simultaneous, multimodal, and
drift-aware. Currently, there is no framework in the literature that simultaneously addresses continuous
multimodal learning, anti-forgetting applied to FMs, adaptive cross-modal fusion under distribution
shift, and explicit detection of concept drift for Earth observation systems. This work calls for the
convergence of FM, CL, and SML for operational EO Systems.</p>
      <p>This work identifies three critical gaps that prevent existing approaches from scaling to continuous
multimodal learning: (1) the absence of memory-eficient anti-forgetting mechanisms at foundation scale,
(2) static cross-modal fusion that cannot adapt to changing sensor reliability, (3) temporal representations
that conflate cyclical patterns with distributional drift.</p>
      <p>The rest of this paper is organized as follows. Section 2 discusses the principles and challenges of
CL and SML in detail, while Section 3 provides an overview of geospatial foundation models (GFMs) .
Section 4 presents the limitation of the current GFMs while Section 5 presents the proposed research
directions. Section 6 presents conclusions and outlines future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Learning in non-stationary environments</title>
      <p>In EO scenarios, data is typically generated continuously at a high frequency. This situation produces
what the literature defines as data stream, which is formally an unbounded sequence of data points
1, 2, . . . , , +1, . . .. When focusing on classification problem, each data point at time  is a couple
(, ),  is a feature vector and  is the target label. The assumption is that the real label ,
associated with , will be available only after receiving  and predicting ˆ.</p>
      <p>
        Additionally, the evolving nature of EO data presents non-stationarities. A concept is the hidden and
unobservable process that produces the data streem. It is modeled as a stochastic process that generates
data points according to the joint distribution  (, ). A concept drift occurs when this probability
distribution changes. Concept drift are considered relevant when they require an update of the trained
model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The literature distinguishes between two main types of changes [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The first, known as virtual drift,
refers to a change in the input distribution  () that afects the feature distribution but not the decision
boundary. In EO, for example, a classification task involves a problem such as the seasonal transition
from winter to summer, which modifies spectral features but does not alter land cover categories or
how previously observed images are classified. A forest remains a forest, but new types of forests with
diferent reflectances may appear throughout the year. While the latter, real drift, refers to the case in
which the decision boundary (formally defined as the probability  (|)) itself changes, resulting in
an explicit modification of the mapping between inputs and outputs. Following the previous example,
recent land-cover standards require a minimum canopy cover to classify an area as forest. Regions that
were previously labeled forest may now be labeled shrubland. The appearance may stay similar, but the
class meaning changes.
      </p>
      <p>CL assumes a general problem to be split into subproblems called tasks (which correspond to concepts).
At each timestamp, the data stream generates a large batch of data (called experience) containing all the
data points associated with a distribution. Moving back to the data stream definition, a CL stream is the
result of accumulating all data points associated with a specific input distribution  () in a batch and
presenting them to the model at once. Whenever a new experience is available, the solution has as much
time as needed to process it. The goal is to learn how to solve the new task without losing previously
acquired knowledge, thus avoiding catastrophic forgetting. Avoiding forgetting in this setting is crucial,
as changes introduce a new input distribution without contradicting what has been learned so far.</p>
      <p>
        CL proposes three strategies to avoid catastrophic forgetting [
        <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
        ]: replay-based, regularization,
and architectural solutions. Replay-based [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] solutions involve storing a portion of the data observed
in previous tasks to be combined with current data during the training phase. Regularization
methods [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] work by limiting the loss function during training to preserve performance in previous tasks.
Architectural solutions mitigate catastrophic forgetting by adjusting the model’s structure as learning
progresses. Architectural solutions include freezing weights or neurons, and expanding the network.
CL mainly uses Deep Learning approaches.
      </p>
      <p>
        Conversely, SML does not make any assumptions about the nature of concept drifts, which can be both
real and virtual. It usually applies statistical machine learning models (decision trees or probabilistic
models) with the goal of focusing just on the current distribution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is typically applied to tabular
data or small computer vision examples. The issue of forgetting is usually overlooked, and the primary
objective is to produce models that can quickly adapt to new distributions. Rapid adaptability is, thus,
preferred over stability [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The main rationale is that if the solution takes a few times to adapt to
changes, it does not need to retain past knowledge. Additionally, SML literature proposes concept drift
detectors, such as ADWIN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Page-Hinkley [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], methods for detecting when concept drift occurs,
and determining when to update the model.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Multimodal Geospatial Foundation Model</title>
      <p>
        Multimodal FMs are large-scale AI models, pretrained on vast and heterogeneous data, that learn
general-purpose representations and enables eficient adaptation to many downstream tasks across
diferent domains and modalities [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] with minimal additional fine-tuning. FMs provide a generalizable
representation framework for enabling cross-task and cross-modal generalization. In the geospatial
domain, foundation models inherit this same philosophy but are specifically designed to address
the unique characteristics and challenges of remote sensing data. Remote sensing imagery presents
fundamental challenges for conventional vision models due to its diverse spatial resolutions, extensive
spectral dimensions beyond visible light, and incorporation of specialized data types including radar
and LiDAR [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The field’s requirements for temporal analysis, multi-scale object detection, and
domain-specific tasks create a significant gap between natural image models and geospatial applications.
This need has guided scientific research the realization of FMs tailored to remote sensing characteristics.
      </p>
      <p>
        The evolution of GFMs can be divided into three main generations. The first generation consists of
unimodal models based on RGB imagery, trained with self-supervised approaches such as masked image
modeling to learn efective representations. The second generation introduces multimodal models that
combine data from diferent sensors, including optical, SAR, multispectral, or LiDAR sources, improving
robustness and accuracy. The third and most recent generation is represented by vision-language
models that align remote sensing imagery with natural language. This enables advanced applications
such as image captioning, visual question answering, and text-based image retrieval [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        In this paper, we focus on Multimodal GFMs, depicted in Figure 2, however, the underlying reasoning
can be extended to Vision-Language GFMs. Multimodal GFMs rely on Transformers [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], such as
ViT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], that are the current standard thanks to their ability to model long-range dependencies, which
are essential for large satellite images.
      </p>
      <p>
        Multimodal GFMs rely on three primary self-supervised training paradigms [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Masked Modeling
learns general representations by reconstructing randomly masked portions of the input. It is
particularly efective for capturing dense spatial contexts and fine-grained details in distinct modalities, as
demonstrated by models like SatMAE [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and RingMo [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Contrastive Learning builds discriminative
representations by maximizing the similarity between positive pairs (e.g., co-registered optical and
SAR patches) while minimizing it for negative ones. This paradigm is the cornerstone for cross-modal
alignment and connects visual features with linguistic semantics in Vision-Language models like
RemoteCLIP [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Finally, Generative Learning models the joint data distribution to generate new samples
or reconstruct missing modalities, proving essential for synthesizing complex spatial structures and
handling multi-scale hierarchies, such as MetaEarth [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Despite the technical complexities, GFMs are widely adopted today because they ofer a paradigm
shift from isolated, sensor-specific analysis to an integrated, semantic understanding of the Earth [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Their key value lies in complementarity: for instance, fusing Optical RGB with SAR allows for
allweather monitoring (seeing through clouds), while integrating LiDAR adds precise geometric structural
information unavailable to 2D sensors [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Furthermore, these models demonstrate strong zero-shot
and few-shot capabilities, enabling adaptation to new tasks with minimal labeled data. In terms of
applications, Multimodal GFMs are deployed across diverse downstream tasks categorized into visual
and vision-language domains. Visual tasks include Object Detection (e.g., ships, vehicles), Land Cover
Classification, and Change Detection for disaster response (e.g., flood assessment) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
• One-shot ofline pretraining
      </p>
      <p>freezes model knowledge.
• Updating requires costly full
re</p>
      <p>training at scale.
• Full retraining can take
weeks/months and substantial
compute.
• Temporal encodings lack
explicit enforcement of
intertemporal relationships.
• Limited robustness in
extracting features for long-term
dynamics.
• Insuficient modeling of abrupt
temporal changes.
• Static fusion uses fixed weights</p>
      <p>across modalities.
• Cannot adapt to sensor
relia</p>
      <p>bility and operating context.
• Missing context-dependent
and reliability-aware weighting.
(a) One-shot pretraining 1
(b) Temporal Encodings 2
(c) Static fusion 3</p>
    </sec>
    <sec id="sec-4">
      <title>4. Limitations of the Static Paradigm in GFMs</title>
      <p>
        While GFMs have demonstrated remarkable performance in enabling EO downstream tasks (e.g., strong
benchmark results on flood detection, burn scar mapping, and crop monitoring), their static training
paradigm presents fundamental limitations. These models are typically trained once on historical data,
efectively freezing their knowledge at a specific temporal snapshot. This ofline approach creates
critical bottlenecks for efective operation in dynamic contexts. Three limitations, which are depicted
Figure 3, exemplify these constraints and motivate the need for a paradigm shift toward a continuous
learning architecture:
1. One-Shot Ofline Pretraining. GFMs are typically trained only once on huge historical datasets,
freezing their knowledge at that specific moment. When new phenomena emerge (e.g., new
patterns of deforestation, unprecedented extreme weather events, or sudden changes in land use),
the model cannot incorporate this information on its own. Retraining is expensive: updating
billions of parameters requires enormous computational resources [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] (e.g., thousands of
GPUhours), prohibitive costs, and weeks or months of time.
2. Limitations of Temporal Encodings. Existing GMFs are mainly based [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] on
Transformers [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and they rely on temporal encodings, which can be seen as an evolution of positional
encodings [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Temporal encodings are learned or predefined vector representations that inject
timestamp information into model inputs, enabling the model to associate each observation with
its corresponding time of acquisition and to learn how geospatial features change or evolve over
time. This design has proven efective, allowing models to develop a meaningful understanding
of temporal progression and to jointly reason over spatial information. However, temporal
encodings do not explicitly enforce relationships between diferent time instants within a time series,
which may limit the model’s ability to robustly extract features that are specifically informative
for long-term dynamics and abrupt changes.
3. Static Cross-Modal Relationships. The models integrate data from diferent sensors (optical,
radar, thermal, lidar) using attention weights learned during training that remain unchanged.
The same fusion strategy is applied regardless of the operating context. In this case, a system
that dynamically evaluates the relevance and reliability of each sensor or modality would allow
the importance attributed to each source to be remodulated in real time. For example, giving
more importance to SAR images in areas with high cloud cover, reducing the importance of a
malfunctioning sensor, or excluding certain sensors in extreme weather conditions (e.g., storms,
hurricanes, volcanic eruptions). What is missing in these models is methods for learning
contextdependent fusion strategies that adapt based on historical performance, current observational
conditions, and drift signals.
• Eficient model updating
beyond full retraining.
      </p>
      <p>• Multi-scale temporal encoders</p>
      <p>to model recurring cycles.
• Selective replay, regularization,
and distillation-based updates.</p>
      <p>• Drift detectors to react to
gen</p>
      <p>uine distribution shifts.
• Preserve past representations
to reduce catastrophic
forgetting.</p>
      <p>• Distinguish gradual vs. abrupt
drift to trigger updates.
• Context-dependent fusion
with dynamic modality
weighting.
• Adapt to reliability and extend</p>
      <p>to new sensors over time.
• Support modality
extensibility without full retraining.</p>
      <p>(c) Adaptive fusion
(a) Eficient updating</p>
      <p>(b) Temporal structure</p>
    </sec>
    <sec id="sec-5">
      <title>5. A Vision for Lifelong GFM Learning</title>
      <p>
        The critical limitations described in the previous section reveal limitations that require a novel research
approach at the intersection of FMs, CL, SML, and EO, which simultaneously considers scale,
multimodality, non-stationarity, and operational constraints. We envision a new unified framework whose
research directions are presented below, with the goal of guiding the community towards systems
that can learn continuously from terabyte-per-day multimodal satellite streams while maintaining
the transfer learning capabilities and scale that make foundation models powerful. To address these
challenges, as illustrated in Figure 4, three interconnected research directions are proposed to enable
this vision:
1. Leverage Eficient Multimodal Foundation Model Updating. The computational and
storage cost of continually updating billions of parameters requires design efective strategies for
continuously updating pretrained models, avoiding full model retraining for large GFMs. GFMs
should support CL paradigms that allow them to incorporate new information over time while
remaining consistent with previously acquired knowledge. This requires mechanisms to mitigate
catastrophic forgetting and preserve historical representations. Continual Pretraining [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] is
an area of research whith this specific goal. In this direction, CaSSLE [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and PFR [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] utilize
distillation mechanisms to update representations and ensure they are consistent, thereby
minimizing the risk of catastrophic forgetting and ensuring they can represent previously observed
instances in a meaningful way. It is definitely necessary to research new approaches or adapt
existing ones from CL and SML for eficient replay on a multimodal scale, eficient parameter
regularization to prevent catastrophic forgetting, and architectural designs tailored to address
this specific problem.
2. Learned Temporal Pattern. Advancing FMs for EO requires a fundamental shift from passive
temporal modeling to explicit temporal structure that can distinguish seasonal patterns from
genuine distribution shifts. This evolution must integrate three complementary mechanisms that
together create temporally aware embeddings capable of supporting advanced predictive tasks
such as crop monitoring, change detection, and disaster response.
      </p>
      <p>
        The first mechanism addresses cyclical pattern recognition across multiple temporal scales,
including diurnal, weekly, seasonal, and annual rhythms that manifest diferently across modalities.
Attention-based temporal encoders [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] can learn to recognize these recurring patterns and
retrieve appropriate learned representations when familiar cycles reappear, rather than treating
each recurrence as a novel task requiring adaptation.
      </p>
      <p>
        The second mechanism focuses on utilizing explicit drift detection modules to recognize concept
drifts, integrating standard SML drift detectors, such as ADWIN [27] or Page-Hinkley [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
These detectors can be updated for multimodal EO data to distinguish gradual drift from abrupt
shifts. These memory-eficient algorithms prove particularly valuable during disaster scenarios or
exceptional events like hurricanes, where rapid response depends on recognizing early indicators
of distribution change. Unlike cyclical patterns that require recognition but not continuous
updating, trends demand ongoing adaptation mechanisms that can monitor deviations from both
historical cycles and expected trajectories. Integrating these adaptation mechanisms directly into
pre-trained models would enable continuously updated representations that evolve alongside the
temporal dynamics of the observed phenomena.
      </p>
      <p>The third and most transformative mechanism involves moving beyond implicit temporal
reasoning by incorporating explicit spatiotemporal masking strategies within FM architectures. By
explicitly enforcing interactions between observations across diferent time istants (e.g, masking
image patches across time) models could learn representations inherently sensitive to seasonal
dynamics, long-term environmental evolution and abrupt drifts. In this vision, temporal masking
is not merely a training heuristic, but a core design principle for GFMs, that associated with a
drift detection mechanism, could truly understand Earth system dynamics over time.
3. Continuous and Context-dependent Cross-Modal Fusion. During pre-training, FMs typically
learn fixed fusion weights to information originating from diferent modalities. However, a more
eficient solution would involve assigning weights dynamically, adapting to the model’s varying
deployment conditions.</p>
      <p>Modality fusion is critical when applying CL to multi modal sources. During the continuous
learning process, features from diferent modalities (e.g., image and text) tend to diverge and
produce a phenomenon referred to as “spatial disorder” [28]. Spatial disorder is the progressive
spatial divergence of representations across diferent modalities that where originally aligned in
a shared embedding space and that starts drifting apart. This misalignment leads to more severe
performance degradation compared to unimodal models. Consequently, classical multimodal
fusion techniques efective in static contexts fail in continuous settings, as diferent fusion
strategies exhibit varying degrees of susceptibility to catastrophic forgetting [28].
Contemporary GFMs have demonstrated robust performance in integrating multimodal data.
However, their reliance on static, pre-defined fusion mechanisms for handling data streams
introduces significant limitations. Specifically, current GFMs lack: (1) Adaptative Modality
Fusion: a mechanism to dynamically adjust weights to compensate for varying quality and
inherent heterogeneity. Approaches such as the modality fusion network presented in [29] ofer
a viable path. Rather than relying on static fusion, a network like this, applied to EO scenarios
would allow the model to continuously and adaptively adjust the contribution of each specific
modality (e.g.; mitigating the fluctuating utility of optical data under cloud cover). (2) Modality
Extensibility: the architectural flexibility to seamlessly incorporate new sensor types or data
modalities. Strategies employing meta-learners [30] can facilitate this expansion without incurring
catastrophic forgetting or requiring extensive retraining of the entire foundation model.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper has presented a vision for SCL in EO FMs, addressing fundamental limitations in current
approaches and outlining directions towards significantly enhanced operational performance. While
existing FMs demonstrate remarkable performance on static benchmarks, their ofline training paradigm,
ifxed temporal embeddings, and static cross-modal learned relationships limit their applicability to the
dynamic, non-stationary nature of continuous EO data.</p>
      <p>Three possible research directions have been identified to realize this vision. First, developing updating
mechanisms via selective replay strategies and distillation-based approaches that update representations
while maintaining consistency with past embeddings, without requiring full model retraining. Second,
creating explicit temporal architectures to discriminate cyclical patterns from distributional shifts (e.g.,
via multi-scale temporal attention-based encoders) and integrating trend/shift detection modules for
long-term changes, to support robust detection and adaptation. Third, implementing context-dependent
cross-modal fusion that dynamically adjusts modality relevance over time based on reliability scores,
current observations, and historical performance, rather than applying fixed fusion weights learned
during pretraining.</p>
      <p>The convergence of FMs, CL, and SML represents not merely an incremental improvement but
a necessary paradigm shift for operational EO systems. As satellite constellations expand and data
generation rates accelerate, the ability to continuously integrate new information while maintaining
learned representations becomes essential for EO applications that demand scalable and interpretable
solutions across diverse spatiotemporal contexts, ranging from disaster response to climate change
monitoring. The research directions outlined in this paper provide a roadmap for the community to
develop systems that are simultaneously adaptive, multimodal, and drift aware, characteristics essential
for an ever-changing world.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>The author acknowledges financial support from TEF through a PhD fellowship, during which this paper
was refined. The author would also like to thank Prof. Emanuele Della Valle and his PhD Candidate
Federico Giannini for their valuable scientific discussions, insightful feedback, careful proofreading,
and continuous guidance throughout the development of this work.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used Generative AI tools to improve grammar and
readability (ChatGPT, Gemini, and Claude) and to translate selected passages (DeepL). In addition, the
author used Gemini to generate and/or edit figures. After using these tools, the author reviewed and
edited the content as needed and take full responsibility for the publication’s content.
[27] A. Bifet, R. Gavaldà, Learning from time-changing data with adaptive windowing, in: Proceedings
of the Seventh SIAM International Conference on Data Mining, April 26-28, 2007, Minneapolis,
Minnesota, USA, SIAM, 2007, pp. 443–448. URL: https://doi.org/10.1137/1.9781611972771.42. doi:10.
1137/1.9781611972771.42.
[28] D. Yu, X. Zhang, Y. Chen, A. Liu, Y. Zhang, P. S. Yu, I. King, Recent advances of multimodal
continual learning: A comprehensive survey, arXiv preprint arXiv:2410.05352 (2024).
[29] H. Wang, S. Zhou, Q. Wu, H. Li, F. Meng, L. Xu, H. Qiu, Confusion mixup regularized multimodal
fusion network for continual egocentric activity recognition, in: Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2023, pp. 3560–3569.
[30] G. Song, X. Tan, Real-world cross-modal retrieval via sequential learning, IEEE Transactions on
Multimedia 23 (2020) 1708–1721.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lesort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lomonaco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stoian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maltoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Filliat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Díaz-Rodríguez</surname>
          </string-name>
          ,
          <article-title>Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges</article-title>
          ,
          <source>Information fusion 58</source>
          (
          <year>2020</year>
          )
          <fpage>52</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gavalda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <article-title>Machine learning for data streams: with practical examples in MOA</article-title>
          , MIT press,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cossu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gepperth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Della</given-names>
            <surname>Valle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <article-title>A practical guide to streaming continual learning</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>674</volume>
          (
          <year>2026</year>
          )
          <article-title>132951</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S0925231226003486. doi:https://doi.org/ 10.1016/j.neucom.
          <year>2026</year>
          .
          <volume>132951</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cossu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lomonaco</surname>
          </string-name>
          ,
          <article-title>Streaming continual learning for unified adaptive intelligence in dynamic environments</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>39</volume>
          (
          <year>2024</year>
          )
          <fpage>81</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunasekara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Bifet,</surname>
          </string-name>
          <article-title>Survey on online streaming continual learning</article-title>
          ,
          <source>in: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>6628</fpage>
          -
          <lpage>6637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Iovine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Proia</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Della Valle</surname>
          </string-name>
          ,
          <article-title>Towards streaming land use classification of images with temporal distribution shifts</article-title>
          ,
          <source>in: ESANN 2025 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning</source>
          ,
          <year>i6doc</year>
          .com publ.,
          <string-name>
            <surname>Bruges</surname>
          </string-name>
          (Belgium) and online,
          <year>2025</year>
          . URL: http://www.i6doc.com/en/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsymbal</surname>
          </string-name>
          ,
          <article-title>The problem of concept drift: definitions and related work</article-title>
          , Computer Science Department, Trinity College Dublin 106 (
          <year>2004</year>
          )
          <fpage>58</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>G. I. Webb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hyde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petitjean</surname>
          </string-name>
          , Characterizing concept drift,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>30</volume>
          (
          <year>2016</year>
          )
          <fpage>964</fpage>
          -
          <lpage>994</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ebrahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Continual learning of large language models: A comprehensive survey</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>58</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bazhenov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Siegelmann</surname>
          </string-name>
          , T. J. Sejnowski, C. Kanan,
          <article-title>Replay in deep learning: Current approaches and missing biological elements</article-title>
          ,
          <source>Neural computation 33</source>
          (
          <year>2021</year>
          )
          <fpage>2908</fpage>
          -
          <lpage>2950</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>G. I. Parisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kemker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Part</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wermter</surname>
          </string-name>
          ,
          <article-title>Continual lifelong learning with neural networks: A review</article-title>
          ,
          <source>Neural networks 113</source>
          (
          <year>2019</year>
          )
          <fpage>54</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bahri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maniu</surname>
          </string-name>
          ,
          <article-title>Data stream analysis: Foundations, major tasks and tools</article-title>
          ,
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <article-title>e1405</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sebastião</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <article-title>Supporting the page-hinkley test with empirical mode decomposition for change detection</article-title>
          ,
          <source>in: International symposium on methodologies for intelligent systems</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>492</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ghamisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Plaza</surname>
          </string-name>
          , L. Fang,
          <article-title>Survey of multimodal geospatial foundation models: Techniques, applications, and challenges</article-title>
          ,
          <source>arXiv preprint arXiv:2510.22964</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          , E. Rozi,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Burke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lobell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          , Satmae:
          <article-title>Pre-training transformers for temporal and multi-spectral satellite imagery</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>197</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chang</surname>
          </string-name>
          , et al.,
          <article-title>Ringmo: A remote sensing foundation model with masked image modeling</article-title>
          ,
          <source>IEEE Transactions on Geoscience and Remote Sensing</source>
          <volume>61</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Remoteclip: A vision language foundation model for remote sensing</article-title>
          ,
          <source>IEEE Transactions on Geoscience and Remote Sensing</source>
          <volume>62</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <article-title>Metaearth: A generative foundation model for global-scale remote sensing image generation</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jakubik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Blumenstiel</surname>
          </string-name>
          , E. Scheurer,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sedona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maurogiovanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bosmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dionelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Marsocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramachandran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fraccaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brunschwiler</surname>
          </string-name>
          , G. Cavallaro,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bernabe-Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Longépé</surname>
          </string-name>
          ,
          <source>TerraMind: Large-Scale Generative Multimodality for Earth Observation</source>
          ,
          <year>2025</year>
          . URL: http://arxiv.org/abs/2504.11171. doi:
          <volume>10</volume>
          .48550/arXiv.2504.11171, arXiv:
          <fpage>2504</fpage>
          .11171 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Scaling laws for neural language models</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>08361</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cossu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lomonaco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuytelaars</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <article-title>Continual pre-training mitigates forgetting in language and vision</article-title>
          ,
          <source>Neural Networks</source>
          <volume>179</volume>
          (
          <year>2024</year>
          )
          <fpage>106492</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. G. T. Da</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Alameda-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ricci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alahari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <article-title>Self-supervised models are continual learners</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>9621</fpage>
          -
          <lpage>9630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gomez-Villa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Twardowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Bagdanov</surname>
          </string-name>
          , J. Van de Weijer,
          <article-title>Continually learning self-supervised representations with projected functional regularization</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3867</fpage>
          -
          <lpage>3877</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Horng</surname>
          </string-name>
          ,
          <article-title>Multivariate time series forecasting via attention-based encoderdecoder framework</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>388</volume>
          (
          <year>2020</year>
          )
          <fpage>269</fpage>
          -
          <lpage>279</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>