<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bridging Streaming Continual Learning via In-Context Large Tabular Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Afonso Lourenço</string-name>
          <email>fonso@isep.ipp.pt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>João Gama</string-name>
          <email>jgama@fep.up.pt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric P. Xing</string-name>
          <email>epxing@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Goreti Marreiros</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Concept N</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Concept history</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>GECAD, ISEP, Polytechnic of Porto</institution>
          ,
          <addr-line>Rua Dr. António Bernardino de Almeida, Porto, 4249-015</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>INESC-TEC, FEP, University of Porto</institution>
          ,
          <addr-line>Rua Dr. Roberto Frias, Porto, 4200-465</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Mohamed bin Zayed University of Artificial Intelligence</institution>
          ,
          <addr-line>Abu Dhabi, UAE</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Sketch &amp; reduce</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>In streaming scenarios, models must learn continuously, adapting to concept drifts without erasing previously acquired knowledge. However, existing research communities address these challenges in isolation. Continual Learning (CL) focuses on long-term retention and mitigating catastrophic forgetting, often without strict real-time constraints. Stream Learning (SL) emphasizes rapid, eficient adaptation to high-frequency data streams, but typically neglects forgetting. Recent eforts have tried to combine these paradigms, yet no clear algorithmic overlap exists. We argue that large in-context tabular models (LTMs) provide a natural bridge for Streaming Continual Learning (SCL). In our view, unbounded streams should be summarized on-the-fly into compact sketches that can be consumed by LTMs. This recovers the classical SL motivation of compressing massive streams with ifxed-size guarantees, while simultaneously aligning with the experience-replay desiderata of CL. To clarify this bridge, we show how the SL and CL communities implicitly adopt a divide-to-conquer strategy to manage the tension between plasticity (performing well on the current distribution) and stability (retaining past knowledge), while also imposing a minimal complexity constraint that motivates diversification (avoiding redundancy in what is stored) and retrieval (re-prioritizing past information when needed). Within this perspective, we propose structuring SCL with LTMs around two core principles of data selection for in-context learning: (1) distribution matching, which balances plasticity and stability, and (2) distribution compression, which controls memory size through diversification and retrieval mechanisms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Concept drift</kwd>
        <kwd>data stream mining</kwd>
        <kwd>foundational model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Model history</title>
      <sec id="sec-1-1">
        <title>Store &amp; retrieve</title>
      </sec>
      <sec id="sec-1-2">
        <title>Trainable</title>
      </sec>
      <sec id="sec-1-3">
        <title>Foundational</title>
      </sec>
      <sec id="sec-1-4">
        <title>Model</title>
      </sec>
      <sec id="sec-1-5">
        <title>Query</title>
        <p>mining</p>
        <sec id="sec-1-5-1">
          <title>1. Introduction</title>
          <p>
            For tabular stream learning (SL), ensembles of incremental decision trees (IDTs) have long been
stateof-the-art [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. They use statistical bounds to decide node splits and handle concept drift via subtree
replacement (Fig. 1a). As shallow learners, IDTs converge quickly online due to their few trainable
parameters. Yet, their learning capacity is limited by single-view features, plasticity loss from making
locally optimal splits, catastrophic forgetting of class-conditional estimators, and the inability to model
dependencies. Although various ad-hoc solutions have been proposed, mostly adding new candidate
m
a
e
r
t
S
Time
          </p>
          <p>
            B
C
components for ensembling, they tend to be narrow, addressing one problem while assuming others are
controlled. For example, ensembles may use drift detectors to swap to a more suitable model (Fig. 1b),
but fail to evaluate stored models for relevance if the drift does not trigger an alarm [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
(b) IDTs swapped upon drift alarm.
          </p>
          <p>
            Indeed, these ad-hoc intelligence components are only useful if we integrate them into a truly
autonomous system. In this regard, continual learning (CL) allows more powerful deep learning (DL)
schemas, adapting both through parameter addition and activation (Fig. 2). However, these struggle
with tabular streams [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. On one hand, the inductive biases of DL architectures assume structures which
ofer little advantage for irregular patterns typical of tabular data [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. On the other hand, entangled
architectures converge slowly due to stochastic updates, and unfixed weights make prior knowledge
prone to being overwritten. Moreover, this plasticity does not guarantee learning eficiency, often
requiring multiple data passes to reduce interference [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
          </p>
          <p>Stream</p>
          <p>Time</p>
          <p>
            Due to these algorithmic diferences, SL and CL prioritize diferent aspects of stateful learning [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. CL
focuses on long-term retention and mitigating forgetting, often without strict real-time constraints [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ],
whereas SL emphasizes rapid adaptation to high-frequency streams but typically ignores high-order
dependencies and forgetting [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. Recent eforts have tried to combine these paradigms [
            <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
            ], yet
no clear overlap exists. To address this, we propose to leverage the disruptive success of in-context
large tabular models (LTMs) [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] as the unifying bridge for streaming continual learning (SCL): using
on-the-fly techniques to summarize unbounded data streams before feeding them to LTMs.
          </p>
          <p>
            In prior work [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], we showed that augmenting the TabPFN transformer [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] with a simple
inferencetime sketching mechanism consistently outperforms state-of-the-art methods such as Adaptive Random
Forest [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], and Streaming Random Patches [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], on standard streaming benchmarks: NOAA,
SmartMeter, Electricity, Rialto, Posture, CoverType, and PokerHand. Here, we explain how this data-centric
view unifies the strengths of SL and CL: recovering the classical SL goal of compressing massive streams
into compact sketches whose size and computational cost remain bounded [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], while simultaneously
aligning with CL’s experience-replay desiderata for retaining past concepts [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. Our argument unfolds
in two steps:
• Extracting the shared desiderata of SL and CL. Both communities follow a divide-to-conquer
strategy driven by the underlying tension between plasticity (adapting to the current distribution)
and stability (retaining past knowledge). Under a minimal memory constraint, this tension gives
rise to two operational requirements: diversification, to avoid redundant or overlapping stored
information, and retrieval, to re-activate relevant past experience when needed.
• Mapping these desiderata to in-context stream mining with LTMs. We show that SCL
can be framed as selecting and organizing data for in-context learning, where distribution
matching governs the plasticity–stability balance, and distribution compression governs
memory eficiency through diversification and retrieval.
          </p>
        </sec>
        <sec id="sec-1-5-2">
          <title>2. Current SL and CL state-of-the-art</title>
          <p>
            Both SL and CL ultimately seek to maintain useful knowledge over time while adapting to new data.
However, when learning occurs under streaming, non-stationary conditions, this requires a
divide-toconquer strategy. As new concepts accumulate, the feasible parameter space becomes progressively
restricted, making it increasingly dificult to adjust the model without interfering with previously
acquired knowledge (Fig. 3a). In the limit, finding parameter configurations that jointly satisfy all
concepts is NP-hard [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. Thus, the challenge is not merely retaining information, but doing so while
preserving eficient access and reuse. A central lever in controlling interference is the degree of parameter
sharing. Full sharing maximizes generalization but risks interference; no sharing avoids interference
but scales poorly with the number of concepts [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. A middle ground is modular compositionality,
where knowledge is distributed across components that can be selectively re-used, enabling forward
and backward transfer (Fig. 3b).
          </p>
          <p>(a) Incrementally constrained solution space.
(b) Modularity.</p>
          <p>
            For tabular data streams, ensembles of IDTs are widely used and highly efective [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. A single IDT
can adapt online and converge to near-optimal splits using one-pass summary statistics and statistically
grounded decision criteria, such as information gain tested with Hoefding-style bounds [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]. However,
predictive performance emerges primarily at the ensemble level. When concept drift occurs, trees
respond diferently depending on their local minima and data histories. This natural diversity can be
leveraged: outdated or underperforming trees can be replaced, new trees introduced, and ensemble
votes re-weighted based on recent performance. Research in this area has therefore progressed along
two complementary axes. The first concerns the base learners themselves, balancing their ability to
adapt to new data (plasticity) while retaining useful structure from the past (stability). The second
focuses on ensemble management, ensuring that the collection of learners remains diverse enough to
cover diferent regions of the input space (diversification) and that previously useful models can be
reactivated when similar conditions reappear (retrieval).
2.1. Stability
Despite their ability to continually store incrementally arriving data, IDTs are often biased toward
recently observed distributions when new classes appear. Under strong temporal imbalance, where
older classes do not reappear, performance on previously learned concepts deteriorates, resulting in
catastrophic forgetting. This problem is amplified after splits, where conditional estimators are reset:
classes absent from the current stream lose representation, are excluded from entropy-based split
decisions, cannot be used in Naive Bayes classification, and have their priors removed at deeper nodes.
Forgetting in IDTs occurs through three mechanisms: (1) exclusion of older classes from split evaluation,
(2) failure of conditional classification due to missing estimates, and (3) disappearance of class priors in
new branches (Fig. 4) [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. Stability-oriented approaches mitigate this by preserving class information
during updates. One strategy propagates class-conditional attribute estimators and maintains class
priors in entropy and Bayesian computations [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. Another relies on short-term memory replay to
preserve representation continuity, e.g., through oversampling [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ], or per-class balanced queues [22].
          </p>
          <p>Priors: C0
Conds: C0</p>
          <p>Priors: C0,C1
Conds: C1 (!)</p>
          <p>Priors: C0,C1</p>
          <p>Conds: C1 (!)
Stream</p>
          <p>Time</p>
          <p>Priors: C1,C2
Conds: C2 (!)</p>
          <p>Neural networks allow more expressive stability mechanisms. A common strategy is to approximate
joint optimization over past tasks by penalizing updates to parameters deemed important for previous
ones (Fig. 5). EWC [23] measures importance via the Fisher Information Matrix, MAS [24] via gradient
or Hebbian activity with constant memory, and SI [25] accumulates importance online from loss
reductions. RWalk [26] unifies these by computing importance in the Fisher-induced Riemannian
parameter space. Since all of these operate incrementally in weight space, they extend naturally to
streaming settings [27]. However, such methods typically exhibit temporary forgetting because SGD
must pass through regions of high loss on old tasks to reach the regularized joint optimum [28]. This
stability gap motivates modifying not only the objective, but also the optimization path [29]. To
address this, gradient projection methods enforce updates orthogonal to gradient subspaces of previous
tasks [30, 31]. However, while this reduces forgetting, strict orthogonality can overly limit knowledge
transfer. Recent work relaxes these constraints, enabling controlled sharing, e.g., NCL [32] re-scales
gradients using a Kronecker-factored Fisher approximation and combines projection with parameter
regularization.</p>
          <p>winit</p>
          <p>g
tin
t
e
g
r
o</p>
          <p>F
wnew
wold
2.2. Plasticity
In streaming settings, models may lose not only past knowledge but also the capacity to learn new
concepts. This occurs when parameters drift into regions of the loss landscape where optimization
becomes slow, leading to reduced plasticity [33]. Under this view, while plasticity is often framed as
freeing capacity in IDTs, e.g., via change detectors and post-pruning [34, 35], a more precise view lies in
how well the current parameters serve as a starting point for further learning, independent of how much
knowledge they store. In this regard, IDTs inherently exhibit low plasticity due to their conservative,
history-dependent structure. Because they grow by making locally optimal splits (e.g., via thresholds,
grace periods, tie-breaking rules), their ability to revise earlier decisions is limited. However, recent
work has challenged this rigidity. PLASTIC [36] introduces a restructuring mechanism that allows
an IDT to revisit and modify pruned subtrees (Fig. 6), exploiting the fact that a tree’s structure can
change without altering its predictive semantics. Similarly, DFDT [37] proposes reordering and pruning
operations to promote informative attributes toward the root, enabling adaptation in trapezoidal data
streams. DCFHT [38] extends this to capricious data streams.</p>
          <p>0</p>
          <p>1
0
0
0
0
1
1
0</p>
          <p>1
0
1
0
0
0
1
0
0
1</p>
          <p>1
1
1
1
0
0
0
1
0
0
0
1
0</p>
          <p>1
0
1
1</p>
          <p>0
0
0
0</p>
          <p>1
0
1
1
1
0
0
1
1
0
1
1
0
0
1
0
1
1
1
1
(a) Disconnect subtree.</p>
          <p>(b) Desired splits at branches.
(c) Move splits to the root.</p>
          <p>(d) Iteratively rebuild the subtree.</p>
          <p>
            Conversely, in neural networks, reduced plasticity often manifests as growing weight magnitudes
and saturated units. Regularization-based stability methods such as EWC, MAS, and SI encourage
parameters toward zero, but can unintentionally collapse weight matrix ranks and hinder adaptation.
To retain plasticity, one can instead regularize toward initialization (preserving “how to learn”) [39] or
toward curvature-preserving parameter distributions, for example via Wasserstein-based order-statistic
regularization [40]. Another direction focuses on reintroducing flexibility: S&amp;P [41] combines weight
decay with stochastic perturbations to restore movement in parameter space, though at the cost of
increased forgetting. ReDo [42] improves stability by resetting only saturated units, yet still struggles
with signal propagation issues [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Continual Backprop [43] makes this selective reset more principled
by tracking utility scores over incoming and outgoing weights and protecting newly reset units until
they mature. UPDG generalizes this by coupling gradient updates with adaptive perturbations, applying
minimal changes to useful units and stronger rejuvenation to dormant ones [27].
2.3. Diversification
Under tight memory constraints, a divide-and-conquer strategy naturally calls for diversification,
ensuring that models store complementary rather than redundant information (Fig. 7a). Even in
cases where the data is relatively simple, one can explore diferent perspectives of the same patterns
within a given computational budget (Fig. 7b). This can be imposed through hard boundaries, such
as explicit output specialization [44]. More commonly, however, streaming ensemble methods rely on
softer mechanisms that perturb the input or feature space: horizontally, e.g. through Poisson-based
instance weighting [45], or selective instance filtering [ 46]; and vertically, e.g. through random subspace
selection [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. Beyond implicit diversification, several approaches explicitly manage the ensemble
repository using diversity metrics such as double-fault [47], or the kappa statistic [48]. At a higher level,
heterogeneous ensembles can be maintained via local search heuristics [49], evolutionary strategies
[50], and meta-learning methods [51].
          </p>
          <p>(a) In a complex space.</p>
          <p>(b) In a simpler space.</p>
          <p>
            In contrast, neural networks typically achieve diversification within a single model by controlling
how representations are shared or separated. One approach is to learn domain-invariant features by
sharing parameters across layers or models [52]. Another method allocates distinct feature-processing
pathways while minimizing discrepancies between them, avoiding explicit mapping across domains
[53]. Distillation-based methods support this by transferring feature-level knowledge from a classifier
trained on past labels to one trained on new labels, enabling the model to adapt without retraining from
scratch [54].
2.4. Retrieval
Balancing stability and plasticity in a shared representation is NP-hard [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. Instead, a model must
quickly recall past knowledge and decide which modules to update within a compositional framework
[55, 56]. In SL, this is achieved in several ways. Neighborhood-based dynamic selection identifies
supervised models with high competence in the local region around the query [57]. Referee
metamodels detect recurrence without statistical comparisons across all stored models [58]. Sequence
mining meta-models capture patterns in stored models, relying on expectations of transitions in their
competence [59, 60]. Repository matching approaches reuse learned models when similar contexts
reemerge, following drift detection (Fig. 1b) [61]. Finally, hybrid methods use Bayesian inference to
compute posterior probabilities for all candidate states: likelihoods estimated with a weighted cosine
distance of the the incoming data window to a meta-representation of each model, and priors estimated
via transition matrices informed by drift detectors (Fig. 8) [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. In unsupervised scenarios, novelty
detection methods rely on clustering structures to find cohesive agglomeration of anomalies [62, 63].
          </p>
          <p>New x
data
presentat
e
R</p>
          <p>ion</p>
          <p>In neural networks, retrieval can be facilitated through end-to-end optimization. A common approach
involves using multiple expert branches that are selectively activated by a gating mechanism, with
their outputs integrated via a data-dependent weighting scheme (Fig. 9b) [64]. Additionally,
routingbased techniques allow for greater specialization by decomposing sub-concepts into sequential or
parallel processing stages, enabling modules to be flexibly reused across diferent contexts (Fig. 9a)
[65, 66]. In challenging unsupervised streaming scenarios, retrieval often relies on identifying latent
concept boundaries. This can be achieved through methods such as cross-concept class discrimination
[67], adversarial one-class classification [ 68], and reverse distillation from one-class embeddings [69].
Furthermore, cross-concept identification can be explicitly modeled using separate networks [ 70],
learned binary masks [71], or various out-of-distribution detection approaches [72].
y = g3( h2 ( p1 ( x ) ) )
y =</p>
          <p>1 f 1 + 2 f2 + 3 f3
Router
g1
h1
p1
g2
h2
p2
x
g3
h3
p3
f1
1
y</p>
          <p>2
f2
x</p>
          <p>Gating
f3
3
(a) Modular routing.</p>
          <p>(b) Mixture-of-experts.</p>
        </sec>
        <sec id="sec-1-5-3">
          <title>3. In-context stream mining for SCL</title>
          <p>
            Traditional SL and CL methods follow a two-stage process: optimizing models over a sliding window,
followed by selecting the best configuration for the next window. Foundational models (FMs), however,
enable instant model deployment, leveraging prior learning to bypass the need for extensive tuning
[73]. Through pretraining on vast corpora, FMs acquire soft inductive biases, drawing from a wealth
of prior experiences. This results in emergent abilities, such as few-shot in-context learning (ICL),
which allows models to perform new tasks during inference by conditioning on a set of input-output
examples, without requiring parameter updates. Consequently, this ICL capability of FMs has spurred
a new research paradigm focused on designing architectures that are pre-trained on a wide range of
synthetic tabular datasets, referred to as large tabular models (LTMs) [74]. Unlike traditional models,
LTMs perform instant classification without fine-tuning [
            <xref ref-type="bibr" rid="ref11">11, 75</xref>
            ]. They adapt to unseen datasets in a
single forward pass by using various training examples as context, similarly to how large language
models (LLMs) use preceding tokens. Practically, an LTM is a transformer model (Fig. 10), trained on
data simulated with Bayesian neural networks or structural causal models [76], inductive biases from
decision trees [75], or DAG-based computational graphs [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], with the training set size acting as a
regularizer on the network’s expected complexity.
          </p>
          <p>Cross-entropy</p>
          <p>N x
Softmax</p>
          <p>Linear
Add &amp; norm
Feed forward
Add &amp; norm
Masked
multihead attention</p>
          <p>
            Building on these developments, a new paradigm for SL and CL centers on stream-level context
construction: summarizing unbounded data streams on-the-fly before providing them to LTMs [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
Modern LTMs can process extremely large contexts, exceeding 500K samples [75] and 50K features
[77], which provides unprecedented flexibility in how sketches of streaming data can be represented.
Importantly, this data-centric perspective still allows us to leverage the aforementioned core insights
from model-centric SL and CL research concerning stability, plasticity, diversification, and retrieval.
In fact, these factors can now be controlled directly through context design rather than complex
architectural or optimization interventions, enabling simpler and more explicit trade-ofs.
          </p>
          <p>Drawing from the literature on how FMs apply data selection strategies for pre-training,
instructiontuning, alignment, and in-context learning [78], we identify two complementary axes for organizing
context design. The first concerns distribution matching: selecting data similar to the target, yielding
plasticity when emphasizing the current distribution, and stability when maintaining support across
prior distributions. The second concerns distribution compression: reducing redundancy while
maintaining representational power, which supports diversification when filling memory with
nonoverlapping representative samples, and retrieval when dynamically constructing a task-specific
context from a larger pool.
3.1. Distribution matching
To better understand the goal of distribution matching, one can adopt a frequentist perspective (Fig.
11a) [79]. From a variance standpoint, an LTM, pre-tuned but untrained, with many hyperparameters
and multi-head attention, is highly sensitive to individual context samples, which increases its ability to
select efective submodels and reduces predictor variance. From a bias standpoint, hyperparameters
are optimized for the prior task distribution. If the prior is broad and not overly concentrated away
from the true hypothesis, the posterior predictive distribution closely approximates the true predictive
distribution. Consequently, the LTM’s ability to learn at inference depends on its structural properties,
with the optimal approximation characterized by a Kullback-Leibler criterion [79]. Intuitively, adding
more context samples reduces sensitivity to minor input perturbations, lowering variance, while bias
persists unless the context is concentrated near the target distribution.</p>
          <p>Low variance</p>
          <p>High bias</p>
          <p>Distribution
matching</p>
          <p>Mlong</p>
          <p>Mshort
Large tabular
model
(a) In-context learning from frequentist viewpoint.
(b) Dual-memory FIFO system.</p>
          <p>With this perspective, when the data distribution shifts over time, the context may no longer reflect
the current environment, leading the model to produce biased predictions based on outdated patterns.
This requires a design trade-of:
• Plasticity: prioritizing recent examples to adapt quickly to new local patterns, at the risk of
losing information about past classes and concepts.
• Stability: retaining examples from earlier concepts to maintain a global understanding of all
classes, including those observed in the distant past.</p>
          <p>
            A simple yet efective solution is a dual-memory FIFO system (Fig. 11b) [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. The long-term memory
stores a fixed set of older samples across all known classes, preserving rare or infrequently seen
categories. In contrast, the short-term memory maintains the most recent portion of the stream,
capturing local variations, transient sub-concepts, and evolving intra-class dynamics. By combining
these two memories, the model achieves a balance between long-term stability and rapid adaptability
to short-term fluctuations in the data distribution.
          </p>
          <p>However, this approach is naive. Plasticity is largely reactive, implemented via fading strategies rather
than proactive adaptation, while stability only addresses catastrophic forgetting of classes, without
ensuring invariant representations across all concepts. Addressing this requires diversification and
retrieval principles, adopting inductive biases to selectively match data distributions, such as:
smoothness, where nearby points in high-density regions are assumed to produce similar outputs; clustering,
where points in the same vicinity likely share a concept; and manifold, where high-dimensional data
lies on shared low-dimensional latent representations. Notably, because in-context stream mining
is formulated as a data selection problem rather than explicit model design, these inductive biases
naturally align with the goal of distribution compression, i.e., promoting eficient and representative
context construction.
3.2. Distribution compression
Distribution compression aims to prioritize heterogeneity and remove redundancies to reduce dataset
size. Diferent communities approach this goal diferently: SL uses synopses and sketches to summarize
unbounded streams [80]; CL relies on experience replay to retain past concepts [81]; and FMs apply
data selection strategies across pre-training, instruction-tuning, alignment, in-context learning, and
ifne-tuning [ 78]. Despite these diferences, compression can generally be framed as a two-stage process:
• Diversification: populating and updating memory with representative, meaningful samples,
through informed addition and deletion.
• (Optional) Retrieval: distinguishing between memory population (which samples to store) and
sampling (which points to use for in-context learning).</p>
          <p>Time</p>
          <p>ist. matching
D</p>
          <p>iversification
D
trieval
e
R</p>
          <p>Distribution compression is inherently linked to distribution matching, shaping how the model
balances stability and plasticity when selecting context examples. However, the relationship between
diversification and these objectives is subtler than it appears: although diversification is often associated
with enhanced stability, it does not necessarily compromise plasticity. For example, selecting examples
solely based on similarity to the query in embedding space [82] can lead to redundancy and omit less
similar yet informative concepts that support contrastive learning [83]. In contrast, true distribution
matching aims to maximize feature coverage [84], select examples according to dificulty [ 85], and
consider sample sensitivity [86]. Empirical evidence supports this (Fig. 13): sequential methods
that explicitly balance similarity to the query with diversity among selected examples consistently
outperform naive strategies, such as choosing the  most similar examples, or selecting similar examples
from a diversity-reduced subset [83].</p>
          <p>similar only
2
1
3</p>
          <p>iv
.d
1
erse → 2. sim
2
3
ilar
1
jo
int optimization
3
2
1</p>
          <p>Viewed from this perspective, diversification emerges as a mechanism that can simultaneously
enhance both plasticity and stability. The benefits of diversification, however, are limited. Its efectiveness
is constrained by the inherent dificulty of learning incremental concepts within a finite parameter
space. In this context, retrieval serves as a complementary strategy: by separating points for new
knowledge from those revisiting prior knowledge, retrieval allows for a divide-and-conquer approach
[87]. Importantly, retrieval itself can be understood as a higher-level application of diversification.
While diversification maintains representative and informative examples in memory to support both
plasticity and stability, retrieval applies the same principle to select query-specific context subsets,
reducing interference between old and new concepts.</p>
          <p>With this unified perspective, it becomes clear that heuristic or score-based methods benefiting
diversification also enhance retrieval, they simply operate at diferent stages of the learning process.
Critically, any such process must account for the fact that data points vary in potential: some are
more representative or informative than others. Classical streaming data selection methods address
this by framing the problem as online clustering, leveraging synopsis techniques such as histograms,
wavelets, or sketches to construct geometric and statistical descriptors of the data [80]. Concept drift is
typically detected by tracking assignment errors [88], or comparing recent and reference data within
clusters using tests like the univariate k-sample Anderson-Darling for each principal component of each
centroid [89]. However, these approaches are largely unsupervised, whereas our focus is on predictive
performance. Predictive strategies may instead maintain a dynamic set of short-term and long-term
prototypes based on error-driven representativeness learning and constrained clustering inspired by
synchronization [90]. In this vein, numerous data selection methods have emerged across the fields of
active learning [91], continual learning [92, 93], and training dynamics [94]. These methods generally
assign each instance a scalar score, either reflecting informativeness or representativeness (Fig. 14) [ 95].</p>
          <p>G
round truth</p>
          <p>Informative</p>
          <p>R
epresentative</p>
          <p>On one hand, informativeness measures how critical a sample is for learning, often favoring hard
or uncertain instances. In active learning, this includes points where the model exhibits low confidence
[96], high reconstruction errors [97] or high disagreement [98]. In continual learning, analogous
metrics are employed, but with the implicit goal of mitigating catastrophic forgetting. Methods include
herding selection [99, 93], which samples based on distance to the class mean; discriminative sampling
[100], which targets decision boundary points; and entropy-based sampling [26], which selects
highuncertainty predictions. In training dynamics, sample quality is assessed via learnability heuristics such
as forgetting frequency [92] or ease of learning [101]. Alternatively, separate scoring models [102, 103]
or reinforcement learning strategies [104, 105] can optimize sample selection.</p>
          <p>On the other hand, representativeness captures how well a subset reflects the overall data
distribution, complementing informativeness. In active learning, these help balance exploration-exploitation
trade-ofs, capturing the structure of the raw data [ 106, 107] or the embedding space [108]. In contrast,
continual learning faces the challenge of maintaining memory samples, where similarity is often be
balanced with diversity [109, 110, 111], or informativeness with representativeness [112, 84].
Examples include CoPE [113], which maintains class prototypes in a shared latent space while minimizing
intra-class variance and maximizing inter-class separation, and core-set methods such as
cardinalityconstrained bilevel optimization [114, 115, 116]. However, unlike standard core-set selection, in-context
stream mining does not involve parameter updates.</p>
          <p>
            All these principles naturally extend to in-context stream mining, with one key distinction: in LTMs,
the memory population efectively is the model, since it directly defines the implicit decision boundaries
that the LTM can express [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. In contrast, in experience replay-based CL, the memory serves primarily
as a support mechanism for a separate parametric model, where samples are selectively replayed to
prevent forgetting, typically chosen using criteria such as max. loss, or min. margin confidence [117].
          </p>
          <p>From this perspective, the statistical uncertainty of in-context stream mining is primarily determined
by which observations are missing or underrepresented. While conventional SL emphasizes accumulated
prequential model uncertainty, in-context stream mining focuses on how the current query shifts the
hypothesis space in response to a context-query pairing. This is analogous to transductive reasoning,
where predictions are made on a closed set of instances without constructing a general model.</p>
          <p>Thus, while heuristic or score-based methods remain useful for evaluating individual samples,
incontext stream learning benefits from assessing the joint influence of sample subsets. Inclusion or
removal of combinations should be evaluated by their collective impact on learning [118], which
can be operationalized by prompting the LTM to rate samples and observing performance changes,
analogous to a leave-one-out procedure [119]. However, directly computing the performance drop for
every possible instance not only may provide insuficient signal, but also is computationally infeasible,
requiring 2 inferences for a dataset of size . Nonetheless, the framework is valuable for studying
how combinations of training examples, rather than individual ones, afect generalization, for example
through inclusion or exclusion of entire prototypical classes [120].</p>
          <p>To make this approach tractable in practice, data selection can be framed as an online learning-to-rank
(LTR) problem, which leverages the counterfactual efects of sample inclusion while minimizing LTM
calls [121]. This naturally aligns with reinforcement learning or contextual bandit frameworks, where
the system selects ranked lists of exemplars and receives observed performance rewards [122]. The
challenge is balancing exploration (testing new rankings) with exploitation (using the best-known one).</p>
        </sec>
        <sec id="sec-1-5-4">
          <title>4. Conclusions</title>
          <p>LTMs emerge as a promising bridge between the CL and SL communities, warranting further
investigation. While extreme real-time and edge constraints still requires smaller and faster LTMs, we
set aside such engineering constraints, which are likely to be addressed by advances in the TinyML
community [123]. Our focus is instead on the SCL context, where the central challenge is not model
size or speed, but orchestrating the dynamic interplay of data arrival, training, recovery, and inference.
As LTMs naturally become more eficient, research should prioritize algorithms that embody the core
principles of streaming continual learning: stability (preserving past knowledge), plasticity (adapting
to new data), diversification (reducing redundancy), and retrieval (enabling faster remembering). To
illustrate the practical realization of these principles, we draw on insights from multiple communities:
from SL, synopsis and sketches provide efective stream summarization [ 80]; from CL, experience
replay preserves past concepts [81]; and from foundational models, data selection strategies enhance
pre-training, fine-tuning, and in-context learning [78].</p>
          <p>Looking further ahead, end-to-end solutions could selectively activate LTM components per instance,
guided by specialized prompter contexts fine-tuned on the most relevant data [ 124] This strategy
is already supported by evidence that fine-tuning improves retrieval-based performance [ 125, 126].
However, applying such strategies in continual learning remains a significant challenge across all
modalities of foundational models [127, 128]. We therefore encourage the community to begin exploring
data selection strategies as lightweight inference-time wrappers, which can guide future developments
in a manner similar to the evolution of prompt engineering in NLP and vision. The authors also plan
to pursue some of these directions and welcome feedback and collaboration from those interested in
contributing to this line of research.</p>
        </sec>
        <sec id="sec-1-5-5">
          <title>Acknowledgments</title>
          <p>FCT funded under PhD scholarship PRT/BD/154713/2023 and project doi.org/10.54499/UIDP/00760/2020.</p>
        </sec>
        <sec id="sec-1-5-6">
          <title>Declaration on Generative AI</title>
          <p>The authors have not employed any Generative AI tools.
data streams, in: 2020 international joint conference on neural networks (IJCNN), IEEE, 2020, pp.
1–8.
[22] K. Malialis, C. G. Panayiotou, M. M. Polycarpou, Online learning with adaptive rebalancing in
nonstationary environments, IEEE transactions on neural networks and learning systems 32
(2020) 4445–4459.
[23] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan,
T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks,
Proceedings of the national academy of sciences 114 (2017) 3521–3526.
[24] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, T. Tuytelaars, Memory aware synapses:
Learning what (not) to forget, in: Proceedings of the European conference on computer vision
(ECCV), 2018, pp. 139–154.
[25] F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, Proceedings of
machine learning research 70 (2017) 3987–3995.
[26] A. Chaudhry, P. K. Dokania, T. Ajanthan, P. H. Torr, Riemannian walk for incremental learning:
Understanding forgetting and intransigence, in: Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 532–547.
[27] M. Elsayed, A. R. Mahmood, Addressing loss of plasticity and catastrophic forgetting in continual
learning, arXiv preprint arXiv:2404.00781 (2024).
[28] M. De Lange, G. van de Ven, T. Tuytelaars, Continual evaluation for lifelong learning: Identifying
the stability gap, arXiv preprint arXiv:2205.13452 (2022).
[29] T. Hess, T. Tuytelaars, G. M. van de Ven, Two complementary perspectives to continual learning:</p>
          <p>Ask not only what to optimize, but also how, arXiv preprint arXiv:2311.04898 (2023).
[30] M. Farajtabar, N. Azizan, A. Mott, A. Li, Orthogonal gradient descent for continual learning, in:</p>
          <p>International conference on artificial intelligence and statistics, PMLR, 2020, pp. 3762–3773.
[31] Y. Guo, W. Hu, D. Zhao, B. Liu, Adaptive orthogonal projection for batch and online continual
learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp.
6783–6791.
[32] T.-C. Kao, K. Jensen, G. van de Ven, A. Bernacchia, G. Hennequin, Natural continual learning:
success is a journey, not (just) a destination, Advances in neural information processing systems
34 (2021) 28067–28079.
[33] R. Pascanu, S. I. Mirzadeh, A study on the plasticity of neural networks, arXiv preprint
arXiv:2106.00042 (2021).
[34] D. Nowak Assis, J. P. Barddal, F. Enembreck, Behavioral insights of adaptive splitting decision
trees in evolving data stream classification, Knowledge and Information Systems (2025) 1–32.
[35] C. Manapragada, G. I. Webb, M. Salehi, Extremely fast decision tree, in: Proceedings of the
24th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, 2018, pp.
1953–1962.
[36] M. Heyden, H. M. Gomes, E. Fouché, B. Pfahringer, K. Böhm, Leveraging plasticity in incremental
decision trees, in: Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, Springer, 2024, pp. 38–54.
[37] C. Schreckenberger, T. Glockner, H. Stuckenschmidt, C. Bartelt, Restructuring of hoefding trees
for trapezoidal data streams, in: 2020 International Conference on Data Mining Workshops
(ICDMW), IEEE, 2020, pp. 416–423.
[38] R. Zhao, Y. You, J. Sun, J. Gama, J. Jiang, Online learning from drifting capricious data streams
with flexible hoefding tree, Information Processing &amp; Management 62 (2025) 104221.
[39] S. Kumar, H. Marklund, B. Van Roy, Maintaining plasticity via regenerative regularization, arXiv
preprint arXiv:2308.11958 (2023).
[40] A. Lewandowski, H. Tanaka, D. Schuurmans, M. C. Machado, Directions of curvature as an
explanation for loss of plasticity, arXiv preprint arXiv:2312.00246 (2023).
[41] J. Ash, R. P. Adams, On warm-starting neural network training, in: Advances in Neural</p>
          <p>Information Processing Systems, volume 33, 2020, pp. 3884–3894.
[42] G. Sokar, R. Agarwal, P. S. Castro, U. Evci, The dormant neuron phenomenon in deep
reinforcement learning, in: International Conference on Machine Learning, PMLR, 2023, pp. 32145–32168.
[43] S. Dohare, A. Hernandez-Garcia, A. Lacoste, M. Weiss, Continual backprop: Stochastic gradient
descent with persistent randomness, in: International Conference on Machine Learning, 2021,
pp. 2660–2670.
[44] L. Neves, A. Lourenço, A. Cano, G. Marreiros, Online Hierarchical Partitioning of the Output Space
in Extreme Multi-Label Data Streams, IOS Press, 2025. URL: http://dx.doi.org/10.3233/faia250975.
doi:10.3233/faia250975.
[45] N. C. Oza, S. Russell, Experimental comparisons of online and batch versions of bagging and
boosting, in: Proceedings of the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining, 2001, pp. 359–364.
[46] M. M. Idrees, M. Abulaish, A new combination of diversity techniques in ensemble classifiers for
handling complex concept drift, Applied Soft Computing 96 (2020) 106613.
[47] S. Abadifard, S. Bakhshi, S. Gheibuni, F. Can, Dyned: Dynamic ensemble diversification in data
stream classification, in: Proceedings of the 32nd ACM International Conference on Information
and Knowledge Management, 2023, pp. 3707–3711.
[48] A. Cano, B. Krawczyk, Kappa updated ensemble for drifting data stream mining, Machine</p>
          <p>Learning 109 (2020) 175–218.
[49] B. Veloso, J. Gama, B. Malheiro, Self hyper-parameter tuning for data streams, in: Discovery
Science: 21st International Conference, DS 2018, Limassol, Cyprus, October 29–31, 2018, Proceedings
21, Springer, 2018, pp. 241–255.
[50] A. R. Moya, B. Veloso, J. Gama, S. Ventura, Improving hyper-parameter self-tuning for data
streams by adapting an evolutionary approach, Data Mining and Knowledge Discovery 38 (2024)
1289–1315.
[51] A. L. D. Rossi, A. C. P. de Leon Ferreira, C. Soares, B. F. De Souza, et al., Metastream: A
metalearning based method for periodic algorithm selection in time-changing data, Neurocomputing
(2014).
[52] P. Wang, N. Jin, D. Davies, W. L. Woo, Model-centric transfer learning framework for concept
drift detection, Knowledge-Based Systems 275 (2023) 110705.
[53] Z.-Y. Zhang, P. Zhao, Y. Jiang, Z.-H. Zhou, Learning with feature and distribution evolvable
streams, in: International Conference on Machine Learning, PMLR, 2020, pp. 11317–11327.
[54] Z. Wang, L. Liu, D. Tao, Deep streaming label learning, in: International Conference on Machine</p>
          <p>Learning, PMLR, 2020, pp. 9963–9972.
[55] F. Tajwar, A. Kumar, S. M. Xie, P. Liang, No true state-of-the-art? ood detection methods are
inconsistent across datasets, arXiv preprint arXiv:2109.05554 (2021).
[56] D. Risca, A. Lourenço, G. Marreiros, Boosting-inspired online learning with transfer for railway
maintenance, arXiv preprint arXiv:2504.08554 (2025).
[57] R. Davtalab, R. M. Cruz, R. Sabourin, A scalable dynamic ensemble selection using fuzzy
hyperboxes, Information Fusion 102 (2024) 102036.
[58] J. Gama, P. Kosina, Recurrent concepts in data streams classification, Knowledge and Information</p>
          <p>Systems 40 (2014) 489–507.
[59] Y. Yang, X. Wu, X. Zhu, Mining in anticipation for concept change: Proactive-reactive prediction
in data streams, Data mining and knowledge discovery 13 (2006) 261–289.
[60] O. Wu, Y. S. Koh, G. Dobbie, T. Lacombe, Probabilistic exact adaptive random forest for recurrent
concepts in data streams, International Journal of Data Science and Analytics (2022) 1–16.
[61] P. M. Gonçalves Jr, R. S. M. de Barros, Rcd: A recurring concept drift framework, Pattern</p>
          <p>Recognition Letters 34 (2013) 1018–1025.
[62] E. R. Faria, J. Gama, A. C. Carvalho, Novelty detection algorithm for data streams multi-class
problems, in: Proceedings of the 28th annual ACM symposium on applied computing, 2013, pp.
795–800.
[63] A. R. Paupério, A. Risca, Diogo Lourenço, G. Marreiros, R. Martins, Explainable anomaly detection
for industrial iot data streams, arXiv preprint arXiv:2512.08885 (2025).
[64] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large
neural networks: The sparsely-gated mixture-of-experts layer, in: International Conference on
Learning Representations, 2017.
[65] C. Rosenbaum, T. Klinger, M. Riemer, Routing networks and the challenges of modular and
compositional computation, arXiv preprint arXiv:1904.12774 (2019).
[66] O. Ostapenko, D. Suris, A. Szabó, T. Mikolov, Attention for compositional modularity, in:
NeurIPS’22 Workshop on All Things Attention: Bridging Diferent Perspectives on Attention,
2022.
[67] Y. Guo, B. Liu, D. Zhao, Dealing with cross-task class discrimination in online continual learning,
in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023,
pp. 20446–20455.
[68] M. Sabokrou, M. Khalooei, M. Fathy, E. Adeli, Adversarially learned one-class classifier for novelty
detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 3379–3388.
[69] H. Deng, X. Li, Anomaly detection via reverse distillation from one-class embedding, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
9737–9746.
[70] J. von Oswald, C. Henning, J. Sacramento, B. F. Grewe, Continual learning with hypernetworks,
arXiv preprint arXiv:1906.00695 (2019).
[71] M. Wortsman, V. Ramanujan, R. Liu, A. Kembhavi, M. Rastegari, J. Yosinski, A. Farhadi,
Supermasks in superposition, in: Advances in Neural Information Processing Systems, 2020, pp.
15173–15184.
[72] G. Kim, C. Xiao, T. Konishi, Z. Ke, B. Liu, A theoretical study on solving continual learning,</p>
          <p>Advances in neural information processing systems 35 (2022) 5065–5079.
[73] T. B. Brown, Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020).
[74] B. van Breugel, M. van der Schaar, Why tabular foundation models should be a research priority,
arXiv preprint arXiv:2405.01147 (2024).
[75] J. Qu, D. HolzmÃžller, G. Varoquaux, M. L. Morvan, Tabicl: A tabular foundation model for
in-context learning on large data, arXiv preprint arXiv:2502.05564 (2025).
[76] N. Hollmann, S. Müller, K. Eggensperger, F. Hutter, Tabpfn: A transformer that solves small
tabular classification problems in a second, arXiv preprint arXiv:2207.01848 (2022).
[77] C. Kolberg, K. Eggensperger, N. Pfeifer, Tabpfn-wide: Continued pre-training for extreme feature
counts, arXiv preprint arXiv:2510.06162 (2025).
[78] A. Albalak, Y. Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighof, B. Hou, L. Pan,
H. Jeong, et al., A survey on data selection for language models, arXiv preprint arXiv:2402.16827
(2024).
[79] T. Nagler, J. Lützen, Statistical foundations of prior-data fitted networks, arXiv preprint
arXiv:2305.11175 (2023).
[80] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. De Carvalho, J. Gama, Data stream
clustering: A survey, ACM Computing Surveys (CSUR) 46 (2013) 1–31.
[81] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, M. Ranzato, On
tiny episodic memories in continual learning, arXiv preprint arXiv:1902.10486 (2019).
[82] Z. Wu, Y. Wang, J. Ye, L. Kong, Self-adaptive in-context learning: An information compression
perspective for in-context example selection and ordering, arXiv preprint arXiv:2212.10375
(2022).
[83] W. Xiao, H. Zhao, L. Huang, The role of diversity in in-context learning for large language models,
arXiv preprint arXiv:2505.19426 (2025).
[84] I. Levy, B. Bogin, J. Berant, Diverse demonstrations improve in-context compositional
generalization, arXiv preprint arXiv:2212.06800 (2022).
[85] R. A. Cook, J. P. Lalor, A. Abbasi, No simple answer to data complexity: An examination of
instance-level complexity metrics for classification tasks, in: Proceedings of the 2025 Conference
of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human
Language Technologies (Volume 1: Long Papers), 2025, pp. 2553–2573.
[86] Y. Chen, C. Zhao, Z. Yu, K. McKeown, H. He, On the relation between sensitivity and accuracy in
in-context learning, arXiv preprint arXiv:2209.07661 (2022).
[87] Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y. Lee, X. Ren, G. Su, V. Perot, J. Dy, et al.,
Dualprompt: Complementary prompting for rehearsal-free continual learning, in: European
conference on computer vision, Springer, 2022, pp. 631–648.
[88] Y. Sakamoto, K.-i. Fukui, J. Gama, D. Nicklas, K. Moriyama, M. Numao, Concept drift
detection with clustering via statistical change detection methods, in: 2015 Seventh International
Conference on Knowledge and Systems Engineering (KSE), IEEE, 2015, pp. 37–42.
[89] J. S.-W. Wan, S.-D. Wang, Concept drift detection based on pre-clustering and statistical testing,</p>
          <p>Journal of Internet Technology 22 (2021) 465–472.
[90] J. Shao, Z. Ahmadi, S. Kramer, Prototype-based learning on concept-drifting data streams, in:
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining, 2014, pp. 412–421.
[91] O. Sener, S. Savarese, Active learning for convolutional neural networks: A core-set approach,
arXiv preprint arXiv:1708.00489 (2017).
[92] M. Toneva, A. Sordoni, R. T. d. Combes, A. Trischler, Y. Bengio, G. J. Gordon, An empirical study
of example forgetting during deep neural network learning, arXiv preprint arXiv:1812.05159
(2018).
[93] F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, K. Alahari, End-to-end incremental learning,
in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 233–248.
[94] P. W. Koh, P. Liang, Understanding black-box predictions via influence functions, in: International
conference on machine learning, PMLR, 2017, pp. 1885–1894.
[95] B. Pecher, I. Srba, M. Bielikova, J. Vanschoren, Automatic combination of sample selection
strategies for few-shot learning, arXiv preprint arXiv:2402.03038 (2024).
[96] B. Krawczyk, Active and adaptive ensemble learning for online activity recognition from data
streams, Knowledge-Based Systems 138 (2017) 69–78.
[97] Z. Yu, S. Huang, K. Yang, J. Lv, C. P. Chen, Ensemble approaches for dynamic data stream
classification under label scarcity, IEEE Transactions on Big Data (2025).
[98] D. M. Cavalcanti, R. Cerri, E. R. Faria, Arm-stream: active recovery of miscategorizations in
clustering-based data stream classifiers, Data Mining and Knowledge Discovery 39 (2025) 1–35.
[99] M. Welling, Herding dynamical weights to learn, in: Proceedings of the 26th annual international
conference on machine learning, 2009, pp. 1121–1128.
[100] Y. Liu, Y. Su, A.-A. Liu, B. Schiele, Q. Sun, Mnemonics training: Multi-class incremental learning
without forgetting, in: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern
Recognition, 2020, pp. 12245–12254.
[101] S. Swayamdipta, R. Schwartz, N. Lourie, Y. Wang, H. Hajishirzi, N. A. Smith, Y. Choi, Dataset
cartography: Mapping and diagnosing datasets with training dynamics, arXiv preprint
arXiv:2009.10795 (2020).
[102] O. Rubin, J. Herzig, J. Berant, Learning to retrieve prompts for in-context learning, arXiv preprint
arXiv:2112.08633 (2021).
[103] X. Li, X. Qiu, Finding support examples for in-context learning, arXiv preprint arXiv:2302.13539
(2023).
[104] Y. Zhang, S. Feng, C. Tan, Active example selection for in-context learning, arXiv preprint
arXiv:2211.04486 (2022).
[105] P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, A. Kalyan, Dynamic
prompt learning via policy gradient for semi-structured mathematical reasoning, arXiv preprint
arXiv:2209.14610 (2022).
[106] D. Himaja, V. Dondeti, S. Uppalapati, S. Virupaksha, Cluster based active learning for classification
of evolving streams, Evolutionary Intelligence 17 (2024) 2167–2191.
[107] C. Yin, S. Chen, Z. Yin, Clustering-based active learning classification towards data stream, ACM</p>
          <p>Transactions on Intelligent Systems and Technology 14 (2023) 1–18.
[108] Z. Zhang, Y. Jiang, S. Zhang, X. Xu, Nonstationary data stream classification with online active
learning and siamese neural networks, IEEE Transactions on Neural Networks and Learning
Systems 33 (2021) 3087–3101.
[109] S. Sun, D. Calandriello, H. Hu, A. Li, M. Titsias, Information-theoretic online memory selection
for continual learning, arXiv preprint arXiv:2204.04763 (2022).
[110] F. Wiewel, B. Yang, Entropy-based sample selection for online continual learning, in: 2020 28th</p>
          <p>European signal processing conference (EUSIPCO), IEEE, 2021, pp. 1477–1481.
[111] C. Qin, A. Zhang, C. Chen, A. Dagar, W. Ye, In-context learning with iterative demonstration
selection, arXiv preprint arXiv:2310.09881 (2023).
[112] J. Bang, H. Kim, Y. Yoo, J.-W. Ha, J. Choi, Rainbow memory: Continual learning with a memory
of diverse samples, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2021, pp. 8218–8227.
[113] M. De Lange, T. Tuytelaars, Continual prototype evolution: Learning online from non-stationary
data streams, in: Proceedings of the IEEE/CVF international conference on computer vision,
2021, pp. 8250–8259.
[114] Z. Borsos, M. Mutny, A. Krause, Coresets via bilevel optimization for continual learning and
streaming, Advances in neural information processing systems 33 (2020) 14879–14890.
[115] K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, R. Iyer, Glister: Generalization based data
subset selection for eficient and robust learning, in: Proceedings of the AAAI conference on
artificial intelligence, volume 35, 2021, pp. 8110–8118.
[116] B. Mirzasoleiman, J. Bilmes, J. Leskovec, Coresets for data-eficient training of machine learning
models, in: International Conference on Machine Learning, PMLR, 2020, pp. 6950–6960.
[117] A. Prabhu, P. H. Torr, P. K. Dokania, Gdumb: A simple approach that questions our progress in
continual learning, in: European Conference on Computer Vision, Springer, 2020, pp. 524–540.
[118] S. Yang, Z. Xie, H. Peng, M. Xu, M. Sun, P. Li, Dataset pruning: Reducing training data by
examining generalization influence, arXiv preprint arXiv:2205.09329 (2022).
[119] H. Tan, S. Wu, F. Du, Y. Chen, Z. Wang, F. Wang, X. Qi, Data pruning via moving-one-sample-out,</p>
          <p>Advances in neural information processing systems 36 (2023) 18251–18262.
[120] S. Jain, H. Salman, A. Khaddaj, E. Wong, S. M. Park, A. Madry, A data-based perspective on
transfer learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2023, pp. 3613–3622.
[121] A. Grotov, M. De Rijke, Online learning to rank for information retrieval: Sigir 2016 tutorial, in:
Proceedings of the 39th International ACM SIGIR conference on Research and Development in
Information Retrieval, 2016, pp. 1215–1218.
[122] K. Purohit, V. Venktesh, S. Bhattacharya, A. Anand, Sample eficient demonstration selection for
in-context learning, arXiv preprint arXiv:2506.08607 (2025).
[123] S. Somvanshi, M. M. Islam, G. Chhetri, R. Chakraborty, M. S. Mimi, S. A. Shuvo, K. S. Islam, S. A.</p>
          <p>Javed, S. A. Rafat, A. Dutta, et al., From tiny machine learning to tiny deep learning: A survey,
arXiv preprint arXiv:2506.18927 (2025).
[124] D. Xu, O. Cirit, R. Asadi, Y. Sun, W. Wang, Mixture of in-context prompters for tabular pfns,
arXiv preprint arXiv:2405.16156 (2024).
[125] F. den Breejen, S. Bae, S. Cha, T.-Y. Kim, S. H. Koh, S.-Y. Yun, Fine-tuning the retrieval mechanism
for tabular deep learning, arXiv preprint arXiv:2311.17641 (2023).
[126] V. Thomas, J. Ma, R. Hosseinzadeh, K. Golestan, G. Yu, M. Volkovs, A. Caterini, Retrieval &amp;
ifne-tuning for in-context tabular models, arXiv preprint arXiv:2406.05207 (2024).
[127] J. Bell, L. Quarantiello, E. N. Coleman, L. Li, M. Li, M. Madeddu, E. Piccoli, V. Lomonaco, The
future of continual learning in the era of foundation models: Three key directions, arXiv preprint
arXiv:2506.03320 (2025).
[128] E. N. Coleman, L. Quarantiello, Z. Liu, Q. Yang, S. Mukherjee, J. Hurtado, V. Lomonaco,
Parametereficient continual fine-tuning: A survey, arXiv preprint arXiv:2504.13822 (2025).</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Krawczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Minku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stefanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <article-title>Ensemble learning for data stream analysis: A survey</article-title>
          ,
          <source>Information Fusion</source>
          <volume>37</volume>
          (
          <year>2017</year>
          )
          <fpage>132</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Halstead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Riddle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pechenizkiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <article-title>A probabilistic framework for adapting to changing and recurring concepts in data streams</article-title>
          ,
          <source>in: 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Online deep learning: Learning deep neural networks on the lfy</article-title>
          ,
          <source>arXiv preprint arXiv:1711.03705</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>McElfresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khandagale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Valverde</surname>
          </string-name>
          , V.
          <string-name>
            <surname>Prasad</surname>
            <given-names>C</given-names>
          </string-name>
          , G. Ramakrishnan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goldblum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <article-title>When do neural nets outperform boosted trees on tabular data?</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>76336</fpage>
          -
          <lpage>76369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Nikishin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          , W. Dabney,
          <article-title>Understanding plasticity in neural networks</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>23190</fpage>
          -
          <lpage>23211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lourenço</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          , G. Marreiros,
          <article-title>On-device edge learning for iot data streams: a survey</article-title>
          ,
          <source>arXiv preprint arXiv:2502.17788</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Shrof, Theory on forgetting and generalization of continual learning</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>21078</fpage>
          -
          <lpage>21100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Spinosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <article-title>Knowledge discovery from data streams</article-title>
          ,
          <source>in: Web Intelligence and Security</source>
          , IOS Press,
          <year>2010</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunasekara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Bifet,</surname>
          </string-name>
          <article-title>Survey on online streaming continual learning</article-title>
          ,
          <source>in: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>6628</fpage>
          -
          <lpage>6637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cossu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lomonaco</surname>
          </string-name>
          ,
          <article-title>Streaming continual learning for unified adaptive intelligence in dynamic environments</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>39</volume>
          (
          <year>2024</year>
          )
          <fpage>81</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hollmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Purucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishnakumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Körfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Hoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Schirrmeister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Accurate predictions on small data with a tabular foundation model</article-title>
          ,
          <source>Nature</source>
          <volume>637</volume>
          (
          <year>2025</year>
          )
          <fpage>319</fpage>
          -
          <lpage>326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lourenço</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Marreiros, In-context learning of evolving data streams with tabular foundational models</article-title>
          ,
          <source>arXiv preprint arXiv:2502.16840</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>H. M. Gomes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bifet</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Read</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Barddal</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Enembreck</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Pfharinger</surname>
          </string-name>
          , G. Holmes, T. Abdessalem,
          <article-title>Adaptive random forests for evolving data stream classification</article-title>
          ,
          <source>Machine Learning</source>
          <volume>106</volume>
          (
          <year>2017</year>
          )
          <fpage>1469</fpage>
          -
          <lpage>1495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>H. M. Gomes</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Read</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bifet</surname>
          </string-name>
          ,
          <article-title>Streaming random patches for evolving data stream classification, in: 2019 IEEE international conference on data mining (ICDM)</article-title>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>240</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cormode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          ,
          <article-title>An improved data stream summary: the count-min sketch and its applications</article-title>
          ,
          <source>Journal of Algorithms</source>
          <volume>55</volume>
          (
          <year>2005</year>
          )
          <fpage>58</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Riemer</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Cases</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ajemian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          , I. Rish,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Tesauro</surname>
          </string-name>
          ,
          <article-title>Learning to learn without forgetting by maximizing transfer and minimizing interference</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Knoblauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Husain</surname>
          </string-name>
          , T. Diethe,
          <article-title>Optimal continual learning has perfect memory and is np-hard</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>05188</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Risca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lourenço</surname>
          </string-name>
          , G. Marreiros,
          <article-title>Continual learning for rotating machinery fault diagnosis with cross-domain environmental and operational variations</article-title>
          ,
          <source>arXiv preprint arXiv:2504.10151</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lourenço</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          , G. Marreiros, Dfdt:
          <article-title>Dynamic fast decision tree for iot data stream mining on edge devices</article-title>
          ,
          <source>arXiv preprint arXiv:2502.14011</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Ł. Korycki</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Krawczyk</surname>
          </string-name>
          ,
          <article-title>Streaming decision trees for lifelong learning</article-title>
          ,
          <source>in: Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD</source>
          <year>2021</year>
          , Bilbao, Spain,
          <source>September 13-17</source>
          ,
          <year>2021</year>
          , Proceedings,
          <source>Part I 21</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>502</fpage>
          -
          <lpage>518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Ł. Korycki</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Krawczyk</surname>
          </string-name>
          ,
          <article-title>Online oversampling for sparsely labeled imbalanced and non-stationary</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>