<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Robust Drift Detection Algorithm with High Accuracy and Low False Positives Rate</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maxime Fuccellaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurent Simon</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akka Zemmari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mangrove</institution>
          ,
          <addr-line>87 Avenue des Aygalades, 13015 Marseille</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bordeaux</institution>
          ,
          <addr-line>CNRS, Bordeaux INP, LaBRI, UMR 5800, 351, cours de la Libération F-33405 Talence</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The number of decision-making processes that rely on machine learning models to operate has been increasing in recent years. Safety of those systems is compromised when models deviate from their expected behavior. One root cause is a shift in the underlying data distribution, known as concept drift. A direct consequence of concept drift is a rapid drop in model's predictive power. Accurate detection of drift is essential as false alarms lead to unnecessary down time and undermine confidence in the drift detection model. This paper introduces Real-Drift Detector (RDD), a drift detector that is not triggered by virtual drift. RDD does not need use class labels during the inference phase to operate. Our detector outperformed the state of the art in an extensive benchmark on a large panel of well-known datasets used in drift detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Concept Drift</kwd>
        <kwd>Real Drift</kwd>
        <kwd>Virtual Drift</kwd>
        <kwd>Unsupervised</kwd>
        <kwd>Hypothesis Testing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        distribution [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A shift of distribution is referred as
Concept Drift (CD) and its detection will be the focus of
More and more online systems rely, at least partly, on this paper.
a form of machine learning model to operate. The Machine learning models are built under the
hypothwidespread integration of Artificial Intelligence based esis that data seen during the training phase share the
model has its roots in the constant progress made in the same distribution as unseen future data. Concept drift
apifeld, which enables models to solve increasingly com- pears when the underlying distribution of a data source
plex tasks well suited to real world applications. The changes over time. If the static distribution hypothesis
democratisation of Machine Learning (ML) models al- is violated, historic data cannot be used to predict the
lows non-experts to use it to automate repetitive tasks future and predictive models see their performances drop.
and is simplified by the easy access and processing of Concept Drift can impact every ML domain including
large quantities of data required to train predictive mod- video analysis [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
els. The emergence of cloud computing has also been In contrast to anomaly detection, where the goal is
accelerating the industrial use of ML models in produc- to isolate few out of distribution samples, concept drift
tion. will cause a large part of the data to deviate from past
      </p>
      <p>
        However, ML models can be crippled by a wide range distributions. One way to categorize drifts is with its
of problems that raise serious questions regarding their impact on a model’s performance. Virtual drift is used
impact on the safety of systems and their consequence to describe a distribution change that does not impact a
on society. Some models inadvertently induce bias in model while real drift does. Let  be a set a variables
their predictions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] such as black box models that are used to predict the target class vector . We distinguish
therefore excluded from applications where explanability three root causes of concept drift: it may come from the
is a critical feature such as loan applications. However, change in the class distribution P( | ), the feature
ML models are often poorly adapted to detect out of space P() or the class priors P(). Where the change in
distribution samples [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that are not classified correctly. distribution is rapid, the drift is said to be abrupt. Drift is
Models can then see a drop of performance while in the incremental when the distribution shifts slowly over time.
inference phase due to a change in the underlying data Recurrent drift is defined as a distribution that oscillate
Proceedings of the Workshop on Artificial Intelligence Safety 2023 in between two or more concepts. Drift detection difers
$ maxime.fuccellaro@u-bordeaux.fr (M. Fuccellaro); from outliers detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as the goal is to identify and
laurent.simon@u-bordeaux.fr (L. Simon); take actions to deal with a global distribution change and
akka.zemmari@u-bordeaux.fr (A. Zemmari) not to remove out of distribution samples.
 https://www.labri.fr/perso/lsimon/ (L. Simon); We present RDD: Real-Drift Detector, a unsupervised
http0s0:0//0w-0w00w3.-l0a5b4ri4.f-r5/5p0e3rs(oL/.zSeimmmona)r;i/0(0A00.-Z0e0m02m-9a7r7i)6-0449 drift detection method based on the supervised
parti(A. Zemmari) tion of feature space aiming to detect local distribution
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License changes that impacts models performances. RDD works
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
      </p>
      <sec id="sec-1-1">
        <title>Over the last few years, concept drift has become a major</title>
        <p>ifeld of research in the machine learning community.</p>
        <p>Focus was mostly aimed at models dealing with drift as
well as drift detectors. Recent advances in detecting drift
when true class labels are available led algorithms to
achieve almost perfect detection. Other (more realistic)
contexts still show room for improvement.</p>
        <p>
          To prevent drift from afecting ML models, several
mechanisms have been proposed. Assuming that recent
data share the same distribution as upcoming data, one
way is to continuously update a pool of models. In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] a
batch of new data is scored by a pool of models, the
individual model’s contribution are weighted by it’s recent
performances. At each new batch, a model is trained on
it and added to the pool while a long term poor
performing one is removed. This ensures good performances in
the presence of concept drift and a fast adaptation on
recurrent drifts. Methods that work on a dynamic pool
of models have been thoroughly studied [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          When detecting a drift, the consensus is that an
increase of the error rate of a model means the presence of
concept drift. Detection methods based on that
hypothesis have been given a lot of attention as this methodology
is able to systematically detect real drift while
consistently ignoring virtual drift. The detection methods
presented in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] work by monitoring a model’s
error rate. A drop of performance is interpreted as a
presence of drift. Diferent statistical tests are used to
monitor the error rate and signal drifts.
        </p>
        <p>Both updating a pool of models and monitoring the
error rate perfectly deal with drift. As virtual drift does not
impact accuracy, it is systematically ignored. However,
in order to work, both approaches need access to the
true labels immediately after the inference phase. This
is not realistic for real-world scenarios when true class
labels are almost never available promptly and are
sometimes never known. To address this issue, several ways
of dealing with drift in an unsupervised manner have
been studied.</p>
        <p>
          To detect drift without class labels availability,
window-based technique have been studied. The authors
of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] introduce ADWIN, an adaptive sliding window
algorithm. It works by keeping a reference window
containing past instances. The window widens when no
in any number of dimensions. Our detector does not need change is detected, while its size decreases rapidly in
labels during inference and outperforms the state of the the presence of drift. The detection-mechanism works
art in a thorough experiment. In Section 2, we present by repeatedly splitting the window, based on the time
related work and position our paper. The RDD algorithm of appearance into two smaller sets. A drift is detected
is described in Section 3. In Section 4, the experimen- when the averages of the values of two sets are
statistal protocol is presented and the results are discussed. tically diferent. Several other window-based detectors
Section 5 concludes this paper. have been presented since such as [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          To avoid detecting drift over one-dimension sliding
windows, [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] detects swift or gradual changes in data
2. Related Work values with minimum enclosing balls. A ball is defined
as a centroid and the minimum radius that enables to
include all of the samples in the ball. A drift is detected
when too many values are labelled as outliers, in which
case the centroid is updated.
        </p>
        <p>
          To circumvent the unavailability of true labels, the
authors of [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] trained a model to distinguish past data from
recent data. All timestamp-like aggregates are removed
prior to training the model to prevent trivial
identification. The ability of the model is assessed using the AUC
metric, using a threshold of .75 in their paper. Using
time to find drift, the authors of [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] include the
timestamp attribute in the observations and train a model on
past and recent data to predict the target variable. If the
timestamp attribute is an informative feature, the target
variable depends on time and the presence of a drift is
assumed.
        </p>
        <p>
          The authors of [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] present another way of detecting
drift in an unsupervised manner. A first model (teacher)
is trained on past labelled data, a second model (student)
is trained to mimic the behavior of the teacher model.
        </p>
        <p>
          During the inference phase the authors monitor the
error rate of the student model and use [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to trigger an
alarm. In [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], drift detection in an unsupervised
image classification context is studied. The authors first
apply a dimension reduction technique before using a
two-sample test to find drift. A number of dimension
reduction and two-sample tests including (MMD [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and
KS) are evaluated on an extensive study of diferent types
of shift applied to images. In [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] the authors incorporate
the target class in the dimension reduction mechanism,
enabling the detector to ignore virtual drift.
        </p>
        <p>
          The idea behind concept drift detection by statistical
tests is that a distribution change will be a strong
indicator of drift [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. A distribution change will enable
the detection of drift, but won’t discriminate between
real drift and virtual drift. To the best of our knowledge,
few algorithms are able to discriminate between real drift
and virtual drift without access to true labels after
detection. In this paper we introduce RDD a detector that does
not need true class labels to operate during the inference
phase. RDD works in any dimension and successfully
discriminates between real and virtual drift.
3. RDD
        </p>
        <sec id="sec-1-1-1">
          <title>3.1. Our detector</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>The idea behind our model is that a real drift changes</title>
        <p>the distribution of regions made by of a class dependant
partitioning while virtual drift does not. We use a
decision tree to partition the feature space into regions of
homogeneous class labels.</p>
        <p>Our intuition is that a data distribution change in a leaf
between the training phase and inference phase indicates
a drift. It is our assumption that a real drift is likely to
change in which region the observations are assigned
to. Such misplaced samples are likely to have a diferent
distribution than that of the training observations. A
distribution change leading to virtual drift is unlikely to
be seen locally as it may afect less the way observations
are distributed in leaves.</p>
        <p>In order to better discriminate real drift from virtual
drift, each region is attributed a weight. The weights
represent the ability of a given region to correctly assign
a label to an observation. A region that only contains
a single class of observations will have a large weight.
While a region that cannot well separate samples on
their label will have a small weight. This is done to
reduce the risk of sudden class imbalance to be detected
as a distribution change. Our model signals a drift when
enough regions flag their distribution as changing.
3.2. Mathematical Background
distribution. During the initialisation step, leaves that
cannot well separate class labels are removed, all leaves
with less than 20% impurity are dropped. During early
experimental runs, we found that setting a low minimum
impurity percentage yields very few leaves with enough
samples to conduct the statistical test. We also found
that setting a high value prevents us from confidently
assigning a class to a leaf. We set the maximum impurity
value at 20% as it ofers a good compromise, although
this value could be changed based on the data at hand.
In an efort to rank the remaining leaves, we attribute to
each leaf a weight which corresponds to a leaf’s purity
during training.</p>
        <p>Weights somehow capture the separation power of a
leaf. If a given observation is classified at a leaf with
high weight, we may expect that the probability of this
observation to be misclassified is low. On the other hand,
a leaf with a lower weight will be more susceptible to
assign the wrong label to an observation. Our goal by
ranking the leaves based on their predictive power is to
help our detector ignore virtual drift.</p>
        <p>The detection step is detailed in Algorithm 1. In line
1-5 test set observations are attributed to  based on
the leaf they are at. In line 6, we go over each leaf that
contains test data; in line 7 we initialize the drift features
DF variable that tracks the number of drifted features. In
line 8, the number of observations at a leaf is checked.
In lines 9 through 13, for all leaves containing enough
test instances, we proceed to do a Levene test of variance
equality on all dimensions between the inference and
training set contained in a leaf. We choose here the
Levene test as it is adequate when the data distribution may
slightly deviate from the normal one. Of course, other
tests could be used, based on the knowledge of the
underlying data distribution to improve the detector’s
performances (when the data distribution is strictly normal, the
Barlett’s test should be used; the Brown–Forsythe test
may also be an alternative when the data does not follow
a normal distribution). We did not make any assumptions
on the distribution and independence of variables, it will
be the focus of future work.</p>
        <p>In lines 14 through 18, leaves are classified as drifting
if the ratio of features that fail the homoscedasticity test
exceeds the user defined  threshold. In line 21, the
algorithm flags a drift if the weighted average of leaf’s
drift-labels exceeds the user defined  threshold.</p>
      </sec>
      <sec id="sec-1-3">
        <title>During the initialization step, a decision tree classifier</title>
        <p>(ℳ) is fit over the training data. Like most drift detection
algorithms, we assume the training data is sampled from
one single concept. We do not consider the training data
to include past concepts that might ofset the detector.</p>
        <p>For both the training ( ) and inference () test sets, we
consider the variables to follow a normal distribution.</p>
        <p>
          This hypothesis is required to test for homoscedasticity
in the latter. After discarding all leaves that contain too
few samples or that are not pure, we store, for each leaf,
the training instances that belong to it. Let ,  be the
training and inference data within leaf  of class . For
the test to be significant [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], we remove all leaves in the
decision tree containing less than a number of
observations  . We set  = 20. We have ∀, min(||, ||) ≥ 
as well as  = ˆ  = .
        </p>
        <p>By construction, the leaves of a decision tree don’t
hold the same separation power of class labels. The in- 3.3. Hyper-parameter discussion
tuition is that leaves containing pure class labels should
be less afected by P( | ) concept change as they are
generally further away from the decision boundary. On
the contrary, impure leaves are more likely to experience
a distribution change due to a P() drift or to be
subject to misclassifications that may impact the inference</p>
      </sec>
      <sec id="sec-1-4">
        <title>The first hyper-parameter is  . Setting a low value for</title>
        <p>the  parameter reduces the risk to make a type I error,
which, in our case, indicating drift when there is not. The
 parameter was set to 0.01.</p>
        <p>The  parameter is the minimum ratio of features to
reject the equal variance hypothesis. A low  parameter
Algorithm 1 RDD - Inference 1.0 Parameter =</p>
        <p>Inputs: 0.8
- ℳ : Trained Decision Tree Model taeR00..46
-  : Dictionary of leaves mapping to the training 0.2
instances 0.0 Metric
-- ∈: RLeda× vmes :wTeeisgthstest with d features and m samples 10..08 Parameter = TTHPN
-Par:aNmuemtebresr: of variables in dataset taeR00..64
--  :: MHyinpiomthuemsinsuremjebcetrioonf oribsskervation in a leaf 00..02 0.1 0.2 0.3 0.4 Va0l.u5e 0.6 0.7 0.8 0.9
-  : Minimum ratio of ℋ0 rejection for a leaf to drift
-  : Minimum ratio of leaves to drift to trigger an Figure 1: Influence of the  and  parameters on the True
alarm. Positive rate, True Negative rate and H score. On the  plot,
Variables:  = 0.3 and on the  plot,  = 0.3. The graph confirms our
-  : Dictionary of leafs mapping to the test in- intuition that low values tend to flag virtual drift as real drift
stances in those leafs while high values cause the detector not to detect any drift.
-  : Drift status of leaves
-  : Number of features that drift within a leaf
1: for  ∈  do
2: if ℳ() ∈  then
3: [ℳ()] ← [ℳ()] + 
4: end if
5: end for
means leaves will be considered as drifting if few
features present a shift in variance (i.e. the detector will be
sensible).</p>
        <p>The  parameter is the minimum ratio of drifted leaves
to signal a drift has taken place.</p>
        <p>In order to set relevant  and  values for our detector,
we conducted a hyper parameter search on three datasets
(airlines, poker and weather). We excluded those datasets
from the experimental study to prevent bias. In Figure 1
we plot the influence of both  and  on TP, TN and the
H score. The H score is detailed in Section 4.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experiment</title>
      <p>
        In this experiment we assess our method’s ability to
detect drift while ignoring virtual drift. We extensively
tested our method against a wide panel of state of the art
detectors on a extensive set of both real and synthetic
datasets. The benchmark used in this section are the
standard ones when testing drift detectors [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>4.1. Experimental Setup</title>
        <sec id="sec-2-1-1">
          <title>The usual procedure to test algorithms suited to handle</title>
          <p>drift when true class labels are available after inference, is
the test-then-train approach. A model predicts the class
on a batch of samples, then, the true class is revealed and
the model updates itself. The global prediction accuracy
is then used to rank models.</p>
          <p>This setup is not suited for models that do not rely on
true labels availability. In most datasets used to
benchmark drift handling methods, the presence of drift is
only assumed or artificially introduced by sorting the
observations on an attribute. To the best of our
knowledge, the exact occurrence of drift is unknown for all
usual datasets. The experimental setup described
bellow allows us to know the exact drift occurrence and to
evaluate the efect it has on a model’s accuracy.</p>
          <p>The goal of the experiment is to assess the performance
of detectors on real drift and virtual drift. Two distinct
perturbations are used to change the dataset: the Step
Drift, where a subset of the features are shufled and the
Noise Drift, where Gaussian  (1, 1) noise is added to a
subset of features (Gaussian noise with mean equal to 1 is
used to change the mean of the distribution, not to
obfuscate the signal). The idea is now to be able to artificially
generate real drift and virtual drift. To create virtual drift,
we add one of the two perturbations on the 25% least im- Table 1
portant features as to their predictive power of the class Overview of the dataset used in our experiment. All but one
labels. In doing so, we hope to change the distribution RW dataset are binary classification problems. 2 RW datasets
of several features will not afect a predictive model’s contain more than 100 features. For the synthetic datasets,
performances. To create real drift, we modify the 25% we limit the number of generated observations at 10 000.
most informative features by adding one of the two per- Dataset Dimensions Classification
turbations. The intuition is that a change of distribution
on the most important features is likely to cause a drop Adult (48842, 66) Binary
of performance in a predictive model. To find 25% most BCaonvk ((14150231913,,4591)) MulBtii-ncalarsys (7)
and least important features, we train a Random Forest Digits08 (1499, 17) Binary
Classifier over the training data. We choose this model Digits17 (1557, 17) Binary
as it is a robust, widely used model that achieves good Elec (45312, 15) Binary
level of performance on the datasets. For each dataset, Musk (6598, 167) Binary
we introduce the 2 perturbations on the 2 sets of features Phishing (11055, 47) Binary
thus creating 4 distinct drift set. Spam (6213, 500) Binary</p>
          <p>In order to have stationary non-drifting data before Wine (6497, 13) Binary
adding our generated drifts, we first randomly shufle Hyperplane (10000, 11) Binary
the observations. Each dataset is then partitioned into LED (10000, 26) Multi-class (10)
three: a train set, a validation set and a drift set. 4 distinct Waveform (10000, 41) Multi-class (3)
copies of the drift set are independently modified with
the 4 diferent perturbations described above. In order
to assess if a drift is virtual or real, we fit a Random Table 2
Forest Classifier on the train set before reporting it’s AdactciuornacayndofdariftRseatn.dAodmdiFnogrensotisCelatossliefaiesrt
oinvfeorrmthaettivraeifneiantgu,rveasliaccuracy on the train set, validation set and the 4 diferent (LN) leads to virtual drift on all datasets. Adding step
perturbadrift sets. The drop of the model’s accuracy between the tion to the least informative features (LS) also leads to virtual
diferent sets is used to classify drift as virtual or real. If drift except for the musk dataset. When those perturbation
the diference in accuracies between the validation set are made on the most informative features (MN and MS), it
and the training set is lower than that of the validation leads to real drift across all real datasets. We highlight in bold
set and the drift set, we consider the drift induced to be perturbations that lead to real drift.
real, otherwise, it is considered a virtual drift. Train Val. LN LS MN MS</p>
          <p>Table 1 briefly describes the datasets used in the
experiment. The datasets dimensions range from 11 on Adult 1. .85 .85 .85 .28 .59
Hyperplane to 500 on Spam. The classification task is Bank 1. .94 .92 .94 .52 .52
binary on 10 datasets and multi-class on 3. This ensures Cov .99 .85 .84 .84 .49 .46
that RDD is tested in a variety of scenarios. DD1078 11.. .919. .919. .919. ..5747 ..8609</p>
          <p>In table 2 we report the average accuracies of the Ran- Elec 1. .89 .87 .87 .57 .62
dom Forest Classifier over the training, validation and Musk 1. .98 .94 .97 .51 .56
the four diferent drift set. changing the most important Phis. .99 .97 .96 .96 .69 .47
features generates real drift while changing the least im- Spam 1. .98 .98 .98 .58 .59
portant features creates virtual drift regardless of the per- Wine 1. 1. 1. 1. .63 .44
turbation. There are 3 exceptions, when noise is added to Hyp. 1. .87 .87 .85 .71 .87
the least important features, real drift is produced on the LED 1. 1. 1. 1. .58 .31
Musk dataset. When corrupting the most important fea- Wav. 1. .85 .85 .85 .72 .41
tures, the step perturbation produces virtual drift on the
Hyperplane dataset while the noise perturbation yields
virtual drift on the Waveform dataset. real drift is detected.</p>
          <p>
            In an efort to aggregate both the True Positives and
FmaelstreicN1edgeafintievdesinin[to20o]n.eSimnecetrwice, wcoenwduilcltmouakreexupseeroimftehnet  = 2 * ̂̂︂︂ +*  (1)
on a batch mode, we removed the impact of the detection
delay defined as the number of drift samples processed We evaluate RDD with  = 0.01,  = 0.3,  = 0.3
before signaling a drift. The Drift Accuracy (̂︂) is a against:
binary value that assess the correctness of the detection,
it’s equal to 1 when a virtual drift is ignored or when a
• ADWIN [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] with  = 0.7
• Discriminative Drift Detector (D3) [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.2. Virtual Drift</title>
        <sec id="sec-2-2-1">
          <title>In Table 3 we present the detections made on virtual</title>
          <p>drift induced by a Step corruption of the least
important features. On the 7 detectors evaluated, 3 are able to
consistently ignore this type of virtual drift: TSDD, ST
and RDD. TSDD comes first with no detections on
realworld (RW) datasets and almost none on synthetic data.
RDD makes no False Positives (FP) on 7 RW datasets and
all synthetic datasets. On 3 real-word datasets the FP is
very low (0.1). The Student Teacher detector produces
no FP on 8 RW datasets and on 2 synthetic ones,
however on 2 RW datasets the FP rate is high as it is around
.5. ADWIN along with the statistical test-based KS and
MMD fail to ignore virtual drift on all but 1 RW dataset.
D3 does slightly better ignoring virtual drift on 3 RW
datasets. On synthetic data, relatively few FP are made
by those 4 detectors.</p>
          <p>Table 4 exhibits detection rate when adding noise to
the least important features. On the Musk dataset, this
type of perturbation produces real drift and therefore,
detections are considered as True Positive (TP). ADWIN
along with D3, KS and MMD which where not specifically
built to handle virtual drift, systematically wrongfully
detect drift across all real and synthetic datasets. RDD flags
detects virtual drift on the Digits 08 dataset 4 out of 10
runs. Virtual drift is otherwise ignored by RDD. ST and
TSDD fail to ignore the virtual drift on 2 RW datasets.
0.1
In Table 5 we observe real drift induced by a Step drift on
the most informative features. The Hyperplane exhibits
virtual drift with this corruption and low values should
be regarded as TN. ADWIN, D3, KS and MMD, which all
exhibit poor performance on virtual drift, now achieve
almost perfect detection on RW datasets. However, D3
and ADWIN fail to detect real drift on synthetic data.
Our method systematically detects real drift on 4 RW
datasets and achieve good levels of detection on 3
others. Drift is detected on 50% of the runs on the phishing
dataset. The ST model achieves 4 perfect detections on
RW datasets and good level of detection on 3 others. The
TSDD detector yields poor performance detecting only 2
drifts out of all RW datasets. On the 2 synthetic datasets
with real drift, ADWIN, D3 and RDD fail to detect the
drift, while TSDD detects 1. KS, MMD and ST detectors
succeed in their detection.</p>
          <p>Noise detection on the most important features results
are shown in Table 6. ADWIN, D3, KS and MMD achieve
perfect detection across both real and synthetic datasets.
RDD detection results exceed that of ST and TSDD with 7
perfect detections on RW datasets. TSDD and ST are tied
with both 5 accurate detections on RW datasets. Only our
detector and TSDD ignore virtual drift on the Waveform
dataset. ST and TSDD outperform RDD with one perfect
detection on the virtual datasets.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>4.4. Overall performances</title>
        <p>In Table 7, we produce the combined true positive and
true negative results by (1). This table showcases the
overall performance achieved by each detector on each
dataset. Due to the fact that on the Musk, Hyperplane and
the Waveform datasets, drift induction either generates
more virtual or real drift, the TN score will have a varying
impact.</p>
        <p>Table 7 allows us to assess the ability of a model to
ignore real drift while detecting real drift. RDD yields the
best H scores on 7 RW datasets and tying the first place</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <p>In this paper we introduced RDD, a drift detector that
does not need ground truth labels during the inference
phase. We extensively challenged our algorithm against
a number of state of the art drift detectors and over a
large panel of both real and synthetic datasets. We
experimentally proved that our method outperforms current
drift detection methods. We showed our detector’s ability
to detect real drift and to ignore virtual drift. As false
alarms are the main reason why drift detectors are not
on 2 synthetic ones. TSDD takes second place with the
highest score on 1 RW dataset but coming first or tying
ifrst place on all synthetic datasets. ST comes third with
the highest H scores on 2 RW datasets and tying first
place on 1 synthetic dataset. On RW datasets, we see that
ADWIN, D3, KS and MMD have overall low scores due
to their misclassification of virtual drift despite having
detected all real drifts. On synthetic datasets, their score
is better having not made too many misclassification
when a step drift was induced on the least informative
features.
widely used in production. We demonstrated the
usability of our detector for real world applications. We tuned
the hyper-parameters on 3 datasets not used in the
experimental study. We show that they are valid in a wide
range of real-world scenarios and that few efort should
be made when using the models in production. We also
demonstrated the ability of RDD to work in any
dimension, having the best detection accuracy on both datasets
that had over 100 features.</p>
      <p>Future work will consist of further modeling the
partition space. Research will also deal on how a drift detector
can be initialized in recurrent concept drift scenarios,
when no stationary dataset can be used to initialize a
detector.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>A.-K. Reuel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Koren</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Corso</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Kochenderfer</surname>
          </string-name>
          ,
          <article-title>Using adaptive stress testing to identify paths to ethical dilemmas in autonomous systems</article-title>
          ., in: SafeAI@ AAAI,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Feature space singularity for out-of-distribution detection</article-title>
          , arXiv preprint arXiv:
          <year>2011</year>
          .
          <volume>14654</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Hand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <article-title>The impact of changing populations on classifier performance</article-title>
          ,
          <source>in: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>1999</year>
          , pp.
          <fpage>367</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Philipsen</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Dueholm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Johansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nasrollahi</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Moeslund</surname>
          </string-name>
          ,
          <article-title>Seasons in drift: A long-term thermal imaging dataset for studying concept drift</article-title>
          ,
          <source>in: Thirty-fifth Conference on Neural Information Processing Systems</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Suprem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Arulraj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          ,
          <article-title>Odin: Automated drift detection and recovery in video analytics</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>05440</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kamoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <article-title>Out-of-distribution detection with likelihoods assigned by deep generative models using multimodal prior distributions</article-title>
          ., in: SafeAI@ AAAI,
          <year>2020</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Elwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Polikar</surname>
          </string-name>
          ,
          <article-title>Incremental learning of concept drift in nonstationary environments</article-title>
          ,
          <source>IEEE Transactions on Neural Networks</source>
          <volume>22</volume>
          (
          <year>2011</year>
          )
          <fpage>1517</fpage>
          -
          <lpage>1531</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brzezinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stefanowski</surname>
          </string-name>
          ,
          <article-title>Reacting to diferent types of concept drift: The accuracy updated ensemble algorithm</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>25</volume>
          (
          <year>2013</year>
          )
          <fpage>81</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Maloof</surname>
          </string-name>
          ,
          <article-title>Dynamic weighted majority: An ensemble method for drifting concepts</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>8</volume>
          (
          <year>2007</year>
          )
          <fpage>2755</fpage>
          -
          <lpage>2790</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Baena-Garcıa</surname>
          </string-name>
          ,
          <source>J. del Campo-Ávila</source>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fidalgo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gavalda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morales-Bueno</surname>
          </string-name>
          ,
          <article-title>Early drift detection method</article-title>
          ,
          <source>in: Fourth international workshop on knowledge discovery from data streams</source>
          , volume
          <volume>6</volume>
          ,
          <year>2006</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>D. R. de Lima</surname>
            <given-names>Cabral</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R. S. M. de Barros</surname>
          </string-name>
          ,
          <article-title>Concept drift detection based on fisher's exact test</article-title>
          ,
          <source>Information Sciences 442</source>
          (
          <year>2018</year>
          )
          <fpage>220</fpage>
          -
          <lpage>234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gavalda</surname>
          </string-name>
          ,
          <article-title>Learning from time-changing data with adaptive windowing</article-title>
          ,
          <source>in: Proceedings of the 2007 SIAM international conference on data mining, SIAM</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Raab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heusinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-M.</given-names>
            <surname>Schleif</surname>
          </string-name>
          ,
          <article-title>Reactive soft prototype computing for concept drift streams</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>416</volume>
          (
          <year>2020</year>
          )
          <fpage>340</fpage>
          -
          <lpage>351</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heusinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-M.</given-names>
            <surname>Schleif</surname>
          </string-name>
          ,
          <article-title>Reactive concept drift detection using coresets over sliding windows</article-title>
          ,
          <source>in: 2020 IEEE Symposium Series on Computational Intelligence (SSCI)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>1350</fpage>
          -
          <lpage>1355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ö. Gözüaçık</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Büyükçakır</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Bonab</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Can</surname>
          </string-name>
          ,
          <article-title>Unsupervised concept drift detection with a discriminative classifier</article-title>
          ,
          <source>in: Proceedings of the 28th ACM international conference on information and knowledge management</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2365</fpage>
          -
          <lpage>2368</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hickey</surname>
          </string-name>
          ,
          <article-title>Learning classification rules for telecom customer call data under concept drift</article-title>
          ,
          <source>Soft Computing</source>
          <volume>8</volume>
          (
          <year>2003</year>
          )
          <fpage>102</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cerqueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <article-title>Unsupervised concept drift detection using a student-teacher approach</article-title>
          , in: International Conference on Discovery Science, Springer,
          <year>2020</year>
          , pp.
          <fpage>190</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rabanser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Günnemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <article-title>Failing loudly: An empirical study of methods for detecting dataset shift</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gretton</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Borgwardt</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Rasch</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Schölkopf</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smola</surname>
          </string-name>
          ,
          <article-title>A kernel two-sample test</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>13</volume>
          (
          <year>2012</year>
          )
          <fpage>723</fpage>
          -
          <lpage>773</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hammer</surname>
          </string-name>
          ,
          <article-title>Task-sensitive concept drift detector with constraint embedding</article-title>
          ,
          <source>in: 2021 IEEE Symposium Series on Computational Intelligence (SSCI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>01</fpage>
          -
          <lpage>08</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Serdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cihan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yücel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Serdar</surname>
          </string-name>
          ,
          <article-title>Sample size, power and efect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies</article-title>
          ,
          <source>Biochemia medica 31</source>
          (
          <year>2021</year>
          )
          <fpage>27</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Sethi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantardzic</surname>
          </string-name>
          ,
          <article-title>On the reliable detection of concept drift from streaming unlabeled data</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>82</volume>
          (
          <year>2017</year>
          )
          <fpage>77</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Sethi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantardzic</surname>
          </string-name>
          , E. Arabmakki,
          <article-title>Monitoring classification blindspots to detect drifts from unlabeled data</article-title>
          ,
          <source>in: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>142</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>A. Van Looveren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Klaise</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Vacanti</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Cobb</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Scillitoe</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Samoilescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Athorne</surname>
          </string-name>
          ,
          <article-title>Alibi detect: Algorithms for outlier, adversarial</article-title>
          and drift detection,
          <year>2019</year>
          . URL: https://github.com/SeldonIO/ alibi-detect.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>