A Robust Drift Detection Algorithm with High Accuracy and
Low False Positives Rate
Maxime Fuccellaro1,2 , Laurent Simon1 and Akka Zemmari1
1
    University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, 351, cours de la Libération F-33405 Talence
2
    Mangrove, 87 Avenue des Aygalades, 13015 Marseille


                                          Abstract
                                          The number of decision-making processes that rely on machine learning models to operate has been increasing in recent
                                          years. Safety of those systems is compromised when models deviate from their expected behavior. One root cause is a shift in
                                          the underlying data distribution, known as concept drift. A direct consequence of concept drift is a rapid drop in model’s
                                          predictive power. Accurate detection of drift is essential as false alarms lead to unnecessary down time and undermine
                                          confidence in the drift detection model. This paper introduces Real-Drift Detector (RDD), a drift detector that is not triggered
                                          by virtual drift. RDD does not need use class labels during the inference phase to operate. Our detector outperformed the
                                          state of the art in an extensive benchmark on a large panel of well-known datasets used in drift detection.

                                          Keywords
                                          Concept Drift, Real Drift, Virtual Drift, Unsupervised, Hypothesis Testing


1. Introduction                                                                                        distribution [3]. A shift of distribution is referred as
                                                                                                       Concept Drift (CD) and its detection will be the focus of
More and more online systems rely, at least partly, on this paper.
a form of machine learning model to operate. The                                                          Machine learning models are built under the hypoth-
widespread integration of Artificial Intelligence based esis that data seen during the training phase share the
model has its roots in the constant progress made in the same distribution as unseen future data. Concept drift ap-
field, which enables models to solve increasingly com- pears when the underlying distribution of a data source
plex tasks well suited to real world applications. The changes over time. If the static distribution hypothesis
democratisation of Machine Learning (ML) models al- is violated, historic data cannot be used to predict the
lows non-experts to use it to automate repetitive tasks future and predictive models see their performances drop.
and is simplified by the easy access and processing of Concept Drift can impact every ML domain including
large quantities of data required to train predictive mod- video analysis [4], [5].
els. The emergence of cloud computing has also been                                                       In contrast to anomaly detection, where the goal is
accelerating the industrial use of ML models in produc- to isolate few out of distribution samples, concept drift
tion.                                                                                                  will cause a large part of the data to deviate from past
   However, ML models can be crippled by a wide range distributions. One way to categorize drifts is with its
of problems that raise serious questions regarding their impact on a model’s performance. Virtual drift is used
impact on the safety of systems and their consequence to describe a distribution change that does not impact a
on society. Some models inadvertently induce bias in model while real drift does. Let 𝑋 be a set a variables
their predictions [1] such as black box models that are used to predict the target class vector 𝑦. We distinguish
therefore excluded from applications where explanability three root causes of concept drift: it may come from the
is a critical feature such as loan applications. However, change in the class distribution P(𝑋 | 𝑦), the feature
ML models are often poorly adapted to detect out of space P(𝑋) or the class priors P(𝑦). Where the change in
distribution samples [2] that are not classified correctly. distribution is rapid, the drift is said to be abrupt. Drift is
Models can then see a drop of performance while in the incremental when the distribution shifts slowly over time.
inference phase due to a change in the underlying data Recurrent drift is defined as a distribution that oscillate
Proceedings of the Workshop on Artificial Intelligence Safety 2023
                                                                                                       in between two or more concepts. Drift detection differs
$ maxime.fuccellaro@u-bordeaux.fr (M. Fuccellaro);                                                     from outliers detection [6] as the goal is to identify and
laurent.simon@u-bordeaux.fr (L. Simon);                                                                take actions to deal with a global distribution change and
akka.zemmari@u-bordeaux.fr (A. Zemmari)                                                                not to remove out of distribution samples.
 https://www.labri.fr/perso/lsimon/ (L. Simon);                                                          We present RDD: Real-Drift Detector, a unsupervised
https://www.labri.fr/perso/zemmari/ (A. Zemmari)
                                                                                                       drift detection method based on the supervised parti-
 0000-0003-0544-5503 (L. Simon); 0000-0002-9776-0449
(A. Zemmari)                                                                                           tion of feature space aiming to detect local distribution
          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License changes that impacts models performances. RDD works
          Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
in any number of dimensions. Our detector does not need         change is detected, while its size decreases rapidly in
labels during inference and outperforms the state of the        the presence of drift. The detection-mechanism works
art in a thorough experiment. In Section 2, we present          by repeatedly splitting the window, based on the time
related work and position our paper. The RDD algorithm          of appearance into two smaller sets. A drift is detected
is described in Section 3. In Section 4, the experimen-         when the averages of the values of two sets are statis-
tal protocol is presented and the results are discussed.        tically different. Several other window-based detectors
Section 5 concludes this paper.                                 have been presented since such as [13].
                                                                   To avoid detecting drift over one-dimension sliding
                                                                windows, [14] detects swift or gradual changes in data
2. Related Work                                                 values with minimum enclosing balls. A ball is defined
                                                                as a centroid and the minimum radius that enables to
Over the last few years, concept drift has become a major
                                                                include all of the samples in the ball. A drift is detected
field of research in the machine learning community.
                                                                when too many values are labelled as outliers, in which
Focus was mostly aimed at models dealing with drift as
                                                                case the centroid is updated.
well as drift detectors. Recent advances in detecting drift
                                                                   To circumvent the unavailability of true labels, the au-
when true class labels are available led algorithms to
                                                                thors of [15] trained a model to distinguish past data from
achieve almost perfect detection. Other (more realistic)
                                                                recent data. All timestamp-like aggregates are removed
contexts still show room for improvement.
                                                                prior to training the model to prevent trivial identifica-
   To prevent drift from affecting ML models, several
                                                                tion. The ability of the model is assessed using the AUC
mechanisms have been proposed. Assuming that recent
                                                                metric, using a threshold of .75 in their paper. Using
data share the same distribution as upcoming data, one
                                                                time to find drift, the authors of [16] include the times-
way is to continuously update a pool of models. In [7] a
                                                                tamp attribute in the observations and train a model on
batch of new data is scored by a pool of models, the indi-
                                                                past and recent data to predict the target variable. If the
vidual model’s contribution are weighted by it’s recent
                                                                timestamp attribute is an informative feature, the target
performances. At each new batch, a model is trained on
                                                                variable depends on time and the presence of a drift is
it and added to the pool while a long term poor perform-
                                                                assumed.
ing one is removed. This ensures good performances in
                                                                   The authors of [17] present another way of detecting
the presence of concept drift and a fast adaptation on
                                                                drift in an unsupervised manner. A first model (teacher)
recurrent drifts. Methods that work on a dynamic pool
                                                                is trained on past labelled data, a second model (student)
of models have been thoroughly studied [8], [9].
                                                                is trained to mimic the behavior of the teacher model.
   When detecting a drift, the consensus is that an in-
                                                                During the inference phase the authors monitor the er-
crease of the error rate of a model means the presence of
                                                                ror rate of the student model and use [12] to trigger an
concept drift. Detection methods based on that hypothe-
                                                                alarm. In [18], drift detection in an unsupervised im-
sis have been given a lot of attention as this methodology
                                                                age classification context is studied. The authors first
is able to systematically detect real drift while consis-
                                                                apply a dimension reduction technique before using a
tently ignoring virtual drift. The detection methods pre-
                                                                two-sample test to find drift. A number of dimension re-
sented in [10] and [11] work by monitoring a model’s
                                                                duction and two-sample tests including (MMD [19] and
error rate. A drop of performance is interpreted as a
                                                                KS) are evaluated on an extensive study of different types
presence of drift. Different statistical tests are used to
                                                                of shift applied to images. In [20] the authors incorporate
monitor the error rate and signal drifts.
                                                                the target class in the dimension reduction mechanism,
   Both updating a pool of models and monitoring the er-
                                                                enabling the detector to ignore virtual drift.
ror rate perfectly deal with drift. As virtual drift does not
                                                                   The idea behind concept drift detection by statistical
impact accuracy, it is systematically ignored. However,
                                                                tests is that a distribution change will be a strong indica-
in order to work, both approaches need access to the
                                                                tor of drift [14], [18]. A distribution change will enable
true labels immediately after the inference phase. This
                                                                the detection of drift, but won’t discriminate between
is not realistic for real-world scenarios when true class
                                                                real drift and virtual drift. To the best of our knowledge,
labels are almost never available promptly and are some-
                                                                few algorithms are able to discriminate between real drift
times never known. To address this issue, several ways
                                                                and virtual drift without access to true labels after detec-
of dealing with drift in an unsupervised manner have
                                                                tion. In this paper we introduce RDD a detector that does
been studied.
                                                                not need true class labels to operate during the inference
   To detect drift without class labels availability,
                                                                phase. RDD works in any dimension and successfully
window-based technique have been studied. The authors
                                                                discriminates between real and virtual drift.
of [12] introduce ADWIN, an adaptive sliding window
algorithm. It works by keeping a reference window con-
taining past instances. The window widens when no
3. RDD                                                          distribution. During the initialisation step, leaves that
                                                                cannot well separate class labels are removed, all leaves
3.1. Our detector                                               with less than 20% impurity are dropped. During early
                                                                experimental runs, we found that setting a low minimum
The idea behind our model is that a real drift changes
                                                                impurity percentage yields very few leaves with enough
the distribution of regions made by of a class dependant
                                                                samples to conduct the statistical test. We also found
partitioning while virtual drift does not. We use a deci-
                                                                that setting a high value prevents us from confidently
sion tree to partition the feature space into regions of
                                                                assigning a class to a leaf. We set the maximum impurity
homogeneous class labels.
                                                                value at 20% as it offers a good compromise, although
   Our intuition is that a data distribution change in a leaf
                                                                this value could be changed based on the data at hand.
between the training phase and inference phase indicates
                                                                In an effort to rank the remaining leaves, we attribute to
a drift. It is our assumption that a real drift is likely to
                                                                each leaf a weight which corresponds to a leaf’s purity
change in which region the observations are assigned
                                                                during training.
to. Such misplaced samples are likely to have a different
                                                                    Weights somehow capture the separation power of a
distribution than that of the training observations. A
                                                                leaf. If a given observation is classified at a leaf with
distribution change leading to virtual drift is unlikely to
                                                                high weight, we may expect that the probability of this
be seen locally as it may affect less the way observations
                                                                observation to be misclassified is low. On the other hand,
are distributed in leaves.
                                                                a leaf with a lower weight will be more susceptible to
   In order to better discriminate real drift from virtual
                                                                assign the wrong label to an observation. Our goal by
drift, each region is attributed a weight. The weights
                                                                ranking the leaves based on their predictive power is to
represent the ability of a given region to correctly assign
                                                                help our detector ignore virtual drift.
a label to an observation. A region that only contains
                                                                    The detection step is detailed in Algorithm 1. In line
a single class of observations will have a large weight.
                                                                1-5 test set observations are attributed to 𝐿𝑡𝑒𝑠𝑡 based on
While a region that cannot well separate samples on
                                                                the leaf they are at. In line 6, we go over each leaf that
their label will have a small weight. This is done to
                                                                contains test data; in line 7 we initialize the drift features
reduce the risk of sudden class imbalance to be detected
                                                                DF variable that tracks the number of drifted features. In
as a distribution change. Our model signals a drift when
                                                                line 8, the number of observations at a leaf is checked.
enough regions flag their distribution as changing.
                                                                In lines 9 through 13, for all leaves containing enough
                                                                test instances, we proceed to do a Levene test of variance
3.2. Mathematical Background                                    equality on all dimensions between the inference and
                                                                training set contained in a leaf. We choose here the Lev-
During the initialization step, a decision tree classifier
                                                                ene test as it is adequate when the data distribution may
(ℳ) is fit over the training data. Like most drift detection
                                                                slightly deviate from the normal one. Of course, other
algorithms, we assume the training data is sampled from
                                                                tests could be used, based on the knowledge of the under-
one single concept. We do not consider the training data
                                                                lying data distribution to improve the detector’s perfor-
to include past concepts that might offset the detector.
                                                                mances (when the data distribution is strictly normal, the
For both the training (𝑇 ) and inference (𝐼) test sets, we
                                                                Barlett’s test should be used; the Brown–Forsythe test
consider the variables to follow a normal distribution.
                                                                may also be an alternative when the data does not follow
This hypothesis is required to test for homoscedasticity
                                                                a normal distribution). We did not make any assumptions
in the latter. After discarding all leaves that contain too
                                                                on the distribution and independence of variables, it will
few samples or that are not pure, we store, for each leaf,
                                                                be the focus of future work.
the training instances that belong to it. Let 𝑇𝑘 , 𝐼𝑘 be the
                                                                    In lines 14 through 18, leaves are classified as drifting
training and inference data within leaf 𝑘 of class 𝑐. For
                                                                if the ratio of features that fail the homoscedasticity test
the test to be significant [21], we remove all leaves in the
                                                                exceeds the user defined 𝛾 threshold. In line 21, the
decision tree containing less than a number of observa-
                                                                algorithm flags a drift if the weighted average of leaf’s
tions 𝜈. We set 𝜈 = 20. We have ∀𝑘, min(|𝑇𝑘 |, |𝐼𝑘 |) ≥ 𝜈
                                                                drift-labels exceeds the user defined 𝛽 threshold.
as well as 𝑌𝑇𝑘 = 𝑌ˆ 𝐼𝑘 = 𝑐.
   By construction, the leaves of a decision tree don’t
hold the same separation power of class labels. The in-         3.3. Hyper-parameter discussion
tuition is that leaves containing pure class labels should
                                                                The first hyper-parameter is 𝛼. Setting a low value for
be less affected by P(𝑦 | 𝑋) concept change as they are
                                                                the 𝛼 parameter reduces the risk to make a type I error,
generally further away from the decision boundary. On
                                                                which, in our case, indicating drift when there is not. The
the contrary, impure leaves are more likely to experience
                                                                𝛼 parameter was set to 0.01.
a distribution change due to a P(𝑦) drift or to be sub-
                                                                   The 𝛾 parameter is the minimum ratio of features to
ject to misclassifications that may impact the inference
                                                                reject the equal variance hypothesis. A low 𝛾 parameter
                                                                                                    Parameter =
Algorithm 1 RDD - Inference                                           1.0
                                                                      0.8
  Inputs:
                                                                      0.6
  - ℳ : Trained Decision Tree Model


                                                               Rate
                                                                      0.4
  - 𝐿𝑡𝑟𝑎𝑖𝑛 : Dictionary of leaves mapping to the training             0.2
  instances                                                           0.0                                                                 Metric
                                                                                                    Parameter =                               TP
  - 𝐼 ∈ Rd×m : Test set with d features and m samples                 1.0                                                                     TN
                                                                                                                                              H
  - 𝑊 : Leaves weights                                                0.8

  - 𝑑 : Number of variables in dataset                                0.6


                                                               Rate
                                                                      0.4
  Parameters:                                                         0.2
  - 𝛼 : Hypothesis rejection risk                                     0.0
  - 𝜈 : Minimum number of observation in a leaf                             0.1   0.2   0.3   0.4        0.5
                                                                                                       Value
                                                                                                                  0.6   0.7   0.8   0.9

  - 𝛾 : Minimum ratio of ℋ0 rejection for a leaf to drift
  - 𝛽 : Minimum ratio of leaves to drift to trigger an         Figure 1: Influence of the 𝛽 and 𝛾 parameters on the True
  alarm.                                                       Positive rate, True Negative rate and H score. On the 𝛽 plot,
  Variables:                                                   𝛾 = 0.3 and on the 𝛾 plot, 𝛽 = 0.3. The graph confirms our
  - 𝐿𝑡𝑒𝑠𝑡 : Dictionary of leafs mapping to the test in-        intuition that low values tend to flag virtual drift as real drift
  stances in those leafs                                       while high values cause the detector not to detect any drift.
  - 𝐷𝐿 : Drift status of leaves
  - 𝐷𝐹 : Number of features that drift within a leaf
 1: for 𝑖 ∈ 𝐼 do                                               4. Experiment
 2:     if ℳ(𝑖) ∈ 𝐿𝑡𝑟𝑎𝑖𝑛 then
 3:         𝐿𝑡𝑒𝑠𝑡 [ℳ(𝑖)] ← 𝐿𝑡𝑒𝑠𝑡 [ℳ(𝑖)] + 𝑖                    In this experiment we assess our method’s ability to de-
 4:     end if                                                 tect drift while ignoring virtual drift. We extensively
 5: end for                                                    tested our method against a wide panel of state of the art
 6: for 𝑖 ∈ 𝐿𝑡𝑒𝑠𝑡 do                                           detectors on a extensive set of both real and synthetic
 7:     𝐷𝐹 = 0                                                 datasets. The benchmark used in this section are the stan-
 8:     if 𝐿𝑡𝑒𝑠𝑡 [𝑖] ≥ 𝜈 then                                  dard ones when testing drift detectors [20], [22], [23].
 9:         for 𝑗 ∈ [0, 𝑑 − 1] do
10:             if 𝐻𝛼 : 𝜎𝐿𝑡𝑟𝑎𝑖𝑛 [𝑖][𝑗] ̸= 𝜎𝐿𝑡𝑒𝑠𝑡 [𝑖][𝑗] then
                                                               4.1. Experimental Setup
11:                  𝐷𝐹 = 𝐷𝐹 + 1
12:             end if                                       The usual procedure to test algorithms suited to handle
13:         end for                                          drift when true class labels are available after inference, is
14:         if 𝐷𝐹
                𝑑
                     ≥ 𝛾 then                                the test-then-train approach. A model predicts the class
15:             𝐷𝐿[𝑖] = 1                                    on a batch of samples, then, the true class is revealed and
16:         else                                             the model updates itself. The global prediction accuracy
17:             𝐷𝐿[𝑖] = 0                                    is then used to rank models.
18:         end if                                              This setup is not suited for models that do not rely on
19:     end if                                               true labels availability. In most datasets used to bench-
20: end for                                                  mark drift handling methods, the presence of drift is
21: Return                                                   only assumed or artificially introduced by sorting the
              ∑︀
                 𝑖∈𝐿𝑡𝑒𝑠𝑡 𝑊𝑖 * 𝐷𝐿[𝑖] ≥ 𝛽
                                                             observations on an attribute. To the best of our knowl-
                                                             edge, the exact occurrence of drift is unknown for all
means leaves will be considered as drifting if few fea- usual datasets. The experimental setup described bel-
tures present a shift in variance (i.e. the detector will be low allows us to know the exact drift occurrence and to
sensible).                                                   evaluate the effect it has on a model’s accuracy.
   The 𝛽 parameter is the minimum ratio of drifted leaves       The goal of the experiment is to assess the performance
to signal a drift has taken place.                           of  detectors on real drift and virtual drift. Two distinct
   In order to set relevant 𝛽 and 𝛾 values for our detector, perturbations   are used to change the dataset: the Step
we conducted a hyper parameter search on three datasets      Drift, where  a subset of the features are shuffled and the
(airlines, poker and weather). We excluded those datasets    Noise  Drift, where  Gaussian  𝒩 (1, 1) noise is added to a
from the experimental study to prevent bias. In Figure 1 subset of features (Gaussian noise with mean equal to 1 is
we plot the influence of both 𝛽 and 𝛾 on TP, TN and the used to change the mean of the distribution, not to obfus-
H score. The H score is detailed in Section 4.               cate the signal). The idea is now to be able to artificially
                                                             generate real drift and virtual drift. To create virtual drift,
we add one of the two perturbations on the 25% least im-           Table 1
portant features as to their predictive power of the class         Overview of the dataset used in our experiment. All but one
labels. In doing so, we hope to change the distribution            RW dataset are binary classification problems. 2 RW datasets
of several features will not affect a predictive model’s           contain more than 100 features. For the synthetic datasets,
performances. To create real drift, we modify the 25%              we limit the number of generated observations at 10 000.
most informative features by adding one of the two per-                      Dataset         Dimensions      Classification
turbations. The intuition is that a change of distribution
on the most important features is likely to cause a drop                     Adult            (48842, 66)      Binary
                                                                             Bank             (45211, 49)      Binary
of performance in a predictive model. To find 25% most
                                                                              Cov            (110393, 51)   Multi-class (7)
and least important features, we train a Random Forest                      Digits08           (1499, 17)      Binary
Classifier over the training data. We choose this model                     Digits17           (1557, 17)      Binary
as it is a robust, widely used model that achieves good                       Elec            (45312, 15)      Binary
level of performance on the datasets. For each dataset,                      Musk             (6598, 167)      Binary
we introduce the 2 perturbations on the 2 sets of features                  Phishing          (11055, 47)      Binary
thus creating 4 distinct drift set.                                          Spam             (6213, 500)      Binary
   In order to have stationary non-drifting data before                      Wine              (6497, 13)      Binary
adding our generated drifts, we first randomly shuffle                    Hyperplane         (10000, 11)       Binary
the observations. Each dataset is then partitioned into                      LED             (10000, 26)    Multi-class (10)
three: a train set, a validation set and a drift set. 4 distinct          Waveform           (10000, 41)    Multi-class (3)
copies of the drift set are independently modified with
the 4 different perturbations described above. In order
                                                                   Table 2
to assess if a drift is virtual or real, we fit a Random
                                                                   Accuracy of a Random Forest Classifier over the training, vali-
Forest Classifier on the train set before reporting it’s           dation and drift set. Adding noise to least informative features
accuracy on the train set, validation set and the 4 different      (LN) leads to virtual drift on all datasets. Adding step perturba-
drift sets. The drop of the model’s accuracy between the           tion to the least informative features (LS) also leads to virtual
different sets is used to classify drift as virtual or real. If    drift except for the musk dataset. When those perturbation
the difference in accuracies between the validation set            are made on the most informative features (MN and MS), it
and the training set is lower than that of the validation          leads to real drift across all real datasets. We highlight in bold
set and the drift set, we consider the drift induced to be         perturbations that lead to real drift.
real, otherwise, it is considered a virtual drift.
                                                                                   Train      Val.    LN    LS    MN      MS
   Table 1 briefly describes the datasets used in the ex-
periment. The datasets dimensions range from 11 on                       Adult       1.       .85     .85   .85    .28     .59
Hyperplane to 500 on Spam. The classification task is                    Bank        1.       .94     .92   .94    .52     .52
                                                                          Cov       .99       .85     .84   .84    .49     .46
binary on 10 datasets and multi-class on 3. This ensures
                                                                          D08        1.        1.     .99   .99    .77     .69
that RDD is tested in a variety of scenarios.
                                                                          D17        1.       .99      1.    1.    .54     .80
   In table 2 we report the average accuracies of the Ran-                Elec       1.       .89     .87   .87    .57     .62
dom Forest Classifier over the training, validation and                  Musk        1.       .98     .94   .97    .51     .56
the four different drift set. changing the most important                Phis.      .99       .97     .96   .96    .69     .47
features generates real drift while changing the least im-               Spam        1.       .98     .98   .98    .58     .59
portant features creates virtual drift regardless of the per-            Wine        1.        1.      1.    1.    .63     .44
turbation. There are 3 exceptions, when noise is added to                Hyp.        1.       .87     .87   .85    .71     .87
the least important features, real drift is produced on the              LED         1.        1.      1.    1.    .58     .31
Musk dataset. When corrupting the most important fea-                    Wav.        1.       .85     .85   .85    .72     .41
tures, the step perturbation produces virtual drift on the
Hyperplane dataset while the noise perturbation yields
virtual drift on the Waveform dataset.                             real drift is detected.
   In an effort to aggregate both the True Positives and
False Negatives into one metric, we will make use of the                                             ̂︂ * 𝑇 𝑁
                                                                                                     𝐷𝐴
metric 1 defined in [20]. Since we conduct our experiment                                 𝐻 =2*                                  (1)
                                                                                                     𝐷𝐴
                                                                                                     ̂︂ + 𝑇 𝑁
on a batch mode, we removed the impact of the detection
delay defined as the number of drift samples processed               We evaluate RDD with 𝛼 = 0.01, 𝛽 = 0.3, 𝛾 = 0.3
before signaling a drift. The Drift Accuracy (𝐷𝐴)     ̂︂ is a      against:
binary value that assess the correctness of the detection,
                                                                         • ADWIN [12] with 𝛿 = 0.7
it’s equal to 1 when a virtual drift is ignored or when a
                                                                         • Discriminative Drift Detector (D3) [15]
Table 3
Least Important Step Drift: The detection results on virtual drift when the least important features are shuffled. The lower the
detection ratio, the better the detectors are. Our model RDD comes second being slightly outperformed by TSDD. ST comes
third with relatively few wrong detections in comparison to the other detectors that wrongfully detect virtual drift.

               Adult    Bank     Cov     D08     D17    Elec    Musk      Phish.    Spam     Wine      Hyp.    LED     Wav.
    RDD         0.0      0.0      0.0    0.1     0.1     0.0      0.0      0.0       0.0       0.1      0.0     0.0     0.0
   ADWIN        1.0      1.0      1.0    1.0     1.0     0.0      1.0      1.0       1.0       1.0      0.0     0.1     0.1
     D3         0.0      0.5      0.0    0.9     1.0     0.0      1.0      1.0       1.0       1.0      0.0     0.0     0.0
     KS         0.9      1.0      0.0    1.0     1.0     0.0      1.0      1.0       1.0       1.0      0.0     1.0     0.1
    MMD         1.0      1.0      0.3    1.0     1.0     1.0      1.0      1.0       1.0       1.0      0.2     0.2     0.6
     ST         0.0      0.4      0.0    0.0     0.0     0.0      0.6      0.0       0.0       0.0      0.0     0.2     0.0
    TSDD        0.0      0.0      0.0    0.0     0.0     0.0      0.0      0.0       0.0       0.0      0.0     0.1     0.0


     • Kolmogorov-Smirnov (KS) distribution test detec-          4.3. Real Drift
       tor, we used the implementation of [24].
                                                                 In Table 5 we observe real drift induced by a Step drift on
     • Maximum Mean Discrepancy (MMD) [19], we
                                                                 the most informative features. The Hyperplane exhibits
       used the implementation of [24].
                                                                 virtual drift with this corruption and low values should
     • Student-Teacher (ST) [17]                                 be regarded as TN. ADWIN, D3, KS and MMD, which all
     • Task Sensitive Drift Detector (TSDD) [20]                 exhibit poor performance on virtual drift, now achieve
                                                                 almost perfect detection on RW datasets. However, D3
All detectors were used with their default parameter val-
                                                                 and ADWIN fail to detect real drift on synthetic data.
ues unless specified otherwise. For the sake of readability,
                                                                 Our method systematically detects real drift on 4 RW
we only highlight the best results in Table 7.
                                                                 datasets and achieve good levels of detection on 3 oth-
                                                                 ers. Drift is detected on 50% of the runs on the phishing
4.2. Virtual Drift                                               dataset. The ST model achieves 4 perfect detections on
In Table 3 we present the detections made on virtual             RW datasets and good level of detection on 3 others. The
drift induced by a Step corruption of the least impor-           TSDD detector yields poor performance detecting only 2
tant features. On the 7 detectors evaluated, 3 are able to       drifts out of all RW datasets. On the 2 synthetic datasets
consistently ignore this type of virtual drift: TSDD, ST         with real drift, ADWIN, D3 and RDD fail to detect the
and RDD. TSDD comes first with no detections on real-            drift, while TSDD detects 1. KS, MMD and ST detectors
world (RW) datasets and almost none on synthetic data.           succeed in their detection.
RDD makes no False Positives (FP) on 7 RW datasets and              Noise detection on the most important features results
all synthetic datasets. On 3 real-word datasets the FP is        are shown in Table 6. ADWIN, D3, KS and MMD achieve
very low (0.1). The Student Teacher detector produces            perfect detection across both real and synthetic datasets.
no FP on 8 RW datasets and on 2 synthetic ones, how-             RDD detection results exceed that of ST and TSDD with 7
ever on 2 RW datasets the FP rate is high as it is around        perfect detections on RW datasets. TSDD and ST are tied
.5. ADWIN along with the statistical test-based KS and           with both 5 accurate detections on RW datasets. Only our
MMD fail to ignore virtual drift on all but 1 RW dataset.        detector and TSDD ignore virtual drift on the Waveform
D3 does slightly better ignoring virtual drift on 3 RW           dataset. ST and TSDD outperform RDD with one perfect
datasets. On synthetic data, relatively few FP are made          detection on the virtual datasets.
by those 4 detectors.
   Table 4 exhibits detection rate when adding noise to          4.4. Overall performances
the least important features. On the Musk dataset, this
                                                                 In Table 7, we produce the combined true positive and
type of perturbation produces real drift and therefore,
                                                                 true negative results by (1). This table showcases the
detections are considered as True Positive (TP). ADWIN
                                                                 overall performance achieved by each detector on each
along with D3, KS and MMD which where not specifically
                                                                 dataset. Due to the fact that on the Musk, Hyperplane and
built to handle virtual drift, systematically wrongfully de-
                                                                 the Waveform datasets, drift induction either generates
tect drift across all real and synthetic datasets. RDD flags
                                                                 more virtual or real drift, the TN score will have a varying
detects virtual drift on the Digits 08 dataset 4 out of 10
                                                                 impact.
runs. Virtual drift is otherwise ignored by RDD. ST and
                                                                    Table 7 allows us to assess the ability of a model to
TSDD fail to ignore the virtual drift on 2 RW datasets.
                                                                 ignore real drift while detecting real drift. RDD yields the
                                                                 best H scores on 7 RW datasets and tying the first place
Table 4
Least Important Noise Drift: The detection results on virtual drift when noise is added to the least important features. The
lower the detection ratio, the better the detectors are. Our model RDD comes first with almost no false detections. ST and
TSDD take second and third place with 2 false detections while the other detectors consistently detect virtual drift.

              Adult     Bank    Cov     D08     D17    Elec    Musk     Phish.    Spam     Wine     Hyp.     LED    Wav.
    RDD         0.0      0.0     0.0     0.4    0.1     0.0     0.0       0.0      0.0      0.2      0.0     0.0     0.0
   ADWIN        1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
     D3         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
     KS         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
    MMD         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
     ST         0.0      1.0     0.2    0.0     0.1     0.0     0.9       0.0      0.7      0.0      0.0     0.0     0.0
    TSDD        0.0      0.0     0.0    0.0     0.0     0.0     0.6       0.0      0.7      1.0      0.0     0.1     0.0


Table 5
Most Important Step Drift: The detection results on real drift when the most important features are shuffled. The highest the
detection ratio, the better the models are. MMD takes first place, followed by KS, ADWIN and D3 closely followed by our
detector RDD and ST. TSDD outputs false negatives in all but 3 datasets.

              Adult     Bank    Cov     D08     D17    Elec    Musk     Phish.    Spam     Wine     Hyp.     LED    Wav.
    RDD         0.9      0.8     0.0     0.9    1.0     1.0     1.0       0.5      0.2      1.0      0.0     0.0      0.0
   ADWIN        1.0      1.0     1.0    0.8     1.0     0.0     1.0       1.0      1.0      1.0      0.0     0.2     0.2
     D3         1.0      1.0     0.0    0.8     1.0     0.0     1.0       1.0      1.0      1.0      0.0     0.0     0.0
     KS         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      0.2     1.0     0.9
    MMD         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      0.2     1.0     1.0
     ST         0.7      0.6     1.0    0.1     0.6     0.3     1.0       1.0      1.0      0.0      0.0     0.9     0.8
    TSDD        0.0      0.0     0.0    0.6     0.0     0.0     0.0       0.0      0.7      0.2      0.0     1.0     0.0


on 2 synthetic ones. TSDD takes second place with the           5. Conclusion
highest score on 1 RW dataset but coming first or tying
first place on all synthetic datasets. ST comes third with      In this paper we introduced RDD, a drift detector that
the highest H scores on 2 RW datasets and tying first           does not need ground truth labels during the inference
place on 1 synthetic dataset. On RW datasets, we see that       phase. We extensively challenged our algorithm against
ADWIN, D3, KS and MMD have overall low scores due               a number of state of the art drift detectors and over a
to their misclassification of virtual drift despite having      large panel of both real and synthetic datasets. We exper-
detected all real drifts. On synthetic datasets, their score    imentally proved that our method outperforms current
is better having not made too many misclassification            drift detection methods. We showed our detector’s ability
when a step drift was induced on the least informative          to detect real drift and to ignore virtual drift. As false
features.                                                       alarms are the main reason why drift detectors are not


Table 6
Most Important Noise Drift: The detection results on real drift when noise is added to the most important features. The
highest the detection ratio, the better the models are. ADWIN, D3, KS and MMD achieve perfect detection across all datasets
exhibiting real drift. Our model RDD achieves perfect detection on 7 datasets outperforming TSDD and ST.

              Adult     Bank    Cov     D08     D17    Elec    Musk     Phish.    Spam     Wine     Hyp.     LED     Wav.
    RDD         1.0      0.9     0.0    1.0     1.0     0.5     1.0       1.0      1.0      1.0      0.0      0.5    0.0
   ADWIN        1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
     D3         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
     KS         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
    MMD         1.0      1.0     1.0    1.0     1.0     1.0     1.0       1.0      1.0      1.0      1.0     1.0     1.0
     ST         0.0      1.0     1.0    0.1     0.0     0.8     0.3       1.0      1.0      0.0      0.0     1.0     0.9
    TSDD        1.0      1.0     0.0    0.1     0.4     0.0     0.8       0.0      0.6      1.0      0.0     1.0     0.0
Table 7
Harmonic Mean: The results of table 2 through 5 aggregated in one metric. The higher the metric the best the detector is at
both ignoring virtual drift and detecting real drift. We see that our model RDD comes first with the highest score on 7 out of
10 RW datasets and on 2 of the 3 virtual datasets. TSDD comes second and ST comes third, the two models have lower scores
than RDD because of some real drift ignored. The ADWIN, D3, KS and MMD detectors don’t yield high score because of their
inability to ignore virtual drift.

              Adult    Bank     Cov     D08     D17     Elec    Musk     Phish.    Spam      Wine     Hyp.    LED     Wav.
   RDD        0.99     0.96     0.67    0.80    0.92    0.93     0.86     0.93      0.89     0.89     0.86    0.77     0.86
  ADWIN        0.00     0.00    0.00    0.00    0.00    0.50     0.00     0.00      0.00     0.00     0.71    0.48    0.29
    D3         0.60     0.36    0.50    0.09    0.00    0.50     0.00     0.00      0.00     0.00     0.71    0.50    0.29
    KS         0.09     0.00    0.60    0.00    0.00    0.60     0.00     0.00      0.00     0.00     0.65    0.00    0.36
   MMD         0.00     0.00    0.46    0.00    0.00    0.00     0.00     0.00      0.00     0.00     0.59    0.51    0.19
    ST         0.81     0.39    0.92    0.71    0.75    0.87     0.50     1.00      0.73     0.67     0.86    0.91    0.71
   TSDD        0.86     0.86    0.67    0.81    0.75    0.67     0.75     0.67      0.65     0.52     0.86    0.92    0.86


widely used in production. We demonstrated the usabil-           [6] R. Kamoi, K. Kobayashi, Out-of-distribution detec-
ity of our detector for real world applications. We tuned            tion with likelihoods assigned by deep generative
the hyper-parameters on 3 datasets not used in the ex-               models using multimodal prior distributions., in:
perimental study. We show that they are valid in a wide              SafeAI@ AAAI, 2020, pp. 113–116.
range of real-world scenarios and that few effort should         [7] R. Elwell, R. Polikar, Incremental learning of con-
be made when using the models in production. We also                 cept drift in nonstationary environments, IEEE
demonstrated the ability of RDD to work in any dimen-                Transactions on Neural Networks 22 (2011) 1517–
sion, having the best detection accuracy on both datasets            1531.
that had over 100 features.                                      [8] D. Brzezinski, J. Stefanowski, Reacting to differ-
   Future work will consist of further modeling the parti-           ent types of concept drift: The accuracy updated
tion space. Research will also deal on how a drift detector          ensemble algorithm, IEEE Transactions on Neural
can be initialized in recurrent concept drift scenarios,             Networks and Learning Systems 25 (2013) 81–94.
when no stationary dataset can be used to initialize a           [9] J. Z. Kolter, M. A. Maloof, Dynamic weighted ma-
detector.                                                            jority: An ensemble method for drifting concepts,
                                                                     The Journal of Machine Learning Research 8 (2007)
                                                                     2755–2790.
References                                                      [10] M. Baena-Garcıa, J. del Campo-Ávila, R. Fidalgo,
                                                                     A. Bifet, R. Gavalda, R. Morales-Bueno, Early drift
 [1] A.-K. Reuel, M. Koren, A. Corso, M. J. Kochender-
                                                                     detection method, in: Fourth international work-
     fer, Using adaptive stress testing to identify paths
                                                                     shop on knowledge discovery from data streams,
     to ethical dilemmas in autonomous systems., in:
                                                                     volume 6, 2006, pp. 77–86.
     SafeAI@ AAAI, 2022.
                                                                [11] D. R. de Lima Cabral, R. S. M. de Barros, Concept
 [2] H. Huang, Z. Li, L. Wang, S. Chen, B. Dong, X. Zhou,
                                                                     drift detection based on fisher’s exact test, Informa-
     Feature space singularity for out-of-distribution de-
                                                                     tion Sciences 442 (2018) 220–234.
     tection, arXiv preprint arXiv:2011.14654 (2020).
                                                                [12] A. Bifet, R. Gavalda, Learning from time-changing
 [3] M. G. Kelly, D. J. Hand, N. M. Adams, The impact
                                                                     data with adaptive windowing, in: Proceedings
     of changing populations on classifier performance,
                                                                     of the 2007 SIAM international conference on data
     in: Proceedings of the fifth ACM SIGKDD interna-
                                                                     mining, SIAM, 2007, pp. 443–448.
     tional conference on Knowledge discovery and data
                                                                [13] C. Raab, M. Heusinger, F.-M. Schleif, Reactive
     mining, 1999, pp. 367–371.
                                                                     soft prototype computing for concept drift streams,
 [4] I. A. Nikolov, M. P. Philipsen, J. Liu, J. V. Dueholm,
                                                                     Neurocomputing 416 (2020) 340–351.
     A. S. Johansen, K. Nasrollahi, T. B. Moeslund, Sea-
                                                                [14] M. Heusinger, F.-M. Schleif, Reactive concept drift
     sons in drift: A long-term thermal imaging dataset
                                                                     detection using coresets over sliding windows, in:
     for studying concept drift, in: Thirty-fifth Confer-
                                                                     2020 IEEE Symposium Series on Computational
     ence on Neural Information Processing Systems,
                                                                     Intelligence (SSCI), IEEE, 2020, pp. 1350–1355.
     2021.
                                                                [15] Ö. Gözüaçık, A. Büyükçakır, H. Bonab, F. Can, Un-
 [5] A. Suprem, J. Arulraj, C. Pu, J. Ferreira, Odin: Au-
                                                                     supervised concept drift detection with a discrimi-
     tomated drift detection and recovery in video ana-
                                                                     native classifier, in: Proceedings of the 28th ACM
     lytics, arXiv preprint arXiv:2009.05440 (2020).
     international conference on information and knowl-
     edge management, 2019, pp. 2365–2368.
[16] M. Black, R. Hickey, Learning classification rules
     for telecom customer call data under concept drift,
     Soft Computing 8 (2003) 102–108.
[17] V. Cerqueira, H. M. Gomes, A. Bifet, Unsupervised
     concept drift detection using a student–teacher ap-
     proach, in: International Conference on Discovery
     Science, Springer, 2020, pp. 190–204.
[18] S. Rabanser, S. Günnemann, Z. Lipton, Failing
     loudly: An empirical study of methods for detect-
     ing dataset shift, Advances in Neural Information
     Processing Systems 32 (2019).
[19] A. Gretton, K. M. Borgwardt, M. J. Rasch,
     B. Schölkopf, A. Smola, A kernel two-sample test,
     The Journal of Machine Learning Research 13 (2012)
     723–773.
[20] A. Castellani, S. Schmitt, B. Hammer, Task-sensitive
     concept drift detector with constraint embedding,
     in: 2021 IEEE Symposium Series on Computational
     Intelligence (SSCI), IEEE, 2021, pp. 01–08.
[21] C. C. Serdar, M. Cihan, D. Yücel, M. A. Serdar, Sam-
     ple size, power and effect size revisited: simplified
     and practical approaches in pre-clinical, clinical
     and laboratory studies, Biochemia medica 31 (2021)
     27–53.
[22] T. S. Sethi, M. Kantardzic, On the reliable detec-
     tion of concept drift from streaming unlabeled data,
     Expert Systems with Applications 82 (2017) 77–99.
[23] T. S. Sethi, M. Kantardzic, E. Arabmakki, Monitor-
     ing classification blindspots to detect drifts from
     unlabeled data, in: 2016 IEEE 17th International
     Conference on Information Reuse and Integration
     (IRI), IEEE, 2016, pp. 142–151.
[24] A. Van Looveren, J. Klaise, G. Vacanti, O. Cobb,
     A. Scillitoe, R. Samoilescu, A. Athorne, Alibi de-
     tect: Algorithms for outlier, adversarial and drift
     detection, 2019. URL: https://github.com/SeldonIO/
     alibi-detect.