A Robust Drift Detection Algorithm with High Accuracy and Low False Positives Rate Maxime Fuccellaro1,2 , Laurent Simon1 and Akka Zemmari1 1 University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, 351, cours de la LibΓ©ration F-33405 Talence 2 Mangrove, 87 Avenue des Aygalades, 13015 Marseille Abstract The number of decision-making processes that rely on machine learning models to operate has been increasing in recent years. Safety of those systems is compromised when models deviate from their expected behavior. One root cause is a shift in the underlying data distribution, known as concept drift. A direct consequence of concept drift is a rapid drop in model’s predictive power. Accurate detection of drift is essential as false alarms lead to unnecessary down time and undermine confidence in the drift detection model. This paper introduces Real-Drift Detector (RDD), a drift detector that is not triggered by virtual drift. RDD does not need use class labels during the inference phase to operate. Our detector outperformed the state of the art in an extensive benchmark on a large panel of well-known datasets used in drift detection. Keywords Concept Drift, Real Drift, Virtual Drift, Unsupervised, Hypothesis Testing 1. Introduction distribution [3]. A shift of distribution is referred as Concept Drift (CD) and its detection will be the focus of More and more online systems rely, at least partly, on this paper. a form of machine learning model to operate. The Machine learning models are built under the hypoth- widespread integration of Artificial Intelligence based esis that data seen during the training phase share the model has its roots in the constant progress made in the same distribution as unseen future data. Concept drift ap- field, which enables models to solve increasingly com- pears when the underlying distribution of a data source plex tasks well suited to real world applications. The changes over time. If the static distribution hypothesis democratisation of Machine Learning (ML) models al- is violated, historic data cannot be used to predict the lows non-experts to use it to automate repetitive tasks future and predictive models see their performances drop. and is simplified by the easy access and processing of Concept Drift can impact every ML domain including large quantities of data required to train predictive mod- video analysis [4], [5]. els. The emergence of cloud computing has also been In contrast to anomaly detection, where the goal is accelerating the industrial use of ML models in produc- to isolate few out of distribution samples, concept drift tion. will cause a large part of the data to deviate from past However, ML models can be crippled by a wide range distributions. One way to categorize drifts is with its of problems that raise serious questions regarding their impact on a model’s performance. Virtual drift is used impact on the safety of systems and their consequence to describe a distribution change that does not impact a on society. Some models inadvertently induce bias in model while real drift does. Let 𝑋 be a set a variables their predictions [1] such as black box models that are used to predict the target class vector 𝑦. We distinguish therefore excluded from applications where explanability three root causes of concept drift: it may come from the is a critical feature such as loan applications. However, change in the class distribution P(𝑋 | 𝑦), the feature ML models are often poorly adapted to detect out of space P(𝑋) or the class priors P(𝑦). Where the change in distribution samples [2] that are not classified correctly. distribution is rapid, the drift is said to be abrupt. Drift is Models can then see a drop of performance while in the incremental when the distribution shifts slowly over time. inference phase due to a change in the underlying data Recurrent drift is defined as a distribution that oscillate Proceedings of the Workshop on Artificial Intelligence Safety 2023 in between two or more concepts. Drift detection differs $ maxime.fuccellaro@u-bordeaux.fr (M. Fuccellaro); from outliers detection [6] as the goal is to identify and laurent.simon@u-bordeaux.fr (L. Simon); take actions to deal with a global distribution change and akka.zemmari@u-bordeaux.fr (A. Zemmari) not to remove out of distribution samples. Β€ https://www.labri.fr/perso/lsimon/ (L. Simon); We present RDD: Real-Drift Detector, a unsupervised https://www.labri.fr/perso/zemmari/ (A. Zemmari) drift detection method based on the supervised parti-  0000-0003-0544-5503 (L. Simon); 0000-0002-9776-0449 (A. Zemmari) tion of feature space aiming to detect local distribution Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License changes that impacts models performances. RDD works Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) in any number of dimensions. Our detector does not need change is detected, while its size decreases rapidly in labels during inference and outperforms the state of the the presence of drift. The detection-mechanism works art in a thorough experiment. In Section 2, we present by repeatedly splitting the window, based on the time related work and position our paper. The RDD algorithm of appearance into two smaller sets. A drift is detected is described in Section 3. In Section 4, the experimen- when the averages of the values of two sets are statis- tal protocol is presented and the results are discussed. tically different. Several other window-based detectors Section 5 concludes this paper. have been presented since such as [13]. To avoid detecting drift over one-dimension sliding windows, [14] detects swift or gradual changes in data 2. Related Work values with minimum enclosing balls. A ball is defined as a centroid and the minimum radius that enables to Over the last few years, concept drift has become a major include all of the samples in the ball. A drift is detected field of research in the machine learning community. when too many values are labelled as outliers, in which Focus was mostly aimed at models dealing with drift as case the centroid is updated. well as drift detectors. Recent advances in detecting drift To circumvent the unavailability of true labels, the au- when true class labels are available led algorithms to thors of [15] trained a model to distinguish past data from achieve almost perfect detection. Other (more realistic) recent data. All timestamp-like aggregates are removed contexts still show room for improvement. prior to training the model to prevent trivial identifica- To prevent drift from affecting ML models, several tion. The ability of the model is assessed using the AUC mechanisms have been proposed. Assuming that recent metric, using a threshold of .75 in their paper. Using data share the same distribution as upcoming data, one time to find drift, the authors of [16] include the times- way is to continuously update a pool of models. In [7] a tamp attribute in the observations and train a model on batch of new data is scored by a pool of models, the indi- past and recent data to predict the target variable. If the vidual model’s contribution are weighted by it’s recent timestamp attribute is an informative feature, the target performances. At each new batch, a model is trained on variable depends on time and the presence of a drift is it and added to the pool while a long term poor perform- assumed. ing one is removed. This ensures good performances in The authors of [17] present another way of detecting the presence of concept drift and a fast adaptation on drift in an unsupervised manner. A first model (teacher) recurrent drifts. Methods that work on a dynamic pool is trained on past labelled data, a second model (student) of models have been thoroughly studied [8], [9]. is trained to mimic the behavior of the teacher model. When detecting a drift, the consensus is that an in- During the inference phase the authors monitor the er- crease of the error rate of a model means the presence of ror rate of the student model and use [12] to trigger an concept drift. Detection methods based on that hypothe- alarm. In [18], drift detection in an unsupervised im- sis have been given a lot of attention as this methodology age classification context is studied. The authors first is able to systematically detect real drift while consis- apply a dimension reduction technique before using a tently ignoring virtual drift. The detection methods pre- two-sample test to find drift. A number of dimension re- sented in [10] and [11] work by monitoring a model’s duction and two-sample tests including (MMD [19] and error rate. A drop of performance is interpreted as a KS) are evaluated on an extensive study of different types presence of drift. Different statistical tests are used to of shift applied to images. In [20] the authors incorporate monitor the error rate and signal drifts. the target class in the dimension reduction mechanism, Both updating a pool of models and monitoring the er- enabling the detector to ignore virtual drift. ror rate perfectly deal with drift. As virtual drift does not The idea behind concept drift detection by statistical impact accuracy, it is systematically ignored. However, tests is that a distribution change will be a strong indica- in order to work, both approaches need access to the tor of drift [14], [18]. A distribution change will enable true labels immediately after the inference phase. This the detection of drift, but won’t discriminate between is not realistic for real-world scenarios when true class real drift and virtual drift. To the best of our knowledge, labels are almost never available promptly and are some- few algorithms are able to discriminate between real drift times never known. To address this issue, several ways and virtual drift without access to true labels after detec- of dealing with drift in an unsupervised manner have tion. In this paper we introduce RDD a detector that does been studied. not need true class labels to operate during the inference To detect drift without class labels availability, phase. RDD works in any dimension and successfully window-based technique have been studied. The authors discriminates between real and virtual drift. of [12] introduce ADWIN, an adaptive sliding window algorithm. It works by keeping a reference window con- taining past instances. The window widens when no 3. RDD distribution. During the initialisation step, leaves that cannot well separate class labels are removed, all leaves 3.1. Our detector with less than 20% impurity are dropped. During early experimental runs, we found that setting a low minimum The idea behind our model is that a real drift changes impurity percentage yields very few leaves with enough the distribution of regions made by of a class dependant samples to conduct the statistical test. We also found partitioning while virtual drift does not. We use a deci- that setting a high value prevents us from confidently sion tree to partition the feature space into regions of assigning a class to a leaf. We set the maximum impurity homogeneous class labels. value at 20% as it offers a good compromise, although Our intuition is that a data distribution change in a leaf this value could be changed based on the data at hand. between the training phase and inference phase indicates In an effort to rank the remaining leaves, we attribute to a drift. It is our assumption that a real drift is likely to each leaf a weight which corresponds to a leaf’s purity change in which region the observations are assigned during training. to. Such misplaced samples are likely to have a different Weights somehow capture the separation power of a distribution than that of the training observations. A leaf. If a given observation is classified at a leaf with distribution change leading to virtual drift is unlikely to high weight, we may expect that the probability of this be seen locally as it may affect less the way observations observation to be misclassified is low. On the other hand, are distributed in leaves. a leaf with a lower weight will be more susceptible to In order to better discriminate real drift from virtual assign the wrong label to an observation. Our goal by drift, each region is attributed a weight. The weights ranking the leaves based on their predictive power is to represent the ability of a given region to correctly assign help our detector ignore virtual drift. a label to an observation. A region that only contains The detection step is detailed in Algorithm 1. In line a single class of observations will have a large weight. 1-5 test set observations are attributed to 𝐿𝑑𝑒𝑠𝑑 based on While a region that cannot well separate samples on the leaf they are at. In line 6, we go over each leaf that their label will have a small weight. This is done to contains test data; in line 7 we initialize the drift features reduce the risk of sudden class imbalance to be detected DF variable that tracks the number of drifted features. In as a distribution change. Our model signals a drift when line 8, the number of observations at a leaf is checked. enough regions flag their distribution as changing. In lines 9 through 13, for all leaves containing enough test instances, we proceed to do a Levene test of variance 3.2. Mathematical Background equality on all dimensions between the inference and training set contained in a leaf. We choose here the Lev- During the initialization step, a decision tree classifier ene test as it is adequate when the data distribution may (β„³) is fit over the training data. Like most drift detection slightly deviate from the normal one. Of course, other algorithms, we assume the training data is sampled from tests could be used, based on the knowledge of the under- one single concept. We do not consider the training data lying data distribution to improve the detector’s perfor- to include past concepts that might offset the detector. mances (when the data distribution is strictly normal, the For both the training (𝑇 ) and inference (𝐼) test sets, we Barlett’s test should be used; the Brown–Forsythe test consider the variables to follow a normal distribution. may also be an alternative when the data does not follow This hypothesis is required to test for homoscedasticity a normal distribution). We did not make any assumptions in the latter. After discarding all leaves that contain too on the distribution and independence of variables, it will few samples or that are not pure, we store, for each leaf, be the focus of future work. the training instances that belong to it. Let π‘‡π‘˜ , πΌπ‘˜ be the In lines 14 through 18, leaves are classified as drifting training and inference data within leaf π‘˜ of class 𝑐. For if the ratio of features that fail the homoscedasticity test the test to be significant [21], we remove all leaves in the exceeds the user defined 𝛾 threshold. In line 21, the decision tree containing less than a number of observa- algorithm flags a drift if the weighted average of leaf’s tions 𝜈. We set 𝜈 = 20. We have βˆ€π‘˜, min(|π‘‡π‘˜ |, |πΌπ‘˜ |) β‰₯ 𝜈 drift-labels exceeds the user defined 𝛽 threshold. as well as π‘Œπ‘‡π‘˜ = π‘ŒΛ† πΌπ‘˜ = 𝑐. By construction, the leaves of a decision tree don’t hold the same separation power of class labels. The in- 3.3. Hyper-parameter discussion tuition is that leaves containing pure class labels should The first hyper-parameter is 𝛼. Setting a low value for be less affected by P(𝑦 | 𝑋) concept change as they are the 𝛼 parameter reduces the risk to make a type I error, generally further away from the decision boundary. On which, in our case, indicating drift when there is not. The the contrary, impure leaves are more likely to experience 𝛼 parameter was set to 0.01. a distribution change due to a P(𝑦) drift or to be sub- The 𝛾 parameter is the minimum ratio of features to ject to misclassifications that may impact the inference reject the equal variance hypothesis. A low 𝛾 parameter Parameter = Algorithm 1 RDD - Inference 1.0 0.8 Inputs: 0.6 - β„³ : Trained Decision Tree Model Rate 0.4 - πΏπ‘‘π‘Ÿπ‘Žπ‘–π‘› : Dictionary of leaves mapping to the training 0.2 instances 0.0 Metric Parameter = TP - 𝐼 ∈ RdΓ—m : Test set with d features and m samples 1.0 TN H - π‘Š : Leaves weights 0.8 - 𝑑 : Number of variables in dataset 0.6 Rate 0.4 Parameters: 0.2 - 𝛼 : Hypothesis rejection risk 0.0 - 𝜈 : Minimum number of observation in a leaf 0.1 0.2 0.3 0.4 0.5 Value 0.6 0.7 0.8 0.9 - 𝛾 : Minimum ratio of β„‹0 rejection for a leaf to drift - 𝛽 : Minimum ratio of leaves to drift to trigger an Figure 1: Influence of the 𝛽 and 𝛾 parameters on the True alarm. Positive rate, True Negative rate and H score. On the 𝛽 plot, Variables: 𝛾 = 0.3 and on the 𝛾 plot, 𝛽 = 0.3. The graph confirms our - 𝐿𝑑𝑒𝑠𝑑 : Dictionary of leafs mapping to the test in- intuition that low values tend to flag virtual drift as real drift stances in those leafs while high values cause the detector not to detect any drift. - 𝐷𝐿 : Drift status of leaves - 𝐷𝐹 : Number of features that drift within a leaf 1: for 𝑖 ∈ 𝐼 do 4. Experiment 2: if β„³(𝑖) ∈ πΏπ‘‘π‘Ÿπ‘Žπ‘–π‘› then 3: 𝐿𝑑𝑒𝑠𝑑 [β„³(𝑖)] ← 𝐿𝑑𝑒𝑠𝑑 [β„³(𝑖)] + 𝑖 In this experiment we assess our method’s ability to de- 4: end if tect drift while ignoring virtual drift. We extensively 5: end for tested our method against a wide panel of state of the art 6: for 𝑖 ∈ 𝐿𝑑𝑒𝑠𝑑 do detectors on a extensive set of both real and synthetic 7: 𝐷𝐹 = 0 datasets. The benchmark used in this section are the stan- 8: if 𝐿𝑑𝑒𝑠𝑑 [𝑖] β‰₯ 𝜈 then dard ones when testing drift detectors [20], [22], [23]. 9: for 𝑗 ∈ [0, 𝑑 βˆ’ 1] do 10: if 𝐻𝛼 : πœŽπΏπ‘‘π‘Ÿπ‘Žπ‘–π‘› [𝑖][𝑗] ΜΈ= πœŽπΏπ‘‘π‘’π‘ π‘‘ [𝑖][𝑗] then 4.1. Experimental Setup 11: 𝐷𝐹 = 𝐷𝐹 + 1 12: end if The usual procedure to test algorithms suited to handle 13: end for drift when true class labels are available after inference, is 14: if 𝐷𝐹 𝑑 β‰₯ 𝛾 then the test-then-train approach. A model predicts the class 15: 𝐷𝐿[𝑖] = 1 on a batch of samples, then, the true class is revealed and 16: else the model updates itself. The global prediction accuracy 17: 𝐷𝐿[𝑖] = 0 is then used to rank models. 18: end if This setup is not suited for models that do not rely on 19: end if true labels availability. In most datasets used to bench- 20: end for mark drift handling methods, the presence of drift is 21: Return only assumed or artificially introduced by sorting the βˆ‘οΈ€ π‘–βˆˆπΏπ‘‘π‘’π‘ π‘‘ π‘Šπ‘– * 𝐷𝐿[𝑖] β‰₯ 𝛽 observations on an attribute. To the best of our knowl- edge, the exact occurrence of drift is unknown for all means leaves will be considered as drifting if few fea- usual datasets. The experimental setup described bel- tures present a shift in variance (i.e. the detector will be low allows us to know the exact drift occurrence and to sensible). evaluate the effect it has on a model’s accuracy. The 𝛽 parameter is the minimum ratio of drifted leaves The goal of the experiment is to assess the performance to signal a drift has taken place. of detectors on real drift and virtual drift. Two distinct In order to set relevant 𝛽 and 𝛾 values for our detector, perturbations are used to change the dataset: the Step we conducted a hyper parameter search on three datasets Drift, where a subset of the features are shuffled and the (airlines, poker and weather). We excluded those datasets Noise Drift, where Gaussian 𝒩 (1, 1) noise is added to a from the experimental study to prevent bias. In Figure 1 subset of features (Gaussian noise with mean equal to 1 is we plot the influence of both 𝛽 and 𝛾 on TP, TN and the used to change the mean of the distribution, not to obfus- H score. The H score is detailed in Section 4. cate the signal). The idea is now to be able to artificially generate real drift and virtual drift. To create virtual drift, we add one of the two perturbations on the 25% least im- Table 1 portant features as to their predictive power of the class Overview of the dataset used in our experiment. All but one labels. In doing so, we hope to change the distribution RW dataset are binary classification problems. 2 RW datasets of several features will not affect a predictive model’s contain more than 100 features. For the synthetic datasets, performances. To create real drift, we modify the 25% we limit the number of generated observations at 10 000. most informative features by adding one of the two per- Dataset Dimensions Classification turbations. The intuition is that a change of distribution on the most important features is likely to cause a drop Adult (48842, 66) Binary Bank (45211, 49) Binary of performance in a predictive model. To find 25% most Cov (110393, 51) Multi-class (7) and least important features, we train a Random Forest Digits08 (1499, 17) Binary Classifier over the training data. We choose this model Digits17 (1557, 17) Binary as it is a robust, widely used model that achieves good Elec (45312, 15) Binary level of performance on the datasets. For each dataset, Musk (6598, 167) Binary we introduce the 2 perturbations on the 2 sets of features Phishing (11055, 47) Binary thus creating 4 distinct drift set. Spam (6213, 500) Binary In order to have stationary non-drifting data before Wine (6497, 13) Binary adding our generated drifts, we first randomly shuffle Hyperplane (10000, 11) Binary the observations. Each dataset is then partitioned into LED (10000, 26) Multi-class (10) three: a train set, a validation set and a drift set. 4 distinct Waveform (10000, 41) Multi-class (3) copies of the drift set are independently modified with the 4 different perturbations described above. In order Table 2 to assess if a drift is virtual or real, we fit a Random Accuracy of a Random Forest Classifier over the training, vali- Forest Classifier on the train set before reporting it’s dation and drift set. Adding noise to least informative features accuracy on the train set, validation set and the 4 different (LN) leads to virtual drift on all datasets. Adding step perturba- drift sets. The drop of the model’s accuracy between the tion to the least informative features (LS) also leads to virtual different sets is used to classify drift as virtual or real. If drift except for the musk dataset. When those perturbation the difference in accuracies between the validation set are made on the most informative features (MN and MS), it and the training set is lower than that of the validation leads to real drift across all real datasets. We highlight in bold set and the drift set, we consider the drift induced to be perturbations that lead to real drift. real, otherwise, it is considered a virtual drift. Train Val. LN LS MN MS Table 1 briefly describes the datasets used in the ex- periment. The datasets dimensions range from 11 on Adult 1. .85 .85 .85 .28 .59 Hyperplane to 500 on Spam. The classification task is Bank 1. .94 .92 .94 .52 .52 Cov .99 .85 .84 .84 .49 .46 binary on 10 datasets and multi-class on 3. This ensures D08 1. 1. .99 .99 .77 .69 that RDD is tested in a variety of scenarios. D17 1. .99 1. 1. .54 .80 In table 2 we report the average accuracies of the Ran- Elec 1. .89 .87 .87 .57 .62 dom Forest Classifier over the training, validation and Musk 1. .98 .94 .97 .51 .56 the four different drift set. changing the most important Phis. .99 .97 .96 .96 .69 .47 features generates real drift while changing the least im- Spam 1. .98 .98 .98 .58 .59 portant features creates virtual drift regardless of the per- Wine 1. 1. 1. 1. .63 .44 turbation. There are 3 exceptions, when noise is added to Hyp. 1. .87 .87 .85 .71 .87 the least important features, real drift is produced on the LED 1. 1. 1. 1. .58 .31 Musk dataset. When corrupting the most important fea- Wav. 1. .85 .85 .85 .72 .41 tures, the step perturbation produces virtual drift on the Hyperplane dataset while the noise perturbation yields virtual drift on the Waveform dataset. real drift is detected. In an effort to aggregate both the True Positives and False Negatives into one metric, we will make use of the Μ‚οΈ‚ * 𝑇 𝑁 𝐷𝐴 metric 1 defined in [20]. Since we conduct our experiment 𝐻 =2* (1) 𝐷𝐴 Μ‚οΈ‚ + 𝑇 𝑁 on a batch mode, we removed the impact of the detection delay defined as the number of drift samples processed We evaluate RDD with 𝛼 = 0.01, 𝛽 = 0.3, 𝛾 = 0.3 before signaling a drift. The Drift Accuracy (𝐷𝐴) Μ‚οΈ‚ is a against: binary value that assess the correctness of the detection, β€’ ADWIN [12] with 𝛿 = 0.7 it’s equal to 1 when a virtual drift is ignored or when a β€’ Discriminative Drift Detector (D3) [15] Table 3 Least Important Step Drift: The detection results on virtual drift when the least important features are shuffled. The lower the detection ratio, the better the detectors are. Our model RDD comes second being slightly outperformed by TSDD. ST comes third with relatively few wrong detections in comparison to the other detectors that wrongfully detect virtual drift. Adult Bank Cov D08 D17 Elec Musk Phish. Spam Wine Hyp. LED Wav. RDD 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 ADWIN 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.1 0.1 D3 0.0 0.5 0.0 0.9 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 KS 0.9 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.1 MMD 1.0 1.0 0.3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.2 0.2 0.6 ST 0.0 0.4 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0.0 0.2 0.0 TSDD 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 β€’ Kolmogorov-Smirnov (KS) distribution test detec- 4.3. Real Drift tor, we used the implementation of [24]. In Table 5 we observe real drift induced by a Step drift on β€’ Maximum Mean Discrepancy (MMD) [19], we the most informative features. The Hyperplane exhibits used the implementation of [24]. virtual drift with this corruption and low values should β€’ Student-Teacher (ST) [17] be regarded as TN. ADWIN, D3, KS and MMD, which all β€’ Task Sensitive Drift Detector (TSDD) [20] exhibit poor performance on virtual drift, now achieve almost perfect detection on RW datasets. However, D3 All detectors were used with their default parameter val- and ADWIN fail to detect real drift on synthetic data. ues unless specified otherwise. For the sake of readability, Our method systematically detects real drift on 4 RW we only highlight the best results in Table 7. datasets and achieve good levels of detection on 3 oth- ers. Drift is detected on 50% of the runs on the phishing 4.2. Virtual Drift dataset. The ST model achieves 4 perfect detections on In Table 3 we present the detections made on virtual RW datasets and good level of detection on 3 others. The drift induced by a Step corruption of the least impor- TSDD detector yields poor performance detecting only 2 tant features. On the 7 detectors evaluated, 3 are able to drifts out of all RW datasets. On the 2 synthetic datasets consistently ignore this type of virtual drift: TSDD, ST with real drift, ADWIN, D3 and RDD fail to detect the and RDD. TSDD comes first with no detections on real- drift, while TSDD detects 1. KS, MMD and ST detectors world (RW) datasets and almost none on synthetic data. succeed in their detection. RDD makes no False Positives (FP) on 7 RW datasets and Noise detection on the most important features results all synthetic datasets. On 3 real-word datasets the FP is are shown in Table 6. ADWIN, D3, KS and MMD achieve very low (0.1). The Student Teacher detector produces perfect detection across both real and synthetic datasets. no FP on 8 RW datasets and on 2 synthetic ones, how- RDD detection results exceed that of ST and TSDD with 7 ever on 2 RW datasets the FP rate is high as it is around perfect detections on RW datasets. TSDD and ST are tied .5. ADWIN along with the statistical test-based KS and with both 5 accurate detections on RW datasets. Only our MMD fail to ignore virtual drift on all but 1 RW dataset. detector and TSDD ignore virtual drift on the Waveform D3 does slightly better ignoring virtual drift on 3 RW dataset. ST and TSDD outperform RDD with one perfect datasets. On synthetic data, relatively few FP are made detection on the virtual datasets. by those 4 detectors. Table 4 exhibits detection rate when adding noise to 4.4. Overall performances the least important features. On the Musk dataset, this In Table 7, we produce the combined true positive and type of perturbation produces real drift and therefore, true negative results by (1). This table showcases the detections are considered as True Positive (TP). ADWIN overall performance achieved by each detector on each along with D3, KS and MMD which where not specifically dataset. Due to the fact that on the Musk, Hyperplane and built to handle virtual drift, systematically wrongfully de- the Waveform datasets, drift induction either generates tect drift across all real and synthetic datasets. RDD flags more virtual or real drift, the TN score will have a varying detects virtual drift on the Digits 08 dataset 4 out of 10 impact. runs. Virtual drift is otherwise ignored by RDD. ST and Table 7 allows us to assess the ability of a model to TSDD fail to ignore the virtual drift on 2 RW datasets. ignore real drift while detecting real drift. RDD yields the best H scores on 7 RW datasets and tying the first place Table 4 Least Important Noise Drift: The detection results on virtual drift when noise is added to the least important features. The lower the detection ratio, the better the detectors are. Our model RDD comes first with almost no false detections. ST and TSDD take second and third place with 2 false detections while the other detectors consistently detect virtual drift. Adult Bank Cov D08 D17 Elec Musk Phish. Spam Wine Hyp. LED Wav. RDD 0.0 0.0 0.0 0.4 0.1 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 ADWIN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 D3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 KS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 MMD 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ST 0.0 1.0 0.2 0.0 0.1 0.0 0.9 0.0 0.7 0.0 0.0 0.0 0.0 TSDD 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.7 1.0 0.0 0.1 0.0 Table 5 Most Important Step Drift: The detection results on real drift when the most important features are shuffled. The highest the detection ratio, the better the models are. MMD takes first place, followed by KS, ADWIN and D3 closely followed by our detector RDD and ST. TSDD outputs false negatives in all but 3 datasets. Adult Bank Cov D08 D17 Elec Musk Phish. Spam Wine Hyp. LED Wav. RDD 0.9 0.8 0.0 0.9 1.0 1.0 1.0 0.5 0.2 1.0 0.0 0.0 0.0 ADWIN 1.0 1.0 1.0 0.8 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.2 0.2 D3 1.0 1.0 0.0 0.8 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 KS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.2 1.0 0.9 MMD 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.2 1.0 1.0 ST 0.7 0.6 1.0 0.1 0.6 0.3 1.0 1.0 1.0 0.0 0.0 0.9 0.8 TSDD 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0.0 0.7 0.2 0.0 1.0 0.0 on 2 synthetic ones. TSDD takes second place with the 5. Conclusion highest score on 1 RW dataset but coming first or tying first place on all synthetic datasets. ST comes third with In this paper we introduced RDD, a drift detector that the highest H scores on 2 RW datasets and tying first does not need ground truth labels during the inference place on 1 synthetic dataset. On RW datasets, we see that phase. We extensively challenged our algorithm against ADWIN, D3, KS and MMD have overall low scores due a number of state of the art drift detectors and over a to their misclassification of virtual drift despite having large panel of both real and synthetic datasets. We exper- detected all real drifts. On synthetic datasets, their score imentally proved that our method outperforms current is better having not made too many misclassification drift detection methods. We showed our detector’s ability when a step drift was induced on the least informative to detect real drift and to ignore virtual drift. As false features. alarms are the main reason why drift detectors are not Table 6 Most Important Noise Drift: The detection results on real drift when noise is added to the most important features. The highest the detection ratio, the better the models are. ADWIN, D3, KS and MMD achieve perfect detection across all datasets exhibiting real drift. Our model RDD achieves perfect detection on 7 datasets outperforming TSDD and ST. Adult Bank Cov D08 D17 Elec Musk Phish. Spam Wine Hyp. LED Wav. RDD 1.0 0.9 0.0 1.0 1.0 0.5 1.0 1.0 1.0 1.0 0.0 0.5 0.0 ADWIN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 D3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 KS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 MMD 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ST 0.0 1.0 1.0 0.1 0.0 0.8 0.3 1.0 1.0 0.0 0.0 1.0 0.9 TSDD 1.0 1.0 0.0 0.1 0.4 0.0 0.8 0.0 0.6 1.0 0.0 1.0 0.0 Table 7 Harmonic Mean: The results of table 2 through 5 aggregated in one metric. The higher the metric the best the detector is at both ignoring virtual drift and detecting real drift. We see that our model RDD comes first with the highest score on 7 out of 10 RW datasets and on 2 of the 3 virtual datasets. TSDD comes second and ST comes third, the two models have lower scores than RDD because of some real drift ignored. The ADWIN, D3, KS and MMD detectors don’t yield high score because of their inability to ignore virtual drift. Adult Bank Cov D08 D17 Elec Musk Phish. Spam Wine Hyp. LED Wav. RDD 0.99 0.96 0.67 0.80 0.92 0.93 0.86 0.93 0.89 0.89 0.86 0.77 0.86 ADWIN 0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.71 0.48 0.29 D3 0.60 0.36 0.50 0.09 0.00 0.50 0.00 0.00 0.00 0.00 0.71 0.50 0.29 KS 0.09 0.00 0.60 0.00 0.00 0.60 0.00 0.00 0.00 0.00 0.65 0.00 0.36 MMD 0.00 0.00 0.46 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.51 0.19 ST 0.81 0.39 0.92 0.71 0.75 0.87 0.50 1.00 0.73 0.67 0.86 0.91 0.71 TSDD 0.86 0.86 0.67 0.81 0.75 0.67 0.75 0.67 0.65 0.52 0.86 0.92 0.86 widely used in production. We demonstrated the usabil- [6] R. Kamoi, K. Kobayashi, Out-of-distribution detec- ity of our detector for real world applications. We tuned tion with likelihoods assigned by deep generative the hyper-parameters on 3 datasets not used in the ex- models using multimodal prior distributions., in: perimental study. We show that they are valid in a wide SafeAI@ AAAI, 2020, pp. 113–116. range of real-world scenarios and that few effort should [7] R. Elwell, R. Polikar, Incremental learning of con- be made when using the models in production. We also cept drift in nonstationary environments, IEEE demonstrated the ability of RDD to work in any dimen- Transactions on Neural Networks 22 (2011) 1517– sion, having the best detection accuracy on both datasets 1531. that had over 100 features. [8] D. Brzezinski, J. Stefanowski, Reacting to differ- Future work will consist of further modeling the parti- ent types of concept drift: The accuracy updated tion space. Research will also deal on how a drift detector ensemble algorithm, IEEE Transactions on Neural can be initialized in recurrent concept drift scenarios, Networks and Learning Systems 25 (2013) 81–94. when no stationary dataset can be used to initialize a [9] J. Z. Kolter, M. A. Maloof, Dynamic weighted ma- detector. jority: An ensemble method for drifting concepts, The Journal of Machine Learning Research 8 (2007) 2755–2790. References [10] M. Baena-GarcΔ±a, J. del Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavalda, R. Morales-Bueno, Early drift [1] A.-K. Reuel, M. Koren, A. Corso, M. J. Kochender- detection method, in: Fourth international work- fer, Using adaptive stress testing to identify paths shop on knowledge discovery from data streams, to ethical dilemmas in autonomous systems., in: volume 6, 2006, pp. 77–86. SafeAI@ AAAI, 2022. [11] D. R. de Lima Cabral, R. S. M. de Barros, Concept [2] H. Huang, Z. Li, L. Wang, S. Chen, B. Dong, X. Zhou, drift detection based on fisher’s exact test, Informa- Feature space singularity for out-of-distribution de- tion Sciences 442 (2018) 220–234. tection, arXiv preprint arXiv:2011.14654 (2020). [12] A. Bifet, R. Gavalda, Learning from time-changing [3] M. G. Kelly, D. J. Hand, N. M. Adams, The impact data with adaptive windowing, in: Proceedings of changing populations on classifier performance, of the 2007 SIAM international conference on data in: Proceedings of the fifth ACM SIGKDD interna- mining, SIAM, 2007, pp. 443–448. tional conference on Knowledge discovery and data [13] C. Raab, M. Heusinger, F.-M. Schleif, Reactive mining, 1999, pp. 367–371. soft prototype computing for concept drift streams, [4] I. A. Nikolov, M. P. Philipsen, J. Liu, J. V. Dueholm, Neurocomputing 416 (2020) 340–351. A. S. Johansen, K. Nasrollahi, T. B. Moeslund, Sea- [14] M. Heusinger, F.-M. Schleif, Reactive concept drift sons in drift: A long-term thermal imaging dataset detection using coresets over sliding windows, in: for studying concept drift, in: Thirty-fifth Confer- 2020 IEEE Symposium Series on Computational ence on Neural Information Processing Systems, Intelligence (SSCI), IEEE, 2020, pp. 1350–1355. 2021. [15] Γ–. GΓΆzΓΌaΓ§Δ±k, A. BΓΌyΓΌkΓ§akΔ±r, H. Bonab, F. Can, Un- [5] A. Suprem, J. Arulraj, C. Pu, J. Ferreira, Odin: Au- supervised concept drift detection with a discrimi- tomated drift detection and recovery in video ana- native classifier, in: Proceedings of the 28th ACM lytics, arXiv preprint arXiv:2009.05440 (2020). international conference on information and knowl- edge management, 2019, pp. 2365–2368. [16] M. Black, R. Hickey, Learning classification rules for telecom customer call data under concept drift, Soft Computing 8 (2003) 102–108. [17] V. Cerqueira, H. M. Gomes, A. Bifet, Unsupervised concept drift detection using a student–teacher ap- proach, in: International Conference on Discovery Science, Springer, 2020, pp. 190–204. [18] S. Rabanser, S. GΓΌnnemann, Z. Lipton, Failing loudly: An empirical study of methods for detect- ing dataset shift, Advances in Neural Information Processing Systems 32 (2019). [19] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. SchΓΆlkopf, A. Smola, A kernel two-sample test, The Journal of Machine Learning Research 13 (2012) 723–773. [20] A. Castellani, S. Schmitt, B. Hammer, Task-sensitive concept drift detector with constraint embedding, in: 2021 IEEE Symposium Series on Computational Intelligence (SSCI), IEEE, 2021, pp. 01–08. [21] C. C. Serdar, M. Cihan, D. YΓΌcel, M. A. Serdar, Sam- ple size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies, Biochemia medica 31 (2021) 27–53. [22] T. S. Sethi, M. Kantardzic, On the reliable detec- tion of concept drift from streaming unlabeled data, Expert Systems with Applications 82 (2017) 77–99. [23] T. S. Sethi, M. Kantardzic, E. Arabmakki, Monitor- ing classification blindspots to detect drifts from unlabeled data, in: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), IEEE, 2016, pp. 142–151. [24] A. Van Looveren, J. Klaise, G. Vacanti, O. Cobb, A. Scillitoe, R. Samoilescu, A. Athorne, Alibi de- tect: Algorithms for outlier, adversarial and drift detection, 2019. URL: https://github.com/SeldonIO/ alibi-detect.