1. Introduction

A Robust Drift Detection Algorithm with High Accuracy and Low False Positives Rate

Maxime Fuccellaro

0 1

Laurent Simon

Akka Zemmari

1 0 Mangrove , 87 Avenue des Aygalades, 13015 Marseille 1 University of Bordeaux , CNRS, Bordeaux INP, LaBRI, UMR 5800, 351, cours de la Libération F-33405 Talence

The number of decision-making processes that rely on machine learning models to operate has been increasing in recent years. Safety of those systems is compromised when models deviate from their expected behavior. One root cause is a shift in the underlying data distribution, known as concept drift. A direct consequence of concept drift is a rapid drop in model's predictive power. Accurate detection of drift is essential as false alarms lead to unnecessary down time and undermine confidence in the drift detection model. This paper introduces Real-Drift Detector (RDD), a drift detector that is not triggered by virtual drift. RDD does not need use class labels during the inference phase to operate. Our detector outperformed the state of the art in an extensive benchmark on a large panel of well-known datasets used in drift detection.

eol>Concept Drift Real Drift Virtual Drift Unsupervised Hypothesis Testing

1. Introduction

distribution [ 3 ]. A shift of distribution is referred as Concept Drift (CD) and its detection will be the focus of More and more online systems rely, at least partly, on this paper. a form of machine learning model to operate. The Machine learning models are built under the hypothwidespread integration of Artificial Intelligence based esis that data seen during the training phase share the model has its roots in the constant progress made in the same distribution as unseen future data. Concept drift apifeld, which enables models to solve increasingly com- pears when the underlying distribution of a data source plex tasks well suited to real world applications. The changes over time. If the static distribution hypothesis democratisation of Machine Learning (ML) models al- is violated, historic data cannot be used to predict the lows non-experts to use it to automate repetitive tasks future and predictive models see their performances drop. and is simplified by the easy access and processing of Concept Drift can impact every ML domain including large quantities of data required to train predictive mod- video analysis [ 4 ], [ 5 ]. els. The emergence of cloud computing has also been In contrast to anomaly detection, where the goal is accelerating the industrial use of ML models in produc- to isolate few out of distribution samples, concept drift tion. will cause a large part of the data to deviate from past

However, ML models can be crippled by a wide range distributions. One way to categorize drifts is with its of problems that raise serious questions regarding their impact on a model’s performance. Virtual drift is used impact on the safety of systems and their consequence to describe a distribution change that does not impact a on society. Some models inadvertently induce bias in model while real drift does. Let be a set a variables their predictions [ 1 ] such as black box models that are used to predict the target class vector . We distinguish therefore excluded from applications where explanability three root causes of concept drift: it may come from the is a critical feature such as loan applications. However, change in the class distribution P( | ), the feature ML models are often poorly adapted to detect out of space P() or the class priors P(). Where the change in distribution samples [ 2 ] that are not classified correctly. distribution is rapid, the drift is said to be abrupt. Drift is Models can then see a drop of performance while in the incremental when the distribution shifts slowly over time. inference phase due to a change in the underlying data Recurrent drift is defined as a distribution that oscillate Proceedings of the Workshop on Artificial Intelligence Safety 2023 in between two or more concepts. Drift detection difers $ maxime.fuccellaro@u-bordeaux.fr (M. Fuccellaro); from outliers detection [ 6 ] as the goal is to identify and laurent.simon@u-bordeaux.fr (L. Simon); take actions to deal with a global distribution change and akka.zemmari@u-bordeaux.fr (A. Zemmari) not to remove out of distribution samples. https://www.labri.fr/perso/lsimon/ (L. Simon); We present RDD: Real-Drift Detector, a unsupervised http0s0:0//0w-0w00w3.-l0a5b4ri4.f-r5/5p0e3rs(oL/.zSeimmmona)r;i/0(0A00.-Z0e0m02m-9a7r7i)6-0449 drift detection method based on the supervised parti(A. Zemmari) tion of feature space aiming to detect local distribution © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License changes that impacts models performances. RDD works CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)

Over the last few years, concept drift has become a major

ifeld of research in the machine learning community.

Focus was mostly aimed at models dealing with drift as well as drift detectors. Recent advances in detecting drift when true class labels are available led algorithms to achieve almost perfect detection. Other (more realistic) contexts still show room for improvement.

To prevent drift from afecting ML models, several mechanisms have been proposed. Assuming that recent data share the same distribution as upcoming data, one way is to continuously update a pool of models. In [ 7 ] a batch of new data is scored by a pool of models, the individual model’s contribution are weighted by it’s recent performances. At each new batch, a model is trained on it and added to the pool while a long term poor performing one is removed. This ensures good performances in the presence of concept drift and a fast adaptation on recurrent drifts. Methods that work on a dynamic pool of models have been thoroughly studied [ 8 ], [ 9 ].

When detecting a drift, the consensus is that an increase of the error rate of a model means the presence of concept drift. Detection methods based on that hypothesis have been given a lot of attention as this methodology is able to systematically detect real drift while consistently ignoring virtual drift. The detection methods presented in [ 10 ] and [ 11 ] work by monitoring a model’s error rate. A drop of performance is interpreted as a presence of drift. Diferent statistical tests are used to monitor the error rate and signal drifts.

Both updating a pool of models and monitoring the error rate perfectly deal with drift. As virtual drift does not impact accuracy, it is systematically ignored. However, in order to work, both approaches need access to the true labels immediately after the inference phase. This is not realistic for real-world scenarios when true class labels are almost never available promptly and are sometimes never known. To address this issue, several ways of dealing with drift in an unsupervised manner have been studied.

To detect drift without class labels availability, window-based technique have been studied. The authors of [ 12 ] introduce ADWIN, an adaptive sliding window algorithm. It works by keeping a reference window containing past instances. The window widens when no in any number of dimensions. Our detector does not need change is detected, while its size decreases rapidly in labels during inference and outperforms the state of the the presence of drift. The detection-mechanism works art in a thorough experiment. In Section 2, we present by repeatedly splitting the window, based on the time related work and position our paper. The RDD algorithm of appearance into two smaller sets. A drift is detected is described in Section 3. In Section 4, the experimen- when the averages of the values of two sets are statistal protocol is presented and the results are discussed. tically diferent. Several other window-based detectors Section 5 concludes this paper. have been presented since such as [ 13 ].

To avoid detecting drift over one-dimension sliding windows, [ 14 ] detects swift or gradual changes in data 2. Related Work values with minimum enclosing balls. A ball is defined as a centroid and the minimum radius that enables to include all of the samples in the ball. A drift is detected when too many values are labelled as outliers, in which case the centroid is updated.

To circumvent the unavailability of true labels, the authors of [ 15 ] trained a model to distinguish past data from recent data. All timestamp-like aggregates are removed prior to training the model to prevent trivial identification. The ability of the model is assessed using the AUC metric, using a threshold of .75 in their paper. Using time to find drift, the authors of [ 16 ] include the timestamp attribute in the observations and train a model on past and recent data to predict the target variable. If the timestamp attribute is an informative feature, the target variable depends on time and the presence of a drift is assumed.

The authors of [ 17 ] present another way of detecting drift in an unsupervised manner. A first model (teacher) is trained on past labelled data, a second model (student) is trained to mimic the behavior of the teacher model.

During the inference phase the authors monitor the error rate of the student model and use [ 12 ] to trigger an alarm. In [ 18 ], drift detection in an unsupervised image classification context is studied. The authors first apply a dimension reduction technique before using a two-sample test to find drift. A number of dimension reduction and two-sample tests including (MMD [ 19 ] and KS) are evaluated on an extensive study of diferent types of shift applied to images. In [ 20 ] the authors incorporate the target class in the dimension reduction mechanism, enabling the detector to ignore virtual drift.

The idea behind concept drift detection by statistical tests is that a distribution change will be a strong indicator of drift [ 14 ], [ 18 ]. A distribution change will enable the detection of drift, but won’t discriminate between real drift and virtual drift. To the best of our knowledge, few algorithms are able to discriminate between real drift and virtual drift without access to true labels after detection. In this paper we introduce RDD a detector that does not need true class labels to operate during the inference phase. RDD works in any dimension and successfully discriminates between real and virtual drift. 3. RDD

3.1. Our detector The idea behind our model is that a real drift changes

the distribution of regions made by of a class dependant partitioning while virtual drift does not. We use a decision tree to partition the feature space into regions of homogeneous class labels.

Our intuition is that a data distribution change in a leaf between the training phase and inference phase indicates a drift. It is our assumption that a real drift is likely to change in which region the observations are assigned to. Such misplaced samples are likely to have a diferent distribution than that of the training observations. A distribution change leading to virtual drift is unlikely to be seen locally as it may afect less the way observations are distributed in leaves.

In order to better discriminate real drift from virtual drift, each region is attributed a weight. The weights represent the ability of a given region to correctly assign a label to an observation. A region that only contains a single class of observations will have a large weight. While a region that cannot well separate samples on their label will have a small weight. This is done to reduce the risk of sudden class imbalance to be detected as a distribution change. Our model signals a drift when enough regions flag their distribution as changing. 3.2. Mathematical Background distribution. During the initialisation step, leaves that cannot well separate class labels are removed, all leaves with less than 20% impurity are dropped. During early experimental runs, we found that setting a low minimum impurity percentage yields very few leaves with enough samples to conduct the statistical test. We also found that setting a high value prevents us from confidently assigning a class to a leaf. We set the maximum impurity value at 20% as it ofers a good compromise, although this value could be changed based on the data at hand. In an efort to rank the remaining leaves, we attribute to each leaf a weight which corresponds to a leaf’s purity during training.

Weights somehow capture the separation power of a leaf. If a given observation is classified at a leaf with high weight, we may expect that the probability of this observation to be misclassified is low. On the other hand, a leaf with a lower weight will be more susceptible to assign the wrong label to an observation. Our goal by ranking the leaves based on their predictive power is to help our detector ignore virtual drift.

The detection step is detailed in Algorithm 1. In line 1-5 test set observations are attributed to based on the leaf they are at. In line 6, we go over each leaf that contains test data; in line 7 we initialize the drift features DF variable that tracks the number of drifted features. In line 8, the number of observations at a leaf is checked. In lines 9 through 13, for all leaves containing enough test instances, we proceed to do a Levene test of variance equality on all dimensions between the inference and training set contained in a leaf. We choose here the Levene test as it is adequate when the data distribution may slightly deviate from the normal one. Of course, other tests could be used, based on the knowledge of the underlying data distribution to improve the detector’s performances (when the data distribution is strictly normal, the Barlett’s test should be used; the Brown–Forsythe test may also be an alternative when the data does not follow a normal distribution). We did not make any assumptions on the distribution and independence of variables, it will be the focus of future work.

In lines 14 through 18, leaves are classified as drifting if the ratio of features that fail the homoscedasticity test exceeds the user defined threshold. In line 21, the algorithm flags a drift if the weighted average of leaf’s drift-labels exceeds the user defined threshold.

During the initialization step, a decision tree classifier

(ℳ) is fit over the training data. Like most drift detection algorithms, we assume the training data is sampled from one single concept. We do not consider the training data to include past concepts that might ofset the detector.

For both the training ( ) and inference () test sets, we consider the variables to follow a normal distribution.

This hypothesis is required to test for homoscedasticity in the latter. After discarding all leaves that contain too few samples or that are not pure, we store, for each leaf, the training instances that belong to it. Let , be the training and inference data within leaf of class . For the test to be significant [ 21 ], we remove all leaves in the decision tree containing less than a number of observations . We set = 20. We have ∀, min(||, ||) ≥ as well as = ˆ = .

By construction, the leaves of a decision tree don’t hold the same separation power of class labels. The in- 3.3. Hyper-parameter discussion tuition is that leaves containing pure class labels should be less afected by P( | ) concept change as they are generally further away from the decision boundary. On the contrary, impure leaves are more likely to experience a distribution change due to a P() drift or to be subject to misclassifications that may impact the inference

The first hyper-parameter is . Setting a low value for

the parameter reduces the risk to make a type I error, which, in our case, indicating drift when there is not. The parameter was set to 0.01.

The parameter is the minimum ratio of features to reject the equal variance hypothesis. A low parameter Algorithm 1 RDD - Inference 1.0 Parameter =

Inputs: 0.8 - ℳ : Trained Decision Tree Model taeR00..46 - : Dictionary of leaves mapping to the training 0.2 instances 0.0 Metric -- ∈: RLeda× vmes :wTeeisgthstest with d features and m samples 10..08 Parameter = TTHPN -Par:aNmuemtebresr: of variables in dataset taeR00..64 -- :: MHyinpiomthuemsinsuremjebcetrioonf oribsskervation in a leaf 00..02 0.1 0.2 0.3 0.4 Va0l.u5e 0.6 0.7 0.8 0.9 - : Minimum ratio of ℋ0 rejection for a leaf to drift - : Minimum ratio of leaves to drift to trigger an Figure 1: Influence of the and parameters on the True alarm. Positive rate, True Negative rate and H score. On the plot, Variables: = 0.3 and on the plot, = 0.3. The graph confirms our - : Dictionary of leafs mapping to the test in- intuition that low values tend to flag virtual drift as real drift stances in those leafs while high values cause the detector not to detect any drift. - : Drift status of leaves - : Number of features that drift within a leaf 1: for ∈ do 2: if ℳ() ∈ then 3: [ℳ()] ← [ℳ()] + 4: end if 5: end for means leaves will be considered as drifting if few features present a shift in variance (i.e. the detector will be sensible).

The parameter is the minimum ratio of drifted leaves to signal a drift has taken place.

In order to set relevant and values for our detector, we conducted a hyper parameter search on three datasets (airlines, poker and weather). We excluded those datasets from the experimental study to prevent bias. In Figure 1 we plot the influence of both and on TP, TN and the H score. The H score is detailed in Section 4.

4. Experiment

In this experiment we assess our method’s ability to detect drift while ignoring virtual drift. We extensively tested our method against a wide panel of state of the art detectors on a extensive set of both real and synthetic datasets. The benchmark used in this section are the standard ones when testing drift detectors [ 20 ], [ 22 ], [ 23 ].

4.1. Experimental Setup The usual procedure to test algorithms suited to handle

drift when true class labels are available after inference, is the test-then-train approach. A model predicts the class on a batch of samples, then, the true class is revealed and the model updates itself. The global prediction accuracy is then used to rank models.

This setup is not suited for models that do not rely on true labels availability. In most datasets used to benchmark drift handling methods, the presence of drift is only assumed or artificially introduced by sorting the observations on an attribute. To the best of our knowledge, the exact occurrence of drift is unknown for all usual datasets. The experimental setup described bellow allows us to know the exact drift occurrence and to evaluate the efect it has on a model’s accuracy.

The goal of the experiment is to assess the performance of detectors on real drift and virtual drift. Two distinct perturbations are used to change the dataset: the Step Drift, where a subset of the features are shufled and the Noise Drift, where Gaussian (1, 1) noise is added to a subset of features (Gaussian noise with mean equal to 1 is used to change the mean of the distribution, not to obfuscate the signal). The idea is now to be able to artificially generate real drift and virtual drift. To create virtual drift, we add one of the two perturbations on the 25% least im- Table 1 portant features as to their predictive power of the class Overview of the dataset used in our experiment. All but one labels. In doing so, we hope to change the distribution RW dataset are binary classification problems. 2 RW datasets of several features will not afect a predictive model’s contain more than 100 features. For the synthetic datasets, performances. To create real drift, we modify the 25% we limit the number of generated observations at 10 000. most informative features by adding one of the two per- Dataset Dimensions Classification turbations. The intuition is that a change of distribution on the most important features is likely to cause a drop Adult (48842, 66) Binary of performance in a predictive model. To find 25% most BCaonvk ((14150231913,,4591)) MulBtii-ncalarsys (7) and least important features, we train a Random Forest Digits08 (1499, 17) Binary Classifier over the training data. We choose this model Digits17 (1557, 17) Binary as it is a robust, widely used model that achieves good Elec (45312, 15) Binary level of performance on the datasets. For each dataset, Musk (6598, 167) Binary we introduce the 2 perturbations on the 2 sets of features Phishing (11055, 47) Binary thus creating 4 distinct drift set. Spam (6213, 500) Binary

In order to have stationary non-drifting data before Wine (6497, 13) Binary adding our generated drifts, we first randomly shufle Hyperplane (10000, 11) Binary the observations. Each dataset is then partitioned into LED (10000, 26) Multi-class (10) three: a train set, a validation set and a drift set. 4 distinct Waveform (10000, 41) Multi-class (3) copies of the drift set are independently modified with the 4 diferent perturbations described above. In order to assess if a drift is virtual or real, we fit a Random Table 2 Forest Classifier on the train set before reporting it’s AdactciuornacayndofdariftRseatn.dAodmdiFnogrensotisCelatossliefaiesrt oinvfeorrmthaettivraeifneiantgu,rveasliaccuracy on the train set, validation set and the 4 diferent (LN) leads to virtual drift on all datasets. Adding step perturbadrift sets. The drop of the model’s accuracy between the tion to the least informative features (LS) also leads to virtual diferent sets is used to classify drift as virtual or real. If drift except for the musk dataset. When those perturbation the diference in accuracies between the validation set are made on the most informative features (MN and MS), it and the training set is lower than that of the validation leads to real drift across all real datasets. We highlight in bold set and the drift set, we consider the drift induced to be perturbations that lead to real drift. real, otherwise, it is considered a virtual drift. Train Val. LN LS MN MS

Table 1 briefly describes the datasets used in the experiment. The datasets dimensions range from 11 on Adult 1. .85 .85 .85 .28 .59 Hyperplane to 500 on Spam. The classification task is Bank 1. .94 .92 .94 .52 .52 binary on 10 datasets and multi-class on 3. This ensures Cov .99 .85 .84 .84 .49 .46 that RDD is tested in a variety of scenarios. DD1078 11.. .919. .919. .919. ..5747 ..8609

In table 2 we report the average accuracies of the Ran- Elec 1. .89 .87 .87 .57 .62 dom Forest Classifier over the training, validation and Musk 1. .98 .94 .97 .51 .56 the four diferent drift set. changing the most important Phis. .99 .97 .96 .96 .69 .47 features generates real drift while changing the least im- Spam 1. .98 .98 .98 .58 .59 portant features creates virtual drift regardless of the per- Wine 1. 1. 1. 1. .63 .44 turbation. There are 3 exceptions, when noise is added to Hyp. 1. .87 .87 .85 .71 .87 the least important features, real drift is produced on the LED 1. 1. 1. 1. .58 .31 Musk dataset. When corrupting the most important fea- Wav. 1. .85 .85 .85 .72 .41 tures, the step perturbation produces virtual drift on the Hyperplane dataset while the noise perturbation yields virtual drift on the Waveform dataset. real drift is detected.

In an efort to aggregate both the True Positives and FmaelstreicN1edgeafintievdesinin[to20o]n.eSimnecetrwice, wcoenwduilcltmouakreexupseeroimftehnet = 2 * ̂̂︂︂ +* (1) on a batch mode, we removed the impact of the detection delay defined as the number of drift samples processed We evaluate RDD with = 0.01, = 0.3, = 0.3 before signaling a drift. The Drift Accuracy (̂︂) is a against: binary value that assess the correctness of the detection, it’s equal to 1 when a virtual drift is ignored or when a • ADWIN [ 12 ] with = 0.7 • Discriminative Drift Detector (D3) [ 15 ]

4.2. Virtual Drift In Table 3 we present the detections made on virtual

drift induced by a Step corruption of the least important features. On the 7 detectors evaluated, 3 are able to consistently ignore this type of virtual drift: TSDD, ST and RDD. TSDD comes first with no detections on realworld (RW) datasets and almost none on synthetic data. RDD makes no False Positives (FP) on 7 RW datasets and all synthetic datasets. On 3 real-word datasets the FP is very low (0.1). The Student Teacher detector produces no FP on 8 RW datasets and on 2 synthetic ones, however on 2 RW datasets the FP rate is high as it is around .5. ADWIN along with the statistical test-based KS and MMD fail to ignore virtual drift on all but 1 RW dataset. D3 does slightly better ignoring virtual drift on 3 RW datasets. On synthetic data, relatively few FP are made by those 4 detectors.

Table 4 exhibits detection rate when adding noise to the least important features. On the Musk dataset, this type of perturbation produces real drift and therefore, detections are considered as True Positive (TP). ADWIN along with D3, KS and MMD which where not specifically built to handle virtual drift, systematically wrongfully detect drift across all real and synthetic datasets. RDD flags detects virtual drift on the Digits 08 dataset 4 out of 10 runs. Virtual drift is otherwise ignored by RDD. ST and TSDD fail to ignore the virtual drift on 2 RW datasets. 0.1 In Table 5 we observe real drift induced by a Step drift on the most informative features. The Hyperplane exhibits virtual drift with this corruption and low values should be regarded as TN. ADWIN, D3, KS and MMD, which all exhibit poor performance on virtual drift, now achieve almost perfect detection on RW datasets. However, D3 and ADWIN fail to detect real drift on synthetic data. Our method systematically detects real drift on 4 RW datasets and achieve good levels of detection on 3 others. Drift is detected on 50% of the runs on the phishing dataset. The ST model achieves 4 perfect detections on RW datasets and good level of detection on 3 others. The TSDD detector yields poor performance detecting only 2 drifts out of all RW datasets. On the 2 synthetic datasets with real drift, ADWIN, D3 and RDD fail to detect the drift, while TSDD detects 1. KS, MMD and ST detectors succeed in their detection.

Noise detection on the most important features results are shown in Table 6. ADWIN, D3, KS and MMD achieve perfect detection across both real and synthetic datasets. RDD detection results exceed that of ST and TSDD with 7 perfect detections on RW datasets. TSDD and ST are tied with both 5 accurate detections on RW datasets. Only our detector and TSDD ignore virtual drift on the Waveform dataset. ST and TSDD outperform RDD with one perfect detection on the virtual datasets.

4.4. Overall performances

In Table 7, we produce the combined true positive and true negative results by (1). This table showcases the overall performance achieved by each detector on each dataset. Due to the fact that on the Musk, Hyperplane and the Waveform datasets, drift induction either generates more virtual or real drift, the TN score will have a varying impact.

Table 7 allows us to assess the ability of a model to ignore real drift while detecting real drift. RDD yields the best H scores on 7 RW datasets and tying the first place

5. Conclusion

In this paper we introduced RDD, a drift detector that does not need ground truth labels during the inference phase. We extensively challenged our algorithm against a number of state of the art drift detectors and over a large panel of both real and synthetic datasets. We experimentally proved that our method outperforms current drift detection methods. We showed our detector’s ability to detect real drift and to ignore virtual drift. As false alarms are the main reason why drift detectors are not on 2 synthetic ones. TSDD takes second place with the highest score on 1 RW dataset but coming first or tying ifrst place on all synthetic datasets. ST comes third with the highest H scores on 2 RW datasets and tying first place on 1 synthetic dataset. On RW datasets, we see that ADWIN, D3, KS and MMD have overall low scores due to their misclassification of virtual drift despite having detected all real drifts. On synthetic datasets, their score is better having not made too many misclassification when a step drift was induced on the least informative features. widely used in production. We demonstrated the usability of our detector for real world applications. We tuned the hyper-parameters on 3 datasets not used in the experimental study. We show that they are valid in a wide range of real-world scenarios and that few efort should be made when using the models in production. We also demonstrated the ability of RDD to work in any dimension, having the best detection accuracy on both datasets that had over 100 features.

Future work will consist of further modeling the partition space. Research will also deal on how a drift detector can be initialized in recurrent concept drift scenarios, when no stationary dataset can be used to initialize a detector.

[1] A.-K. Reuel , M.

Koren , A.

Corso , M. J.

Kochenderfer , Using adaptive stress testing to identify paths to ethical dilemmas in autonomous systems ., in: SafeAI@ AAAI, 2022 .

[2]

Huang ,

Li ,

Wang ,

Chen ,

Dong ,

Zhou , Feature space singularity for out-of-distribution detection , arXiv preprint arXiv: 2011 . 14654 ( 2020 ).

[3]

M. G.

Kelly ,

D. J.

Hand ,

N. M.

Adams , The impact of changing populations on classifier performance , in: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining , 1999 , pp. 367 - 371 .

[4]

I. A.

Nikolov ,

M. P.

Philipsen , J. Liu,

J. V.

Dueholm ,

A. S.

Johansen ,

Nasrollahi , T. B. Moeslund , Seasons in drift: A long-term thermal imaging dataset for studying concept drift , in: Thirty-fifth Conference on Neural Information Processing Systems , 2021 .

[5]

Suprem ,

Arulraj ,

Pu ,

Ferreira , Odin: Automated drift detection and recovery in video analytics , arXiv preprint arXiv: 2009 . 05440 ( 2020 ).

[6]

Kamoi ,

Kobayashi , Out-of-distribution detection with likelihoods assigned by deep generative models using multimodal prior distributions ., in: SafeAI@ AAAI, 2020 , pp. 113 - 116 .

[7]

Elwell ,

Polikar , Incremental learning of concept drift in nonstationary environments , IEEE Transactions on Neural Networks 22 ( 2011 ) 1517 - 1531 .

[8]

Brzezinski ,

Stefanowski , Reacting to diferent types of concept drift: The accuracy updated ensemble algorithm , IEEE Transactions on Neural Networks and Learning Systems 25 ( 2013 ) 81 - 94 .

[9]

J. Z.

Kolter ,

M. A.

Maloof , Dynamic weighted majority: An ensemble method for drifting concepts , The Journal of Machine Learning Research 8 ( 2007 ) 2755 - 2790 .

[10]

Baena-Garcıa , J. del Campo-Ávila ,

Fidalgo ,

Bifet ,

Gavalda ,

Morales-Bueno , Early drift detection method , in: Fourth international workshop on knowledge discovery from data streams , volume 6 , 2006 , pp. 77 - 86 .

[11] D. R. de Lima

Cabral

, R. S. M. de Barros , Concept drift detection based on fisher's exact test , Information Sciences 442 ( 2018 ) 220 - 234 .

[12]

Bifet ,

Gavalda , Learning from time-changing data with adaptive windowing , in: Proceedings of the 2007 SIAM international conference on data mining, SIAM , 2007 , pp. 443 - 448 .

[13]

Raab ,

Heusinger ,

F.-M.

Schleif , Reactive soft prototype computing for concept drift streams , Neurocomputing 416 ( 2020 ) 340 - 351 .

[14]

Heusinger ,

F.-M.

Schleif , Reactive concept drift detection using coresets over sliding windows , in: 2020 IEEE Symposium Series on Computational Intelligence (SSCI) , IEEE, 2020 , pp. 1350 - 1355 .

[15] Ö. Gözüaçık , A.

Büyükçakır , H.

Bonab , F.

Can , Unsupervised concept drift detection with a discriminative classifier , in: Proceedings of the 28th ACM international conference on information and knowledge management , 2019 , pp. 2365 - 2368 .

[16]

Black ,

Hickey , Learning classification rules for telecom customer call data under concept drift , Soft Computing 8 ( 2003 ) 102 - 108 .

[17]

Cerqueira ,

H. M.

Gomes ,

Bifet , Unsupervised concept drift detection using a student-teacher approach , in: International Conference on Discovery Science, Springer, 2020 , pp. 190 - 204 .

[18]

Rabanser ,

Günnemann ,

Lipton , Failing loudly: An empirical study of methods for detecting dataset shift , Advances in Neural Information Processing Systems 32 ( 2019 ).

[19]

Gretton , K. M. Borgwardt , M. J.

Rasch , B.

Schölkopf , A.

Smola , A kernel two-sample test , The Journal of Machine Learning Research 13 ( 2012 ) 723 - 773 .

[20]

Castellani ,

Schmitt ,

Hammer , Task-sensitive concept drift detector with constraint embedding , in: 2021 IEEE Symposium Series on Computational Intelligence (SSCI) , IEEE, 2021 , pp. 01 - 08 .

[21]

C. C.

Serdar ,

Cihan ,

Yücel ,

M. A.

Serdar , Sample size, power and efect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies , Biochemia medica 31 ( 2021 ) 27 - 53 .

[22]

T. S.

Sethi ,

Kantardzic , On the reliable detection of concept drift from streaming unlabeled data , Expert Systems with Applications 82 ( 2017 ) 77 - 99 .

[23]

T. S.

Sethi ,

Kantardzic , E. Arabmakki, Monitoring classification blindspots to detect drifts from unlabeled data , in: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI) , IEEE, 2016 , pp. 142 - 151 .

[24] A. Van Looveren , J.

Klaise , G.

Vacanti , O.

Cobb , A.

Scillitoe , R.

Samoilescu , A.

Athorne , Alibi detect: Algorithms for outlier, adversarial and drift detection, 2019 . URL: https://github.com/SeldonIO/ alibi-detect.