1. Introduction

Data Quality in Data Streams by Modular Change Point Detection

Yaron Kanza

Rajat Malik

Divesh Srivastava

Caroline Stone

Gordon Woodhull

0 0 AT&T Chief Data Ofice , New Jersey , USA

Sensors that collect data from complex systems generate a stream of measurements, for example, measuring CPU utilization of machines in a data center, gathering meteorological data like atmospheric pressure and humidity levels across the USA, or tracking the occupancy of taxis in a large city. Downstream systems use the streamed data in a variety of applications, including training machine learning models and making data-driven decisions as part of automation. This makes data quality critical and requires detecting significant, unexpected, and rapid changes in indicative features of the streaming data. This can be done by detecting change points in the stream - points where the underlying distribution of a statistical feature of the stream fundamentally changes. In this paper, we discuss diferent types of change points in the data stream - changes that indicate a potential data quality problem. We present a modular method for combining operations on data streams to examine data quality in a flexible and adaptable way. Experiments over real-world and synthetic data streams show the efectiveness of the modular approach in comparison to traditional anomaly detection methods.

eol>Anomaly detection change point detection data streams modular architecture

1. Introduction

When monitoring complex systems like cellular networks, data centers, cloud infrastructures and content delivery networks, the monitoring system generates a data stream of telemetry, such as processing times, data transfer times, communication latency, CPU utilization, memory usage, network throughput, and other statistics that can help to track the health of the system. Monitoring is also used for collecting meteorological data for weather forecasting, trafic data to regulate and mitigate congestion in highways and highly-used roads, tracking the operation of machines and facilities, and continuously gathering data for real-time systems.

Data streams are often analyzed to detect anomalies and irregularities. Anomalies and irregularities in the stream may indicate a problem in the underlying system or may reveal an event that requires intervention. Since the data in the stream is the basis for critical decisions, poor data quality may afect those decisions. In addition, collected data sets are often used for training machine learning models. The models are trained to learn the expected behavior of systems and applications. Thus, the data that is fed into these models in the training process should be accurate and representative. This requires high data quality. Otherwise, the trained models could be biased or yield inaccurate results. The impact of data quality on machine learning is discussed in [1].

Maintaining high-quality data is crucial when critical applications depend on the monitored system or on models that are trained over the data. This is essential in applications for forecasting events, and for detecting security attacks, frauds, outages, and the efect of natural events like storms on infrastructures and services.

Data quality has many aspects, including completeness (no missing data), consistency (the data does not lead to contradictory inferences), cleanliness (no noise), conformity (complying with standards and rules), and continuity (uniformity in the arrival of the data). Some of these aspects can be evaluated using standard anomaly detection tools, but only to a limited extent. Therefore, there is a need to combine a variety of tools for efective data-quality assurance.

There are many tools and methods for detecting anomalies (outliers) in streaming data [2]. Anomalies are values in the data stream that are significantly different from the values that are expected based on previous observations. Often, anomalies can indicate that the system does not function properly. However, most anomalies are ephemeral and can be ignored because by the time that they are noticed the system is already back to normal. So, it is often essential to focus on lasting changes in the data stream, detect them, and alert on them. This raises several questions. First, what type of changes should the system detect? Second, how should

3. Quality Measures over Streams

without overwhelming the user with too many alerts but also without missing critical alerts?

In this paper our focus is on detection of change points, In this section we provide formal definitions and present that is, points where the underlying distribution of a sta- the problem of discovering changes in the underlying dististical feature of the stream changes in a significant, non- tribution of quality measures over a data stream. Unlike ephemeral, and unexpected way. We present a modular time series with a bounded number of points, streams architecture for change point detection over streaming often have high volume, velocity, variety and veracity, data, to provide flexibility and adaptability for a large so quality measurements should be adapted to streams, variety of data streams and diverse use cases. accordingly [20]. We present examples and illustrate

The paper is organized as follows. In Section 2 we our method based on real data taken from the Numenta discuss related work. Section 3 introduces quality mea- Anomaly Benchmark1, e.g., a sequence from a stream of sures for data streams. In Section 4 we present methods Taxi occupancy in The Twin Cities. for detection of change points. Section 5 describes our A data stream is a sequence of measurements = modular architecture and its benefits. Section 6 presents 1, 2, . . . where each measurement = (, ) is the results of our experimental evaluation. In Section 7 a pair of valid time and measured value . The valid we discuss our conclusions and future work. time is the time when the value was measured. The time when the measurement is processed as part of the stream is considered as transaction time. The delay of 2. Related Work measurement is the diference between the valid time and the transaction time.

The study in this paper is related to the following three re- For a time series where all the measurements are given search areas: data quality, anomaly detection and change a priori, computing statistics like mean and variance is point detection. These areas were studied extensively, simple. But for streaming data, new values arrive continhowever, the approach of a modular change point detec- uously and the statistics changes frequently. So, values tion, which we present in this paper, is novel. like mean and variance should be based on recent values Data quality. Quality measures for data streams have in the stream, not on the entire history. This can be done been studied in diferent contexts [ 3, 4, 5, 6]. Klein [7] ex- using a sliding window [21] or a decaying mean [22, 23]. amined data quality in sensor data streaming. Karkouch et al. [8] explored data quality in streams produced by IoT devices. Brown et al. [9] studied methods for coping with glitches in spatiotemporal streams by applying smoothing and imputation to data streams produced by spatially distributed sensors. The importance of empiricism in data quality studies has been emphasized in [10].

Sliding window. When using a sliding window of size , at time ≥ , the sub-sequence [] = − +1, − +2, . . . , of the stream comprises the most recent values in the stream up to measurement . The mean , variance 2, standard deviation , median , and other statistics of [] are computed in the usual way. Since for each measurement there is a diferent window, the statistics of [] may be diferent from the statistics of ′ [′] when ̸= ′ or ̸= ′.

Anomaly detection. Anomaly detection in time series has received a lot of attention in the literature. Many diferent anomaly detection methods have been developed and tested [11, 12, 13]. See Schmidl et al. [2] for a recent comparison of many methods. However, point anomalies are often ephemeral and do not reflect significant changes in the stream or data-quality issues. Some studies of anomalies considered anomalous subsequences rather than point anomalies. Boniol et al. [14, 15] studied a method for finding subsequences of a time series that are the farthest from a normal distribution. However, their assumption of normal distribution in the data does not hold in many real-world data streams, like those that we explore. Moreover, these studies do not focus on data quality or on change point detection.

Change point detection. Change detection has been

studied for time series [16] and data streams [17, 18, 19], however, these methods were not designed for data quality measures and do not explore the modular approach that we present in this paper.

Decaying mean and variance. A decaying mean is computed with a decay parameter 0 < ≤ 1, such that 1 = 1 and for > 1, = + (1 − ) − 1. We refer to the residual at time as the diference − , where is the measured value at time and is the decaying mean at that point. The decaying variance at time is the average over the squared residuals, that is, 12 = 0 and for > 1, 2 = ( − )2 +(1− )( 2− 1).

Point outlier. A point outlier is a value that significantly

exceeds the expected value, e.g., a value that is above or below the mean by more than 2.5 standard deviations, | − | > 2.5 . Outliers could indicate a volatile data quality problem. In Fig. 1, the red dots are outliers returned by the kNN outlier detection method. Data quality. Various data quality issues can be detected based on changes in the statistical properties of a data 1https://github.com/numenta/NAB ∙ Skewness of Delay (third moment). The delay may behave somewhat like an asymmetric wave and the skew will indicate whether the problem is increasing or decreasing. ∙ Outlier Rate. In many cases, the rate of point outliers is an indicator of data quality problems, e.g., jitter in a communication network. In some systems it is expected to have a few glitches and anomalies from time to time. But a major increase in the rate or concentration of point outliers is regarded as a data quality issue.

The goal is to apply data quality measurements in an efective and modular way and raise an alert when there are significant changes in the stream for the relevant data quality measures. 4. Detecting Changes in a Stream

Data streams and their statistical properties vary and depend on the application. In this paper we suggest a modular approach for anomaly detection over data streams. Each module receives a stream of data items and returns a stream of data items. A modular architecture is achieved by combining diferent modules such that the output stream of one module is the input of the next module. In this section we define some of the modules and their composition. stream. Some of the characteristics of the stream can be measured using the moments of the distribution, of the measured values, or of the delays. Commonly, for a random variable , the -th moment is [(− )] , i.e.,

the normalized expectancy of the residuals to the power of . The following are measurable changes in a data stream that can be evaluated using moments. ∙ Level Shift in Value (first moment). A significant change in the values of measurements can be the result of a dataquality problem. For example, in a system that monitors temperatures, an unexpected lasting increase or decrease in the measured values can be the result of a calibration issue or malfunction of sensors. In Fig. 1, there is a level shift around the date of September 12. ∙ Level Shift in Variance (second moment). A significant change in the variance of measurements can be the result of noise. The noise could afect measurement accuracy and impact the data quality. For example, noise could be the result of partial interference to a sensor.

Value extraction. Given the initial stream , the first module extracts the statistical values that we want to measure. For example, we can extract from the stream of measurements a stream of values 1, 2, . . ., a stream of delays 1, 2, . . ., a stream of mean values 1, 2, . . ., a ∙ Level Shift in Skewness (third moment). The skew mea- stream of variance or skew for measured values or delays, sures the symmetry of the distribution. It can be mea- a stream of point outliers, and so on. The residuals for sured as the distribution of the diferences − be- computing the mean, the variance or the skew can be tween the mean and median values. It may reflect bias based on a sliding window or a decaying mean. that afects data quality. ∙ Varying Delay (second moment). A change in the variance of the delay indicates that measurements are arriving inconsistently. This can often cause data loss or improper data processing by downstream applications.

Smoothing and imputation. In some cases, we may ∙ Changes in Volume. The volume is the number of mea- want to apply smoothing or convolution to emphasize surements that arrive at each time interval. Unexpected certain features of the stream. Smoothing can be done in changes in the volume may indicate that some measure- diferent ways, e.g., by replacing values with smoothed ments are missing, duplicated or arrive from data sources values 1, 2, . . . based on a moving average and a trend that should not be included in the stream. factor , where 1 = 1, 1 = 2 − 1, = + (1 − ∙ Delayed Data (first moment). The measurements may )(− 1 + − 1) and = ( − − 1) + (1 − )− 1 arrive one by one or in a batch. The delay is the diference for some 0 < < 1 and 0 < < 1. Seasonality can between the valid time and the transaction time of the also be included in the smoothing using Holt Winters measurement. A significant increase in the diference smoothing [24]. Smoothing can also be executed using may indicate that something is delaying the data arrival, Kernel Density Estimation (KDE) [25], by applying a which may lead to missing data, data points that arrive kernel function to the stream. out of order or measurements that arrive too late for Predicted values using a moving average, Holt Winters some online applications. exponential smoothing, ARIMA and other forecasting methods can be used for imputation of missing values to create a stream that is more complete if the next step of the processing is by a method that does not cope well with missing values.

Distribution comparison with moving windows. A comparison of the underlying distribution is executed for two consecutive moving windows. By measuring the distance between the distributions, we get a new Combining modules. In figures 5-8 we see how a comstream of values. Formally, given the stream , let tphoesivtaiorinanocfem.Foidgu.7lesshioswasppthlieedstrteoadmettehcatt aislpevroedlushceifdt biny +1− , +2− , . . . , be values of window [], applying EMD to the two rolling consecutive windows. and let +1, +2, . . . , + be values of window Note that there are two large peaks or elevated parts +[]. Note that [] and +[] are consecu- of the sequence. One is at the beginning of the change tive windows. The distributions [] and +[] of and the other is at the end of it. Fig. 8 shows the rolling tdphoaerneevdaublsuyinecsgoiEmnaptrhuthetiMnwgointvhdeero’wdsiDsstiasnta[cnec]beae(ntEwdMeeDn+),thale[smo].kaIntroecwacnnombaes- qZu-secnocreeinanFdigt.h7e. WexetrceamneseveatlhueesefewchtievnenaepspsloiefddetotetchtiengseWbaacsks–eLrsetiebilnerddisitvaenrcgee,nJceen,seetnc–.SFhoranenvoernydi,vtehregednifecree, nKcuell- ttehceticohna,neg.ge.,pkoNinNt ianncoommaplyardiesotencttioonoradsindaerpyicatendo minaFlyigd.e6-. between the distributions [] and +[] yields a Early detection. The comparison of two windows of value , and the result is a sequence , +1, +2, . . ., size may lead to a delay in detection. For measurement that is, a stream of diferences between the distribu- = (, ), the comparison of the window [] of tions. Extreme values in this stream indicate a significant values that precede and the values of +[] that

Rolling Z-score. In each stream, including the stream

that is produced by the comparison of distributions over the sliding windows, we can find extreme values by using Z-score with respect to the moving average, or by some other anomaly detection method. The extreme values are clustered, to prevent a burst of alerts. In Fig. 4 we see the rolling Z-score as the blue line and the extreme values as a cluster of red dots.

One is reusing components, e.g., in Section 4, modules

for computing EMD or rolling Z-score were applied to measurement values and to variance values. Hence, the same modules can be reused in diferent change point detection tasks.

Another benefit of the modular approach is dynamic composition of components. Modules can be added, adjusted or removed from a chain to accommodate changes in the streaming data. For example, consider two chains 1 and 2 of components. Chain 1 designed for detecting level shift comprises (1) extracting measurement Figure 9: Composition of components into a chain to discov- values 1, 2, . . . from the stream, (2) applying EMD to ers change points in the data stream. the extracted values, and (3) using rolling Z-score for ifnding change points. Chain 2 is the same as 1 except that in the first step it extracts the residual values follow , requires waiting for measurements to be |1 − 1|, |2 − 2|, . . . and finds change points for the delivered in the stream after seeing . This delay can variance. In this case, if a significant increase in the varibe mitigated by computing an estimation of the distance ance is detected by 2, the system can add an initial between the distributions and issuing a warning if the component to 1 for smoothing the values 1, 2, . . . estimation indicates high likelihood for a change point. before applying level-shift detection. This reduces the

Let (1, 2) be a function that computes the dif- noise caused by the large variance to prevent an undeference in distribution for two windows. To assess sirable efect on the level-shift detection. If a detection the distance early, we define function (, ) that esti- of missing values is applied, a detected increase in missing values may lead to adding an imputation module to tmhae,tde.is.s.ta,n(ce,b[feotr]w,ee<n+th[<ew])i+nadftoew.rTsheee[ines]gtaimmndeaatthesduerwveaminluedenotwiss cmhoani nitori2nsgootfhtahtetvhaerima niscsei.ng values will not afect the In some cases, we could have trees instead of linear Ear,lier=estim+at[io]ntshaartecboansteadinosntfheewvearlvuaelsues+so1,th..e.y,are. chains, where the stream of a component can be directed less accurate. But they may provide an early indication to two or more branches (sub-chains). A composition of the change and trigger a warning that there is high may form a DAG when some components aggregate or likelihood for a change point. combine the results of two or more streams.

Selecting the components and the order in which they are composed can be done based on a labeled ground 5. Modular Architecture truth. The system architect will examine typical data quality issues in the use case the system is built for In this paper we suggest a modular architecture for and will try diferent combinations of modules, to find change point detection. In a modular architecture, the the combination that provides the best detection accucomponents receive a stream of values and produce a racy. This process can be automated so that the system stream of values, so components can be composed in could check the detection chains periodically against the diferent ways, dynamically. Typically, processing is in a ground truth and the best combination of modules will chain-like structure where the first component receives be selected and used. a stream of measurements as the input and the last com- We implemented the modules in Python on top of ponent yields a stream of alerts, as illustrated in Fig. 9 Databricks, to utilize the large distributed storage and and Fig. 10. computation capacity of Spark and have the flexibility There are several benefits to the modular approach. of Python and Databricks notebooks. The modular ap

Evaluation. We computed for the diferent methods their precision (the percentage of correct detection cases out of all detection cases), recall (the percentage of correct detection cases out of all the true cases), percentage of false positive cases out of all positive cases, and the number of false positive cases. This shows how many false alerts could be raised. Note that too many alerts can lead to a case where alerts are ignored [27], i.e., an alert fatigue, so we want to avoid false alerts.

Results. The results show the efectiveness of detecting

change points using the combined components. Table 1 Figure 12: kNN over the CPU utilization stream. shows that by executing EMD combined with Z-score on the modified CPU stream (Fig. 11) the detection has much higher accuracy than kNN (Fig. 12). Note that kNN has a proach can also be implemented over stream processing large number of false detection cases, because it detects systems like Apache Flink [26] by leveraging the stream point outliers that are not part of a change point. processing API they provide. This would automatically In most of our tests all the change points were detected, add data-quality capabilities to these systems. i.e., an alert was raised at or near the change point. In these cases the recall was 1. Note that change points are noticeable in the time series that we explored, so 6. Experimental Evaluation preventing false positive cases in these tests is a greater challenge than preventing false negative cases. The modWe conducted an experimental evaluation to (1) show ular approach can be used to create chains with varying the efectiveness of our method for change point detec- sensitivity to false positive or false negative, according tion, in comparison to ordinary outlier detection, and to the application and the features of the data stream. (2) demonstrate the benefits of the modular design when The results for detecting variance shifts are presented combining and reusing components. in Table 2. Note that for variance level shift, kNN generData. In the experiments we used real data from Nu- ates too many false alerts. When using EMD combined menta Anomaly Benchmark and streamed the measure- with Z-score, the detection has high precision and high ments. To have ground truth, we inserted data-quality recall, however, with JSD the combined method does not issues into the time series, like adding to selected regions detect the level shift and has low recall, because JSD is a level shift, noise, outliers, gaps, delays, etc. This gave designed for categorical data and not for metric data. us the ability to distinguish between true positive cases, Table 3 presents the results of detection of a shift in at a change point, and false positive cases, not near a the frequency of point outliers. We can see in Table 3 that change point. We present experiments with two data applying a rolling window for counting the frequency of sets. (1) Taxi is real taxi occupancy data collected in outliers detected by kNN combined with Z-score does not 2015, in the Twin Cities Metro area, Minnesota. (2) CPU have high accuracy. This is because kNN generates too Util. is CPU utilization at an AWS cluster. many anomalies, not just near the change point. When executing ARIMA as an outlier detection, the accuracy is still low. However, when executing a rolling window that counts the outlier frequency detected by ARIMA and applying Z-score to the result we get a precision of 0.85.

Methods. We used diferent combinations of compo

nents. As a baseline we used kNN – the kNN unsupervised outlier detection method. It finds the closest nearest neighbors for every data point and measures the average

7. Discussion

ever an outlier is detected could overwhelm the users and make them ignore alerts (“The Boy Who Cried Wolf” Detecting data quality issues in streaming data is chal- efect [ 27]). Thus, it is essential to only raise alerts when lenging because (1) the data can frequently change, (2) not there are significant change points. In this paper we all of the data is available while it is streaming and (3) data show that our modular approach is efective at detecting quality can be afected by delays or changes in the under- change points without raising too many false alerts. lying distribution of data arriving from the applications One of the limitations of change point detection is that that generate the data. However, many data quality issues it may miss concept drifts (changes over time in unforecan be discovered as change points in the distribution of seen ways) [28]. Detection of concept drifts may require a statistical measure. a complementary method, so further study is needed.

There are diferent types of statistical measures, data While the modular method presented in this paper types, and data quality issues. Instead of developing a provides a promising direction for the detection of data completely independent method for each case, we sug- quality issues over streaming data, more study is needed gest a modular approach in which basic statistical com- over a larger variety of data streams and for additional ponents over streams can be combined and reused for use cases. Future work includes the development of a detection of change points. We show in this paper that method that could help users select the best combination the combined components are much more efective than of components and of parameters for their streaming traditional methods for point outliers. We show results data use cases. Future work also includes exploring the for kNN but we also tested other outlier detection meth- approach of ranking alerts based on the length and comods, including ARIMA, Z-score, and Histogram-Based plexity of the chain used for the detection. The premise is Outlier Scoring (HBOS), and got similar results. When that simpler chains may detect more noticeable changes, using traditional outlier detection methods over real data and thus, changes detected by simple chains should have there are too many outliers and creating an alert when- higher priority than detections by complex chains. subsequence anomaly detection in large data series, The VLDB Journal (2021) 1–23. [1] L. Budach, M. Feuerpfeil, N. Ihde, A. Nathansen, [16] S. Aminikhanghahi, D. J. Cook, A survey of methN. Noack, H. Patzlaf, H. Harmouch, F. Naumann, ods for time series change point detection, KnowlThe efects of data quality on machine learning per- edge and information systems 51 (2017) 339–367. formance, arXiv preprint arXiv:2207.14529 (2022). [17] D. Kifer, S. Ben-David, J. Gehrke, Detecting change [2] S. Schmidl, P. Wenig, T. Papenbrock, Anomaly de- in data streams, in: Proceedings of the 30-th Intection in time series: a comprehensive evaluation, ternational Conference on Very Large Data Bases, PVLDB 15 (2022) 1779–1797. VLDB ’04, VLDB Endowment, 2004, p. 180–191. [3] T. Dasu, R. Duan, D. Srivastava, Data quality for [18] L. I. Kuncheva, Change detection in streaming temporal streams., IEEE Data Eng. Bull. 39 (2016) multivariate data using likelihood detectors, IEEE 78–92. transactions on knowledge and data engineering [4] A. Klein, H.-H. Do, G. Hackenbroich, M. Karnstedt, 25 (2011) 1175–1180.

W. Lehner, Representing data quality for streaming [19] D.-H. Tran, M. M. Gaber, K.-U. Sattler, Change and static data, in: 2007 IEEE 23rd International detection in streaming data in the era of big data: Conference on Data Engineering Workshop, IEEE, models and issues, ACM SIGKDD Explorations 2007, pp. 3–10. Newsletter 16 (2014) 30–38. [5] F. Korn, S. Muthukrishnan, Y. Zhu, Checks and [20] J. Merino, I. Caballero, B. Rivas, M. Serrano, M. Piatbalances: Monitoring data quality problems in net- tini, A data quality in use model for big data, Future work trafic databases, in: Proceedings 2003 VLDB Generation Computer Systems 63 (2016) 123–130.

Conference, Morgan Kaufmann, 2003, pp. 536–547. [21] B. Babcock, M. Datar, R. Motwani, Sampling from a [6] B. Saha, D. Srivastava, Data quality: The other face moving window over streaming data, in: Proceedof big data, in: 2014 IEEE 30th international con- ings of the thirteenth annual ACM-SIAM sympoference on data engineering, IEEE, 2014, pp. 1294– sium on Discrete algorithms, 2002, pp. 633–634. 1297. [22] E. Cohen, M. Strauss, Maintaining time-decaying [7] A. Klein, W. Lehner, Representing data quality in stream aggregates, in: Proc. of the 22nd ACM Symsensor data streaming environments, Journal of posium on Principles of Database Systems, 2003, p.

Data and Information Quality (JDIQ) 1 (2009) 1–28. 223–233. [8] A. Karkouch, H. Mousannif, H. Al Moatassime, [23] G. Cormode, F. Korn, S. Tirthapura, ExponenT. Noel, Data quality in internet of things: A state- tially decayed aggregates on data streams, in: 24th of-the-art survey, Journal of Network and Com- Int. Conf. on Data Engineering, IEEE, 2008, pp. puter Applications 73 (2016) 57–81. 1379–1381. [9] P. E. Brown, T. Dasu, Y. Kanza, D. Srivastava, From [24] S. Gelper, R. Fried, C. Croux, Robust forecasting rocks to pebbles: Smoothing spatiotemporal data with exponential and holt–winters smoothing, Jourstreams in an overlay of sensors, ACM Transactions nal of forecasting 29 (2010) 285–300. on Spatial Algorithms and Systems 5 (2019). [25] M. Rosenblatt, Remarks on some nonparametric [10] S. Sadiq, T. Dasu, X. L. Dong, J. Freire, I. F. Ilyas, estimates of a density function, The annals of mathS. Link, R. J. Miller, F. Naumann, X. Zhou, D. Srivas- ematical statistics (1956) 832–837. tava, Data quality: The role of empiricism, ACM [26] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, SIGMOD Record 46 (2018) 35–43. S. Haridi, K. Tzoumas, Apache Flink: Stream and [11] S. Ahmad, A. Lavin, S. Purdy, Z. Agha, Unsuper- batch processing in a single engine, The Bulletin vised real-time anomaly detection for streaming of the Technical Committee on Data Engineering data, Neurocomputing 262 (2017) 134–147. 38 (2015). [12] A. Blázquez-García, A. Conde, U. Mori, J. A. Lozano, [27] P. E. Brown, T. Dasu, Y. Kanza, E. Koutsofios, R. MaA review on outlier/anomaly detection in time se- lik, D. Srivastava, Don’t cry wolf, in: International ries data, ACM Computing Surveys (CSUR) 54 Conference on Data Science and Advanced Analyt(2021) 1–33. ics (DSAA), IEEE, 2019, pp. 616–617. [13] A. Boukerche, L. Zheng, O. Alfandi, Outlier detec- [28] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, tion: Methods, models, and classification, ACM A. Bouchachia, A survey on concept drift adapComputing Surveys (CSUR) 53 (2020) 1–37. tation, ACM computing surveys (CSUR) 46 (2014) [14] P. Boniol, J. Paparrizos, T. Palpanas, M. J. Franklin, 1–37.

Sand: Streaming subsequence anomaly detection,

PVLDB 14 (2021) 1717–1729. [15] P. Boniol, M. Linardi, F. Roncallo, T. Palpanas,

M. Meftah, E. Remy, Unsupervised and scalable