<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Quality in Data Streams by Modular Change Point Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yaron Kanza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajat Malik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Divesh Srivastava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caroline Stone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gordon Woodhull</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AT&amp;T Chief Data Ofice</institution>
          ,
          <addr-line>New Jersey</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sensors that collect data from complex systems generate a stream of measurements, for example, measuring CPU utilization of machines in a data center, gathering meteorological data like atmospheric pressure and humidity levels across the USA, or tracking the occupancy of taxis in a large city. Downstream systems use the streamed data in a variety of applications, including training machine learning models and making data-driven decisions as part of automation. This makes data quality critical and requires detecting significant, unexpected, and rapid changes in indicative features of the streaming data. This can be done by detecting change points in the stream - points where the underlying distribution of a statistical feature of the stream fundamentally changes. In this paper, we discuss diferent types of change points in the data stream - changes that indicate a potential data quality problem. We present a modular method for combining operations on data streams to examine data quality in a flexible and adaptable way. Experiments over real-world and synthetic data streams show the efectiveness of the modular approach in comparison to traditional anomaly detection methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Anomaly detection</kwd>
        <kwd>change point detection</kwd>
        <kwd>data streams</kwd>
        <kwd>modular architecture</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>When monitoring complex systems like cellular
networks, data centers, cloud infrastructures and content
delivery networks, the monitoring system generates a
data stream of telemetry, such as processing times, data
transfer times, communication latency, CPU utilization,
memory usage, network throughput, and other statistics
that can help to track the health of the system.
Monitoring is also used for collecting meteorological data for
weather forecasting, trafic data to regulate and mitigate
congestion in highways and highly-used roads, tracking
the operation of machines and facilities, and continuously
gathering data for real-time systems.</p>
      <p>Data streams are often analyzed to detect anomalies
and irregularities. Anomalies and irregularities in the
stream may indicate a problem in the underlying system
or may reveal an event that requires intervention. Since
the data in the stream is the basis for critical decisions,
poor data quality may afect those decisions. In addition,
collected data sets are often used for training machine
learning models. The models are trained to learn the
expected behavior of systems and applications. Thus, the
data that is fed into these models in the training process
should be accurate and representative. This requires
high data quality. Otherwise, the trained models could
be biased or yield inaccurate results. The impact of data
quality on machine learning is discussed in [1].</p>
      <p>Maintaining high-quality data is crucial when
critical applications depend on the monitored system or on
models that are trained over the data. This is essential
in applications for forecasting events, and for detecting
security attacks, frauds, outages, and the efect of natural
events like storms on infrastructures and services.</p>
      <p>Data quality has many aspects, including
completeness (no missing data), consistency (the data does not
lead to contradictory inferences), cleanliness (no noise),
conformity (complying with standards and rules), and
continuity (uniformity in the arrival of the data). Some of
these aspects can be evaluated using standard anomaly
detection tools, but only to a limited extent. Therefore,
there is a need to combine a variety of tools for efective
data-quality assurance.</p>
      <p>There are many tools and methods for detecting
anomalies (outliers) in streaming data [2]. Anomalies
are values in the data stream that are significantly
different from the values that are expected based on
previous observations. Often, anomalies can indicate that
the system does not function properly. However, most
anomalies are ephemeral and can be ignored because by
the time that they are noticed the system is already back
to normal. So, it is often essential to focus on lasting
changes in the data stream, detect them, and alert on
them. This raises several questions. First, what type of
changes should the system detect? Second, how should</p>
    </sec>
    <sec id="sec-2">
      <title>3. Quality Measures over Streams</title>
      <p>without overwhelming the user with too many alerts but
also without missing critical alerts?</p>
      <p>In this paper our focus is on detection of change points, In this section we provide formal definitions and present
that is, points where the underlying distribution of a sta- the problem of discovering changes in the underlying
dististical feature of the stream changes in a significant, non- tribution of quality measures over a data stream. Unlike
ephemeral, and unexpected way. We present a modular time series with a bounded number of points, streams
architecture for change point detection over streaming often have high volume, velocity, variety and veracity,
data, to provide flexibility and adaptability for a large so quality measurements should be adapted to streams,
variety of data streams and diverse use cases. accordingly [20]. We present examples and illustrate</p>
      <p>The paper is organized as follows. In Section 2 we our method based on real data taken from the Numenta
discuss related work. Section 3 introduces quality mea- Anomaly Benchmark1, e.g., a sequence from a stream of
sures for data streams. In Section 4 we present methods Taxi occupancy in The Twin Cities.
for detection of change points. Section 5 describes our A data stream is a sequence of measurements  =
modular architecture and its benefits. Section 6 presents 1, 2, . . . where each measurement  = (, ) is
the results of our experimental evaluation. In Section 7 a pair of valid time  and measured value . The valid
we discuss our conclusions and future work. time  is the time when the value  was measured. The
time when the measurement is processed as part of the
stream is considered as transaction time. The delay   of
2. Related Work measurement  is the diference between the valid time
and the transaction time.</p>
      <p>The study in this paper is related to the following three re- For a time series where all the measurements are given
search areas: data quality, anomaly detection and change a priori, computing statistics like mean and variance is
point detection. These areas were studied extensively, simple. But for streaming data, new values arrive
continhowever, the approach of a modular change point detec- uously and the statistics changes frequently. So, values
tion, which we present in this paper, is novel. like mean and variance should be based on recent values
Data quality. Quality measures for data streams have in the stream, not on the entire history. This can be done
been studied in diferent contexts [ 3, 4, 5, 6]. Klein [7] ex- using a sliding window [21] or a decaying mean [22, 23].
amined data quality in sensor data streaming. Karkouch
et al. [8] explored data quality in streams produced by IoT
devices. Brown et al. [9] studied methods for coping with
glitches in spatiotemporal streams by applying
smoothing and imputation to data streams produced by spatially
distributed sensors. The importance of empiricism in
data quality studies has been emphasized in [10].</p>
      <p>Sliding window. When using a sliding window 
of size , at time  ≥ , the sub-sequence [] =
− +1, − +2, . . . ,  of the stream  comprises the
most recent  values in the stream up to measurement
. The mean  , variance  2, standard deviation  ,
median  , and other statistics of [] are computed in
the usual way. Since for each measurement  there is a
diferent window, the statistics of [] may be diferent
from the statistics of ′ [′] when  ̸= ′ or  ̸= ′.</p>
      <p>Anomaly detection. Anomaly detection in time series
has received a lot of attention in the literature. Many
diferent anomaly detection methods have been
developed and tested [11, 12, 13]. See Schmidl et al. [2] for
a recent comparison of many methods. However, point
anomalies are often ephemeral and do not reflect
significant changes in the stream or data-quality issues. Some
studies of anomalies considered anomalous subsequences
rather than point anomalies. Boniol et al. [14, 15] studied
a method for finding subsequences of a time series that
are the farthest from a normal distribution. However,
their assumption of normal distribution in the data does
not hold in many real-world data streams, like those that
we explore. Moreover, these studies do not focus on data
quality or on change point detection.</p>
      <sec id="sec-2-1">
        <title>Change point detection. Change detection has been</title>
        <p>studied for time series [16] and data streams [17, 18, 19],
however, these methods were not designed for data
quality measures and do not explore the modular approach
that we present in this paper.</p>
        <p>Decaying mean and variance. A decaying mean   is
computed with a decay parameter 0 &lt;  ≤ 1, such that
 1 = 1 and for  &gt; 1,   =   + (1 −  ) − 1. We
refer to the residual at time  as the diference  −  ,
where  is the measured value at time  and   is the
decaying mean at that point. The decaying variance at
time  is the average over the squared residuals, that is,
 12 = 0 and for  &gt; 1,  2 =  ( −  )2 +(1−  )( 2− 1).</p>
      </sec>
      <sec id="sec-2-2">
        <title>Point outlier. A point outlier is a value that significantly</title>
        <p>exceeds the expected value, e.g., a value  that is above
or below the mean by more than 2.5 standard deviations,
| −  | &gt; 2.5 . Outliers could indicate a volatile
data quality problem. In Fig. 1, the red dots are outliers
returned by the kNN outlier detection method.
Data quality. Various data quality issues can be detected
based on changes in the statistical properties of a data
1https://github.com/numenta/NAB
∙ Skewness of Delay (third moment). The delay may
behave somewhat like an asymmetric wave and the skew
will indicate whether the problem is increasing or
decreasing.
∙ Outlier Rate. In many cases, the rate of point outliers
is an indicator of data quality problems, e.g., jitter in a
communication network. In some systems it is expected
to have a few glitches and anomalies from time to time.
But a major increase in the rate or concentration of point
outliers is regarded as a data quality issue.</p>
      </sec>
      <sec id="sec-2-3">
        <title>The goal is to apply data quality measurements in an efective and modular way and raise an alert when there are significant changes in the stream for the relevant data quality measures.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Detecting Changes in a Stream</title>
      <p>Data streams and their statistical properties vary and
depend on the application. In this paper we suggest
a modular approach for anomaly detection over data
streams. Each module receives a stream of data items and
returns a stream of data items. A modular architecture
is achieved by combining diferent modules such that
the output stream of one module is the input of the next
module. In this section we define some of the modules
and their composition.
stream. Some of the characteristics of the stream can
be measured using the moments of the distribution, of
the measured values, or of the delays. Commonly, for a
random variable , the -th moment is [(−  )] , i.e.,</p>
      <p>the normalized expectancy of the residuals to the power
of . The following are measurable changes in a data
stream that can be evaluated using moments.
∙ Level Shift in Value (first moment). A significant change
in the values of measurements can be the result of a
dataquality problem. For example, in a system that monitors
temperatures, an unexpected lasting increase or decrease
in the measured values can be the result of a calibration
issue or malfunction of sensors. In Fig. 1, there is a level
shift around the date of September 12.
∙ Level Shift in Variance (second moment). A significant
change in the variance of measurements can be the result
of noise. The noise could afect measurement accuracy
and impact the data quality. For example, noise could be
the result of partial interference to a sensor.</p>
      <p>Value extraction. Given the initial stream , the first
module extracts the statistical values that we want to
measure. For example, we can extract from the stream of
measurements a stream of values 1, 2, . . ., a stream of
delays  1,  2, . . ., a stream of mean values  1,  2, . . ., a
∙ Level Shift in Skewness (third moment). The skew mea- stream of variance or skew for measured values or delays,
sures the symmetry of the distribution. It can be mea- a stream of point outliers, and so on. The residuals for
sured as the distribution of the diferences   −   be- computing the mean, the variance or the skew can be
tween the mean and median values. It may reflect bias based on a sliding window or a decaying mean.
that afects data quality.
∙ Varying Delay (second moment). A change in the
variance of the delay indicates that measurements are
arriving inconsistently. This can often cause data loss or
improper data processing by downstream applications.</p>
      <p>Smoothing and imputation. In some cases, we may
∙ Changes in Volume. The volume is the number of mea- want to apply smoothing or convolution to emphasize
surements that arrive at each time interval. Unexpected certain features of the stream. Smoothing can be done in
changes in the volume may indicate that some measure- diferent ways, e.g., by replacing values with smoothed
ments are missing, duplicated or arrive from data sources values 1, 2, . . . based on a moving average and a trend
that should not be included in the stream. factor  , where 1 = 1, 1 = 2 − 1,  =   + (1 −
∙ Delayed Data (first moment). The measurements may  )(− 1 + − 1) and  =  ( − − 1) + (1 −  )− 1
arrive one by one or in a batch. The delay is the diference for some 0 &lt;  &lt; 1 and 0 &lt;  &lt; 1. Seasonality can
between the valid time and the transaction time of the also be included in the smoothing using Holt Winters
measurement. A significant increase in the diference smoothing [24]. Smoothing can also be executed using
may indicate that something is delaying the data arrival, Kernel Density Estimation (KDE) [25], by applying a
which may lead to missing data, data points that arrive kernel function to the stream.
out of order or measurements that arrive too late for Predicted values using a moving average, Holt Winters
some online applications. exponential smoothing, ARIMA and other forecasting
methods can be used for imputation of missing values
to create a stream that is more complete if the next step
of the processing is by a method that does not cope well
with missing values.</p>
      <p>Distribution comparison with moving windows. A
comparison of the underlying distribution is executed
for two consecutive moving windows. By measuring
the distance between the distributions, we get a new Combining modules. In figures 5-8 we see how a
comstream of values. Formally, given the stream , let tphoesivtaiorinanocfem.Foidgu.7lesshioswasppthlieedstrteoadmettehcatt aislpevroedlushceifdt biny
+1− , +2− , . . . ,  be  values of window [], applying EMD to the two rolling consecutive windows.
and let +1, +2, . . . , + be  values of window Note that there are two large peaks or elevated parts
+[]. Note that [] and +[] are consecu- of the sequence. One is at the beginning of the change
tive windows. The distributions [] and +[] of and the other is at the end of it. Fig. 8 shows the rolling
tdphoaerneevdaublsuyinecsgoiEmnaptrhuthetiMnwgointvhdeero’wdsiDsstiasnta[cnec]beae(ntEwdMeeDn+),thale[smo].kaIntroecwacnnombaes- qZu-secnocreeinanFdigt.h7e.
WexetrceamneseveatlhueesefewchtievnenaepspsloiefddetotetchtiengseWbaacsks–eLrsetiebilnerddisitvaenrcgee,nJceen,seetnc–.SFhoranenvoernydi,vtehregednifecree, nKcuell- ttehceticohna,neg.ge.,pkoNinNt ianncoommaplyardiesotencttioonoradsindaerpyicatendo minaFlyigd.e6-.
between the distributions [] and +[] yields a Early detection. The comparison of two windows of
value , and the result is a sequence , +1, +2, . . ., size  may lead to a delay in detection. For measurement
that is, a stream of diferences between the distribu-  = (, ), the comparison of the window [] of 
tions. Extreme values in this stream indicate a significant values that precede  and the  values of +[] that</p>
      <sec id="sec-3-1">
        <title>Rolling Z-score. In each stream, including the stream</title>
        <p>that is produced by the comparison of distributions over
the sliding windows, we can find extreme values by using
Z-score with respect to the moving average, or by some
other anomaly detection method. The extreme values are
clustered, to prevent a burst of alerts. In Fig. 4 we see the
rolling Z-score as the blue line and the extreme values as
a cluster of red dots.</p>
      </sec>
      <sec id="sec-3-2">
        <title>One is reusing components, e.g., in Section 4, modules</title>
        <p>for computing EMD or rolling Z-score were applied to
measurement values and to variance values. Hence, the
same modules can be reused in diferent change point
detection tasks.</p>
        <p>Another benefit of the modular approach is dynamic
composition of components. Modules can be added,
adjusted or removed from a chain to accommodate changes
in the streaming data. For example, consider two chains
1 and 2 of components. Chain 1 designed for
detecting level shift comprises (1) extracting measurement
Figure 9: Composition of components into a chain to discov- values 1, 2, . . . from the stream, (2) applying EMD to
ers change points in the data stream. the extracted values, and (3) using rolling Z-score for
ifnding change points. Chain 2 is the same as 1
except that in the first step it extracts the residual values
follow , requires waiting for  measurements to be |1 −  1|, |2 −  2|, . . . and finds change points for the
delivered in the stream after seeing . This delay can variance. In this case, if a significant increase in the
varibe mitigated by computing an estimation of the distance ance is detected by 2, the system can add an initial
between the distributions and issuing a warning if the component to 1 for smoothing the values 1, 2, . . .
estimation indicates high likelihood for a change point. before applying level-shift detection. This reduces the</p>
        <p>Let (1, 2) be a function that computes the dif- noise caused by the large variance to prevent an
undeference in distribution for two windows. To assess sirable efect on the level-shift detection. If a detection
the distance early, we define function (, ) that esti- of missing values is applied, a detected increase in
missing values may lead to adding an imputation module to
tmhae,tde.is.s.ta,n(ce,b[feotr]w,ee&lt;n+th[&lt;ew])i+nadftoew.rTsheee[ines]gtaimmndeaatthesduerwveaminluedenotwiss cmhoani nitori2nsgootfhtahtetvhaerima niscsei.ng values will not afect the
In some cases, we could have trees instead of linear
Ear,lier=estim+at[io]ntshaartecboansteadinosntfheewvearlvuaelsues+so1,th..e.y,are. chains, where the stream of a component can be directed
less accurate. But they may provide an early indication to two or more branches (sub-chains). A composition
of the change and trigger a warning that there is high may form a DAG when some components aggregate or
likelihood for a change point. combine the results of two or more streams.</p>
        <p>Selecting the components and the order in which they
are composed can be done based on a labeled ground
5. Modular Architecture truth. The system architect will examine typical data
quality issues in the use case the system is built for
In this paper we suggest a modular architecture for and will try diferent combinations of modules, to find
change point detection. In a modular architecture, the the combination that provides the best detection
accucomponents receive a stream of values and produce a racy. This process can be automated so that the system
stream of values, so components can be composed in could check the detection chains periodically against the
diferent ways, dynamically. Typically, processing is in a ground truth and the best combination of modules will
chain-like structure where the first component receives be selected and used.
a stream of measurements as the input and the last com- We implemented the modules in Python on top of
ponent yields a stream of alerts, as illustrated in Fig. 9 Databricks, to utilize the large distributed storage and
and Fig. 10. computation capacity of Spark and have the flexibility
There are several benefits to the modular approach. of Python and Databricks notebooks. The modular
ap</p>
        <p>Evaluation. We computed for the diferent methods
their precision (the percentage of correct detection cases
out of all detection cases), recall (the percentage of
correct detection cases out of all the true cases), percentage
of false positive cases out of all positive cases, and the
number of false positive cases. This shows how many
false alerts could be raised. Note that too many alerts can
lead to a case where alerts are ignored [27], i.e., an alert
fatigue, so we want to avoid false alerts.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results. The results show the efectiveness of detecting</title>
        <p>change points using the combined components. Table 1
Figure 12: kNN over the CPU utilization stream. shows that by executing EMD combined with Z-score on
the modified CPU stream (Fig. 11) the detection has much
higher accuracy than kNN (Fig. 12). Note that kNN has a
proach can also be implemented over stream processing large number of false detection cases, because it detects
systems like Apache Flink [26] by leveraging the stream point outliers that are not part of a change point.
processing API they provide. This would automatically In most of our tests all the change points were detected,
add data-quality capabilities to these systems. i.e., an alert was raised at or near the change point. In
these cases the recall was 1. Note that change points
are noticeable in the time series that we explored, so
6. Experimental Evaluation preventing false positive cases in these tests is a greater
challenge than preventing false negative cases. The
modWe conducted an experimental evaluation to (1) show ular approach can be used to create chains with varying
the efectiveness of our method for change point detec- sensitivity to false positive or false negative, according
tion, in comparison to ordinary outlier detection, and to the application and the features of the data stream.
(2) demonstrate the benefits of the modular design when The results for detecting variance shifts are presented
combining and reusing components. in Table 2. Note that for variance level shift, kNN
generData. In the experiments we used real data from Nu- ates too many false alerts. When using EMD combined
menta Anomaly Benchmark and streamed the measure- with Z-score, the detection has high precision and high
ments. To have ground truth, we inserted data-quality recall, however, with JSD the combined method does not
issues into the time series, like adding to selected regions detect the level shift and has low recall, because JSD is
a level shift, noise, outliers, gaps, delays, etc. This gave designed for categorical data and not for metric data.
us the ability to distinguish between true positive cases, Table 3 presents the results of detection of a shift in
at a change point, and false positive cases, not near a the frequency of point outliers. We can see in Table 3 that
change point. We present experiments with two data applying a rolling window for counting the frequency of
sets. (1) Taxi is real taxi occupancy data collected in outliers detected by kNN combined with Z-score does not
2015, in the Twin Cities Metro area, Minnesota. (2) CPU have high accuracy. This is because kNN generates too
Util. is CPU utilization at an AWS cluster. many anomalies, not just near the change point. When
executing ARIMA as an outlier detection, the accuracy
is still low. However, when executing a rolling window
that counts the outlier frequency detected by ARIMA and
applying Z-score to the result we get a precision of 0.85.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Methods. We used diferent combinations of compo</title>
        <p>nents. As a baseline we used kNN – the kNN unsupervised
outlier detection method. It finds the closest  nearest
neighbors for every data point and measures the average</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Discussion</title>
      <p>ever an outlier is detected could overwhelm the users
and make them ignore alerts (“The Boy Who Cried Wolf”
Detecting data quality issues in streaming data is chal- efect [ 27]). Thus, it is essential to only raise alerts when
lenging because (1) the data can frequently change, (2) not there are significant change points. In this paper we
all of the data is available while it is streaming and (3) data show that our modular approach is efective at detecting
quality can be afected by delays or changes in the under- change points without raising too many false alerts.
lying distribution of data arriving from the applications One of the limitations of change point detection is that
that generate the data. However, many data quality issues it may miss concept drifts (changes over time in
unforecan be discovered as change points in the distribution of seen ways) [28]. Detection of concept drifts may require
a statistical measure. a complementary method, so further study is needed.</p>
      <p>There are diferent types of statistical measures, data While the modular method presented in this paper
types, and data quality issues. Instead of developing a provides a promising direction for the detection of data
completely independent method for each case, we sug- quality issues over streaming data, more study is needed
gest a modular approach in which basic statistical com- over a larger variety of data streams and for additional
ponents over streams can be combined and reused for use cases. Future work includes the development of a
detection of change points. We show in this paper that method that could help users select the best combination
the combined components are much more efective than of components and of parameters for their streaming
traditional methods for point outliers. We show results data use cases. Future work also includes exploring the
for kNN but we also tested other outlier detection meth- approach of ranking alerts based on the length and
comods, including ARIMA, Z-score, and Histogram-Based plexity of the chain used for the detection. The premise is
Outlier Scoring (HBOS), and got similar results. When that simpler chains may detect more noticeable changes,
using traditional outlier detection methods over real data and thus, changes detected by simple chains should have
there are too many outliers and creating an alert when- higher priority than detections by complex chains.
subsequence anomaly detection in large data series,
The VLDB Journal (2021) 1–23.
[1] L. Budach, M. Feuerpfeil, N. Ihde, A. Nathansen, [16] S. Aminikhanghahi, D. J. Cook, A survey of
methN. Noack, H. Patzlaf, H. Harmouch, F. Naumann, ods for time series change point detection,
KnowlThe efects of data quality on machine learning per- edge and information systems 51 (2017) 339–367.
formance, arXiv preprint arXiv:2207.14529 (2022). [17] D. Kifer, S. Ben-David, J. Gehrke, Detecting change
[2] S. Schmidl, P. Wenig, T. Papenbrock, Anomaly de- in data streams, in: Proceedings of the 30-th
Intection in time series: a comprehensive evaluation, ternational Conference on Very Large Data Bases,
PVLDB 15 (2022) 1779–1797. VLDB ’04, VLDB Endowment, 2004, p. 180–191.
[3] T. Dasu, R. Duan, D. Srivastava, Data quality for [18] L. I. Kuncheva, Change detection in streaming
temporal streams., IEEE Data Eng. Bull. 39 (2016) multivariate data using likelihood detectors, IEEE
78–92. transactions on knowledge and data engineering
[4] A. Klein, H.-H. Do, G. Hackenbroich, M. Karnstedt, 25 (2011) 1175–1180.</p>
      <p>W. Lehner, Representing data quality for streaming [19] D.-H. Tran, M. M. Gaber, K.-U. Sattler, Change
and static data, in: 2007 IEEE 23rd International detection in streaming data in the era of big data:
Conference on Data Engineering Workshop, IEEE, models and issues, ACM SIGKDD Explorations
2007, pp. 3–10. Newsletter 16 (2014) 30–38.
[5] F. Korn, S. Muthukrishnan, Y. Zhu, Checks and [20] J. Merino, I. Caballero, B. Rivas, M. Serrano, M.
Piatbalances: Monitoring data quality problems in net- tini, A data quality in use model for big data, Future
work trafic databases, in: Proceedings 2003 VLDB Generation Computer Systems 63 (2016) 123–130.</p>
      <p>Conference, Morgan Kaufmann, 2003, pp. 536–547. [21] B. Babcock, M. Datar, R. Motwani, Sampling from a
[6] B. Saha, D. Srivastava, Data quality: The other face moving window over streaming data, in:
Proceedof big data, in: 2014 IEEE 30th international con- ings of the thirteenth annual ACM-SIAM
sympoference on data engineering, IEEE, 2014, pp. 1294– sium on Discrete algorithms, 2002, pp. 633–634.
1297. [22] E. Cohen, M. Strauss, Maintaining time-decaying
[7] A. Klein, W. Lehner, Representing data quality in stream aggregates, in: Proc. of the 22nd ACM
Symsensor data streaming environments, Journal of posium on Principles of Database Systems, 2003, p.</p>
      <p>Data and Information Quality (JDIQ) 1 (2009) 1–28. 223–233.
[8] A. Karkouch, H. Mousannif, H. Al Moatassime, [23] G. Cormode, F. Korn, S. Tirthapura,
ExponenT. Noel, Data quality in internet of things: A state- tially decayed aggregates on data streams, in: 24th
of-the-art survey, Journal of Network and Com- Int. Conf. on Data Engineering, IEEE, 2008, pp.
puter Applications 73 (2016) 57–81. 1379–1381.
[9] P. E. Brown, T. Dasu, Y. Kanza, D. Srivastava, From [24] S. Gelper, R. Fried, C. Croux, Robust forecasting
rocks to pebbles: Smoothing spatiotemporal data with exponential and holt–winters smoothing,
Jourstreams in an overlay of sensors, ACM Transactions nal of forecasting 29 (2010) 285–300.
on Spatial Algorithms and Systems 5 (2019). [25] M. Rosenblatt, Remarks on some nonparametric
[10] S. Sadiq, T. Dasu, X. L. Dong, J. Freire, I. F. Ilyas, estimates of a density function, The annals of
mathS. Link, R. J. Miller, F. Naumann, X. Zhou, D. Srivas- ematical statistics (1956) 832–837.
tava, Data quality: The role of empiricism, ACM [26] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl,
SIGMOD Record 46 (2018) 35–43. S. Haridi, K. Tzoumas, Apache Flink: Stream and
[11] S. Ahmad, A. Lavin, S. Purdy, Z. Agha, Unsuper- batch processing in a single engine, The Bulletin
vised real-time anomaly detection for streaming of the Technical Committee on Data Engineering
data, Neurocomputing 262 (2017) 134–147. 38 (2015).
[12] A. Blázquez-García, A. Conde, U. Mori, J. A. Lozano, [27] P. E. Brown, T. Dasu, Y. Kanza, E. Koutsofios, R.
MaA review on outlier/anomaly detection in time se- lik, D. Srivastava, Don’t cry wolf, in: International
ries data, ACM Computing Surveys (CSUR) 54 Conference on Data Science and Advanced
Analyt(2021) 1–33. ics (DSAA), IEEE, 2019, pp. 616–617.
[13] A. Boukerche, L. Zheng, O. Alfandi, Outlier detec- [28] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy,
tion: Methods, models, and classification, ACM A. Bouchachia, A survey on concept drift
adapComputing Surveys (CSUR) 53 (2020) 1–37. tation, ACM computing surveys (CSUR) 46 (2014)
[14] P. Boniol, J. Paparrizos, T. Palpanas, M. J. Franklin, 1–37.</p>
      <p>Sand: Streaming subsequence anomaly detection,</p>
      <p>PVLDB 14 (2021) 1717–1729.
[15] P. Boniol, M. Linardi, F. Roncallo, T. Palpanas,</p>
      <p>M. Meftah, E. Remy, Unsupervised and scalable</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>