Towards Unsupervised Data Quality Validation
                                  on Dynamic Data
                 Sergey Redyuk                                              Volker Markl                                   Sebastian Schelter
         Technische Universität Berlin                            Technische Universität Berlin                           New York University
          sergey.redyuk@tu-berlin.de                               volker.markl@tu-berlin.de                           sebastian.schelter@nyu.edu

Validating the quality of data is crucial for establishing the trust-                  for which we have chronological information as well as erroneous
worthiness of data pipelines. State-of-the-art solutions for data                      and manually cleaned variants. We show a series of “acceptable”
validation and error detection require explicit domain expertise                       data batches from the past to our approach, and have it decide
(e.g., in the form of rules or patterns) [1] or manually labeled ex-                   whether the next data batch is “acceptable” or erroneous (we
amples [7]. In real-world applications, domain knowledge is often                      randomly choose either of those for evaluation). We repeat this
incomplete, data changes over time, which limits the applicability                     for multiple timespans and compute binary classification metrics
of existing solutions. We propose an unsupervised approach for                         such as accuracy and F1-score. We find that the popular time
detecting data quality degradation early and automatically. We                         series forecasting method exponential smoothing combined with
will present the approach, its key assumptions, and preliminary                        a simple decision strategy for outlier detection (inclusion in a
results on public data to demonstrate how data quality can be                          90% confidence interval) works well in many cases, and provides
monitored without manually curated rules and constraints.                              F1-scores of up to 96% for the Flights dataset. In contrast, existing
Exemplary use case. Consider a data engineering team at a                              baseline solutions such as TFX Data Validation [1] or statistical
retail company, which has to regularly ingest product data from                        tests [3] perform with F1-scores of only 64% and 62% respectively.
heterogeneous sources such as web crawls, databases, or key-
                                                                                              Previously observed data batches           New data batch to validate
value stores, with the goal of indexing the products for a search
engine. Errors in the data, such as missing values or typos in the
category descriptions, lead to various problems: attributes with                              Mean of column A
missing values might not be indexed, or the products might end                                                                                                       3
                                                                                         1                                                         2
up in the wrong category. Ultimately, customers may not be able                               Number of distinct values of column B
to find the products via the search engine. Tackling such data                                                                                                       4
quality issues is tedious, as manual solutions require in-depth                              time                                       yesterday           today
domain knowledge and result in complex engineering efforts.
Proposed approach. We focus on scenarios where systems reg-                            Figure 1: Overview of the proposed approach for data qual-
ularly ingest potentially erroneous external data. We apply a                          ity monitoring: we maintain time series of column statis-
machine learning-based approach which automatically learns                             tics from previously observed data batches of “acceptable”
to identify “acceptable” data batches, and raises alerts for data                      quality. To decide whether a new batch should be accepted,
batches that vary significantly from previous observations. We                         we compare its statistics to a forecast-based estimate of
analyze structured data that arrives periodically in batches (e.g.,                    the expected statistics based on the observed time series.
via a nightly ingestion of log files). At time t, we assume that
previously ingested data (timestamps 1 to t − 1) is of “acceptable”                    Next directions. We intend to conduct an extensive evaluation
quality if it did not result in system crashes or require manual                       on additional datasets against several baselines [2, 5, 8] with
repairing. We use these previously ingested data batches as exam-                      respect to the prediction performance, execution time, and scala-
ples with the goal to identify future erroneous batches. Note that                     bility. Furthermore, we will investigate the benefits of applying
we do not look for erroneous records, but aim to identify errors                       multivariate forecasting methods for our use case.
that corrupt an entire batch, such as the accidental introduction                      Acknowledgements. This work was funded by the HEIBRiDS graduate school,
of a large number of missing values in a column.                                       with the support of the German Ministry for Education and Research as BIFOLD -
Figure 1 illustrates our approach: We compute a set of statistics                      Berlin Institute for the Foundations of Learning and Data, BBDC 2 (01IS18025A),
for every column of an observed data batch: completeness, ap-                          BZML (01IS18037A), and the Software Campus Program (01IS17052).
proximate number of distinct values, as well as mean, standard
deviation, minimum and maximum values for numeric columns.                             REFERENCES
We record these statistics as time series over multiple batches 1 .                    [1] Dennis Baylor et al. 2017. Tfx: A tensorflow-based production-scale machine
We apply time series forecasting methods [4] to estimate the                               learning platform. KDD, 1387–1395.
                                                                                       [2] Eric Breck, Marty Zinkevich, Neoklis Polyzotis, Steven Whang, and Sudip Roy.
expected data statistics for the next batch (the green area in 2 ).                        2019. Data Validation for Machine Learning. SysML.
When a new batch of data becomes available, we compute its                             [3] Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit.
actual statistics 3 and compare them to the estimate 4 . If the                            Journal of the American statistical Association 46, 253 (1951), 68–78.
                                                                                       [4] Douglas C Montgomery, Cheryl L Jennings, and Murat Kulahci. 2015. Introduc-
observed statistic differs significantly from the estimated value,                         tion to time series analysis and forecasting. John Wiley & Sons.
we raise an alert about a potential degradation of the data quality.                   [5] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix
                                                                                           Naumann. 2015. Data Profiling with Metanome. PVLDB 8, 12 (2015), 1860–1863.
Preliminary Results. We conducted a preliminary evaluation                             [6] Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current
on datasets of flight information [6] and crawled Facebook posts,                          approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13.
                                                                                       [7] Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holo-
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-       clean: Holistic data repairs with probabilistic inference. PVLDB 10, 11 (2017),
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,           1190–1201.
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-              [8] Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biess-
tribution 4.0 International (CC BY 4.0)                                                    mann, and Andreas Grafberger. 2018. Automating Large-scale Data Quality
                                                                                           Verification. PVLDB 11, 12 (Aug. 2018), 1781–1794.