Towards Unsupervised Data Quality Validation on Dynamic Data Sergey Redyuk Volker Markl Sebastian Schelter Technische Universität Berlin Technische Universität Berlin New York University sergey.redyuk@tu-berlin.de volker.markl@tu-berlin.de sebastian.schelter@nyu.edu Validating the quality of data is crucial for establishing the trust- for which we have chronological information as well as erroneous worthiness of data pipelines. State-of-the-art solutions for data and manually cleaned variants. We show a series of “acceptable” validation and error detection require explicit domain expertise data batches from the past to our approach, and have it decide (e.g., in the form of rules or patterns) [1] or manually labeled ex- whether the next data batch is “acceptable” or erroneous (we amples [7]. In real-world applications, domain knowledge is often randomly choose either of those for evaluation). We repeat this incomplete, data changes over time, which limits the applicability for multiple timespans and compute binary classification metrics of existing solutions. We propose an unsupervised approach for such as accuracy and F1-score. We find that the popular time detecting data quality degradation early and automatically. We series forecasting method exponential smoothing combined with will present the approach, its key assumptions, and preliminary a simple decision strategy for outlier detection (inclusion in a results on public data to demonstrate how data quality can be 90% confidence interval) works well in many cases, and provides monitored without manually curated rules and constraints. F1-scores of up to 96% for the Flights dataset. In contrast, existing Exemplary use case. Consider a data engineering team at a baseline solutions such as TFX Data Validation [1] or statistical retail company, which has to regularly ingest product data from tests [3] perform with F1-scores of only 64% and 62% respectively. heterogeneous sources such as web crawls, databases, or key- Previously observed data batches New data batch to validate value stores, with the goal of indexing the products for a search engine. Errors in the data, such as missing values or typos in the category descriptions, lead to various problems: attributes with Mean of column A missing values might not be indexed, or the products might end 3 1 2 up in the wrong category. Ultimately, customers may not be able Number of distinct values of column B to find the products via the search engine. Tackling such data 4 quality issues is tedious, as manual solutions require in-depth time yesterday today domain knowledge and result in complex engineering efforts. Proposed approach. We focus on scenarios where systems reg- Figure 1: Overview of the proposed approach for data qual- ularly ingest potentially erroneous external data. We apply a ity monitoring: we maintain time series of column statis- machine learning-based approach which automatically learns tics from previously observed data batches of “acceptable” to identify “acceptable” data batches, and raises alerts for data quality. To decide whether a new batch should be accepted, batches that vary significantly from previous observations. We we compare its statistics to a forecast-based estimate of analyze structured data that arrives periodically in batches (e.g., the expected statistics based on the observed time series. via a nightly ingestion of log files). At time t, we assume that previously ingested data (timestamps 1 to t − 1) is of “acceptable” Next directions. We intend to conduct an extensive evaluation quality if it did not result in system crashes or require manual on additional datasets against several baselines [2, 5, 8] with repairing. We use these previously ingested data batches as exam- respect to the prediction performance, execution time, and scala- ples with the goal to identify future erroneous batches. Note that bility. Furthermore, we will investigate the benefits of applying we do not look for erroneous records, but aim to identify errors multivariate forecasting methods for our use case. that corrupt an entire batch, such as the accidental introduction Acknowledgements. This work was funded by the HEIBRiDS graduate school, of a large number of missing values in a column. with the support of the German Ministry for Education and Research as BIFOLD - Figure 1 illustrates our approach: We compute a set of statistics Berlin Institute for the Foundations of Learning and Data, BBDC 2 (01IS18025A), for every column of an observed data batch: completeness, ap- BZML (01IS18037A), and the Software Campus Program (01IS17052). proximate number of distinct values, as well as mean, standard deviation, minimum and maximum values for numeric columns. REFERENCES We record these statistics as time series over multiple batches 1 . [1] Dennis Baylor et al. 2017. Tfx: A tensorflow-based production-scale machine We apply time series forecasting methods [4] to estimate the learning platform. KDD, 1387–1395. [2] Eric Breck, Marty Zinkevich, Neoklis Polyzotis, Steven Whang, and Sudip Roy. expected data statistics for the next batch (the green area in 2 ). 2019. Data Validation for Machine Learning. SysML. When a new batch of data becomes available, we compute its [3] Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. actual statistics 3 and compare them to the estimate 4 . If the Journal of the American statistical Association 46, 253 (1951), 68–78. [4] Douglas C Montgomery, Cheryl L Jennings, and Murat Kulahci. 2015. Introduc- observed statistic differs significantly from the estimated value, tion to time series analysis and forecasting. John Wiley & Sons. we raise an alert about a potential degradation of the data quality. [5] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data Profiling with Metanome. PVLDB 8, 12 (2015), 1860–1863. Preliminary Results. We conducted a preliminary evaluation [6] Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current on datasets of flight information [6] and crawled Facebook posts, approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13. [7] Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holo- © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- clean: Holistic data repairs with probabilistic inference. PVLDB 10, 11 (2017), ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, 1190–1201. Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- [8] Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biess- tribution 4.0 International (CC BY 4.0) mann, and Andreas Grafberger. 2018. Automating Large-scale Data Quality Verification. PVLDB 11, 12 (Aug. 2018), 1781–1794.