<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>The Kolmogorov-Smirnov test for goodness of fit.
Journal of the American statistical Association</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Unsupervised Data Quality Validation on Dynamic Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergey Redyuk</string-name>
          <email>sergey.redyuk@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volker Markl</string-name>
          <email>volker.markl@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Schelter</string-name>
          <email>sebastian.schelter@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New York University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universität Berlin</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technische Universität Berlin</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>46</volume>
      <issue>253</issue>
      <abstract>
        <p>for which we have chronological information as well as erroneous and manually cleaned variants. We show a series of “acceptable” data batches from the past to our approach, and have it decide whether the next data batch is “acceptable” or erroneous (we randomly choose either of those for evaluation). We repeat this for multiple timespans and compute binary classification metrics such as accuracy and F1-score. We find that the popular time series forecasting method exponential smoothing combined with a simple decision strategy for outlier detection (inclusion in a 90% confidence interval) works well in many cases, and provides F1-scores of up to 96% for the Flights dataset. In contrast, existing baseline solutions such as TFX Data Validation [1] or statistical tests [3] perform with F1-scores of only 64% and 62% respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Previously observed data batches
New data batch to validate
3
4
Validating the quality of data is crucial for establishing the
trustworthiness of data pipelines. State-of-the-art solutions for data
validation and error detection require explicit domain expertise
(e.g., in the form of rules or patterns) [1] or manually labeled
examples [7]. In real-world applications, domain knowledge is often
incomplete, data changes over time, which limits the applicability
of existing solutions. We propose an unsupervised approach for
detecting data quality degradation early and automatically. We
will present the approach, its key assumptions, and preliminary
results on public data to demonstrate how data quality can be
monitored without manually curated rules and constraints.
Exemplary use case. Consider a data engineering team at a
retail company, which has to regularly ingest product data from
heterogeneous sources such as web crawls, databases, or
keyvalue stores, with the goal of indexing the products for a search
engine. Errors in the data, such as missing values or typos in the
category descriptions, lead to various problems: attributes with
missing values might not be indexed, or the products might end
up in the wrong category. Ultimately, customers may not be able
to find the products via the search engine. Tackling such data
quality issues is tedious, as manual solutions require in-depth
domain knowledge and result in complex engineering eforts.
Proposed approach. We focus on scenarios where systems
regularly ingest potentially erroneous external data. We apply a
machine learning-based approach which automatically learns
to identify “acceptable” data batches, and raises alerts for data
batches that vary significantly from previous observations. We
analyze structured data that arrives periodically in batches (e.g.,
via a nightly ingestion of log files). At time t , we assume that
previously ingested data (timestamps 1 to t − 1) is of “acceptable”
quality if it did not result in system crashes or require manual
repairing. We use these previously ingested data batches as
examples with the goal to identify future erroneous batches. Note that
we do not look for erroneous records, but aim to identify errors
that corrupt an entire batch, such as the accidental introduction
of a large number of missing values in a column.</p>
      <p>Figure 1 illustrates our approach: We compute a set of statistics
for every column of an observed data batch: completeness,
approximate number of distinct values, as well as mean, standard
deviation, minimum and maximum values for numeric columns.
We record these statistics as time series over multiple batches 1 .
We apply time series forecasting methods [4] to estimate the
expected data statistics for the next batch (the green area in 2 ).
When a new batch of data becomes available, we compute its
actual statistics 3 and compare them to the estimate 4 . If the
observed statistic difers significantly from the estimated value,
we raise an alert about a potential degradation of the data quality.
Preliminary Results. We conducted a preliminary evaluation
on datasets of flight information [ 6] and crawled Facebook posts,
1
time
Mean of column A
Number of distinct values of column B
2
yesterday
today
Next directions. We intend to conduct an extensive evaluation
on additional datasets against several baselines [2, 5, 8] with
respect to the prediction performance, execution time, and
scalability. Furthermore, we will investigate the benefits of applying
multivariate forecasting methods for our use case.</p>
      <p>Acknowledgements. This work was funded by the HEIBRiDS graduate school,
with the support of the German Ministry for Education and Research as BIFOLD
Berlin Institute for the Foundations of Learning and Data, BBDC 2 (01IS18025A),
BZML (01IS18037A), and the Software Campus Program (01IS17052).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>