Let’s Make it Dirty with BART!

               Donatello Santoro1 , Patricia C. Arocena2 , Boris Glavic3 ,
              Giansalvatore Mecca1 , Renée J. Miller2 , and Paolo Papotti4
         1
             Università della Basilicata - Italy, 2 University of Toronto - Canada
                  3
                    Illinois Inst. of Technology - USA, 4 Eurecom - France
                                 (Discussion Paper)


        Abstract. In the last few years many automatic or semi-automatic
        data-repairing algorithms have been proposed in order to improve the
        quality of a given database. Due to the richness of research proposals,
        it is important to conduct experimental evaluations to assess each tool’s
        potential. Bart is an open-source error-generation system conceived to
        support thorough experimental evaluations of these data-repairing sys-
        tems. In this paper we discuss how generating errors in data is a complex
        problem, with several facets. We introduce the important notions of de-
        tectability and repairability of an error, that stand at the core of Bart.
        Then, we show how, by changing the features of errors, it is possible
        to influence quite significantly the performance of the tools. Finally, we
        concretely put to work five data-repairing algorithms on dirty data of
        various kinds generated using Bart, and discuss their performance.


1     Introduction

Data quality is a very important concern in data management. To date, many
(disparate) automatic and semi-automatic data-cleaning algorithms have been
proposed in the database community [13][12][8][5][6][4]. These algorithms come
from different inspirations. Most of them are constraint-based : they assume that
the target database comes with a set of data-quality rules – for example, func-
tional dependencies (FDs) or conditional functional dependencies (CFDs) – and
data is repaired to remove violations to these constraints. Others, on the con-
trary, do not rely on constraints, but rather on statistics-based approaches to
identify suspect values or outliers and try to repair them.
    Due to the richness of research proposals, it is important to conduct thor-
ough and fair experimental evaluations to assess each tool’s potential. In fact,
other fields like entity resolution, record linkage, schema mapping and data ex-
change have worked to develop consolidated tools and benchmarks for empir-
ically evaluating algorithms [14] [1] [3]. Thorough evaluation of data-cleaning
systems requires systematic control over the amount of errors in a dataset, and
over how hard these errors are to repair. Dirty datasets must be paired with a
    SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the
    author(s).
ground-truth clean dataset to enable evaluation of the quality of a repair pro-
duced by a cleaning algorithm. To support rigorous empirical evaluations, an
error-generation system must be able to generate multiple dirty versions of a
dataset with low user effort, and scale to large datasets.
Bart. Bart [2][17] is the first error-generation tool explicitly conceived to sup-
port empirical evaluations of data-repairing algorithms as per the requirements
outlined above. It takes as input a clean database and a set of data-quality rules,
and injects errors into the database. Rules are expressed using the powerful lan-
guage of denial constraints [15] and errors can be of several kinds, such as typos,
duplicated values, nulls, and outliers. We show the major components of the
system in Figure 1. A user interacts with Bart by creating error-generation
tasks, using a GUI or CL interface. These tasks are then interpreted by Bart’s
error-generation engine to create dirty versions of a clean database.
    The system provides the highest
possible level of control over the error-
generation process. Among other pa-         BART     BART GUI      Command Line
rameters, it allows users to choose
the percentage of errors, whether they          Error-generation Engine
want a guarantee that errors are de-
tectable using the given constraints,        DBMS
and even provides an estimate of how                                      Error
                                                           Dirty
                                                            Dirty       Generation
hard it will be to restore the database         Clean         Dirty
                                                         Version
                                                           Version        Task
                                                 DB         Version
to its original clean state. Bart is
open-source: its codebase is available
on GitHub and can be further ex-               Fig. 1. Bart System Overview
tended by the community to develop
new features and functionalities.
Evaluation Overview. The empirical evaluation convey three primary insights
about the system.
(i) First, we will discuss how the fine-level control over the features of errors
distinguishes Bart from previous error-generating techniques used in evaluating
data-repairing systems.
(ii) Then, we will discuss how the characteristics of errors may significantly
influence the quality of repairs generated by a system.
(iii) Finally, we will demonstrate five different algorithms in action on dirty data
generated using Bart, to reveal new insights on their (relative) performance as
the characteristics of the errors are varied by Bart.
    Overall, the attendees will learn how the availability of a tool like Bart may
help to level the field and raise the bar for evaluation standards in data cleaning.
    The paper is organized as follows. Section 2 introduces the main motivation
for the system, and the notions of detectability and repairability of errors. Sec-
  Bart: Benchmarking Algorithms for data Repairing and Translation
  https://github.com/dbunibas/BART
                Player Name      Season Team      Stadium     Goals
                   t1 : Giovinco 2013-14 Juventus Juv.Stadium 3
                   t2 : Giovinco 2014-15 Toronto BMO Field 23
                   t3 : Pirlo    2014-15 Juventus Juv.Stadium 5
                   t4 : Pirlo    2015-16 N.Y.City Yankee St. 0
                   t5 : Vidal    2014-15 Juventus Juv.Stadium 8
                   t6 : Vidal    2015-16 Bayern Allianz Ar. 3

                         Fig. 2. Example Clean Database

tion 3 provides an overview of the system and of its main use cases. Finally,
Section 4 discusses the empirical evaluation, and the main lessons that can be
learned from it.


2   Concept and Motivation
Assume we are given a database about soccer players, shown in Figure 2, and
we want to assess the performance of repair algorithms according to a few data-
quality rules.
(i) A first FD stating that Name and Season are a key for the table: d1 :
Name, Season → Team, Stadium, Goals.
(ii) And, a second FD stating that Team implies Stadium: d2 : Team → Stadium.
    We specify these rules in Bart using the language of denial constraints. De-
nial constraints are a very expressive language, capable of capturing most data-
quality rules used for data-repairing, including FDs, CFDs, cleaning equality-
generating dependencies, editing rules, fixing rules, and ordering constraints [13].
For the sake of simplicity, here we omit the technical details about the syntax
and semantics of denial constraints, and show example data-quality rules in the
more familiar syntax of FDs.
    To evaluate data-repair systems, we proceed as follows.
(i) We start with a clean instance I , like the one in Figure 2, and the set of
constraints Σ = {d1 , d2 } discussed above.
(ii) We inject errors by applying a set of cell changes; each cell change ch =
hti .A := vi updates the value of attribute A in tuple ti to a new value v, e.g,
ht1 .Season := 2011-12i. By applying a set of cell changes Ch to I , we obtain a
new instance Id = Ch(I ), named the dirty instance.
(iii) We run a data-repairing algorithm over Id to obtain a repaired instance Irep .
We measure the quality of the algorithm using precision and recall. Intuitively,
we count how many changes in Ch have been restored to their original values in
Irep . Further details are in the full paper [2].
    We want now to emphasize how different ways to change the cells of the
clean instance may lead to errors that show completely different features when
considered from the perspective of a data-repairing tool.
2.1     Detectability


When evaluating a constraint-based repair algorithm, we want to make sure the
errors we inject are detectable by the system in question. After all, an error
that cannot be detected, cannot be repaired. To reason about detectability, we
need a notion for determining whether a cell change is involved in a constraint
violation. Consider the following cell change: ch1 = ht1 .Season := 2012-13i that
updates tuple t1 as follows:

 Player Name           Season      Team         Stadium           Goals
      t1 : Giovinco 2012-13 Juventus Juv.Stadium 3

     This change does not introduce a violation to any of the constraints in Σ,
i.e., after the cell change the modified database instances does fulfill all the
constraints. Therefore, any data-repairing tool that relies on the constraints to
detect dirtiness in the database will not be able to detect the change. We call
this an undetectable change.
     More formally, a cell change ch = hti .A := vi in Ch introduces a detectable
error in I for constraint dc if: (i) cell ti .A is involved in a violation with dc
in instance Id , and (ii) cell ti .A was not involved in a violation with dc in
instance I . Here “involved in a violation” is defined based on the fact that a
constraint can be associated with a query that returns sets of cells that cause
a violation. We call this type of queries violation-detection queries. A cell is
involved in a violation with a constraint dc if it is in the result of the violation-
detection query for dc [2]. For example, the violation-detection query for d2 is
Qd2 (i, i0 , t, a, a0 ) = Player(i, n, s, t, a, g), Player(i0 , n0 , s0 , t, a0 , g 0 ), a 6= a0 , i 6= i0 .
This query returns the Team and Stadium attributes of pairs of Player tuples with
the same team, but different stadiums. Using the tuple id’s, sets of cells involved
in violations can be determined based on the result of this query.
     Bart allows users to control whether changes to the clean database are de-
tectable when using the constraints in Σ. Bart may be configured to generate
random changes, that do not need to be detectable, or errors that are guaranteed
to be detectable using the constraints. Note this requires Bart to reason effi-
ciently and holistically about a set of changes to ensure that they are detectable
using a given set of constraints.
     Interestingly, this latter requirement significantly increases the complexity
of the error-generation process. In fact, generating a given number of errors in
a clean database that are detectable using a set of constraints Σ is an NP-
complete problem [2]. To deal with this complexity Bart implements several
novel optimizations [2] that balance the need for control over the nature of
errors and scalability.

   We assume every tuple has a unique identifier that per convention is the first at-
   tribute. Queries are expressed in a notation similar to Datalog.
2.2     Repairability
An alternative change that indeed introduces a detectable error is the following:
ch2 = ht1 .Season := 2014-15i. After this update, tuples t1 and t2 violate FD d1 ,
which states that Name and Season are a key for the table:

 Player Name      Season    Team      Stadium        Goals
      t1 : Giovinco 2014-15 Juventus Juv.Stadium 3
      t2 : Giovinco 2014-15 Toronto BMO Field 23

    This change is easily detected using the constraints. Still, it is quite difficult
for an automatic data-repairing algorithm to restore the database to its clean
state. Notice, in fact, that after this change, the original value 2013-14 has been
removed from the active domain of the dirty database. A correct repair cannot
be found by any repair algorithm that uses the values in the database as the
candidates for repair. Bart uses the notion of repairability of an error to charac-
terize this aspect. In the case above, it would assign repairability 0 to change ch2 .
Different detectable changes may have quite different repairability values. As an
example, consider now change ch3 = ht1 .Stadium := Delle Alpii. The change is
detectable using FD d2 . In addition, the redundancy in the dirty database may
be used to repair the database:

 Player Name      Season   Team      Stadium        Goals
      t1 : Giovinco 2013-14 Juventus Delle Alpi 3
      t3 : Pirlo    2014-15 Juventus Juv.Stadium 5
      t5 : Vidal    2014-15 Juventus Juv.Stadium 8


    The new, dirty tuple t1 is involved in two violations to d2 , one with t3 ,
another with t5 . In both cases, the change is in violation with Juv.Stadium. By a
straightforward probabilistic argument, Bart would calculate a 2/3 repairability
for this error, and rank it as a medium-repairability error.
    Errors may have higher repairability, even 1 in some cases. Consider, for ex-
ample, an additional rule d3 : Team[Juventus], Season[2013 − 14] → Stadium[Juv.Stadium].
This CFD rule states unequivocably that Juventus has played its home games
for season 2013–14 in the Juventus Stadium. Since this knowledge is part of the
constraint, the dirty cell can easily be restored to its original, clean state.

2.3     Other Kinds of Errors
To conclude this discussion about the features of errors, we notice that the
notions of detectability and repairability, that are centered around detecting
violations to constraints, are not the only ones supported by Bart.
    Consider, for example, change ch4 = ht1 .Goals := 123i. This change is not de-
tectable using the constraints. However, it might be detected by a statistics-based
data-repairing algorithm, because it introduces an outlier into the distribution
of values of the Goals attribute. Bart can be configured in order to generate
changes of this kind as well.
                               Fig. 3. User interface


3   Overview of the System

Bart provides users with the graphical user interface shown in Figure 3 to
handle error-generation tasks. An error-generation task, E is composed of four
key elements: (i) a database schema S; (ii) a set Σ of denial constraints (DCs)
encoding data quality rules over S; (iii) an instance I of S that is clean with
respect to Σ; (iv) a set of configuration parameters Conf (shown in Figure 3.1)
to control the error-generation process. These parameters specify, among other
things, which relations can be changed, how many errors should be introduced,
and how many of these errors should be detectable. They also let the user control
the degree of repairability of the errors.
    Based on this, Bart supports several uses cases. The main one consists of
generating a desired degree of detectable errors for each constraint. In addition,
users may also specify a range of repairability values for each constraint; Bart
will estimate the repairability of changes, and only generate errors with estimated
repairability within that range. In addition to detectable errors, Bart may also
generate random errors of several kinds: typos (e.g., ‘databse’), duplicated values,
bogus or null values (e.g., ‘999’, ‘***’). Random errors may be freely mixed
with constraint-induced ones. Finally, Bart can introduce outliers in numerical
attributes. Bart provides sophisticated features to analyze the characteristics of
errors that are introduced in the data. It generates charts to analyze the number
of errors detected by each constraint, and their estimated repairability (shown
in Figure 3.3). It also offers a versioning system, that allows users to generate
different dirty databases for the given scenario, and compare the characteristics
of their errors.
    Finally, Bart offers a flexible set of metrics to measure the quality of the
repairs generated by a data-repairing tool. In fact, different algorithms can repair
data in different ways. For example, some algorithms can produce repairs that
mark dirty cells using a variable, while others always restore the dirty instance
with a constant value. Different metrics have been proposed and are implemented
in Bart to uniformly evaluate these heterogenous changes in the data [4, 11].


4    Empirical Evaluation

We conduct [17] an empirical evaluation of several data-repairing algorithms over
dirty data generated using Bart.
    We used two publicly available tools, Llunatic [11] and Nadeef [8], to run
four rule-based data-repairing algorithms: (i) Greedy [5, 7]; (ii) Holistic [6]; (iii)
Llunatic [10]; and (iv) Sampling [4]. In addition, we evaluated (v) SCARE [18], a
statistics-based tool.
    The tools were tested with several repair tasks, based on synthetic and real
datasets, some of them constraint-based and some statistics-based. We briefly
list them here: (i) Employees is a synthetic scenario in the full paper [2]; (ii)
Customers is a synthetic scenario from Geerts et al. [10]; (iii) Tax is a synthetic
scenario from Fan et al. [9]; (iv) Bus is a real-world scenario from Dallachiesa
et al. [8]; and (v) Hospital is a real-world scenario used in several data-repairing
papers (e.g., [8, 10]).
    Datasets and constraints have been chosen to exhibit different characteristics.
Some have high redundancy in their data. Others contain numerical attributes,
and constraints containing ordering (<, >) comparisons. Some datasets have
master-data [16] and CFDs, while others have only FDs. All these differences
help to validate our techniques and the tools under exam.
    Notice the purpose of the evaluation was not to assess the quality of the repair
algorithms, rather to show how Bart can be used to uncover new insights into
the data-repairing process. Some important insights are the following.
Lesson 1: Data-repairing is not yet Mature. We expect that a wide degree
of variability in quality among all algorithms will emerge from our evaluations.
This variability does not clearly emerge from evaluations reported in the litera-
ture, suggesting there is no definitive data-repairing algorithm yet.
Lesson 2: Repairability Matters. We observe different trends with respect
to repairability. Typically, repair algorithms return very good repairs when suf-
ficient information is available (i.e., high repairability); however, their quality
tends to degrade quickly as repairability decreases.
     A key observation is that repairability has a strong correlation with the
quality of the repairs. In this respect, we believe it nicely captures the “hardness”
of the data-repairing problem and it helps in getting a concrete intuition of the
power and the limits of existing solutions.
Lesson 3: We Need to Document Our Dirty Data. We may conclude that
tools exhibit quite different performance on data-repairing problems of different
nature, and the repairability is a natural proxy to characterize how “difficult”
a data-repairing problem is.
   In light of this and to level the field, we believe it is crucial to have at
our disposal systematic error-generation tools and to properly document the
characteristics of the dirty data used in empirical evaluations of data-repairing
solutions. Bart is a concrete step in this direction.
Lesson 4: Generating Errors is Hard. The problem of systematically gen-
erating errors, however, is not an easy one. We will show how different config-
urations of the error-generation task affect the overall scalability of the system,
and discuss the main optimizations that Bart relies on in order to tame the
complexity of the process.

References
 1. B. Alexe, W. Tan, and Y. Velegrakis. Comparing and Evaluating Mapping Systems
    with STBenchmark. PVLDB, 1(2):1468–1471, 2008.
 2. P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro.
    Messing-Up with BART: Error Generation for Evaluating Data Cleaning Algo-
    rithms. PVLDB, 9(2):36–47, 2015.
 3. M. Benedikt, G. Konstantinidis, G. Mecca, B. Motik, P. Papotti, D. Santoro, and
    E. Tsamoura. Benchmarking the chase. In PODS, pages 37–52. ACM, 2017.
 4. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the Repairs of Functional Depen-
    dency Violations under Hard Constraints. PVLDB, 3(1):197–207, 2010.
 5. P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A Cost-Based Model and
    Effective Heuristic for Repairing Constraints by Value Modification. In SIGMOD,
    pages 143–154, 2005.
 6. X. Chu, I. F. Ilyas, and P. Papotti. Holistic Data Cleaning: Putting Violations
    into Context. In ICDE, pages 458–469, 2013.
 7. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving Data Quality: Consis-
    tency and Accuracy. In VLDB, pages 315–326, 2007.
 8. M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani,
    and N. Tang. NADEEF: a Commodity Data Cleaning System. In SIGMOD, pages
    541–552, 2013.
 9. W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional Functional Depen-
    dencies for Capturing Data Inconsistencies. ACM TODS, 33, 2008.
10. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE,
    pages 232–243, 2014.
11. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. That’s All Folks! LLUNATIC
    Goes Open Source. PVLDB, 7(13):1565–1568, 2014.
12. J. He, E. Veltri, D. Santoro, G. Li, G. Mecca, P. Papotti, and N. Tang. Interactive
    and deterministic data cleaning. In SIGMOD, pages 893–907, 2016.
13. I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and dedu-
    plication. Foundations and Trends in Databases, 5(4):281–393, 2015.
14. H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on
    real-world match problems. VLDB, 3(1-2):484–493, 2010.
15. A. Lopatenko and L. Bravo. Efficient Approximation Algorithms for Repairing
    Inconsistent Databases. In ICDE, pages 216–225, 2007.
16. D. Loshin. Master Data Management. Knowl. Integrity, Inc., 2009.
17. D. Santoro, P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, and P. Papotti. BART
    in action: Error generation and empirical evaluations of data-cleaning systems. In
    SIGMOD, pages 2161–2164, 2016.
18. M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don’t be SCAREd: Use
    SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes.
    In SIGMOD, pages 553–564, 2013.