=Paper= {{Paper |id=Vol-2175/paper04 |storemode=property |title=Workload-Aware Discovery of Integrity Constraints for Data Cleaning |pdfUrl=https://ceur-ws.org/Vol-2175/paper04.pdf |volume=Vol-2175 |authors=Eduardo Pena |dblpUrl=https://dblp.org/rec/conf/vldb/Pena18 }} ==Workload-Aware Discovery of Integrity Constraints for Data Cleaning== https://ceur-ws.org/Vol-2175/paper04.pdf
      Workload-Aware Discovery of Integrity Constraints for
                        Data Cleaning

                                                   Eduardo H. M. Pena
                                         supervised by Eduardo Cunha de Almeida
                                                       Federal University of Paraná
                                                         ehmpena@inf.ufpr.br

ABSTRACT                                                                      trend has brought many contributions: reasoning about var-
Violations of integrity constraints (ICs) usually indicate that               ious ICs, other than functional dependencies (FDs) [9]; prac-
data are not consistent with a valid representation of real-                  tical techniques for ICs violation detection [11]; and methods
world entities, and therefore data are dirty. Recent studies                  to repair database errors by enforcing ICs [17]; to name but
present facilities to make dirty databases consistent with                    a few.
a given set of ICs. Yet their evaluation scenarios assume                        It is crucial that there be effective methods to discover ICs
humans-in-the-loop to feed ICs to their facilities. Manually                  from data. Manually designing ICs is a cost-prohibitive and,
designing ICs is burdensome, and it is doomed to fail if we                   worse yet, error-prone task. To do so, domain experts would
consider the dynamic changes of data and applications. An                     need to keep up with the ever-evolving semantics of data and
attractive alternative is the automatic discovery of ICs, but                 application. Moreover, the set of IC candidates is usually
the discovery results are mostly accidental. Our research                     too large for human validation, even for small datasets. In-
proposal sticks to the automatic discovery approach, but it                   deed, many algorithms for automatic discovering of ICs have
leverages information found in the application workload to                    been developed [14, 5]. In theory, data cleaning pipelines
mining semantically valuable ICs. This paper overviews our                    could use IC discover algorithms to mine ICs from sample
research on discovering ICs that are particularly suitable                    data, and then use those ICs as data quality rules to detect
for data cleaning, i.e., ICs that are resilient to data and                   and repair errors in the database. However, the recent eval-
application evolutions.                                                       uations of IC-based cleaning tools have only used a manual
                                                                              definition of ICs. Therefore, we believe there are still chal-
                                                                              lenges to overcome before a fully automated IC-based data
                                                                              cleaning pipeline can operate.
1.    INTRODUCTION                                                               Discovering ICs from data might return non-reliable re-
   Data profiling is a complex task that helps to uncover rel-                sults sets. If the discovery is carried out on dirty data, the
evant metadata for datasets. Typical examples of metadata                     discovered ICs are likely dirty themselves. Even if the dis-
include basic statistics (e.g., value distributions), patterns of             covery is carried out on cleaned samples of data, which is
data values, and integrity constraints (ICs). These kind of                   hard and expensive to get, most of the discovered ICs are
information provide valuable insights on large datasets by                    likely accidental. The number of discovered ICs radically in-
serving various use-cases, such as data exploration, query                    creases as the number of attributes in the dataset goes up.
optimization, and data cleaning [1]. Our research program                     For example, datasets with dozens of attributes and a few
is built upon data profiling for data cleaning.                               thousands of records produce result sets with thousands of
   Dirty data remain ubiquitous in enterprises. Industry and                  functional dependencies (FDs) [14]. Although some classes
academia constantly report the severe impact that low data                    of IC allow for implication analysis to discard redundant
quality have on businesses [6, 11]. The key question, what                    ICs (e.g., canonical covers of functional dependencies), the
are dirty data? has been extensively studied [8], with many                   number of discovered ICs which shows no semantic mean-
studies revisiting ICs to ensure business rules over data [7,                 ing remains huge. Furthermore, the effectiveness of current
8, 11]. In this context, violation of ICs usually indicates                   methods for data repairing is highly influenced by the num-
that data are not consistent with a valid representation of                   ber of ICs the methods consider. Our goal is mining ICs for
real-world entities, and therefore data are dirty. Many ap-                   data cleaning. If these ICs are compliant with the most cur-
proaches follow the idea of detecting and repairing IC vio-                   rent semantics of data and application, they should help to
lations for data cleaning. The development of such research                   repair the database and spot erroneous data in future loads.
                                                                                 We propose an initial approach to discovering ICs that
                                                                              are relevant for data cleaning. We stick to the automatic IC
                                                                              discovery approach, but to support environments in which
                                                                              both data and application evolve we focus on ICs matching
                                                                              the current queries in the pipe. The intuition here is to use
                                                                              the query workload in the application to identify data spots
                                                                              and structural fragments (e.g., set of attributes and predi-
Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de
Janeiro, Brazil.
                                                                              cates) that are important to users. These assets generate a
Copyright (c) 2018 for this paper by its authors. Copying permitted for       workload characterization that is semantically relevant for
private and academic purposes.


                                                                          1
the current application. Our approach uses this characteri-                 Employees(N ame, M anager, Salary, Bonus), Employees ∈
zation to guide the IC discovery process. From the workload                 R. A DC ϕ can express the constraint “if two employees
characterization, our approach estimates a set of features for              are managed by the same person, the one earning a higher
ICs. Our vision is a classifier that leverages these features               salary has a higher bonus” as follows: ¬(tx .M anager =
to classify a set of ICs based on their applicability for data              ty .M anager∧tx .Salary > ty .Salary∧tx .Bonus < ty .Bonus).
cleaning. In this thesis, we aim to answer the following re-                   An instance of relation r satisfies a DC ϕ if at least one
search questions:                                                           predicate of ϕ is false, for every pair of tuples of r, i.e., the
                                                                            predicates of ϕ cannot be all true at the same time.
     • How can database workloads benefit IC discovery?
     • How automatic ICs discovery can serve data cleaning?                 3.2    Proposed Approach
                                                                               We envision an automated data cleaning system. Figure
                                                                            1 shows an overview of the expected system’s architecture.
2.    RELATED WORK                                                          The system regularly collects query logs from the database
   Most of the recent approaches to IC-based data cleaning                  server to build workloads W. The characterization for work-
manually define the sets of ICs to evaluate their solutions                 load W takes into account query structures and tuples that
[19, 17]. Much of these approaches assume that ICs and                      are selected to produce the query results. The outputs of the
data do not change very often, and use a set of fixed ICs                   workload characterization are data samples and attribute
to repair the database through modification and deletions                   clusters.
of inconsistent tuples. A few approaches consider evolving                     The DC profiler takes as input data samples, attribute
ICs. In this case, the database is repaired after modifications             clusters, and, optionally, user-defined DCs templates. The
to both records and ICs [19]. More recently, the authors                    profiler sets the predicate space (from which DCs are mined)
of [18] leverage automatic IC discovery for detecting data                  according to the attribute clusters. By relying on this pred-
errors. However, they use the discovered ICs (they focus on                 icate space, our DC discovery algorithm discovers approxi-
FDs) only to generate questions that guide an interaction                   mate DCs from the samples. The next step is to discard the
with experts. The goal of this interaction is to discover as                superfluous DCs and select the relevant ones. It is impor-
many FD-related errors in data as possible on a budget of                   tant to discard superfluous DCs so that data repairing only
questions.                                                                  has to consider DCs that capture the current semantics of
   Most works on IC discovery have focused on attribute de-                 the data and applications. In [15], we have studied the use
pendencies. Liu et al. [12] present a comprehensive review                  of query workload for automatic selection of FDs. We have
of the topic. Papenbrock et al. [14] have looked into imple-                developed ranking strategies to select FDs that match the
mentation details, experimental evaluation, and comparison                  current workload. The results have shown that it is possible
of various FD discovery algorithms. The studies on IC dis-                  to select FDs that are semantically close to the query work-
covery usually focus on the algorithmic issues of the process,              load, and with reasonably similar statistical distributions to
leaving the applicability of the discovered results aside. To               those general FDs produced by the exhaustive FD discov-
handle the large number of discovered ICs, some approaches                  ery. However, selecting DCs is more complex and requires
score ICs on instance-driven metrics for user validation [12,               further investigation. We have been evaluating decision tree
5]. Unfortunately, these scoring schemes do not consider the                ensembles for this task. The general idea is to estimate
dynamic changes in data and application.                                    different features for DCs, and then use a classifier to de-
                                                                            cide whether or not DCs are reliable. Feature extraction
3.    MINING RELEVANT ICs                                                   is based on structural properties of DCs (e.g., how many
  This section describes the main steps we are taking to-                   predicates they have); and metrics that correlate DCs with
wards answering our research questions.                                     the database and query workload. The DCs that are classi-
                                                                            fied as reliable are expected to hold in the database as long
3.1     Denial Constraints                                                  as the database manipulations produce semantically clean
   In the context of this thesis, a database is clean if it is con-         data. Our implementation of the DC classifier is still under
sistent with a given set of ICs. Some types of ICs can express              way.
semantic rules that others ICs cannot, or vice versa. Denial                   The last task is finding and repairing the tuples that vi-
constraints (DCs) [5] are known to be a response to this ex-                olate at least one constraint from the DC profiling results.
pressiveness issue because they generalize important types                  We have been using HoloClean [17] for this task. Our cur-
of ICs, such as functional dependencies (FDs), conditional                  rent setup only uses the DC-based error detection and error
FDs, and check constraints. Recent works have introduced                    repairing modules of the tool. However, we believe that our
methods that automatically repair the violations of DCs in                  system could be adjusted to produce different signals to feed
the database [17]; hence, our proposal is also based on DCs.                the probabilistic graphical model of HoloClean. For exam-
   DCs define sets of predicates that databases must sat-                   ple, deriving features from attribute clusters and samples
isfy to prevent attributes from taking combinations of values               for the graphical model. We postpone this kind of investi-
considered semantically inconsistent. Consider a relational                 gation because we have been currently focusing on the data
database schema R and a set of operators W : {=, 6=, <, ≤                   profiling aspects of the system.
, >, ≥}. A DC [2, 5] has the form ϕ : ∀tx , ty , ... ∈ r, ¬(P1 ∧
...∧Pm ), where tx , ty , ... are tuples of an instance of relation r       3.3    Discovering DCs
of R, and R ∈ R. A predicate Pi is a comparison atom with                     We have first worked on the algorithmic issues of discov-
either the form v1 wo v2 or v1 wo c: v1 , v2 are variables tid .Aj ,        ering DCs from data. At the beginning of our research
Aj ∈ R, id ∈ {x, y, ...}, c is a constant from Aj ’s domain,                program, the only known algorithm for DC discovery was
and wo ∈ W. For example, consider a schema of relation                      FASTDC [5], for which there was no publicly available im-


                                                                        2
                                    Workload                           DC Profiler
                                 Characterization                     DC Discovery                         over the entire dataset and large predicate spaces, our sys-
                  Applications    Build samples
                                   A B C D
                                                            Es = {Qt
                                                                         μ ,tν
                                                                                 ∣ ∀(tμ , tν ) ∈ s}
                                                                                                           tem focuses the discovery efforts on areas that are relevant
                                                              ′
                                                             P    = {P1 , P2 , . . . }
                                                                                                           for the application workload.
 Query Workload                                                                  DCs templates
                                                                                   φ : ¬(P1 ∧. . . )

                                                                                   φ : ¬(P2 ∧. . . )
                                                                                                           3.4    Workload Characterization
 SELECT A,B                                                                        φ : ¬(P4 ∧. . . )

 FROM R
 WHERE D<10          W
                                     Cluster
                                    Atributtes
                                                                                                              We hypothesize that the information from the application
                                  AC       A                          DC classifier
                                                                                                           workload (e.g., selection filters in SQL statements) is a pow-
                                   B           D
                                                                                   φ : ¬(P1 ∧. . . )       erful asset to narrow the large number of DCs discovered.
                                                                                   φ : ¬(P2 ∧. . . )

                                                                                                           Query workloads present strong access patterns which, ei-
                                                                                                           ther in horizontal level (individual tuples) or vertical level
                   Database
                                                    Data Repairing
                                                                                                           (individual attributes), points out to specific database re-
                                                                                                           gions that are more frequently accessed than others.
                                                                                                              Consider a set of user queries Q = {q1 , ..., qm }, which is
     Figure 1: Proposed Data Cleaning Pipeline.                                                            expected to run on the database. For simplicity, we assume
                                                                                                           there is only one relation in the database. For each query qi ,
                                                                                                           our system associates the attributes in the operators of the
plementation. We have implemented FASTDC from scratch,                                                     query qi (e.g., projection and selection) to the attributes of
and our results agree with those of the original publication.                                              the relation R(A1 , ..., An ) to define an attribute occurrence
Next, we briefly describe FASTDC, and the overall improve-                                                 matrix (AOM) O as follows:
ments we have developed for the algorithm.                                                                                    
                                                                                                                                 1 if qi uses attribute Aj ,
   The first step to discover DCs is to set the predicate space                                                         oij =
                                                                                                                                 0 othewise.
P from which DCs are derived. Experts may define predi-
cates for attributes based on the database structure or use                                                Each entry oij indicates which attributes of relation R is
specialized tools for the task, e.g., [20]. The satisfied pred-                                            required to answer query qi .
icate set Qtµ ,tν of an arbitrary pair of tuples (tµ , tν ) ∈ r                                               The AOM model is straightforward but can be enhanced
is a subset Q ⊂ P such that for every P ∈ Q, P (tµ , tν ) is                                               to support dynamic workloads in which the incoming queries
true. The set of satisfied predicate sets of r is the evidence                                             are not necessarily identical to each other. We use the ap-
set Er = {Qtµ ,tν | ∀(tµ , tν ) ∈ r}. Different tuple pairs                                                proach described in [3], which models workloads as sets of
may return the same predicate set, hence, each Q ∈ Er is                                                   pairs of queries and their weights W = {hq1 , w1 i, ..., hqn , wn i}.
associated with an occurrence counter.                                                                     Each weight wi is associated with a query qi , and measures
   A cover for Er is a set of predicates that intersects with                                              the significance of query qi in the workload. Thus, AOMs
every satisfied predicate set of Er , and it is minimal if none                                            are weighted according to the importance of their queries.
of its subsets equally intersects with Er . The authors of                                                 The AOMs for each application are straightforward to model
FASTDC demonstrate that minimal covers of Er represent                                                     if the domain experts are aware of the applications that
the predicates of minimal DCs [5]. FASTDC uses a depth-                                                    will run on the database. Furthermore, modern DBMSs
first search (DFS) strategy to find minimal covers for Er .                                                commonly provide tools to help with this kind of workload
   FASTDC builds the evidence set by evaluating every pred-                                                modeling [4].
icates of the predicate space P on every pair of tuples of                                                    The authors of [3] describe a probabilistic distribution
relation instance r. Our improved version of the algorithm,                                                method to measure the similarity of incoming queries and
BFASTDC [16], is a bitwise version of FASTDC that ex-                                                      workload W. Our system uses this method to identify sub-
ploits bit-level operations to avoid unnecessary tuple com-                                                stantial changes in the workload so that it knows when
parisons. BFASTDC builds associations between attribute                                                    to start profiling DCs again. For each different workload,
values and lists of tuple identifiers so that different combi-                                             our system estimates data samples and predicate spaces for
nations of these associations indicate which tuple pairs sat-                                              which the DC discovery run.
isfy predicates. To frame evidence sets, BFASTDC operates                                                     The data samples are expected to represent the data that
over auxiliary bit structures that store predicate satisfaction                                            is most accessed by applications and, therefore, spot data
data. This allows our algorithm to use simple logical opera-                                               that likely produce most errors. We have been testing three
tions (e.g., conjunctions and disjunctions) to imply the sat-                                              strategies to build samples from queries. The baseline (naive)
isfaction of remaining predicates. In addition, BFASTDC                                                    strategy is building a single sample with every tuple that is
can use two modifications described in [5] to discover ap-                                                 used to answer the queries in the workload. The second
proximate and constant DCs. These DCs variants let the                                                     strategy is building a sample for each fundamental region
discovery process to work with data containing errors (e.g.,                                               [3]. They partition the tuples of the relation instance such
integrated data from multiple sources).                                                                    that each region intersects with the tuples selected by a max-
   In our experiments, BFASTDC produced considerable im-                                                   imum number of queries. The third strategy is based on a
provements on DC discovery performance (up to 24-fold                                                      self-pruning splay tree (SPST) index [10]. The SPST keeps
compared to FASTDC). Unfortunately, the number of dis-                                                     the most accessed tuple references near the root of the tree.
covered DCs were unmanageably large, even after selecting                                                  Thus, our system uses the strategy of [10] to prune the tree
only the DCs with high scores for the interestingness dimen-                                               and to build samples from the remaining partitions.
sion described by the authors of FASTDC [5].                                                                  Predicate spaces come from AOMs, and restrict the search
   Discovering DCs handles large search spaces. Thus, run-                                                 space of DC discovery for predicates having a high affinity to
ning times for DC discovery increase when more tuples are                                                  each other. Our system uses the frequency of attributes in
considered, and dramatically increase when the discovery is                                                the queries to transform AOMs into attribute affinity ma-
set for large predicate spaces. Instead of running BFASTDC                                                 trices (AAMs), in which each entry represents the sum of


                                                                                                       3
access frequencies of all queries for that attribute. Further,        [3] S. Chaudhuri, G. Das, and V. Narasayya. Optimized
it uses the graph-based algorithm described in [13] to build              stratified sampling for approximate query processing.
a spanning tree from which attribute clusters are derived.                ACM Trans. Database Syst., 32(2), June 2007.
These clusters form the predicate spaces as follows. The              [4] S. Chaudhuri and V. Narasayya. Self-tuning database
system defines single and two-tuple predicates on categori-               systems: A decade of progress. In VLDB, pages 3–14,
cal attributes using operators {=, 6=}; and on numerical at-              2007.
tributes using operators {=, 6=, <, >, ≤, ≥}. It defines pred-        [5] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial
icates involving two different attributes provided that the               constraints. PVLDB, 6(13):1498–1509, Aug. 2013.
values of the two attributes are in the same order of magni-          [6] W. W. Eckerson. Data quality and the bottom line:
tude.                                                                     achieving business success through a commitment to
   BFASTDC runs for every pair of predicate space and                     high quality data. Technical report, Data Warehousing
data sample. The output of this phase are sets of approx-                 Institute, 2002.
imate DCs (i.e., DCs approximately satisfied by the whole             [7] W. Fan. Data quality: From theory to practice.
dataset). In preliminary experiments, we noticed that many                SIGMOD Rec., 44(3):7–18, Dec. 2015.
of these DCs were, in fact, exact, or at least had similar
                                                                      [8] W. Fan and F. Geerts. Foundations of Data Quality
structures to many DCs found in exhaustive discovery.
                                                                          Management. Morgan & Claypool Publishers, 2012.
3.5   Features for DCs                                                [9] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis.
                                                                          Conditional functional dependencies for capturing
   Among the discovered DCs, how to choose those that are
                                                                          data inconsistencies. ACM Trans. Database Syst.,
relevant for data cleaning? To answer that question, we are
                                                                          33(2):6:1–6:48, June 2008.
building a classifier that simulates the experts judging. The
main challenge is to form descriptions of “good” DCs. We             [10] P. Holanda and E. C. de Almeida. Spst-index: A
have included the workload characterization in the discovery              self-pruning splay tree index for caching database
process so that our system can estimate features regarding                cracking. In EDBT, pages 458–461, 2017.
DC’s application adherence, and not only estimate features           [11] I. F. Ilyas and X. Chu. Trends in cleaning relational
regarding data support.                                                   data: Consistency and deduplication. Foundations and
   So far, we have considered the following features for DCs:             Trends in Databases, 5(4):281–393, 2015.
statistical significance of DCs for the input dataset, as in         [12] J. Liu, J. Li, C. Liu, and Y. Chen. Discover
[5]; the number of predicates [5]; similarity between pairs of            dependencies from data - a review. IEEE TKDE,
DCs, an adaptation of the method described in [15]; similar-              24(2):251–264, Feb. 2012.
ity between DCs and AAMs, also an adaptation of [15]; and            [13] S. Navathe, K. Karlapalem, and M. Ra. A mixed
number of DC-related errors in samples and entire dataset.                partitioning methodology for distributed database
   We have been developing ensembles [21] to combine differ-              design. Journal of Computer and Software
ent decision trees, which are trained upon the above features.            Engineering, 3(4), 1995.
For now, the training sets for the learning decision trees are       [14] T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert,
user-defined DCs templates. One could also use the set of                 J.-P. Rudolph, M. Schönberg, J. Zwiener, and
DCs currently defined in the database, or manually define                 F. Naumann. Functional dependency discovery: An
acceptable values for the features of the DCs.                            experimental evaluation of seven algorithms. PVLDB.,
                                                                          8(10):1082–1093, June 2015.
4.    RESEARCH PLAN                                                  [15] E. H. M. Pena and E. C. de Almeida. Uso de
                                                                          instâncias de dados e carga de trabalho para
   This research program is currently in its second year. The
                                                                          mineração de restrições de integridade. In SBBD,
research challenges we will have to face to answer our re-
                                                                          pages 312–317, 2017.
search questions are as follows. First, we will establish an
                                                                     [16] E. H. M. Pena and E. C. de Almeida. Bfastdc: a
evaluation scenario in which data, DCs and applications are
                                                                          bitwise algorithm for mining denial constraints. In
evolving. So far, we have been running preliminary experi-
                                                                          DEXA 2018, to appear.
ments on datasets that have already been used to evaluate
DC discovery [5]. We plan to adapt the evaluation proto-             [17] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré.
cols of [19] and [3] for our workload-aware scenario. Then,               Holoclean: Holistic data repairs with probabilistic
we will define and detail the metrics that classify DCs as                inference. PVLDB Endow., 10(11):1190–1201, Aug.
suitable for data cleaning. With these metrics in place, we               2017.
should be able to devise rules to know when a DC is no               [18] S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani,
longer valid. Such metrics will also help to evaluate the                 J.-A. Quiane-Ruiz, and N. Tang. Uguide: User-guided
overall system accuracy. Finally, we plan to compare the                  discovery of fd-detectable errors. In SIGMOD, pages
accuracy and scalability of our system against traditional                1385–1397, New York, NY, USA, 2017.
cleaning solutions.                                                  [19] M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller.
                                                                          Continuous data cleaning. In ICDE, pages 244–255,
                                                                          March 2014.
5.    REFERENCES                                                     [20] M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M.
 [1] Z. Abedjan, L. Golab, and F. Naumann. Profiling                      Procopiuc, and D. Srivastava. On multi-column
     relational data: A survey. The VLDB Journal,                         foreign key discovery. PVLDB., 3(1-2), Sept. 2010.
     24(4):557–581, Aug. 2015.                                       [21] Z.-H. Zhou. Ensemble Methods: Foundations and
 [2] L. Bertossi. Database Repairing and Consistent Query                 Algorithms. Chapman & Hall/CRC, 1st edition, 2012.
     Answering. Morgan & Claypool Publishers, 2011.


                                                                 4