=Paper= {{Paper |id=Vol-2175/paper04 |storemode=property |title=Workload-Aware Discovery of Integrity Constraints for Data Cleaning |pdfUrl=https://ceur-ws.org/Vol-2175/paper04.pdf |volume=Vol-2175 |authors=Eduardo Pena |dblpUrl=https://dblp.org/rec/conf/vldb/Pena18 }} ==Workload-Aware Discovery of Integrity Constraints for Data Cleaning== https://ceur-ws.org/Vol-2175/paper04.pdf

Workload-Aware Discovery of Integrity Constraints for
Data Cleaning

Eduardo H. M. Pena
supervised by Eduardo Cunha de Almeida
Federal University of Paraná
ehmpena@inf.ufpr.br

ABSTRACT trend has brought many contributions: reasoning about var-
Violations of integrity constraints (ICs) usually indicate that ious ICs, other than functional dependencies (FDs) [9]; prac-
data are not consistent with a valid representation of real- tical techniques for ICs violation detection [11]; and methods
world entities, and therefore data are dirty. Recent studies to repair database errors by enforcing ICs [17]; to name but
present facilities to make dirty databases consistent with a few.
a given set of ICs. Yet their evaluation scenarios assume It is crucial that there be effective methods to discover ICs
humans-in-the-loop to feed ICs to their facilities. Manually from data. Manually designing ICs is a cost-prohibitive and,
designing ICs is burdensome, and it is doomed to fail if we worse yet, error-prone task. To do so, domain experts would
consider the dynamic changes of data and applications. An need to keep up with the ever-evolving semantics of data and
attractive alternative is the automatic discovery of ICs, but application. Moreover, the set of IC candidates is usually
the discovery results are mostly accidental. Our research too large for human validation, even for small datasets. In-
proposal sticks to the automatic discovery approach, but it deed, many algorithms for automatic discovering of ICs have
leverages information found in the application workload to been developed [14, 5]. In theory, data cleaning pipelines
mining semantically valuable ICs. This paper overviews our could use IC discover algorithms to mine ICs from sample
research on discovering ICs that are particularly suitable data, and then use those ICs as data quality rules to detect
for data cleaning, i.e., ICs that are resilient to data and and repair errors in the database. However, the recent eval-
application evolutions. uations of IC-based cleaning tools have only used a manual
definition of ICs. Therefore, we believe there are still chal-
lenges to overcome before a fully automated IC-based data
cleaning pipeline can operate.
1. INTRODUCTION Discovering ICs from data might return non-reliable re-
Data profiling is a complex task that helps to uncover rel- sults sets. If the discovery is carried out on dirty data, the
evant metadata for datasets. Typical examples of metadata discovered ICs are likely dirty themselves. Even if the dis-
include basic statistics (e.g., value distributions), patterns of covery is carried out on cleaned samples of data, which is
data values, and integrity constraints (ICs). These kind of hard and expensive to get, most of the discovered ICs are
information provide valuable insights on large datasets by likely accidental. The number of discovered ICs radically in-
serving various use-cases, such as data exploration, query creases as the number of attributes in the dataset goes up.
optimization, and data cleaning [1]. Our research program For example, datasets with dozens of attributes and a few
is built upon data profiling for data cleaning. thousands of records produce result sets with thousands of
Dirty data remain ubiquitous in enterprises. Industry and functional dependencies (FDs) [14]. Although some classes
academia constantly report the severe impact that low data of IC allow for implication analysis to discard redundant
quality have on businesses [6, 11]. The key question, what ICs (e.g., canonical covers of functional dependencies), the
are dirty data? has been extensively studied [8], with many number of discovered ICs which shows no semantic mean-
studies revisiting ICs to ensure business rules over data [7, ing remains huge. Furthermore, the effectiveness of current
8, 11]. In this context, violation of ICs usually indicates methods for data repairing is highly influenced by the num-
that data are not consistent with a valid representation of ber of ICs the methods consider. Our goal is mining ICs for
real-world entities, and therefore data are dirty. Many ap- data cleaning. If these ICs are compliant with the most cur-
proaches follow the idea of detecting and repairing IC vio- rent semantics of data and application, they should help to
lations for data cleaning. The development of such research repair the database and spot erroneous data in future loads.
We propose an initial approach to discovering ICs that
are relevant for data cleaning. We stick to the automatic IC
discovery approach, but to support environments in which
both data and application evolve we focus on ICs matching
the current queries in the pipe. The intuition here is to use
the query workload in the application to identify data spots
and structural fragments (e.g., set of attributes and predi-
Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de
Janeiro, Brazil.
cates) that are important to users. These assets generate a
Copyright (c) 2018 for this paper by its authors. Copying permitted for workload characterization that is semantically relevant for
private and academic purposes.

1
the current application. Our approach uses this characteri- Employees(N ame, M anager, Salary, Bonus), Employees ∈
zation to guide the IC discovery process. From the workload R. A DC ϕ can express the constraint “if two employees
characterization, our approach estimates a set of features for are managed by the same person, the one earning a higher
ICs. Our vision is a classifier that leverages these features salary has a higher bonus” as follows: ¬(tx .M anager =
to classify a set of ICs based on their applicability for data ty .M anager∧tx .Salary > ty .Salary∧tx .Bonus < ty .Bonus).
cleaning. In this thesis, we aim to answer the following re- An instance of relation r satisfies a DC ϕ if at least one
search questions: predicate of ϕ is false, for every pair of tuples of r, i.e., the
predicates of ϕ cannot be all true at the same time.
• How can database workloads benefit IC discovery?
• How automatic ICs discovery can serve data cleaning? 3.2 Proposed Approach
We envision an automated data cleaning system. Figure
1 shows an overview of the expected system’s architecture.
2. RELATED WORK The system regularly collects query logs from the database
Most of the recent approaches to IC-based data cleaning server to build workloads W. The characterization for work-
manually define the sets of ICs to evaluate their solutions load W takes into account query structures and tuples that
[19, 17]. Much of these approaches assume that ICs and are selected to produce the query results. The outputs of the
data do not change very often, and use a set of fixed ICs workload characterization are data samples and attribute
to repair the database through modification and deletions clusters.
of inconsistent tuples. A few approaches consider evolving The DC profiler takes as input data samples, attribute
ICs. In this case, the database is repaired after modifications clusters, and, optionally, user-defined DCs templates. The
to both records and ICs [19]. More recently, the authors profiler sets the predicate space (from which DCs are mined)
of [18] leverage automatic IC discovery for detecting data according to the attribute clusters. By relying on this pred-
errors. However, they use the discovered ICs (they focus on icate space, our DC discovery algorithm discovers approxi-
FDs) only to generate questions that guide an interaction mate DCs from the samples. The next step is to discard the
with experts. The goal of this interaction is to discover as superfluous DCs and select the relevant ones. It is impor-
many FD-related errors in data as possible on a budget of tant to discard superfluous DCs so that data repairing only
questions. has to consider DCs that capture the current semantics of
Most works on IC discovery have focused on attribute de- the data and applications. In [15], we have studied the use
pendencies. Liu et al. [12] present a comprehensive review of query workload for automatic selection of FDs. We have
of the topic. Papenbrock et al. [14] have looked into imple- developed ranking strategies to select FDs that match the
mentation details, experimental evaluation, and comparison current workload. The results have shown that it is possible
of various FD discovery algorithms. The studies on IC dis- to select FDs that are semantically close to the query work-
covery usually focus on the algorithmic issues of the process, load, and with reasonably similar statistical distributions to
leaving the applicability of the discovered results aside. To those general FDs produced by the exhaustive FD discov-
handle the large number of discovered ICs, some approaches ery. However, selecting DCs is more complex and requires
score ICs on instance-driven metrics for user validation [12, further investigation. We have been evaluating decision tree
5]. Unfortunately, these scoring schemes do not consider the ensembles for this task. The general idea is to estimate
dynamic changes in data and application. different features for DCs, and then use a classifier to de-
cide whether or not DCs are reliable. Feature extraction
3. MINING RELEVANT ICs is based on structural properties of DCs (e.g., how many
This section describes the main steps we are taking to- predicates they have); and metrics that correlate DCs with
wards answering our research questions. the database and query workload. The DCs that are classi-
fied as reliable are expected to hold in the database as long
3.1 Denial Constraints as the database manipulations produce semantically clean
In the context of this thesis, a database is clean if it is con- data. Our implementation of the DC classifier is still under
sistent with a given set of ICs. Some types of ICs can express way.
semantic rules that others ICs cannot, or vice versa. Denial The last task is finding and repairing the tuples that vi-
constraints (DCs) [5] are known to be a response to this ex- olate at least one constraint from the DC profiling results.
pressiveness issue because they generalize important types We have been using HoloClean [17] for this task. Our cur-
of ICs, such as functional dependencies (FDs), conditional rent setup only uses the DC-based error detection and error
FDs, and check constraints. Recent works have introduced repairing modules of the tool. However, we believe that our
methods that automatically repair the violations of DCs in system could be adjusted to produce different signals to feed
the database [17]; hence, our proposal is also based on DCs. the probabilistic graphical model of HoloClean. For exam-
DCs define sets of predicates that databases must sat- ple, deriving features from attribute clusters and samples
isfy to prevent attributes from taking combinations of values for the graphical model. We postpone this kind of investi-
considered semantically inconsistent. Consider a relational gation because we have been currently focusing on the data
database schema R and a set of operators W : {=, 6=, <, ≤ profiling aspects of the system.
, >, ≥}. A DC [2, 5] has the form ϕ : ∀tx , ty , ... ∈ r, ¬(P1 ∧
...∧Pm ), where tx , ty , ... are tuples of an instance of relation r 3.3 Discovering DCs
of R, and R ∈ R. A predicate Pi is a comparison atom with We have first worked on the algorithmic issues of discov-
either the form v1 wo v2 or v1 wo c: v1 , v2 are variables tid .Aj , ering DCs from data. At the beginning of our research
Aj ∈ R, id ∈ {x, y, ...}, c is a constant from Aj ’s domain, program, the only known algorithm for DC discovery was
and wo ∈ W. For example, consider a schema of relation FASTDC [5], for which there was no publicly available im-

2
Workload DC Proﬁler
Characterization DC Discovery over the entire dataset and large predicate spaces, our sys-
Applications Build samples
A B C D
Es = {Qt
μ ,tν
∣ ∀(tμ , tν ) ∈ s}
tem focuses the discovery efforts on areas that are relevant
′
P = {P1 , P2 , . . . }
for the application workload.
Query Workload DCs templates
φ : ¬(P1 ∧. . . )

φ : ¬(P2 ∧. . . )
3.4 Workload Characterization
SELECT A,B φ : ¬(P4 ∧. . . )

FROM R
WHERE D<10 W
Cluster
Atributtes
We hypothesize that the information from the application
AC A DC classiﬁer
workload (e.g., selection filters in SQL statements) is a pow-
B D
φ : ¬(P1 ∧. . . ) erful asset to narrow the large number of DCs discovered.
φ : ¬(P2 ∧. . . )

Query workloads present strong access patterns which, ei-
ther in horizontal level (individual tuples) or vertical level
Database
Data Repairing
(individual attributes), points out to specific database re-
gions that are more frequently accessed than others.
Consider a set of user queries Q = {q1 , ..., qm }, which is
Figure 1: Proposed Data Cleaning Pipeline. expected to run on the database. For simplicity, we assume
there is only one relation in the database. For each query qi ,
our system associates the attributes in the operators of the
plementation. We have implemented FASTDC from scratch, query qi (e.g., projection and selection) to the attributes of
and our results agree with those of the original publication. the relation R(A1 , ..., An ) to define an attribute occurrence
Next, we briefly describe FASTDC, and the overall improve- matrix (AOM) O as follows:
ments we have developed for the algorithm.
1 if qi uses attribute Aj ,
The first step to discover DCs is to set the predicate space oij =
0 othewise.
P from which DCs are derived. Experts may define predi-
cates for attributes based on the database structure or use Each entry oij indicates which attributes of relation R is
specialized tools for the task, e.g., [20]. The satisfied pred- required to answer query qi .
icate set Qtµ ,tν of an arbitrary pair of tuples (tµ , tν ) ∈ r The AOM model is straightforward but can be enhanced
is a subset Q ⊂ P such that for every P ∈ Q, P (tµ , tν ) is to support dynamic workloads in which the incoming queries
true. The set of satisfied predicate sets of r is the evidence are not necessarily identical to each other. We use the ap-
set Er = {Qtµ ,tν | ∀(tµ , tν ) ∈ r}. Different tuple pairs proach described in [3], which models workloads as sets of
may return the same predicate set, hence, each Q ∈ Er is pairs of queries and their weights W = {hq1 , w1 i, ..., hqn , wn i}.
associated with an occurrence counter. Each weight wi is associated with a query qi , and measures
A cover for Er is a set of predicates that intersects with the significance of query qi in the workload. Thus, AOMs
every satisfied predicate set of Er , and it is minimal if none are weighted according to the importance of their queries.
of its subsets equally intersects with Er . The authors of The AOMs for each application are straightforward to model
FASTDC demonstrate that minimal covers of Er represent if the domain experts are aware of the applications that
the predicates of minimal DCs [5]. FASTDC uses a depth- will run on the database. Furthermore, modern DBMSs
first search (DFS) strategy to find minimal covers for Er . commonly provide tools to help with this kind of workload
FASTDC builds the evidence set by evaluating every pred- modeling [4].
icates of the predicate space P on every pair of tuples of The authors of [3] describe a probabilistic distribution
relation instance r. Our improved version of the algorithm, method to measure the similarity of incoming queries and
BFASTDC [16], is a bitwise version of FASTDC that ex- workload W. Our system uses this method to identify sub-
ploits bit-level operations to avoid unnecessary tuple com- stantial changes in the workload so that it knows when
parisons. BFASTDC builds associations between attribute to start profiling DCs again. For each different workload,
values and lists of tuple identifiers so that different combi- our system estimates data samples and predicate spaces for
nations of these associations indicate which tuple pairs sat- which the DC discovery run.
isfy predicates. To frame evidence sets, BFASTDC operates The data samples are expected to represent the data that
over auxiliary bit structures that store predicate satisfaction is most accessed by applications and, therefore, spot data
data. This allows our algorithm to use simple logical opera- that likely produce most errors. We have been testing three
tions (e.g., conjunctions and disjunctions) to imply the sat- strategies to build samples from queries. The baseline (naive)
isfaction of remaining predicates. In addition, BFASTDC strategy is building a single sample with every tuple that is
can use two modifications described in [5] to discover ap- used to answer the queries in the workload. The second
proximate and constant DCs. These DCs variants let the strategy is building a sample for each fundamental region
discovery process to work with data containing errors (e.g., [3]. They partition the tuples of the relation instance such
integrated data from multiple sources). that each region intersects with the tuples selected by a max-
In our experiments, BFASTDC produced considerable im- imum number of queries. The third strategy is based on a
provements on DC discovery performance (up to 24-fold self-pruning splay tree (SPST) index [10]. The SPST keeps
compared to FASTDC). Unfortunately, the number of dis- the most accessed tuple references near the root of the tree.
covered DCs were unmanageably large, even after selecting Thus, our system uses the strategy of [10] to prune the tree
only the DCs with high scores for the interestingness dimen- and to build samples from the remaining partitions.
sion described by the authors of FASTDC [5]. Predicate spaces come from AOMs, and restrict the search
Discovering DCs handles large search spaces. Thus, run- space of DC discovery for predicates having a high affinity to
ning times for DC discovery increase when more tuples are each other. Our system uses the frequency of attributes in
considered, and dramatically increase when the discovery is the queries to transform AOMs into attribute affinity ma-
set for large predicate spaces. Instead of running BFASTDC trices (AAMs), in which each entry represents the sum of

3
access frequencies of all queries for that attribute. Further, [3] S. Chaudhuri, G. Das, and V. Narasayya. Optimized
it uses the graph-based algorithm described in [13] to build stratified sampling for approximate query processing.
a spanning tree from which attribute clusters are derived. ACM Trans. Database Syst., 32(2), June 2007.
These clusters form the predicate spaces as follows. The [4] S. Chaudhuri and V. Narasayya. Self-tuning database
system defines single and two-tuple predicates on categori- systems: A decade of progress. In VLDB, pages 3–14,
cal attributes using operators {=, 6=}; and on numerical at- 2007.
tributes using operators {=, 6=, <, >, ≤, ≥}. It defines pred- [5] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial
icates involving two different attributes provided that the constraints. PVLDB, 6(13):1498–1509, Aug. 2013.
values of the two attributes are in the same order of magni- [6] W. W. Eckerson. Data quality and the bottom line:
tude. achieving business success through a commitment to
BFASTDC runs for every pair of predicate space and high quality data. Technical report, Data Warehousing
data sample. The output of this phase are sets of approx- Institute, 2002.
imate DCs (i.e., DCs approximately satisfied by the whole [7] W. Fan. Data quality: From theory to practice.
dataset). In preliminary experiments, we noticed that many SIGMOD Rec., 44(3):7–18, Dec. 2015.
of these DCs were, in fact, exact, or at least had similar
[8] W. Fan and F. Geerts. Foundations of Data Quality
structures to many DCs found in exhaustive discovery.
Management. Morgan & Claypool Publishers, 2012.
3.5 Features for DCs [9] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis.
Conditional functional dependencies for capturing
Among the discovered DCs, how to choose those that are
data inconsistencies. ACM Trans. Database Syst.,
relevant for data cleaning? To answer that question, we are
33(2):6:1–6:48, June 2008.
building a classifier that simulates the experts judging. The
main challenge is to form descriptions of “good” DCs. We [10] P. Holanda and E. C. de Almeida. Spst-index: A
have included the workload characterization in the discovery self-pruning splay tree index for caching database
process so that our system can estimate features regarding cracking. In EDBT, pages 458–461, 2017.
DC’s application adherence, and not only estimate features [11] I. F. Ilyas and X. Chu. Trends in cleaning relational
regarding data support. data: Consistency and deduplication. Foundations and
So far, we have considered the following features for DCs: Trends in Databases, 5(4):281–393, 2015.
statistical significance of DCs for the input dataset, as in [12] J. Liu, J. Li, C. Liu, and Y. Chen. Discover
[5]; the number of predicates [5]; similarity between pairs of dependencies from data - a review. IEEE TKDE,
DCs, an adaptation of the method described in [15]; similar- 24(2):251–264, Feb. 2012.
ity between DCs and AAMs, also an adaptation of [15]; and [13] S. Navathe, K. Karlapalem, and M. Ra. A mixed
number of DC-related errors in samples and entire dataset. partitioning methodology for distributed database
We have been developing ensembles [21] to combine differ- design. Journal of Computer and Software
ent decision trees, which are trained upon the above features. Engineering, 3(4), 1995.
For now, the training sets for the learning decision trees are [14] T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert,
user-defined DCs templates. One could also use the set of J.-P. Rudolph, M. Schönberg, J. Zwiener, and
DCs currently defined in the database, or manually define F. Naumann. Functional dependency discovery: An
acceptable values for the features of the DCs. experimental evaluation of seven algorithms. PVLDB.,
8(10):1082–1093, June 2015.
4. RESEARCH PLAN [15] E. H. M. Pena and E. C. de Almeida. Uso de
instâncias de dados e carga de trabalho para
This research program is currently in its second year. The
mineração de restrições de integridade. In SBBD,
research challenges we will have to face to answer our re-
pages 312–317, 2017.
search questions are as follows. First, we will establish an
[16] E. H. M. Pena and E. C. de Almeida. Bfastdc: a
evaluation scenario in which data, DCs and applications are
bitwise algorithm for mining denial constraints. In
evolving. So far, we have been running preliminary experi-
DEXA 2018, to appear.
ments on datasets that have already been used to evaluate
DC discovery [5]. We plan to adapt the evaluation proto- [17] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré.
cols of [19] and [3] for our workload-aware scenario. Then, Holoclean: Holistic data repairs with probabilistic
we will define and detail the metrics that classify DCs as inference. PVLDB Endow., 10(11):1190–1201, Aug.
suitable for data cleaning. With these metrics in place, we 2017.
should be able to devise rules to know when a DC is no [18] S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani,
longer valid. Such metrics will also help to evaluate the J.-A. Quiane-Ruiz, and N. Tang. Uguide: User-guided
overall system accuracy. Finally, we plan to compare the discovery of fd-detectable errors. In SIGMOD, pages
accuracy and scalability of our system against traditional 1385–1397, New York, NY, USA, 2017.
cleaning solutions. [19] M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller.
Continuous data cleaning. In ICDE, pages 244–255,
March 2014.
5. REFERENCES [20] M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M.
[1] Z. Abedjan, L. Golab, and F. Naumann. Profiling Procopiuc, and D. Srivastava. On multi-column
relational data: A survey. The VLDB Journal, foreign key discovery. PVLDB., 3(1-2), Sept. 2010.
24(4):557–581, Aug. 2015. [21] Z.-H. Zhou. Ensemble Methods: Foundations and
[2] L. Bertossi. Database Repairing and Consistent Query Algorithms. Chapman & Hall/CRC, 1st edition, 2012.
Answering. Morgan & Claypool Publishers, 2011.