=Paper=
{{Paper
|id=Vol-2175/paper04
|storemode=property
|title=Workload-Aware Discovery of Integrity Constraints for Data Cleaning
|pdfUrl=https://ceur-ws.org/Vol-2175/paper04.pdf
|volume=Vol-2175
|authors=Eduardo Pena
|dblpUrl=https://dblp.org/rec/conf/vldb/Pena18
}}
==Workload-Aware Discovery of Integrity Constraints for Data Cleaning==
Workload-Aware Discovery of Integrity Constraints for Data Cleaning Eduardo H. M. Pena supervised by Eduardo Cunha de Almeida Federal University of Paraná ehmpena@inf.ufpr.br ABSTRACT trend has brought many contributions: reasoning about var- Violations of integrity constraints (ICs) usually indicate that ious ICs, other than functional dependencies (FDs) [9]; prac- data are not consistent with a valid representation of real- tical techniques for ICs violation detection [11]; and methods world entities, and therefore data are dirty. Recent studies to repair database errors by enforcing ICs [17]; to name but present facilities to make dirty databases consistent with a few. a given set of ICs. Yet their evaluation scenarios assume It is crucial that there be effective methods to discover ICs humans-in-the-loop to feed ICs to their facilities. Manually from data. Manually designing ICs is a cost-prohibitive and, designing ICs is burdensome, and it is doomed to fail if we worse yet, error-prone task. To do so, domain experts would consider the dynamic changes of data and applications. An need to keep up with the ever-evolving semantics of data and attractive alternative is the automatic discovery of ICs, but application. Moreover, the set of IC candidates is usually the discovery results are mostly accidental. Our research too large for human validation, even for small datasets. In- proposal sticks to the automatic discovery approach, but it deed, many algorithms for automatic discovering of ICs have leverages information found in the application workload to been developed [14, 5]. In theory, data cleaning pipelines mining semantically valuable ICs. This paper overviews our could use IC discover algorithms to mine ICs from sample research on discovering ICs that are particularly suitable data, and then use those ICs as data quality rules to detect for data cleaning, i.e., ICs that are resilient to data and and repair errors in the database. However, the recent eval- application evolutions. uations of IC-based cleaning tools have only used a manual definition of ICs. Therefore, we believe there are still chal- lenges to overcome before a fully automated IC-based data cleaning pipeline can operate. 1. INTRODUCTION Discovering ICs from data might return non-reliable re- Data profiling is a complex task that helps to uncover rel- sults sets. If the discovery is carried out on dirty data, the evant metadata for datasets. Typical examples of metadata discovered ICs are likely dirty themselves. Even if the dis- include basic statistics (e.g., value distributions), patterns of covery is carried out on cleaned samples of data, which is data values, and integrity constraints (ICs). These kind of hard and expensive to get, most of the discovered ICs are information provide valuable insights on large datasets by likely accidental. The number of discovered ICs radically in- serving various use-cases, such as data exploration, query creases as the number of attributes in the dataset goes up. optimization, and data cleaning [1]. Our research program For example, datasets with dozens of attributes and a few is built upon data profiling for data cleaning. thousands of records produce result sets with thousands of Dirty data remain ubiquitous in enterprises. Industry and functional dependencies (FDs) [14]. Although some classes academia constantly report the severe impact that low data of IC allow for implication analysis to discard redundant quality have on businesses [6, 11]. The key question, what ICs (e.g., canonical covers of functional dependencies), the are dirty data? has been extensively studied [8], with many number of discovered ICs which shows no semantic mean- studies revisiting ICs to ensure business rules over data [7, ing remains huge. Furthermore, the effectiveness of current 8, 11]. In this context, violation of ICs usually indicates methods for data repairing is highly influenced by the num- that data are not consistent with a valid representation of ber of ICs the methods consider. Our goal is mining ICs for real-world entities, and therefore data are dirty. Many ap- data cleaning. If these ICs are compliant with the most cur- proaches follow the idea of detecting and repairing IC vio- rent semantics of data and application, they should help to lations for data cleaning. The development of such research repair the database and spot erroneous data in future loads. We propose an initial approach to discovering ICs that are relevant for data cleaning. We stick to the automatic IC discovery approach, but to support environments in which both data and application evolve we focus on ICs matching the current queries in the pipe. The intuition here is to use the query workload in the application to identify data spots and structural fragments (e.g., set of attributes and predi- Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de Janeiro, Brazil. cates) that are important to users. These assets generate a Copyright (c) 2018 for this paper by its authors. Copying permitted for workload characterization that is semantically relevant for private and academic purposes. 1 the current application. Our approach uses this characteri- Employees(N ame, M anager, Salary, Bonus), Employees ∈ zation to guide the IC discovery process. From the workload R. A DC ϕ can express the constraint “if two employees characterization, our approach estimates a set of features for are managed by the same person, the one earning a higher ICs. Our vision is a classifier that leverages these features salary has a higher bonus” as follows: ¬(tx .M anager = to classify a set of ICs based on their applicability for data ty .M anager∧tx .Salary > ty .Salary∧tx .Bonus < ty .Bonus). cleaning. In this thesis, we aim to answer the following re- An instance of relation r satisfies a DC ϕ if at least one search questions: predicate of ϕ is false, for every pair of tuples of r, i.e., the predicates of ϕ cannot be all true at the same time. • How can database workloads benefit IC discovery? • How automatic ICs discovery can serve data cleaning? 3.2 Proposed Approach We envision an automated data cleaning system. Figure 1 shows an overview of the expected system’s architecture. 2. RELATED WORK The system regularly collects query logs from the database Most of the recent approaches to IC-based data cleaning server to build workloads W. The characterization for work- manually define the sets of ICs to evaluate their solutions load W takes into account query structures and tuples that [19, 17]. Much of these approaches assume that ICs and are selected to produce the query results. The outputs of the data do not change very often, and use a set of fixed ICs workload characterization are data samples and attribute to repair the database through modification and deletions clusters. of inconsistent tuples. A few approaches consider evolving The DC profiler takes as input data samples, attribute ICs. In this case, the database is repaired after modifications clusters, and, optionally, user-defined DCs templates. The to both records and ICs [19]. More recently, the authors profiler sets the predicate space (from which DCs are mined) of [18] leverage automatic IC discovery for detecting data according to the attribute clusters. By relying on this pred- errors. However, they use the discovered ICs (they focus on icate space, our DC discovery algorithm discovers approxi- FDs) only to generate questions that guide an interaction mate DCs from the samples. The next step is to discard the with experts. The goal of this interaction is to discover as superfluous DCs and select the relevant ones. It is impor- many FD-related errors in data as possible on a budget of tant to discard superfluous DCs so that data repairing only questions. has to consider DCs that capture the current semantics of Most works on IC discovery have focused on attribute de- the data and applications. In [15], we have studied the use pendencies. Liu et al. [12] present a comprehensive review of query workload for automatic selection of FDs. We have of the topic. Papenbrock et al. [14] have looked into imple- developed ranking strategies to select FDs that match the mentation details, experimental evaluation, and comparison current workload. The results have shown that it is possible of various FD discovery algorithms. The studies on IC dis- to select FDs that are semantically close to the query work- covery usually focus on the algorithmic issues of the process, load, and with reasonably similar statistical distributions to leaving the applicability of the discovered results aside. To those general FDs produced by the exhaustive FD discov- handle the large number of discovered ICs, some approaches ery. However, selecting DCs is more complex and requires score ICs on instance-driven metrics for user validation [12, further investigation. We have been evaluating decision tree 5]. Unfortunately, these scoring schemes do not consider the ensembles for this task. The general idea is to estimate dynamic changes in data and application. different features for DCs, and then use a classifier to de- cide whether or not DCs are reliable. Feature extraction 3. MINING RELEVANT ICs is based on structural properties of DCs (e.g., how many This section describes the main steps we are taking to- predicates they have); and metrics that correlate DCs with wards answering our research questions. the database and query workload. The DCs that are classi- fied as reliable are expected to hold in the database as long 3.1 Denial Constraints as the database manipulations produce semantically clean In the context of this thesis, a database is clean if it is con- data. Our implementation of the DC classifier is still under sistent with a given set of ICs. Some types of ICs can express way. semantic rules that others ICs cannot, or vice versa. Denial The last task is finding and repairing the tuples that vi- constraints (DCs) [5] are known to be a response to this ex- olate at least one constraint from the DC profiling results. pressiveness issue because they generalize important types We have been using HoloClean [17] for this task. Our cur- of ICs, such as functional dependencies (FDs), conditional rent setup only uses the DC-based error detection and error FDs, and check constraints. Recent works have introduced repairing modules of the tool. However, we believe that our methods that automatically repair the violations of DCs in system could be adjusted to produce different signals to feed the database [17]; hence, our proposal is also based on DCs. the probabilistic graphical model of HoloClean. For exam- DCs define sets of predicates that databases must sat- ple, deriving features from attribute clusters and samples isfy to prevent attributes from taking combinations of values for the graphical model. We postpone this kind of investi- considered semantically inconsistent. Consider a relational gation because we have been currently focusing on the data database schema R and a set of operators W : {=, 6=, <, ≤ profiling aspects of the system. , >, ≥}. A DC [2, 5] has the form ϕ : ∀tx , ty , ... ∈ r, ¬(P1 ∧ ...∧Pm ), where tx , ty , ... are tuples of an instance of relation r 3.3 Discovering DCs of R, and R ∈ R. A predicate Pi is a comparison atom with We have first worked on the algorithmic issues of discov- either the form v1 wo v2 or v1 wo c: v1 , v2 are variables tid .Aj , ering DCs from data. At the beginning of our research Aj ∈ R, id ∈ {x, y, ...}, c is a constant from Aj ’s domain, program, the only known algorithm for DC discovery was and wo ∈ W. For example, consider a schema of relation FASTDC [5], for which there was no publicly available im- 2 Workload DC Profiler Characterization DC Discovery over the entire dataset and large predicate spaces, our sys- Applications Build samples A B C D Es = {Qt μ ,tν ∣ ∀(tμ , tν ) ∈ s} tem focuses the discovery efforts on areas that are relevant ′ P = {P1 , P2 , . . . } for the application workload. Query Workload DCs templates φ : ¬(P1 ∧. . . ) φ : ¬(P2 ∧. . . ) 3.4 Workload Characterization SELECT A,B φ : ¬(P4 ∧. . . ) FROM R WHERE D<10 W Cluster Atributtes We hypothesize that the information from the application AC A DC classifier workload (e.g., selection filters in SQL statements) is a pow- B D φ : ¬(P1 ∧. . . ) erful asset to narrow the large number of DCs discovered. φ : ¬(P2 ∧. . . ) Query workloads present strong access patterns which, ei- ther in horizontal level (individual tuples) or vertical level Database Data Repairing (individual attributes), points out to specific database re- gions that are more frequently accessed than others. Consider a set of user queries Q = {q1 , ..., qm }, which is Figure 1: Proposed Data Cleaning Pipeline. expected to run on the database. For simplicity, we assume there is only one relation in the database. For each query qi , our system associates the attributes in the operators of the plementation. We have implemented FASTDC from scratch, query qi (e.g., projection and selection) to the attributes of and our results agree with those of the original publication. the relation R(A1 , ..., An ) to define an attribute occurrence Next, we briefly describe FASTDC, and the overall improve- matrix (AOM) O as follows: ments we have developed for the algorithm. 1 if qi uses attribute Aj , The first step to discover DCs is to set the predicate space oij = 0 othewise. P from which DCs are derived. Experts may define predi- cates for attributes based on the database structure or use Each entry oij indicates which attributes of relation R is specialized tools for the task, e.g., [20]. The satisfied pred- required to answer query qi . icate set Qtµ ,tν of an arbitrary pair of tuples (tµ , tν ) ∈ r The AOM model is straightforward but can be enhanced is a subset Q ⊂ P such that for every P ∈ Q, P (tµ , tν ) is to support dynamic workloads in which the incoming queries true. The set of satisfied predicate sets of r is the evidence are not necessarily identical to each other. We use the ap- set Er = {Qtµ ,tν | ∀(tµ , tν ) ∈ r}. Different tuple pairs proach described in [3], which models workloads as sets of may return the same predicate set, hence, each Q ∈ Er is pairs of queries and their weights W = {hq1 , w1 i, ..., hqn , wn i}. associated with an occurrence counter. Each weight wi is associated with a query qi , and measures A cover for Er is a set of predicates that intersects with the significance of query qi in the workload. Thus, AOMs every satisfied predicate set of Er , and it is minimal if none are weighted according to the importance of their queries. of its subsets equally intersects with Er . The authors of The AOMs for each application are straightforward to model FASTDC demonstrate that minimal covers of Er represent if the domain experts are aware of the applications that the predicates of minimal DCs [5]. FASTDC uses a depth- will run on the database. Furthermore, modern DBMSs first search (DFS) strategy to find minimal covers for Er . commonly provide tools to help with this kind of workload FASTDC builds the evidence set by evaluating every pred- modeling [4]. icates of the predicate space P on every pair of tuples of The authors of [3] describe a probabilistic distribution relation instance r. Our improved version of the algorithm, method to measure the similarity of incoming queries and BFASTDC [16], is a bitwise version of FASTDC that ex- workload W. Our system uses this method to identify sub- ploits bit-level operations to avoid unnecessary tuple com- stantial changes in the workload so that it knows when parisons. BFASTDC builds associations between attribute to start profiling DCs again. For each different workload, values and lists of tuple identifiers so that different combi- our system estimates data samples and predicate spaces for nations of these associations indicate which tuple pairs sat- which the DC discovery run. isfy predicates. To frame evidence sets, BFASTDC operates The data samples are expected to represent the data that over auxiliary bit structures that store predicate satisfaction is most accessed by applications and, therefore, spot data data. This allows our algorithm to use simple logical opera- that likely produce most errors. We have been testing three tions (e.g., conjunctions and disjunctions) to imply the sat- strategies to build samples from queries. The baseline (naive) isfaction of remaining predicates. In addition, BFASTDC strategy is building a single sample with every tuple that is can use two modifications described in [5] to discover ap- used to answer the queries in the workload. The second proximate and constant DCs. These DCs variants let the strategy is building a sample for each fundamental region discovery process to work with data containing errors (e.g., [3]. They partition the tuples of the relation instance such integrated data from multiple sources). that each region intersects with the tuples selected by a max- In our experiments, BFASTDC produced considerable im- imum number of queries. The third strategy is based on a provements on DC discovery performance (up to 24-fold self-pruning splay tree (SPST) index [10]. The SPST keeps compared to FASTDC). Unfortunately, the number of dis- the most accessed tuple references near the root of the tree. covered DCs were unmanageably large, even after selecting Thus, our system uses the strategy of [10] to prune the tree only the DCs with high scores for the interestingness dimen- and to build samples from the remaining partitions. sion described by the authors of FASTDC [5]. Predicate spaces come from AOMs, and restrict the search Discovering DCs handles large search spaces. Thus, run- space of DC discovery for predicates having a high affinity to ning times for DC discovery increase when more tuples are each other. Our system uses the frequency of attributes in considered, and dramatically increase when the discovery is the queries to transform AOMs into attribute affinity ma- set for large predicate spaces. Instead of running BFASTDC trices (AAMs), in which each entry represents the sum of 3 access frequencies of all queries for that attribute. Further, [3] S. Chaudhuri, G. Das, and V. Narasayya. Optimized it uses the graph-based algorithm described in [13] to build stratified sampling for approximate query processing. a spanning tree from which attribute clusters are derived. ACM Trans. Database Syst., 32(2), June 2007. These clusters form the predicate spaces as follows. The [4] S. Chaudhuri and V. Narasayya. Self-tuning database system defines single and two-tuple predicates on categori- systems: A decade of progress. In VLDB, pages 3–14, cal attributes using operators {=, 6=}; and on numerical at- 2007. tributes using operators {=, 6=, <, >, ≤, ≥}. It defines pred- [5] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial icates involving two different attributes provided that the constraints. PVLDB, 6(13):1498–1509, Aug. 2013. values of the two attributes are in the same order of magni- [6] W. W. Eckerson. Data quality and the bottom line: tude. achieving business success through a commitment to BFASTDC runs for every pair of predicate space and high quality data. Technical report, Data Warehousing data sample. The output of this phase are sets of approx- Institute, 2002. imate DCs (i.e., DCs approximately satisfied by the whole [7] W. Fan. Data quality: From theory to practice. dataset). In preliminary experiments, we noticed that many SIGMOD Rec., 44(3):7–18, Dec. 2015. of these DCs were, in fact, exact, or at least had similar [8] W. Fan and F. Geerts. Foundations of Data Quality structures to many DCs found in exhaustive discovery. Management. Morgan & Claypool Publishers, 2012. 3.5 Features for DCs [9] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing Among the discovered DCs, how to choose those that are data inconsistencies. ACM Trans. Database Syst., relevant for data cleaning? To answer that question, we are 33(2):6:1–6:48, June 2008. building a classifier that simulates the experts judging. The main challenge is to form descriptions of “good” DCs. We [10] P. Holanda and E. C. de Almeida. Spst-index: A have included the workload characterization in the discovery self-pruning splay tree index for caching database process so that our system can estimate features regarding cracking. In EDBT, pages 458–461, 2017. DC’s application adherence, and not only estimate features [11] I. F. Ilyas and X. Chu. Trends in cleaning relational regarding data support. data: Consistency and deduplication. Foundations and So far, we have considered the following features for DCs: Trends in Databases, 5(4):281–393, 2015. statistical significance of DCs for the input dataset, as in [12] J. Liu, J. Li, C. Liu, and Y. Chen. Discover [5]; the number of predicates [5]; similarity between pairs of dependencies from data - a review. IEEE TKDE, DCs, an adaptation of the method described in [15]; similar- 24(2):251–264, Feb. 2012. ity between DCs and AAMs, also an adaptation of [15]; and [13] S. Navathe, K. Karlapalem, and M. Ra. A mixed number of DC-related errors in samples and entire dataset. partitioning methodology for distributed database We have been developing ensembles [21] to combine differ- design. Journal of Computer and Software ent decision trees, which are trained upon the above features. Engineering, 3(4), 1995. For now, the training sets for the learning decision trees are [14] T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, user-defined DCs templates. One could also use the set of J.-P. Rudolph, M. Schönberg, J. Zwiener, and DCs currently defined in the database, or manually define F. Naumann. Functional dependency discovery: An acceptable values for the features of the DCs. experimental evaluation of seven algorithms. PVLDB., 8(10):1082–1093, June 2015. 4. RESEARCH PLAN [15] E. H. M. Pena and E. C. de Almeida. Uso de instâncias de dados e carga de trabalho para This research program is currently in its second year. The mineração de restrições de integridade. In SBBD, research challenges we will have to face to answer our re- pages 312–317, 2017. search questions are as follows. First, we will establish an [16] E. H. M. Pena and E. C. de Almeida. Bfastdc: a evaluation scenario in which data, DCs and applications are bitwise algorithm for mining denial constraints. In evolving. So far, we have been running preliminary experi- DEXA 2018, to appear. ments on datasets that have already been used to evaluate DC discovery [5]. We plan to adapt the evaluation proto- [17] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. cols of [19] and [3] for our workload-aware scenario. Then, Holoclean: Holistic data repairs with probabilistic we will define and detail the metrics that classify DCs as inference. PVLDB Endow., 10(11):1190–1201, Aug. suitable for data cleaning. With these metrics in place, we 2017. should be able to devise rules to know when a DC is no [18] S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, longer valid. Such metrics will also help to evaluate the J.-A. Quiane-Ruiz, and N. Tang. Uguide: User-guided overall system accuracy. Finally, we plan to compare the discovery of fd-detectable errors. In SIGMOD, pages accuracy and scalability of our system against traditional 1385–1397, New York, NY, USA, 2017. cleaning solutions. [19] M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244–255, March 2014. 5. REFERENCES [20] M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. [1] Z. Abedjan, L. Golab, and F. Naumann. Profiling Procopiuc, and D. Srivastava. On multi-column relational data: A survey. The VLDB Journal, foreign key discovery. PVLDB., 3(1-2), Sept. 2010. 24(4):557–581, Aug. 2015. [21] Z.-H. Zhou. Ensemble Methods: Foundations and [2] L. Bertossi. Database Repairing and Consistent Query Algorithms. Chapman & Hall/CRC, 1st edition, 2012. Answering. Morgan & Claypool Publishers, 2011. 4