=Paper=
{{Paper
|id=Vol-2578/PIE4
|storemode=property
|title=Coverage-based Rewriting for Data Preparation
|pdfUrl=https://ceur-ws.org/Vol-2578/PIE4.pdf
|volume=Vol-2578
|authors=Chiara Accinelli,Simone Minisi,Barbara Catania
|dblpUrl=https://dblp.org/rec/conf/edbt/AccinelliMC20
}}
==Coverage-based Rewriting for Data Preparation==
Coverage-based Rewriting for Data Preparation Chiara Accinelli Simone Minisi Barbara Catania University of Genoa, Italy University of Genoa, Italy University of Genoa, Italy chiara.accinelli@dibris.unige.it simone.minisi@dibris.unige.it barbara.catania@dibris.unige.it ABSTRACT this type of guarantee is provided by design, i.e., intrinsically The development of technological solutions satisfying non dis- embedded in the mechanisms of the data processing workflow. criminating requirements is currently one of the main challenges Among the relevant themes from a social point of view, vari- for data processing. Concepts like fairness, i.e., lack of bias, and ous non-discriminating constraints, like fairness, i.e., lack of bias, diversity, i.e., the degree to which different kinds of objects are and diversity, i.e., the degree to which different kinds of objects represented in a dataset, have been recently taken into account in are represented in a dataset, have recently received attention designing non-discriminating set selection, ranking, and OLAP from the data management community. As pointed out in [6, 8], approaches. Information extraction is however also at the basis an important direction is developing a holistic treatment of non- of back-end data processing, for preparing, e.g., extracting and discriminating constraints through different stages of the data transforming data, usually based on SQL queries, before loading management and analysis life-cycle, from data cleaning, integra- them inside a data warehouse for further front-end processing. tion, and preprocessing, through selection and ranking, to result The impact of an unfair data preparation process might have a interpretation. In this context, a key issue concerns enforcing relevant impact on front-end analysis. As an example, an under- non-discriminating constraints incrementally through individual represented category in the warehouse might lead to an under- independent choices, rather than as a constraint on the set of representation of that category in most of the following processes. final results. This kind of guarantee is known as coverage. In this paper, we start from this consideration and we propose an approach for Problem. Recently, various approaches to cope with non-discri- automatically rewriting back-end queries, whose results do not minating constraints during front-end information extraction guarantee some coverage constraints, into the βclosestβ queries stages, aiming at presenting the users with data organized as satisfying those constraints. Through rewriting, coverage-based to satisfy their needs, have been proposed, in the context of set modifications of data preparation steps are traced for further pro- selection [20], ranking [2, 3, 22], and OLAP [14, 15, 17]. cessing. We also present some preliminary experimental results Information extraction is however also at the basis of back-end and we identify some directions for future works. processing in analytical systems, in which source data have to be prepared, namely extracted, transformed, and loaded, into an analytical repository - a data warehouse - for further front- 1 INTRODUCTION end processing. This kind of processes is probably simpler from Background. Nowadays, large-scale technologies for the manage- the point of view of data manipulation with respect to relevant ment and the analysis of big data have a relevant and positive front-end search functionalities and can often be represented in impact: they can improve peopleβs lives and enhance scientific terms of Select-Project-Join (SPJ) queries. However, an unfair data exploration. At the same time, it becomes increasingly important preparation process might have a huge impact on many front-end to understand the nature of these impacts at the social level and processes: as an example, an under-represented category in the to take responsibility for them, especially when they deal with warehouse might lead to an under-representation of that category human-related data [7]. in any following process. As pointed out in [16], this motivates The development of technological solutions satisfying non- the study of fairness-aware data transformations, where the idea discriminating requirements is currently one of the main chal- is to minimally rewrite the transformation query so that certain lenges in data processing and, in particular, in data management. fairness constraints are guaranteed to be satisfied in the result of Due to the above-mentioned social relevance, themes such as the transformation. diversity, non-discrimination, fairness, protection of minorities, In this paper, we start from this considerations and we consider and transparency are becoming increasingly crucial when deal- fairness constraints guaranteeing that there are enough entries ing with any data processing step. More concretely, consider a in the dataset for each object category, before starting front-end population upon which a data processing (either operational or processes. Such property has been called coverage in [4] and has analytical) task has to be applied. Suppose that a subset of our relationships with both fairness and diversity issues [22]. More population shares some characteristics that should not be em- precisely, consider a (source) dataset πΌ in which some items are ployed for discrimination (e.g., race, gender, disability status). It is characterized by a sensitive attribute (e.g, gender, race, age) and important to guarantee that the result of the processing task is not an SPJ query π over πΌ , including the sensitive attribute in the discriminating with respect to the considered sensitive attributes. projection list. Now suppose that some coverage constraints are This may include ensuring a fair probability of selection, not given with respect to the result of π when executed over πΌ (e.g., giving undue relevance to individuals sharing these properties, we would like the number of females is greater than a given or other related constraints. This type of requirements is made threshold). Suppose π, when executed over πΌ , does not satisfy the desirable by the ethical need to take responsibility; however, it constraints. We are interested in rewriting π into a new query π β² is also made mandatory by the recent General Data Protection such that π β² relaxes predicates in π, thus, π β π β² , and π β² is the Regulation of the European Union [5]. The latter imposes that smallest query containing π satisfying the coverage constraints Β© 2020 Copyright held by the owner/author(s). Published in the Workshop Pro- when executed over πΌ . ceedings of the EDBT/ICDT 2020 Joint Conference, March 30-April 2, 2020 on Notice that we are interested in rewriting π into π β² in order CEUR-WS.org. Distribution of this paper is permitted under the terms of the Cre- ative Commons license CC BY 4.0. to achieve transparency: through rewriting, the coverage-aware relaxation process is traced for further analysis or processing. In Query relaxation. Query relaxation is a well-studied problem in this respect, the approach we propose can be seen as a revision database literature, in which the goal is to help users in revising of existing query refinement techniques, addressing the empty query specification or execution so that generated results better or few answer problem (see, e.g., [12]), to take care of coverage match usersβ desired results. Previous work has exploited query and fairness issues. logs [9] and data being queried [23]. In order to address the empty or few answer problem, a skyline-based refined query execution Contributions. The main contributions of the paper can be sum- has been proposed in [10] while an interactive framework for marized as follows: query relaxation through rewriting has been presented in [12]. β’ We formally define a query rewriting problem based on The approach we propose can be seen as a reinterpretation of coverage constraints. such last query relaxation framework to take care of coverage β’ We present a baseline algorithm for query rewriting, relax- and fairness issues without any required user interaction. ing the query with the aim of satisfying the constraints at hand and we propose two optimizations based on pruning 3 PROBLEM DEFINITION and incremental refinement. Let π· = {π 1, ..., π π } be a relational database schema, let π β π π , β’ We present some preliminary experimental results show- π β {1, ..., π} be an attribute called sensitive attribute with domain ing that the proposed approach is effective and efficient π·π = {π 1, ...π β }. Let π be an arbitrary SQL query. Let π ππ 1, ..., π πππ when applied over some real datasets. be the selection conditions in π, π πππ β‘ ππ ππ£π , attribute ππ has Organization. The remainder of this paper is organized as follows. domain π·ππ and belongs to the schema of a relation π π in π·, Section 2 discusses related work about non-discrimination in π£π β π·ππ . When needed, we denote π by π β¨π£ 1, ..., π£π β© or π β¨π£β©, information extraction and query relaxation. The problem we π£ β‘ (π£ 1, ..., π£π ). For the sake of simplicity, we assume that ππ , consider is then presented in Section 3. The proposed grid-based π = 1, ..., π, is a numeric attribute; the proposed approach can algorithm is introduced in Section 4 while experimental results however be easily extended to deal with any ordered domain. are presented in Section 5. Finally, Section 6 outlines future works We restrict our attention to SQL queries π in which: (i) the and presents some concluding remarks. sensitive attribute π belongs to the projection list; (ii) π β¨π£β© β π β¨π’β© when π’π β₯ π£π , π = 1, ..., π (denoted by π’ β₯ π£). Such queries 2 RELATED WORK are called sensitive selection monotone queries. There are several areas of research that are related to our work, We then denote with π βπ π the query ππ=π π (π), π β {1, ..., β}, which we discuss below. and with |π (πΌ )| the cardinality of the result of π when executed over a database instance πΌ of π·. For each value π π of the sensitive Non-discrimination in information extraction. Information extrac- attribute π, a constraint πΆπΆ π β‘ |π βπ π | β₯ ππ , is called coverage tion aims at presenting the users with data organized as to satisfy constraint with respect to π π or simply coverage constraint. their needs. Non-discriminating information extraction takes The problem we want to address is the following: given a into account non-discriminating properties, like fairness and di- database instance πΌ , a sensitive selection monotone query π β¨π£β©, versity, in generating results to be returned to the user. Fairness and a set of coverage constraints πΆπΆ = {πΆπΆ 1, ..., πΆπΆπ‘ }, π‘ β€ β, and diversity constraints have been taken into account for design- not satisfied by π (πΌ ), rewrite π β¨π£β© into a new relaxed query ing set selection [20] and ranking [2, 3, 22] approaches. Fairness π β² β‘ π β¨π’β© so that: (i) π β π β² ; (ii) βπ β {1, ..., π‘ }, πΆπΆ π is satisfied has also been considered to provide fairness-aware rewriting of by π β² (πΌ ); (iii) there is no other query π β²β² satisfying conditions OLAP, i.e., aggregation-based, queries [14, 15], and database re- (i) and (ii) such that π β²β² β π β² (thus, π β² is the smallest query pair approaches [17, 18] that provide provable fairness guarantees satisfying (i) and (ii)); (iv) π β² β‘ π β¨π’β© is the closest query to π β¨π£β© about classifiers trained on database training labels. The notion according to the Euclidean distance between π£ and π’. Query π β² of causal fairness and the design of data management techniques is called a coverage-based rewriting of π with respect to πΆπΆ and πΌ . for it have been investigated in [16]. Coverage over multiple Coverage constraints can be provided together with query categorical attributes has been introduced in [4], where efficient π or they can already be available in the system. This could be techniques for determining the least amount of additional data useful when they represent generally valid non-discrimination that must be obtained to guarantee coverage satisfaction have rules that must be satisfied by any query execution. Since the user been proposed. A web application made up of a collection of might not be aware of the available constraints, we relax only visual widgets that implement most latest research results on selection predicates appearing in π with the aim of generating fairness, stability, and transparency for rankings, and that com- relaxed queries that are syntactically close to π (thus, potentially municate details of the applied methodology to the end-user, is increasing user satisfaction). Additionally, differently from [12], presented in [19, 21]. In [1], an interaction model for explain-and- we assume the user is not involved in the relaxation process to repair data transformation systems, in which users interactively make the rewriting process βlighterβ from the user point of view. define constraints for transformation code and the resultant data, Consider a numeric selection predicate π πππ β‘ ππ < π£π . A relax- is proposed. The system satisfies these constraints as much as ation of π πππ is any predicate π πππβ² β‘ ππ < π£πβ² s.t. π£πβ² β₯ π£π . We can possible and provides an explanation for any encountered prob- convert any predicate on a numeric domain to a predicate of the lem. In this paper, we consider coverage constraints, as proposed form ππ < π£π . For instance, a predicate ππ > π£π can be transformed in [4], over a single sensitive attribute. The problem we address is into βππ < βπ£π ; range predicates of the form πππ < π£π < πππ’ can be closely related to [1]. However, differently from [1], we consider transformed into two separate predicates βππ < βπ£ππ and ππ < π£ππ’ . transformations corresponding to SPJ queries with the aim of In the rest of the paper, for ease of exposition, we assume that satisfying coverage constraints through an approach that can be numeric predicates have been appropriately transformed into easily integrated inside an SQL engine. predicates of the form ππ < π£π . We rewrite the equality operator ππ = π£π as a conjunction of ππ β₯ π£π and ππ β€ π£π . 4 A GRID BASED APPROACH FOR query from π, corresponding to cell (π β 1, ..., π β 1). Notice that COVERAGE-BASED REWRITING in ππππ each selection condition contains the maximum value Let π β¨π£β© be a sensitive selection monotone query and πΆπΆ β‘ for the selection attribute, thus all conditions are always satisfied {πΆπΆ 1, ..., πΆπΆπ‘ } be a set of coverage constraints. Relaxed queries in πΌ and ππππ (πΌ ) returns all the input tuples. As a consequence, generated through coverage-based rewriting starting from π β¨π£β© if ππππ (πΌ ) does not satisfy πΆπΆ (line 7), no coverage rewriting can and πΆπΆ have the form π β¨π’β©, with π’ β₯ π£, and can be represented be generated and a void result is returned. Otherwise, we navi- as points π’ in a π-dimensional space. gate the space one cell after the other according to the SFC (line Let πΌ be an instance of π· and suppose that each selection 10). For each visited cell with index π, coverage-based constraints attribute π π assumes in πΌ values in π· π π β‘ [ππππ π , ..., π π πππ₯ ], π = πΆπΆ are checked over π π (πΌ ) through predicate πΆβπππ (π π (πΌ ), πΆπΆ). 1, ..., π. In order to navigate the π-dimensional space in a discrete If they are satisfied (πΆβπππ (π π (πΌ ), πΆπΆ) is true) (line 11) or they way, we assume to discretize π· π π into a fixed number of bins π, are not satisfied but the cardinality of π π (πΌ ) is higher than the that we assume to be expressed as 2π , for some integer number current minimum cardinality (line 16), we remove π’π (π) from π. The space can thus be represented as a multidimensional π Γπ πππππ (lines 12 and 17), since all cells in π’π (π) correspond to matrix, called cumulative multi-dimensional grid for π and πΆπΆ queries with higher cardinalities with respect to the currently in πΌ (πΆππΊπ,πΆπΆ,πΌ or πΆππΊ for short when no ambiguity arises). identified minimum. When constraints are satisfied, the index of Each cell of the CMG with index (π 1, ..., ππ ), π π β {0, ..., π β 1}, the currently minimum query is updated accordingly (line 13). corresponds to query π β¨π£ 1 + (ππππ₯ π β π£ 1 )/π β π 1, ...., π£π + (ππππ₯ π β π£π )/π β ππ β©. For the sake of simplicity, we denote such query with Algorithm 1 CRBase π (π 1 ,...,ππ ) . Notice that π (0,...,0) β‘ π β¨π£β©. 1: function CRBase(π, πΆπΆ, πΌ ) In order to detect a coverage-based rewriting for π with re- 2: Input π: a sensitive selection monotone query; πΆπΆ: a set of coverage- spect to πΆπΆ and πΌ , we store in each cell (π 1, ..., ππ ) of the πΆππΊ based constraints; πΌ : a database instance the estimated cardinality |π (π 1 ,...,ππ ) (πΌ )| (in the following denoted 3: Output πππ: index of a coverage-based rewriting of π with respect by πΆππΊ ((π 1, ..., ππ )).1) and a list of estimated cardinalities, one to πΆπΆ and πΌ for each value of the sensitive attribute for which a coverage 4: πππππ = {all the cell indexes in the CMG} constraint has been specified, namely (|π (π 1 ,...,ππ ) βπ 1 (πΌ )|, ..., 5: πππ = (π β 1, ...., π β 1) |π (π 1 ,...,ππ ) βπ π‘ (πΌ )|) (denoted by πΆππΊ ((π 1, ..., ππ )).2). 6: π = (0, ..., 0) Given an index π, we denote with π’π (π) (upper right elements 7: if not Check(π πππ (πΌ ), πΆπΆ) then return β₯ of π), the set {π |π β₯ π}, and with ππ (π) (lower left elements 8: else of π), the set {π |π β€ π}. Since we deal with sensitive selection 9: while πππππ β β do monotone queries, it is easy to show that the following properties 10: π = Next(π, πππππ) hold: 11: if Check(π π (πΌ ), πΆπΆ) then 12: πππππ = πππππ \ ur(π) (a) πΆππΊ (π).1 β₯ πΆππΊ (π).1 and πΆππΊ (π).2 β₯ πΆππΊ (π).2 when 13: if |π π (πΌ ) | < |π πππ (πΌ ) | then πππ = π π β₯ π. 14: end if (b) π π β π π for all π β π’π (π); 15: else (c) π π β π π for all π β ππ (π). 16: if |π πππ (πΌ ) | < |π π (πΌ ) | then 17: πππππ = πππππ \ ur(π) As an example, Figure 1(b) shows matrix πΆππΊ for the dataset 18: end if in Figure 1(a), when only one coverage constraint is provided. For 19: end if the sake of simplicity, each cell π reports only πΆππΊ (π).2. Grey 20: end while cells correspond to π’π ((1, 1)) while light grey cells correspond to 21: return πππ ππ ((1, 1)). Cell (1, 1) is included in both π’π ((1, 1)) and ππ ((1, 1)). 22: end if 23: end function In order to identify a coverage-based rewriting for π, we pro- pose three algorithms. The first is a baseline algorithm that iden- tifies the solution by navigating the CMG, one cell after the other, at increasing distance from π. The second algorithm is obtained 4.2 Adding Pruning from the first one by pruning the space, thus reducing the number Algorithm πΆπ π΅ππ π can be optimized, leading to algorithm πΆπ π΅ππ ππ of visited cells with a limited overhead; finally, the third algo- (πΆπ π΅ππ π with Pruning), by adding some pruning rules. Such rules rithm visits the space, after pruning, by iteratively refining the generalize the pruning already applied by algorithm πΆπ π΅ππ π at size and the number of matrix cells to be visited, thus converging lines 12 and 16 with the aim of further reducing the space to be to the solution faster. visited and coping with the well known curse of dimensionality problem. Each pruning rule generates two sets of cells: 4.1 The Baseline Approach a) Locking set (πΏπ). For each π β πΏπ, query π π (πΌ ) satisfies Algorithm πΆπ π΅ππ π (Coverage-based Rewriting Baseline algo- coverage constraints. Based on properties (a) and (b), all rithm) visits the CMG according to a Space Filling Curve (SFC) the queries in π’π (π) can therefore be removed from πππππ in which cells are visited at increasing distance from πΆππΊ (0) since they satisfy coverage constraints but they are not (see Figure 1(c)), by relying on a function π ππ₯π‘ (). During the the closest to the initial query and their cardinality might visit, we look for the cell corresponding to the query with the not be the minimum one. We denote with πΏπ β the set Γ minimum cardinality that satisfies coverage-based constraints π βπΏπ π’π (π). πΆπΆ. In particular, let πππππ be the set containing all the CMG b) No Solution set (π π). For each π β π π, query π π (πΌ ) does cell indexes. We initialize πππ with the index of the more distant not satisfy coverage constraints. Based on properties (a) (a) Data distribution (b) πΆππΊπ,πΆπΆ,πΌ (c) SFC Figure 1: Data representation and (c), this means that all the queries in ππ (π) can be removed from πππππ since they cannot satisfy coverage constraints. We denote with π π β the set π βπ π ππ (π). Γ We consider two pruning rules: β’ Diagonal pruning. Using binary search, we look for the cell (π, ...., π), π β {0, ..., π β 1}, closest to (0, ...., 0) such that π (π,....,π) (πΌ ) satisfies coverage constraints. We add π (π,....,π) into πΏπ and π (πβ1,....,πβ1) into π π. The number of cardinal- ity estimations performed by diagonal pruning is π (log π), where π is the number of bins. (a) Diagonal pruning (b) Dimensional pruning β’ Dimensional pruning. For the sake of simplicity, we explain dimensional pruning in a 2-dimensional space. The prun- Figure 2: Pruning in 2D with πΏππππππ and ππππππ’π‘πππ sets ing can however be easily extended to any d-dimensional matrix. Suppose that the horizontal axis corresponds to the selection attribute π 1 and the vertical one to the selection attribute π 2 . For each position π, π = 0, ..., π β 1 in the hori- at each iteration π β₯ 1, the number of bins is set to 2π and Algo- zontal axis, we look for the first cell (π, π), π β {0, ..., π β 1}, rithm πΆπ π΅ππ ππ is used to prune the space, navigate it, and update that satisfies coverage constraints. It is easy to show that index πππ. Only when we reach the maximum precision i.e., when (π, π) β πΏπ and (π, π§) β π π, for all π§ β€ π β 1. A similar π = π and the number of bins is equal to π, the computed πππ is computation is then applied to the vertical axis. returned as result. Assuming to use binary search to locate the first cell sat- isfying the constraints for each direction, the number of 4.4 Cardinality Estimation cardinality estimations performed by dimensional pruning Our query refinement framework requires fast and accurate car- is π (ππ log π), where π is the number of bins and π the dinality estimates for queries corresponding to visited cells of number of axis, i.e., the number of selection conditions in the CMG. These estimates could be obtained from the cardinality the input query π. estimation component of the database system. However, such Figure 2 shows an example of the considered pruning rules, with estimates are often incorrect, especially for queries with multiple a single coverage constraint set to π βπ π β₯ 7. joins and selection predicates. Since the accuracy of the proposed Pruning rules can be easily integrated in πΆπ π΅ππ π algorithm as coverage-based rewriting depends on the used cardinality esti- follows: mation approach, similarly to [12], we rely on sampling based β’ sets πΏπ and π π are computed before line 7 according to estimators for cardinality estimation. In order to avoid sampling diagonal and dimensional pruning; repeatedly for each visited cell, we compute an π-approximation β’ at line 4, πππππ is initialized with all the cell indexes in the of the base tables and we rely on the approach in [13] for gener- CMG minus those that do not satisfy coverage constraints, ating the sample of joined tables. We compute just one random identified by the pruning phase. Thus, πππππ = { all the sample for each query at hand (uniformly, independently, and cell indexes in the CMG } \ (πΏπ β βͺ π π β ). without replacement). According to [13] such sample can be reused for all structurally equivalent queries (i.e., queries con- taining the same number of selection conditions and the same 4.3 Adding Iteration number of joins). πΆπ π΅ππ ππ can be further optimized by iteratively increasing the number of bins during the search up to a given maximum π = 2π . 5 EXPERIMENTAL EVALUATION The new algorithm is denoted by πΆπ π΅ππ πππΌ (πΆπ π΅ππ ππ with Itera- tion). Each iteration, by increasing the number of bins, increases 5.1 Experimental Setup the number of matrix cells to be visited for pruning and search. All experiments were conducted on a PC with an Intel Core i5- As a consequence, each iteration increases the precision by which 8300H 2.30 GHz CPU and 16GB main memory, running Microsoft we refine the query and we compute cardinalities. More precisely, Windows 10. All programs were coded in Python 3.8. (a) Execution time (b) Visited cells (c) Relaxation degree Figure 3: Comparison of coverage rewriting algorithms We considered two real datasets with different sizes, stored education number equal or greater than 13, work more than 20 in PostgreSQL: Adult dataset [11] and US Forbes Richest People hours per week and have a capital gain higher than 5500: dataset.1 The Adult dataset was originally extracted from the 1994 Q4: SELECT * FROM adult_data Census database and has been used in assessing fairness-aware WHERE age > 20 AND education_num β₯ 13 data management tecniques (see, e.g., [16]). It contains 48842 AND hours_per_week > 20 AND capital_gain > 5500 individuals that are described by six numerical attributes (age, Consider the coverage constraint π βπΉπππππ β₯ 250. Query π fnlwgt, education-num, capital-gain, capital-loss, hours-per-week) returns 1242 tuples (selectivity equal to 2.5%), out of which 200 and eight categorical attributes (workclass, education, marital- are females (about 16%), thus the constraint is not satisfied. status, occupation, relationship, race, sex, native-country). For our Figure 3 shows the execution time (in seconds) of the three experiments, we use sex as sensitive attribute; it presents an un- algorithms by varying the number of bins π. As expected, the balanced proportion of males (67%) and females (33%). For this number of visited cells increases (and, as a consequence, per- dataset, we generated a random sample with 9604 tuples, guar- formance decreases) while increasing π. πΆπ π΅ππ πππΌ exhibits the anteeing query cardinality estimation with 0.1% error and 99% best performance, showing the benefits of relying on pruning confidence. The sample is small enough to be stored in mem- and iteration. On the other hand, all techniques return the same ory. The US Forbes Richest People dataset contains information coverage-based rewritten query, whose relaxation degree de- about the richest people in US from 2016 (about 400 individuals), creases while increasing the number of bins (see Figure 3 (c)). described by six numerical attributes (worthchange, age, realtime- Similar results have been obtained with other queries. There- worth, realtimerank, timestamp, realtimeposition) and nine cate- fore, in the following, we present experimental results only for gorical attributes (name, lastname, uri, imageuri, source, industry, πΆπ π΅ππ πππΌ . Additionally, since the number of visited cells and the gender, headquarters, state). For our experiments we use ππππππ execution time are correlated, in the following we report only as sensitive attribute; it presents an unbalanced proportion of execution time results. males (87%) and females (13%). To measure the performance of the algorithms, we consider Impact of the number of selection conditions. In the second group both the execution time and the number of query cardinality of experiments, we analyze in more details execution time and re- estimations, corresponding to the number of visited cells. To laxation degree achieved with πΆπ π΅ππ πππΌ by varying the number measure the accuracy of the algorithms, similarly to [12], given of bins π and the threshold of the coverage constraint. The con- an initial query π over a database instance πΌ , rewritten into a sidered queries (π2 and π3 listed below and π4 presented above) new query π β² , we compute the relaxation degree as the ratio contain a different number of selection conditions but all them |π β² (πΌ )| β |π (πΌ )| have a selectivity between 2-4% and the percentage of returned . |π (πΌ )| females is about 16% of the result. In this way, we can analyze We then performed four groups of experiments aiming at: (i) πΆπ π΅ππ πππΌ performance independently from differences in query comparing the three proposed algorithms; (ii) analyzing their selectivity (similar results have been obtained with queries with performance and accuracy with respect to the number of bins and higher selectivities). a single coverage constraint; (iii) analyzing their performance Q2: SELECT * FROM adult_data while changing the reference dataset; (iv) analyzing their per- WHERE hours_per_week > 20 AND formance and accuracy while changing the number of coverage capital_gain > 5500 constraints. Those preliminary experimental results refer to selec- Q3: SELECT * FROM adult_data tion queries, we leave to future work the experimental analysis WHERE age > 20 AND hours_per_week > 20 of join queries. AND capital_gain > 5500 5.2 Experimental Results For each query, we consider a coverage constraint π βπΉπππππ β₯ Impact of pruning and iteration. In the first group of experiments, ππ + 25%, where ππ denotes the number of females returned by we compare the three proposed algorithms to understand the the initial query. From Figure 4 (a), we observe that, by varying impact of pruning and iteration on performance and relaxation the number of selection conditions, and therefore the dimen- degree. To this aim, we consider a query π4 with four selection sionality of the CMG, execution time rapidly increases starting conditions, selecting individuals who are over 20 years-old, have from 32 bins. This is a well known problem, due to the curve of dimensionality. From Figure 4 (b), we notice that, starting from 1 https://www.forbes.com/forbes-400/#ec679fa7e2ff 16 bins, we get a relaxation degree lower than 0.4. The relaxation (a) Execution time (b) Relaxation degree (a) Execution time, π = 2 (b) Execution time, π = 3 Figure 4: πΆπ π΅ππ πππΌ performance and accuracy, with re- Figure 7: Performance comparison between Adult and spect to the number of bins Forbes datasets, with respect to the number of bins (a) Execution time, π = 2 (b) Execution time, π = 3 (a) Pruning time (b) Search time Figure 8: Performance comparison between Adult and Figure 5: πΆπ π΅ππ πππΌ pruning and search time, with respect Forbes datasets, with respect to coverage threshold to the number of bins Impact of sampling. In our next experiment, we examine perfor- mance when varying the dataset size. To this aim, we consider the Forbes dataset. Since it is quite small (about 400 individuals), we do not generate any sample and we execute queries directly on the input dataset. Due to space constraints, we consider only queries with two and three selection predicates, keeping query selectivity in the range considered for the Adult dataset (about 3%): (a) Execution time (b) Relaxation degree Q2_f: SELECT * FROM forbes WHERE realtimeworth > 10000 AND age < 65 Q3_f: SELECT * FROM forbes Figure 6: πΆπ π΅ππ πππΌ performance and accuracy, with re- WHERE realtimeworth > 10000 AND age < 65 spect to coverage threshold AND age > 45 Figure 7 compares execution time of corresponding queries degree is almost stable starting from 32 bins. By combining to- executed over the Adult and Forbes datasets. As expected, the gether results shown in Figures 4 (a) and (b), we can consider a dataset size influences the cardinality query estimation time and, number of bins equal to 32 a reasonable choice for getting a good as a consequence, the total execution time. A similar result holds compromise between performance and accuracy. From Figure while increasing the female threshold (Figure 8). In both cases, 5(a), we notice that the pruning time is quite low (lower than half differences in times are more evident for higher number of selec- a second) even with π = 4, thus it does not deeply suffer from the tion conditions (π = 3). curse of dimensionality while the search time does (Figure 5(b)). As a final experiment, we analyze πΆπ π΅ππ πππΌ performance by fix- Impact of the number of coverage constraints. In order to analyze ing the number of bins to 32 and varying the coverage constraint, the impact of the number of coverage constraints, we consider setting the threshold to ππ plus a given percentage (25%, 33%, query π2 and the following sets of coverage constraints: 50%, 66%). From Figure 6(a), we observe that, by increasing the πΆπΆ 1 = π2 βπΉ πππππ β₯ 456 female coverage threshold, the execution time increases (quite πΆπΆ 2 = π2 βπΉ πππππ β₯ 456 AND π2 βππππ β₯ 1800 rapidly for π = 4 due to the curve of dimensionality problem) πΆπΆ 3 = π2 βπΉ πππππ β₯ 456 AND π2 βππππ β₯ 2400 since more cells are visited by the algorithm (the distance be- πΆπΆ 1 is the constraint we considered in the previous experiments. tween the original query and the final one is higher). On the πΆπΆ 2 specifies an additional coverage constraint on males, which other hand, the relaxation degree almost linearly increases (see is however satisfied by the coverage-based rewriting generated Figure 6(b)) with respect to the additional percentage of females when considering πΆπΆ 1 . Thus, we expect that the solution returned required by the coverage constraint since, by increasing females, with πΆπΆ 1 is the same than that returned with πΆπΆ 2 . On the other males increase as well. hand, with πΆπΆ 3 , we expect a different rewriting for increasing the number of males, as required by the constraint. [6] Marina Drosou, H. V. Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. 2017. Diversity in Big Data: A Review. Big Data 5, 2 (2017), 73β84. [7] Serge Abiteboul et Al. 2018. Research Directions for Principles of Data Man- agement (Dagstuhl Perspectives Workshop 16151). Dagstuhl Manifestos 7, 1 (2018), 1β29. [8] Donatella Firmani, Letizia Tanca, and Riccardo Torlone. 2019. Data Processing: Reflections on Ethics. In Proceedings of the 1st International Workshop on Processing Information Ethically co-located with 31st International Conference on Advanced Information Systems Engineering, PIE@CAiSE 2019, Rome, Italy, June 4, 2019. [9] Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. (a) Execution time (b) Relaxation degree 2010. SnipSuggest: Context-Aware Autocompletion for SQL. PVLDB 4, 1 (2010), 22β33. [10] Nick Koudas, Chen Li, Anthony K. H. Tung, and Rares Vernica. 2006. Relaxing Join and Selection Queries. In Proceedings of the 32nd International Conference Figure 9: πΆπ π΅ππ πππΌ Performance and accuracy, with re- on Very Large Data Bases, Seoul, Korea, September 12-15, 2006. 199β210. spect to different coverage constraints [11] Moshe Lichman. 2013. UCI Machine Learning Repository. Technical Report. [12] Chaitanya Mishra and Nick Koudas. 2009. Interactive query refinement. In EDBT 2009, 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 24-26, 2009, Proceedings. 862β873. [13] Matteo Riondato, Mert Akdere, Ugur Γetintemel, Stanley B. Zdonik, and Eli Figure 9(a) shows execution time of query π2 with respect Upfal. 2011. The VC-Dimension of SQL Queries and Selectivity Estimation to the considered coverage thresholds. When considering πΆπΆ 2 , through Sampling. In Machine Learning and Knowledge Discovery in Databases the execution time slightly increases since, for each visited cell, - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part II. 661β676. one more query cardinality estimation is needed. However, since, [14] Babak Salimi, Corey Cole, Peter Li, Johannes Gehrke, and Dan Suciu. 2018. by construction, when πΆπΆ 1 is satisfied πΆπΆ 2 is satisfied as well, HypDB: A Demonstration of Detecting, Explaining and Resolving Bias in OLAP queries. PVLDB 11, 12 (2018), 2062β2065. the number of visited cells in the two cases coincides. On the [15] Babak Salimi, Johannes Gehrke, and Dan Suciu. 2018. Bias in OLAP Queries: other hand, when considering πΆπΆ 3 more cells are visited and, as Detection, Explanation, and Removal. In Proceedings of the 2018 International a consequence, the execution time increases as well. From Figure Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018. 1021β1035. 9(b) we see that, as expected, the relaxation degrees for πΆπΆ 1 and [16] Babak Salimi, Bill Howe, and Dan Suciu. 2019. Data Management for Causal πΆπΆ 2 coincide since in both cases the returned relaxed queries Algorithmic Fairness. IEEE Data Eng. Bull. 42, 3 (2019), 24β35. coincide. Relaxation is higher for πΆπΆ 3 since a higher coverage [17] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Capuchin: Causal Database Repair for Algorithmic Fairness. CoRR abs/1902.08283 (2019). threshold is considered. [18] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 6 CONCLUDING REMARKS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 793β810. In this paper, we have presented a preliminary approach for [19] Julia Stoyanovich and Bill Howe. 2019. Nutritional Labels for Data and Models. IEEE Data Eng. Bull. 42, 3 (2019), 13β23. coverage-based rewriting of SQL queries, suitable for address- [20] Julia Stoyanovich, Ke Yang, and H. V. Jagadish. 2018. Online Set Selection with ing non-discrimination in early data processing stages. The pro- Fairness and Diversity Constraints. In Proceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, posed techniques revise and extend existing query refinement March 26-29, 2018. 241β252. approaches, defined for addressing the empty or few answer [21] Chenkai Sun, Abolfazl Asudeh, H. V. Jagadish, Bill Howe, and Julia Stoyanovich. problem, to take care of coverage issues, without the need of any 2019. MithraLabel: Flexible Dataset Nutritional Labels for Responsible Data Science. In Proceedings of the 28th ACM International Conference on Information user interaction. Experimental results show that the proposed and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019. approach is effective and efficient. Future works include: (i) fur- 2893β2896. ther optimizations of the πΆπ π΅π΄π πππΌ algorithm to cope with the [22] Ke Yang, Vasilis Gkatzelis, and Julia Stoyanovich. 2019. Balanced Ranking with Diversity Constraints. In Proceedings of the Twenty-Eighth International curve of dimensionality issue, by exploiting CMG sparsity and Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August additional heuristics; (ii) considering more than one sensitive 10-16, 2019. 6035β6042. [23] Junjie Yao, Bin Cui, Liansheng Hua, and Yuxin Huang. 2012. Keyword Query attribute and automating the identification of the sensitive at- Reformulation on Structured Data. In IEEE 28th International Conference on tribute to analyze first; (iii) dealing with categorical attributes in Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 the selection conditions and more complex fairness constraints; April, 2012. 953β964. (iv) taking into account data summaries of the input dataset, thus making the rewriting reusable to datasets characterized by the same synopses; (v) data preparation is an iterative process, coverage-based rewriting can be considered as a new relational operation to take into account during the whole query processing steps, including query optimization. REFERENCES [1] Dolan Antenucci and Michael J. Cafarella. 2018. Constraint-based Explanation and Repair of Filter-Based Transformations. PVLDB 11, 9 (2018), 947β960. [2] Abolfazl Asudeh, H. V. Jagadish, Gerome Miklau, and Julia Stoyanovich. 2018. On Obtaining Stable Rankings. PVLDB 12, 3 (2018), 237β250. [3] Abolfazl Asudeh, H. V. Jagadish, Julia Stoyanovich, and Gautam Das. 2019. Designing Fair Ranking Schemes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 1259β1276. [4] Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. 554β565. [5] Piero A. Bonatti and Sabrina Kirrane. 2019. Big Data and Analytics in the Age of the GDPR. In 2019 IEEE International Congress on Big Data, BigData Congress 2019, Milan, Italy, July 8-13, 2019. 7β16.