Coverage-based Queries: Nondiscrimination Awareness in Data Transformation Chiara Accinelliโˆ— , Barbara Catania and Giovanna Guerrini University of Genoa, Italy Abstract When people-related data are used in automated decision making processes, social inequities can be amplified. Thus, the development of technological solutions satisfying nondiscrimination requirements, in terms of, e.g., fairness, diversity, and coverage, is currently one of the main challenges for the data management and data analytics communities. In particular, coverage constraints guarantee that a dataset includes enough items for each (protected) category of interest, thus increasing diversity with the aim of limiting the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for designing data repair solutions, in our study we investigate their effects on data processing pipelines, with a special reference to data transformation. To this aim, we first introduce coverage-based queries as a means for ensuring coverage constraint satisfaction on a selection-based query result, through rewriting. We then present two approximate algorithms for coverage-based query processing: the first, covRew , initially introduced in [3], relies on data discretization and sampling; the second, covKnn is a novel contribution and relies on a nearest neighbour approach for coverage- based query processing. The algorithms are experimentally compared with respect to efficiency and effectiveness, on a real dataset. Keywords nondiscrimination, data transformation, coverage, rewriting 1. Introduction Today, we are surrounded by information that is increasingly being used to make decisions that could have an impact on peopleโ€™s lives. It is therefore very important to understand the nature of this social impact and take responsibility for it. To this aim, the development of responsible and nondiscrimination-aware technological solutions is one of the main challenges in data management and analytics and the satisfaction of ethical requirements is becoming a crucial element for asserting the quality of a dataset [9]. More precisely, nondiscrimination-aware data decision systems should take humans in the loop in all the data processing steps, from acquisition to wrangling and analysis: from one hand, they must ensure transparency and interpretability, for tracing the whole process; from the other, they should guarantee nondiscrimination with respect to protected groups of individuals. This is achieved by characterizing groups of interests in terms of sensitive attributes, like, e.g., gender or economical status, and relying on the ITADATA2022: The 1๐‘ ๐‘ก Italian Conference on Big Data and Data Science, September 20โ€“21, 2022, Milan, Italy โˆ— Corresponding author. Envelope-Open chiara.accinelli@dibris.unige.it (C. Accinelli); barbara.catania@dibris.unige.it (B. Catania); giovanna.guerrini@dibris.unige.it (G. Guerrini) Orcid 0000-0002-6626-9470 (C. Accinelli); 0000-0002-6443-169X (B. Catania); 0000-0001-9125-9867 (G. Guerrini) ยฉ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) specification of properties, like fairness, i.e., lack of bias [14], diversity, i.e., the degree to which different kinds of objects are represented in a dataset [8], and coverage [4], guaranteeing a sufficient representation of any category of interest in a dataset. A significant ongoing research effort [8, 18] aims at a holistic treatment of nondiscrimination along the stages of the data processing life-cycle, rather than as a constraint on the final result: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain. Approaches have been proposed both with reference to front-end (e.g., OLAP queries [13], set selection [19], ranking [21]) and back-end stages (e.g, dataset repair during data acquisition, with a special focus on coverage in [4, 5, 12] and causal fairness [15]). We refer the reader to [7] for a short survey. In our work, we are interested in nondiscrimination constraints defined in terms of coverage. Indeed, it is well known that the quality of a classifier depends not only on the used algorithm but also on the number of instances, i.e., the coverage, of each group in the dataset [16]. While coverage constraints have been mainly used for repairing raw datasets before any transformation [4, 5, 12], we investigate their effects on the data processing pipeline, with a special reference to data transformed through query execution. Example 1. Consider the StudentPerformance dataset1 and suppose we want to predict who, among the best students, has parents with a high level of education (e.g. from graduation upwards), through a classification task. The information about the lunch at school allows us to deduce information about the economical status of the student family, thus lunch can be taken as sensitive attribute. Suppose we would like to train a model which accuracy does not deeply depend on the economical status of the student family, assuming that the lower the accuracy difference between the considered student groups the lower the model discriminates against them. Now suppose to qualify the best students with two selection conditions over math and reading subjects: math_score โ‰ฅ 80 AND reading_score โ‰ฅ 80 . This query returns just 13 students with free/reduced lunch out of 143 individuals. When training a Random Forest classifier on the query result, we get a model with an accuracy equal to 0.71 and an accuracy difference equal to 0.28 (accuracy for standard lunch = 0.78, accuracy for reduced lunch = 0.5). Thus, the insufficient number of students with reduced lunch in the query result affects the accuracy difference of the model. Now suppose to modify the initial query with the following one: math_score โ‰ฅ 71 AND reading_score โ‰ฅ 68 . This query returns 74 free/reduced instances out of 350 individuals. The accuracy of the classification model trained on this new result slightly changes (0.69) but the accuracy difference is improved and corresponds to 0.006. The improvement is due to the higher number, i.e., the higher coverage, of protected instances in the query result. The price to pay is a relaxation of the original query. โ™ฆ In this paper, we present the main results we achieved for ensuring coverage constraint satisfaction during data transformation represented in terms of selection-based queries. In order to avoid disparate treatment discrimination [6], and similarly to what has been done in [13] for OLAP queries and causal fairness and, more recently, in [17] for range queries and different types of associational fairness, the input selection-based query, when executed over a given dataset ๐ผ, is modified through rewriting, obtaining a new query that, when executed over ๐ผ, satisfies the given constraints and is still close to the initial request. As far as we know and according 1 It is available at https://www.kaggle.com/spscientist/students-performance-in-exams. to [16], no other solutions addressing coverage-based rewriting have been proposed so far. More precisely, we first introduce coverage-based queries as a means for ensuring coverage constraint satisfaction in selection-based query results. We then present two approximate algorithms for coverage-based query processing: the first, covRew , initially introduced in [3], relies on data discretization and sampling; the second, covKnn , presented here for the first time, relies on a nearest neighbour approach for coverage-based query processing. The approaches are then experimentally compared with respect to efficiency and effectiveness, on a real dataset. The remainder of this paper is organized as follows. Section 2 presents preliminary definitions and the reference problem. covRew is briefly described in Section 3 while Section 4 presents covKnn , an alternative nearest neighbour approximate approach for coverage-based query processing. Experimental results are presented in Section 5. Finally, Section 6 discusses related work and Section 7 concludes and outlines future work directions. 2. Problem Definition Dataset. We assume data are represented as a collection of tabular datasets (e.g., relations in a relational database, data frames in the Pandas analytical environment, called tables in the following) ๐ผ โ‰ก {๐ผ1 , ..., ๐ผ๐‘Ÿ }, with disjoint schemas, for the sake of simplicity. Let ๐ด1 , ..., ๐ด๐‘š be the attributes of ๐ผ, ๐ท๐ด๐‘— denote the domain of ๐ด๐‘— . A set of discrete-valued attributes ๐’ฎ = {๐‘†1 , ..., ๐‘†๐‘› } are of particular concern since they allow the identification of protected groups and are called sensitive attributes. Examples of sensitive attributes are gender and race . Data transformation operations. We focus on selection-based data transformations (or queries) over stored or computed datasets, in data processing pipelines that might alter the representation (i.e., the coverage) of specific groups of interests, defined in terms of sensitive attribute values. Examples are data slicing operations in Pandas2 or ColumnTransformers in Scikit-Learn3 . We consider boolean combinations of atomic selection conditions ๐‘ ๐‘’๐‘™๐‘– โ‰ก ๐ด๐‘– ๐œƒ๐‘ฃ๐‘– , ๐‘ฃ๐‘– โˆˆ ๐ท๐ด๐‘– , ๐œƒ โˆˆ {<, โ‰ค, >, โ‰ฅ}, ๐ด๐‘– numeric attribute,4 ๐‘– = 1, โ€ฆ , ๐‘‘, ๐ด๐‘– โˆ‰ ๐’ฎ. A selection-based query ๐‘„ is thus denoted by ๐‘„โŸจ๐‘ฃ1 , ..., ๐‘ฃ๐‘‘ โŸฉ or ๐‘„โŸจ๐‘ฃโŸฉ, ๐‘ฃ โ‰ก (๐‘ฃ1 , ..., ๐‘ฃ๐‘‘ ). Attributes ๐ด1 , ..., ๐ด๐‘‘ are called dimensions of the selection-based query. Coverage constraints. Conditions over the number of entries belonging to a given protected group of interest returned by the execution of a selection-based query can be specified in terms ๐‘†๐‘™ ,...,๐‘†๐‘™ of coverage constraints [4, 12]. A coverage constraint ๐ถ has the form โ†“๐‘ ๐‘™11,...,๐‘ ๐‘™โ„Žโ„Ž โ‰ฅ ๐‘˜ and specifies that the minimum number of instances with sensitive attribute ๐‘†๐‘™๐‘– equal to ๐‘ ๐‘™๐‘– , ๐‘– = 1, ..., โ„Ž, in a gender query result has to be ๐‘˜. As an example, โ†“female โ‰ฅ 10 specifies that a query result should include at least 10 female individuals. The group a coverage constraint refers to is called protected group and is characterized by the selection condition ๐‘ƒ๐บ โ‰ก ๐‘†๐‘™1 = ๐‘ ๐‘™1 โˆง ... โˆง ๐‘†๐‘™โ„Ž = ๐‘ ๐‘™โ„Ž . Even if coverage thresholds are usually application specific and should be determined through statistical analysis, the central limit theorem suggests that the number of representative should be around 30 and, according to [20], at least 20 to 50 instances for each group are required. 2 https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html 3 https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html 4 For the sake of simplicity, selection attributes are assumed numeric. The approach can be easily extended to deal with any other ordered domain. (a) Input and induced queries (b) Skyline points Figure 1: Coverage-based query properties Definition of coverage-based queries. Let ๐ถ be a set of coverage constraints over a set of sensitive attributes ๐‘† and let ๐‘„โŸจ๐‘ฃโŸฉ be a selection-based query. A coverage-based query ๐œ‰๐‘„๐ถ for ๐ถ and ๐‘„ is a selection-based query that, given a dataset ๐ผ, stretches the result ๐‘„(๐ผ ) as little as possible so that constraints in ๐ถ are satisfied. More precisely, for each dataset ๐ผ: (i) a tuple ๐‘ข exists such that ๐œ‰๐‘„๐ถ (๐ผ ) = ๐‘„โŸจ๐‘ขโŸฉ(๐ผ ); (ii) ๐‘„ โІ ๐œ‰๐‘„๐ถ ; (iii) all coverage constraints are satisfied by ๐œ‰๐‘„๐ถ (๐ผ ); (iv) ๐œ‰๐‘„๐ถ is the closest query to ๐‘„ in terms of minimality, i.e., result cardinality, and proximity, i.e., syntactic closeness. Minimality means that no other query ๐‘„ โ€ฒ satisfying conditions (i)โ€“(iii) exists such that ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐‘„ โ€ฒ (๐ผ )) < ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐œ‰๐‘„๐ถ (๐ผ )); proximity means that ๐‘„โŸจ๐‘ขโŸฉ is the closest query to ๐‘„โŸจ๐‘ฃโŸฉ, according to the Euclidean distance (defined in a unit space) between ๐‘ฃ and ๐‘ข, satisfying properties (i)โ€“(iv). Properties. It has been proved that a coverage-based query ๐œ‰๐‘„๐ถ satisfies the following properties: (P1) It can be represented in a canonical form in which each selection condition has the form ๐ด๐‘– โ‰ค ๐‘ฃ๐‘– or ๐ด๐‘– < ๐‘ฃ๐‘– ; ๐‘„โŸจ๐‘ขโŸฉ can thus be represented as point ๐‘ข in the ๐‘‘-dimensional space defined by selection attributes (see ๐‘„ in Figure 1(a)). (P2) Let ๐œ‰๐‘„๐ถ (๐ผ ) โ‰ก ๐‘„โŸจ๐‘ขโŸฉ(๐ผ ). It can be proved that ๐‘ข coincides with the upper right vertex of the minimum bounding box of at most ๐‘‘ distinct points in ๐ผ, ๐‘„, and the origin of the space (see the green triangle in Figure 1(a)). Such vertices are called induced points and the set of all induced points corresponds to the search space for coverage-based queries. (P3) There is a relationship between ๐‘ข and the skyline of induced points corresponding to queries that, when executed over ๐ผ, satisfy ๐ถ. The dominance relation, needed for the skyline computation, is defined over selection attributes, assuming the lower the better (see Figure 1(b)). It can be proved that ๐‘ข coincides with the skyline point corresponding to the query with the minimal cardinality at the lowest distance from ๐‘„. The problem. The problem we address concerns the design of efficient algorithms for processing ๐œ‰๐‘„๐ถ . The designed algorithms improve the naรฏf approach derived from properties P1, P2, and P3, which is inherently inefficient due to the size of the search space and the skyline computation. We do not rely on any index data structure, so that both stored and computed datasets can be considered. The proposed approximate algorithms are briefly described in the next sections while an optimized precise one is currently under evaluation. (a) Data distribution (b) Multi-dimensional grid (c) Cardinality estimates and so- lution Figure 2: Data representation and processing 3. Approximate Coverage-based Query Processing through Discretization and Sampling In order to improve the efficiency of the naรฏf approach, the approach proposed in [1, 2, 3] relies on a discretized search space instead of generating and visiting the whole set of induced points. Additionally, cardinality estimation, needed for constraint and minimality checking, is performed through a sample-based approach. The proposed technique, called covRew , can be applied over any dataset for which a sample is available or can be easily computed on the fly, and relies on two main steps: โ€ข Pre-processing. The approximate search space is generated by considering the intersec- tion points of a grid obtained by discretizing each axis (one for each selection attribute ๐ด๐‘– in ๐‘„โŸจ๐‘ฃโŸฉ, from the query value ๐‘ฃ๐‘– to the maximum value of ๐ด๐‘– in ๐ผ), using standard binning approaches (e.g., equi-width and equi-depth). Each point on the grid corresponds to a selection-based query of type ๐‘„โŸจ๐‘ŽโŸฉ, thus satisfying conditions (i) and (ii) of the reference problem. โ€ข Processing. Given ๐ผ, ๐‘„โŸจ๐‘ขโŸฉ such that ๐œ‰๐‘„๐ถ (๐ผ ) = ๐‘„โŸจ๐‘ขโŸฉ(๐ผ ) is then computed by visiting the discretized search space starting from ๐‘ฃ, one point after the other, at increasing distance from ๐‘ฃ, checking minimality and proximity in the approximated space. The properties of the discretized search space and the canonical form are considered for pruning the space (algorithm covRewP in [2]), possibly increasing the number of points to be visited at different iterations (algorithms covRewI and covRewIP in [2], depending on whether pruning is considered together with iterations). The trade-off between accuracy and efficiency of the proposed algorithms has been evaluated on both synthetic and real-world datasets (see [2] for details). Example 2. Consider the scenario introduced in Example 1. ๐‘„ can be rewrit- ten in canonical form as -math_score โ‰ค -80 AND -reading_score โ‰ค -80 . Figure 2(a) shows the dataset projected over attributes -math_score and -reading_score . Points are colored and shaped according to the sensitive attribute val- ues: red crosses for free/reduced lunches and black dots for standard lunches. The result of ๐‘„ in canonical form corresponds to the region at the bottom left side of the query point (-80,-80). The space to be discretized corresponds to the green rectangle. By applying the equi-depth approach to each axis with 4 bins, we obtain the grid in Figure 2(b). The discretized search space corresponds to the grid intersection points: each point represents a selection-based query containing ๐‘„. During the processing step, the points of the grid are visited, under different strategies, starting from (-80, -80), at increasing distance from it (blue numbers represent the visiting order). For each visited point ๐‘„โŸจ๐‘ŽโŸฉ, we estimate the cardinality of ๐œŽ๐‘™๐‘ข๐‘›๐‘โ„Ž=๐‘“ ๐‘Ÿ๐‘’๐‘’/๐‘Ÿ๐‘’๐‘‘๐‘ข๐‘๐‘’๐‘‘ (๐‘„โŸจ๐‘ŽโŸฉ)(๐ผ ), needed for checking constraint satisfaction, and of ๐‘„โŸจ๐‘ŽโŸฉ(๐ผ ), by relying on a sample-based approach (see the pairs of numbers associated with each point in Figure 2(c)). The result corresponds to point ๐‘ข =(-71,-65) (see Figure 2(c)): it satisfies the coverage constraint (๐œŽ๐‘™๐‘ข๐‘›๐‘โ„Ž=๐‘“ ๐‘Ÿ๐‘’๐‘’/๐‘Ÿ๐‘’๐‘‘๐‘ข๐‘๐‘’๐‘‘ (๐‘„โŸจ๐‘ŽโŸฉ)(๐ผ ) = 79, property (iii)), has the minimum cardinality (370), at the lowest distance from ๐‘„ (property (iv)). Distances are computed in the reference unit space (normalized values are shown in red in Figure 2(c)). Query ๐‘„ is thus rewritten into -math_score โ‰ค -71 AND -reading_score โ‰ค -65 . โ™ฆ The number of cardinality estimations in covRew is in ๐‘‚(๐‘›๐‘‘ ), where ๐‘› is the number of bins used for the discretization of the axis and ๐‘‘ is the number of selection conditions. As shown in [2], pruning strategies help in reducing such exponential complexity in real cases. On the other hand, thanks to the sample-based approach, covRew performance does not depend on the dataset cardinality. 4. A Nearest Neighbour Approach to Approximate Coverage-based Query Processing Similarly to the naรฏf approach derived from properties P1, P2, and P3 in Section 2, covRew works in a space of query points to identify an approximation of ๐‘„โŸจ๐‘ขโŸฉ, needed to process the coverage- based query ๐œ‰๐‘„๐ถ (๐ผ ). An alternative approach could be that of computing an approximation of ๐‘„โŸจ๐‘ขโŸฉ directly from the data instances. To this aim, we first look for the missing number of protected group instances closest to the query point, with respect to an input distance function ๐‘“, and then use them to compute the induced query representing an approximation of ๐‘„โŸจ๐‘ขโŸฉ. ๐‘†๐‘™ ,...,๐‘†๐‘™ More precisely, given one coverage constraint ๐ถ โ‰กโ†“๐‘ ๐‘™11,...,๐‘ ๐‘™โ„Žโ„Ž โ‰ฅ ๐‘˜, a selection-based query ๐‘„โŸจ๐‘ฃโŸฉ, and a dataset ๐ผ, the proposed approach, called covKnn , relies on four main steps:5 1. compute the set of protected group instances that are not included in ๐‘„(๐ผ ), corresponding to ๐‘†๐‘ƒ๐บ โ‰ก ๐œŽ๐‘ƒ๐บ (๐ผ โˆ’ ๐‘„(๐ผ )); 2. determine the number of missing protected instances โ„Ž = ๐‘˜ โˆ’ ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐œŽ๐‘ƒ๐บ (๐‘„(๐ผ ))); 3. identify the โ„Ž nearest neighbours to ๐‘ฃ, with respect to ๐‘“, in ๐‘†๐‘ƒ๐บ ; 4. return the query induced by the โ„Ž nearest neighbours as an approximation of ๐‘„โŸจ๐‘ขโŸฉ. It is easy to show that covKnn guarantees the satisfaction of properties (i)-(iii) of coverage- based queries; however, minimality and proximity (property (iv)) are not necessarily satisfied by the induced query even if they are implicitly taken into account during the nearest neighbour computation. 5 covKnn can be easily extended to deal with a set of constraints by considering the query induced by the union of the โ„Ž๐‘– missing instances for each coverage constraint ๐ถ๐‘– in the input set. One additional consideration is needed for the distance function to be used. As observed in [11], those based on the general theory of vector p-norms, like the Euclidean or Manhattan distances, are distances between points. They are therefore not suitable when considering distances from boxes corresponding to a query space, as in our case, since it is not obvious which point inside the box should be treated as the reference point. In such a context, the user would usually prefer answers that are close to the periphery of the box. The box query distance proposed in [11] achieves this goal. Given an input query ๐‘„โŸจ๐‘ฃโŸฉ and a database instance point ๐‘Ž that does not belong to ๐‘„(๐ผ ), when considering selection-based queries in canonical form, it is computed as follows: โˆš โˆš โˆš ๐‘‘ โˆš โˆš ๐‘Ž โˆ’ ๐‘ฃ๐‘– if ๐‘Ž๐‘– > ๐‘ฃ๐‘– โˆšโˆ‘(๐‘‘๐‘– (๐‘ฃ๐‘– , ๐‘Ž๐‘– ))2 โˆš where ๐‘‘๐‘– (๐‘ฃ๐‘– , ๐‘Ž๐‘– ) = { ๐‘– (1) 0 otherwise. โˆš ๐‘– The box query distance can be customized to user needs by using weights. In our case, the idea is to use weights that make the distance of a point ๐‘Ž to each query axis closer to the area corresponding to the contribution of ๐‘Ž to the induced query, for that axis. To this aim, we use the Inverse Aspect-Ratio weights presented in [11]. Example 3. Consider the scenario introduced in Example 1. Under covKnn , the processing proceeds in this way: (i) the search space ๐‘†๐‘ƒ๐บ is first computed (see the white space in Figure 3(a)): it contains 342 points out of the 1000 points of the whole dataset; (ii) assuming that ๐‘„(๐ผ ) contains 13 protected group instances, the number of missing instances is determined as โ„Ž = 70-13 = 57; (iii) the 57 nearest neighbour points to the query point (-80, -80) in ๐‘†๐‘ƒ๐บ are computed, using the box query distance (green points in Figure 3(a)); (iv) the induced query of the detected points is then computed (see Figure 3(b)): it corresponds to the query -math_score โ‰ค -70 AND -reading_score โ‰ค -69 . โ™ฆ In covKnn , no cardinality estimate is required (minimality is not checked and the coverage constraint is satisfied by the induced query by construction). The complexity is ๐‘‚(๐‘š๐‘‘), where ๐‘š is the cardinality of ๐‘†๐‘ƒ๐บ since, for each instance in ๐‘†๐‘ƒ๐บ , the distance from ๐‘„ is computed according to Eq. (1). As a final remark, we notice that the obtained result is different from the one obtained with covRew (see Example 2): they represent two possible and potentially distinct approximations of the coverage-based query result. 5. Experimental Evaluation 5.1. Experimental setup All experiments were conducted on a machine with an Intel(R) Core(TM) i99900K 3.60GHz CPU and 16GB main memory, running Ubuntu 20.04. All programs were coded in Python 3.8 and the dataset is stored on PostgreSQL 9.6.19.1. Dataset. The experiments refer to the Adult dataset,6 exploited for assessing many non- discrimination data management techniques. It contains 48,842 individuals, described by six 6 https://archive.ics.uci.edu/ml/datasets/Adult (a) Query nearest neighbour points (b) Induced query Figure 3: Query nearest neighbours and the corresponding induced query numerical attributes and eight categorical attributes. For our experiments, we use attributes sex , race , and marital_status as sensitive attributes. Queries and coverage constraints. We identified 9 selection-based queries ๐‘„๐‘–,๐‘— with a variable number ๐‘‘ of selection conditions and selectivity %sel (see Table 1): ๐‘– corresponds to the number of selection conditions and ๐‘— points out the selectivity (โ„Ž corresponds to %sel โ‰ค 15%, ๐‘š to 15% < %sel < 40%, and ๐‘™ to %sel โ‰ฅ 40%). We then considered many problem instances by combining the selection-based queries with various coverage constraints, requiring a different number of protected group instances. Table 2 presents the considered instances pointing out the cardinality of the initial query result ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐‘„(๐ด๐‘‘๐‘ข๐‘™๐‘ก)), the cardinality of each protected group in the initial query result ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐‘„ โ†“๐ถ๐‘– (๐ด๐‘‘๐‘ข๐‘™๐‘ก)), the missing number of protected group instances, according to the selected coverage constraint, and the cardinality of ๐‘†๐‘ƒ๐บ (#space). Algorithms. The experiments compare covKnn and covRewIP ; the precise algorithm derived from the properties presented in Section 2 is used as a baseline. covRewIP is the grid-based algorithm guaranteeing the best tradeoff between accuracy and efficiency according to [2], executed using the equi-depth discretization, with 32 bins, applied over a random sample of size 9,604 of the ๐ด๐‘‘๐‘ข๐‘™๐‘ก dataset. In the covKnn implementation, we rely on a naรฏf top-sort approach, in which we keep in a buffer the best ๐‘˜ points seen so far, for nearest neighbour computation.7 Distance function. We performed many experiments, considering many different weights for the box query distance proposed in [11]. The reported experimental results refer to the box query distance with squared Inverse Aspect-Ratio weights (see [11] for details). Efficiency and effectiveness indicators. Efficiency is analyzed in terms of execution time for determining the new query to be executed (averaged over 10 executions). For better pointing out the impact of the dataset size, we include the time to load the dataset for covKnn (0.19 s) and to generate and load the sample for covRewIP (0.071 s).8 Effectiveness is evaluated by comparing: (i) the achieved semantic relaxation, called relaxation degree in [2], as the rate between the cardinality of ๐œ‰๐‘„๐ถ (๐ด๐‘‘๐‘ข๐‘™๐‘ก) โˆ’ ๐‘„(๐ด๐‘‘๐‘ข๐‘™๐‘ก) (i.e, the number of new tuples returned) and of ๐‘„(๐ด๐‘‘๐‘ข๐‘™๐‘ก); (ii) the syntactical relaxation, i.e., the proximity of ๐‘„โŸจ๐‘ขโŸฉ with respect to the initial query ๐‘„โŸจ๐‘ฃโŸฉ. The baseline approach is used only for the analysis of the accuracy since experiments, that are 7 Notice that, since the box query distance is not a metric function and since, in general, the dataset might not be stored, we cannot rely on usual data structures, like k-d trees. 8 The time for computing the query result is not included since it is equal for both the approaches. Table 1 Selected queries IdQ Query d %sel ๐‘„2,โ„Ž age โ‰ค 46 AND education_num โ‰ฅ 14 2 5% ๐‘„2,๐‘š hours_per_week โ‰ฅ 41 AND age โ‰ฅ 38 2 16% ๐‘„2,๐‘™ education_num โ‰ค 11 AND hours_per_week โ‰ค 40 2 54% ๐‘„3,โ„Ž education_num โ‰ฅ 13 AND age โ‰ค 34 AND hours_per_week โ‰ค 40 3 5% ๐‘„3,๐‘š age โ‰ค 54 AND education_num โ‰ฅ 13 AND capital_gain โ‰ค 3000 3 20% ๐‘„3,๐‘™ age โ‰ค 39 AND hours_per_week โ‰ค 40 AND capital_loss โ‰ค 2500 3 40% ๐‘„4,โ„Ž age โ‰ค 34 AND education_num โ‰ฅ 13 AND hours_per_week โ‰ค 40 AND capital_loss โ‰ค 1400 4 5% ๐‘„4,๐‘š capital_gain โ‰ค 1500 AND age โ‰ค 34 AND capital_loss โ‰ค 500 AND hours_per_week โ‰ฅ 38 4 28% ๐‘„4,๐‘™ education_num โ‰ค 13 AND hours_per_week โ‰ฅ 32 AND capital_gain โ‰ค 450 AND age โ‰ค 49 4 57% Table 2 Problem instances IdI IdQ C ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐‘„(๐ผ )) ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐‘„ โ†“๐ถ๐‘– (๐ผ )) #space [#added] I1 ๐‘„2,โ„Ž โ†“sex female โ‰ฅ 780 2,447 730 [50] 15,462 I2 ๐‘„2,๐‘š โ†“sex female โ‰ฅ 1450 7,962 1387 [63] 14,805 ,marital _status I3 ๐‘„2,๐‘š โ†“sex female ,married โˆ’civ โˆ’spouse โ‰ฅ 280 7,962 249 [31] 2,231 _status I4 ๐‘„2,๐‘š โ†“marital married โˆ’civ โˆ’spouse โ‰ฅ 5700 7,962 5,545 [155] 16,834 sex I5 ๐‘„2,๐‘™ โ†“female โ‰ฅ 10, 600 26,504 10,525 [75] 5,667 I6 ๐‘„2,๐‘™ โ†“sex female โ‰ฅ 11, 200 26,504 10,525 [675] 5,667 I7 ๐‘„2,๐‘™ โ†“sex female โ‰ฅ 12, 000 26,504 10,525 [1475] 5,667 I8 ๐‘„2,๐‘™ โ†“sex female โ‰ฅ 13, 000 26,504 10,525 [2475] 5,667 I9 ๐‘„3,โ„Ž โ†“sex female โ‰ฅ 1250 2,603 1,197 [53] 14,995 I10 ๐‘„3,โ„Ž โ†“race black โ‰ฅ 210 2,603 194 [16] 4,491 I11 ๐‘„3,๐‘š โ†“sex female โ‰ฅ 3100 9,186 2,960 [140] 13,232 I12 ๐‘„3,๐‘™ โ†“sex female โ‰ฅ 8500 20,064 8,444 [56] 7,748 I13 ๐‘„3,๐‘™ โ†“sex female โ‰ฅ 9000 20,064 8,444 [556] 7,748 I14 ๐‘„4,โ„Ž โ†“sex female โ‰ฅ 1230 2,507 1,159 [71] 15,033 I15 ๐‘„4,๐‘š โ†“sex female โ‰ฅ 4400 13,541 4,330 [70] 11,862 I16 ๐‘„4,๐‘š โ†“race black โ‰ฅ 1450 13,541 1,369 [81] 3,316 ,race I17 ๐‘„4,๐‘š โ†“sex female ,black โ‰ฅ 680 13,541 633 [47] 1,675 I18 ๐‘„4,๐‘™ โ†“sex female โ‰ฅ 8650 27,606 8,594 [56] 7,598 I19 ๐‘„4,๐‘™ โ†“sex female โ‰ฅ 9100 27,606 8,594 [506] 7,598 not reported here for space constraints, show that its average performance is worst than those of the approximate approaches, especially for large-scale datasets. 5.2. Experimental results Impact of the search space size on performance. From Figure 4(a), we observe that, as expected, the time increases while increasing the size of the search space ๐‘š, for a fixed number of selection conditions. The search space size can increase by either considering queries with the same or similar selectivity and varying the considered protected group (e.g., I3 < I2 < I4; I10