=Paper=
{{Paper
|id=Vol-2841/PIE+Q_1
|storemode=property
|title=Data Wrangling for Fair Classification
|pdfUrl=https://ceur-ws.org/Vol-2841/PIE+Q_1.pdf
|volume=Vol-2841
|authors=Lacramioara Mazilu,Nikolaos Konstantinou,Norman Paton,Alvaro A. A. Fernandes
|dblpUrl=https://dblp.org/rec/conf/edbt/Mazilu0PF21
}}
==Data Wrangling for Fair Classification==
Data Wrangling for Fair Classification Lacramioara Mazilu Norman W. Paton University of Manchester University of Manchester Manchester, United Kingdom Manchester, United Kingdom lacramioaramazilu@gmail.com npaton@manchester.ac.uk Nikolaos Konstantinou Alvaro A.A. Fernandes University of Manchester University of Manchester Manchester, United Kingdom Manchester, United Kingdom nkons@live.com fernandesaaa@gmail.com ABSTRACT In this paper, we investigate interventions in the data wran- Whenever decisions that affect people are informed by classifiers, gling process that directly target classifier bias, when labels are there is a risk that the decisions can discriminate against certain available. Specifically, we assume a situation in which data prepa- groups as a result of bias in the training data. There has been ration selects and combines datasets for use in training, and we significant work to address this, based on pre-processing the intervene in data preparation to ensure that these selection and inputs to the classifier, changing the classifier itself, or post- combination decisions are informed by the fairness of the re- processing the results of the classifier. However, upstream from sulting classifier. Similarly to other works, our focus is on the these steps, there may be a variety of data wrangling processes pre-training step, however, we do not rely on the assumption that select and integrate the data that is used to train the classifier, that the data is already gathered and that fairness is achieved and these steps could themselves lead to bias. In this paper, we through operations on the integrated data source. Our focus is propose an approach that generates schema mappings in ways on creating the integrated data source in a fashion that takes into that take into account bias observed in classifiers trained on the consideration fairness measurements. results of different mappings. The approach searches a space The contributions are as follows: of candidate interventions in the mapping generation process, (1) The proposal of a strategy for fairness aware data prepa- which change how these mappings are generated, informed by a ration for classification. bias-aware fitness function. The resulting approach is evaluated (2) The realisation of the strategy as a search for fair data using Adult Census and German Credit data sets. preparation plans. (3) An evaluation of (2) for benchmark data sets that shows KEYWORDS how the interventions can improve a specific fairness met- fairness, bias, classification, data preparation, data wrangling ric, namely demographic parity. Also, we show how the accuracy of the trained classifier is impacted by the inter- ventions. 1 INTRODUCTION The remainder of this paper is structured as follows. Section 2 Fairness in machine learning is important because machine learn- provides the context for this work by reviewing related results. ing supports decision making; problems resulting from bias have Section 3 describes some technical background on data prepara- been widely recognised, and there are many results on fairness- tion required for later sections. Section 4 details the approach, enhancing interventions in machine learning [2]. Proposals have in which a space of alternative wrangling strategies is explored. been made for taking fairness into account before, during and Section 5 presents an empirical evaluation, and Section 6 reviews after learning. progress against the objectives. Although proposals that focus on interventions before learn- ing have considered a variety of techniques for selecting or mod- 2 RELATED WORK ifying training data [7], most of this work assumes that the train- The problem considered in this paper is as follows: Assume we ing data is already available. As such, the interventions take place have access to data sets that can be used to populate a target after the data preparation steps that select and integrate data schema that is suitable for training a classifier. Combine the data sets for analysis. As data scientists spend a considerable portion sets in such a way as to reduce the bias in the generated classifier. of their time on such steps, an opportunity seems to exist for Previous related work has tended to assume that there is an making interventions earlier in the data processing pipeline. existing data set that is suitable for training the classifier. In such In our previous work [12], we investigated how data wrangling a case, the data is considered to have a sensitive attribute (such processes could be adjusted to address dataset properties that as gender or race), and the objective is to train a classifier that can give rise to bias, specifically sample size disparity and proxy reduces bias with respect to some outcome, represented by labels attributes. Such interventions are promising, in that they can be in the data (such as a decision to hire or promote an individual). applied to unlabelled data, but there is no direct guarantee that This section reviews existing work on fairness pertaining to addressing the underlying property will in fact reduce classifier machine learning, looking at interventions that are designed to bias. reduce bias that take place before, during and after training. The Β© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- emphasis is on before training, as wrangling falls there. ings of the EDBT/ICDT 2021 Joint Conference (March 23β26, 2021, Nicosia, Cyprus) Before Approaches to reducing bias that influence the data used on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) for training may remove attributes that correlate with the sensi- tive attribute, change the labels to reduce bias, choose samples of Profiling: identify candidate keys for the source relations, and (partial) inclusion dependencies that identify overlaps between attribute values in different sources. Mapping: informed by the results of matching and profiling, generate candidate mappings in the form of views that can be used to populate the target table from the sources [11]. Selection: given the results of the mappings, select the top k results from the mappings that best satisfy some criterion. In this paper, the criterion used is completeness β prefer the mappings with the fewest missing values. 4 APPROACH Figure 1: Data wrangling pipeline 4.1 Overview of Approach The approach to fair data preparation for classification involves exploring a space of candidate data preparation plans, assessing the available data, or assign weights to different rows that impact the fairness of the classifiers built using the results of these plans. their influence on the classifier [7]. Alternatively, pre-processing The steps in the approach are illustrated in Figure 2, and are techniques may directly tackle properties of the underlying data discussed below. set that may be problematic, such as coverage [10]. In the approach, a Candidate Solution captures a collection of There has been less work on data wrangling, the focus of this interventions into the wrangling process that are deemed to be paper; although some papers explicitly discuss data preparation, helpful for reducing bias. Specifically, the interventions involve such work has generally been later in the pipeline than this paper. removing matches or inclusion dependencies from consideration Valentim et al. [15] empirically compare several interventions, during data wrangling. For example, assume that π 1 is a match specifically the removal of the sensitive attribute, feature dis- and π 1 is an inclusion dependency. It is possible that the data cretisation and sampling techniques. Accinelli et al. [1] consider resulting from the wrangling pipeline in Figure 1 produces a coverage within data preparation, though primarily by selecting fairer output data set when: (i) π 1 is available but π 1 is not; (ii) data rather than determining how the data should be wrangled. π 1 is not available but π 1 is available; (iii) π 1 and π 1 are both In our previous work [12], we steer the generation of data available; or (iv) neither π 1 nor π 1 are available. These different preparation plans in ways that reflect risk factors for bias. The environments in which to wrangle constitute a search space of current paper follows a similar approach, in the sense that plan interventions that lead to the production of different data sets generation is steered, but directly calls the classifier on candi- that may lead to more or less biased classifiers. date plans, thereby steering the plan generation based on direct Assume we have a match π : π.π β π .π β² relating attribute π evidence of bias. from source π to attribute π β² in target π . The removal of π from the wrangling process can have one of the following effects: (i) During Approaches to reducing bias within the learning algo- π is replaced by another match involving the same table, so that rithm are often associated with metrics that quantify some no- now π.π β²β² β π .π β² and as a result π .π β² is populated differently; tion of bias, and the learning algorithm then explicitly trades (ii) π is replaced by another match involving a different table, so off bias with accuracy [13]. For example, Hajian et al. [5] pro- that now π .π β π .π β² and as a result π .π β² is populated differently; vides techniques for constraining the levels of bias exhibited by (iii) no alternative match is found, so π β² .π is populated with null classification rules to a specified level. values. After Approaches to reducing bias after learning operate by Assume we have an inclusion dependency πΌ = π .π βπ π.π selectively changing labels at prediction time (e.g., [6, 8]). between attributes in the sources π and π, where π is the degree Work on bias-aware data preparation can be seen as com- of overlap between the attributes π and π. The removal of πΌ can plementary to most other work on fair classification; benefits have one of the following effects: (i) there is another inclusion that arise with respect to fairness in upstream processes simply dependency between π and π, which leads to the same tables reduce the scale of the problem for later interventions. being considered for joining, but on different attributes; (ii) there is another indirect way in which π and π can be joined through 3 TECHNICAL BACKGROUND an intermediate table that can be used instead by the mapping generator; or (iii) the tables are no longer joined with each other. The approach to fair data preparation depends on identifying To consider different ways of preparing the data, the approach interventions during data wrangling that affect the behaviour is to explore the space of possible interventions, with a view of a wrangling pipeline. We build on the VADA data wrangling to identifying combinations that lead to less biased results. The system [9], the relevant steps of which are illustrated in Figure 1. space of alternative interventions is explored using Tabu search [4], In VADA, given a target schema definition and the data sources, a local search algorithm that employs a diversity heuristic to the system can generate schema mappings that populate the avoid becoming stuck in local optima. When describing the ap- target. We particularly exploit the automatic generation of data proach, we will periodically refer to a running example. preparation plans, to generate new ways of integrating the data. The steps that are relevant to this paper are: Example 1. Assume an example where a model is trained to Matching: identify relationships between source and target predict which individuals will be hired, with the sensitive binary at- attributes, where the former may be suitable for providing tribute gender, which can have the values male or female. Consider data for the latter. a target schema π (ππππ, πππ, ππ’πππ π ππππ‘πππ, ππππππ, βπππ), which automatically-derived matches and inclusion dependencies, but excluding those from the interventions of the Candidate Solution, thereby associating each Candidate Solution with a data set that populates the target schema. The resulting mappings are evalu- ated, thereby associating each Candidate Solution with a data set that populates the target. Train classifier. Having generated a data set for each candidate solution, the next step is to train a classifier on each of the data sets, and compute the bias of each of the resulting classifiers. The overall approach is independent of the type of classifier used, but in the experiments we use the J48 implementation of the C4.5 decision tree learning algorithm [14]. For each data set, we carry out k-fold cross validation. As such, Figure 2: Steps in approach. the classifier is trained π times using π β 1 folds and evaluated on the remaining fold, each time with a different evaluation indicates the hiring decision of the person with the given name, age, fold. Then, its fairness is assessed by averaging the bias for the qualification and gender. evaluation folds. The overall approach is independent of the notion of bias used, but in the experiments we use demographic parity [3]. Demographic parity measures group fairness. Where 4.2 The Steps the Positive Rate (PR) for a group is the fraction of the items in In this section, each of the steps in Figure 2 are described, in turn. the group that have the given outcome (e.g., the fraction of the Stopping condition. Each iteration of the search uses the data people in the group that are hired), the demographic parity for a wrangling pipeline in Figure 1 to create new data preparation fold π is computed as: plans, the data from which is used to train a classifier. As such, each iteration can be considered to be quite expensive, and cer- πππ = πππ (ππ π (πΊ 1 ) β ππ π (πΊ 2 )) (1) tainly more expensive than evaluating a typical fitness function. where πΊ 1 is one group (e.g., females) and πΊ 2 is the other (e.g., As a result, the search cannot be allowed to carry out arbitrary males). Where ππ = 0, there is no bias. numbers of iterations. The search terminates with its best so- The overall demographic parity for a Candidate Solution is lution when it has completed a specified number of wrangles computed as: (the number of executions of the data wrangling pipeline in Fig- Γπ=π ure 1 during the search from Figure 2). The number is an input π=1 πππ parameter that is usually chosen by considering the trade-off πππ ππ = (2) π between runtime and bias reduction (an analysis is reported in Note that the role of training during the execution of the steps our previous work in [12]). in Figure 2 is to obtain evidence to inform the search for fairer Create neighborhood. A Candidate Solution is a set of inter- plans, and not to produce a final classifier. ventions. An intervention is a match or inclusion dependency Choose best solution. The best solution should have low bias. that is to be excluded from consideration when generating an However, as the approach explores a space of interventions that integration using the wrangling pipeline in Figure 1. The Cur- are liable to lead to increasingly sparse results (due to the deletion rent Candidate Solution is the starting point for the creation of of matches and inclusion dependencies), it is important not to a neighbourhood. The neighbourhood is a set of Candidate So- end up with a solution that has low bias but poor accuracy as lutions, each of which is obtained by associating the Current a result of the provision of sparse training data. So, to prefer Candidate Solution with a single intervention from a set of Candi- solutions that retain more data that can be used as evidence by date Interventions. The Candidate Interventions are the matches the classifier, the objective function for the search is: and inclusion dependencies used in the mappings created by run- ning the wrangling pipeline in the context of the interventions ππ π = (π€ 1 β πππ ππ ) + (π€ 2 β πππ’πππ ) (3) from Current Candidate Solution. where πππ’πππ is the ratio of nulls in the data set, and π€ 1 and Example 2. Following on from Example 1, assume that there are π€ 2 are weights, which are both set to 0.5 in the experiments, sources π 1 (ππππ, πππ, βπππβπ‘) and π 2 (ππππ, βπππ) and that there i.e., completeness and fairness are equally important in the end are matches π 1 from π 1 .ππππ to π .ππππ, π 2 from π 1 .πππ to π .πππ, result. The search thus prefers solutions with low bias and high π 3 from π 2 .ππππ to π .ππππ, π 4 from π 2 .βπππ to π .βπππ. Assume completeness. that a view has been produced for populating the target that con- Update Tabu list. The Tabu search maintains a collection of tains the join π 1 β²β³π 1 .ππππ=π 2 .ππππ π 2 , informed by an inclusion points in the search space that have already been visited. This list dependency π 1 between π 1 .ππππ and π 2 .ππππ. In addition, assume is updated here to include all solutions that have been evaluated that the Candidate Interventions are {π 1, π 2, π 3, π 4, π 1 }. Further, in the current iteration. assume that the Current Candidate is {π 1 }. Then the neighbour- The Tabu List is also used during the Create Neighborhood hood will consist of the Candidate Solutions {π 1, π 2 }, {π 1, π 3 }, step, to avoid considering parts of the search space that have {π 1, π 4 } and {π 1, π 1 }. been explored before. Evaluate candidate solutions. Each of the Candidate Solutions Example 3. Following on from Example 2, assume that the cur- in the neighbourhood in turn are passed to the data wrangling rent tabu list is [{π 1 }]. Then, after the exploration of the Candidate pipeline in Figure 1. The pipeline generates mappings, using Solutions in the neighbourhood, i.e., {π 1, π 2 }, {π 1, π 3 }, {π 1, π 4 } and {π 1, π 1 }, these will be added to the Tabu list so that the search sources over the previous scenario, e.g., for Adult Credit, we is optimized by keeping track of the plans that have been explored, created five scenarios with 84, 105, 127, 147, and 168 sources, and thus that need not be explored again. each with a 0, 25, 50, 75, 100% increase over the initial number of sources (84). The target schema is that of the original benchmark Tabu diversity mechanism. Tabu search is essentially a greedy data sets. local search algorithm, with a single Current Candidate Solution, Configuring the search. There are a number of parameters which is normally replaced at each iteration by the best solution that control the behaviour of the search. The maximum num- in its neighbourhood in accordance with the objective function. ber of wrangles for both data sets is set to 80; this is to obtain To avoid being trapped in local optima, a diversity mechanism manageable experiment run times, while also allowing enough can be used to jump to a new location in the search space. In our searching to generate plausible results. The value was obtained implementation of Tabu, when the best solution has not been after a sensitivity analysis. In addition, up to the top 130 map- improved for several iterations, the new Best Candidate Solution pings are obtained from mapping generation; in practice, this is is set to the highest scoring of a subset of plans from the Tabu more than sufficient for the scenarios used, and often the map- List. ping generator will produce significantly fewer mappings than this. 5 EVALUATION In the German Credit data set, the maximum number of tuples 5.1 Experimental Setup in the result of the wrangling process is set to 620 (the original data set contained 1,000 rows). This is to ensure that the results Data Sets. To investigate the effectiveness of the approach de- of the fair wrangling can be selective, leaving out results that are scribed in Section 4, we have evaluated it in the context of data associated with more bias. For the same reasons, the maximum preparation scenarios derived from the following benchmark number of tuples in the Adult Census dataset is 6,616 out of 10,000. data sets: The results of the classifiers are obtained using 5-fold cross β’ German Credit1 : The role of the classifier is to determine validation. if it is risky or not to give credit to a person. The dataset Classifier. The reported results are for the J48 decision tree clas- contains 1000 tuples, each with 21 attributes. The sensi- sifier from the Weka4 . The experiments were run with other tive attribute is Gender, which contains the values male classifiers, such as Logistic Regression, obtaining negligible varia- (69%) and female (31%). The class attribute is Risky, which tion between the experimental results, thus, we report only one contains the values Yes or No, stating if it is risky to give set of results. Although the experiments involve binary classifi- someone a loan or not, respectively. cation tasks, the overall approach does not depend upon this. β’ Adult Census2 : The role of the classifier is to predict if a person makes over $50K a year. The dataset contains 10,000 Baseline. The baseline in the experiments is obtained by running tuples3 , each with 14 attributes. The sensitive attribute is the wrangling process without the approach from Section 4. Thus Gender, which contains the values male (66%) and female we are able to obtain a direct indication of the effectiveness of (33%). The class attribute is Salary, which contains the the interventions made. Note that there is no direct competitor values > 50πΎ and <= 50πΎ, stating if someone is predicted in the literature with which to compare. to earn more than 50πΎ per year or less than that. Experimental setup. The experiments were run over an Intel These are benchmark data sets, that have been used in other Core i5 with 2Γ2.7 GHz, and 8 GB RAM. studies on fairness, e.g., [15β18]. However, each is supplied as a single table, whereas to experiment on data wrangling, several 5.2 Results tables are required. So, for the experiment, each of the data sets German Credit Data. Figures 3 and 4 show the results of the is partitioned horizontally into groups of rows, which are then experiment with the German Credit data set. For both figures, the vertically partitioned. This subdivision of the tables creates dif- horizontal axis reports the number of sources in each scenario. In ferent ways in which the target can be populated, and allows Figure 3, the vertical axis represents the demographic parity bias interventions to take place that apply to subsets of the data. (computed using Equation 2), while, in Figure 4, the vertical axis For Adult Census, we horizontally divided the dataset using represents the accuracy of the classifier trained and evaluated the contained 42 country values, which we then divided into on the output datasets. Both bias and accuracy represent the av- 2 vertical partitions, thus, amounting to 84 initial sources. For erages computed through cross validation.The following can be German Credit, we divided the initial dataset into 4 horizontal observed: (i) The fair wrangling approach performs as well as or partitions, which were then divided each into 3 vertical partitions, better than the baseline for demographic parity in all scenarios. resulting 12 initial sources. (ii) Both methods sometimes have the same demographic parity In the experiments, the number of sources is varied; adding for different numbers of sources; this is because the same plan additional sources provides more ways of populating the target, may be selected even though the number of available sources and in turn more potential interventions to improve fairness. In grows. (iii) The fairness interventions do not always provide order to create opportunities for different intervention plans, we improving results as more sources become available; this is be- created alternative sources with synthetic constant values for cause the search is not exhaustive, and thus, although the larger non-sensitive attributes. Thus, for each benchmark dataset, we numbers of sources provide more opportunities for effective in- created five scenarios, each with an increase of 25% alternative terventions to be discovered, there is no guarantee that the best 1 https://www.kaggle.com/uciml/german-credit intervention plans will be obtained. As mentioned in Section 5.1, 2 https://archive.ics.uci.edu/ml/datasets/adult the parameter for the amount of explored search space was set 3 Due to limited resources, we used stratified sampling to choose 10,000 tuples out of based on a sensitivity analysis similar to the one reported for the 48,842 tuples in the original dataset. The used sample maintains the properties of the initial dataset. 4 https://www.cs.waikato.ac.nz/ml/weka/ Figure 3: Demographic parity for the German Credit data Figure 5: Demographic parity for the Adult Census data set for different numbers of sources. set for different numbers of sources. Figure 6: Classifier accuracy for the Adult Census data set Figure 4: Classifier accuracy for the German Credit data for different numbers of sources. set for different numbers of sources. Runtime. Table 1 shows the runtime for each dataset consider- the experiments presented in our previous work in [12]. Fur- ing their different scenarios, i.e., with varying percents of syn- thermore, the additional sources may not always provide good thetic data sources (as explained in Section 5.1). The runtime is opportunities to increase fairness. (iv) Accuracy did not suffer as expressed in seconds. Note that for the same percent, the num- a result of the interventions. ber of sources for each dataset differs, e.g., for the 50% case, the German Credit dataset scenario contains 18 sources, while the Adult Census Data. Figures 5 and 6 show the results of the Adult Census scenario contains 127 sources. experiment with the Adult Census data set. The following can For the German Credit dataset, the runtime increases as the be observed: (i) The results are similar to those for the German number of sources increases. This is due to the fact that i) the Credit data set; we do not repeat the recurring explanations here; Tabu search space increases and there are more intervention (ii) The overall bias levels are smaller for the Adult Census data plans to be explored, and ii) the wrangling process is longer in set than for the German Credit one, and in several cases bias the cases with more sources because each of the wrangling com- is almost completely removed. (iii) The accuracy for the Adult ponents that are described in Section 3 depends on the number Census data set is higher than that in the German Credit scenarios. of input data sources, e.g., the mapping component considers This may be due to the difference in the amount of data that is combinations between all subsets of sources, thus, the number used to train the classifier, i.e., Adult Census contains significantly of subsets increases with the number of initial sources [11]. more rows than that German Credit training set, which may lead For the Adult Census dataset, it is interesting to notice that to an underfit model for the German Credit scenarios. there is no clear pattern in the runtime values. This is due to the fact that for some of the scenarios, e.g., 25% and 100%, the Runtime (in seconds) for scenarios search ends if the bias is below a set threshold. This threshold Dataset with percent of synthetic sources is set as, in the case of demographic parity, one cannot expect 0% 25% 50% 75% 100% to have equal ratios for the positive labels for male and female German Credit 427 562 780 1,002 1,309 values (in the sensitive attributes). Thus, we need to consider an acceptable difference. In these scenarios, we set the threshold Adult Census 1,416 48 1,691 1,605 41 at 0.05, i.e., πππ ππ < 0.05 (see Equation 2). This threshold is set Table 1: Runtime for varying number of sources on each according to the used fairness measure, e.g., demographic parity dataset in our case. For the Adult Census scenarios with 25% and 100% of synthetic sources, in both cases, the wangling process stops after the initial wrangle as it detects that the output dataset presents an [6] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of Opportunity acceptable bias value, thus, it does not prepare more interventions in Supervised Learning. In NIPS (Barcelona, Spain). Red Hook, NY, USA, 3323β3331. from the initial wrangle. [7] Faisal Kamiran and Toon Calders. 2011. Data preprocessing techniques for classification without discrimination. KAIS 33 (2011), 1β33. [8] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. 2012. Decision Theory 6 CONCLUSIONS for Discrimination-Aware Classification. In ICDM. USA, 924β929. https: Our previous work tackled underlying causes of bias for unla- //doi.org/10.1109/ICDM.2012.45 [9] Nikolaos Konstantinou, Edward Abel, Luigi Bellomarini, Alex Bogatu, Cristina belled data [12]. This paper complements that work for labelled Civili, Endri Irfanie, Martin Koehler, Lacramioara Mazilu, Emanuel Sallinger, data, by wrangling in a way that directly responds to the bias Alvaro Fernandes, Georg Gottlob, John Keane, and Norman Paton. 2019. VADA: An Architecture for End User Informed Data Preparation. Journal of Big Data observed when wrangling with different mappings. This work 6:74 (2019). https://doi.org/10.1186/s40537-019-0237-9 focuses on creating fair datasets by using demographic parity as [10] Yin Lin, Yifan Guan, Abolfazl Asudeh, and H. V. Jagadish. 2020. Identifying a way to measure the bias. However, other metrics may be inves- Insufficient Data Coverage in Databases with Multiple Relations. Proc. VLDB Endow. 13, 11 (2020), 2229β2242. http://www.vldb.org/pvldb/vol13/p2229-lin. tigated, e.g., equalized odds, equal opportunity, etc., depending on pdf the properties that are desired upon the output dataset. [11] Lacramioara Mazilu, Norman W. Paton, Alvaro A.A. Fernandes, and Mar- We now revisit the claimed contributions from the introduc- tin Koehler. 2019. Dynamap: Schema Mapping Generation in the Wild. In Proceedings of the 31st International Conference on Scientific and Statistical tion: Database Management (Santa Cruz, CA, USA) (SSDBM β19). Association for Computing Machinery, New York, NY, USA, 37β48. https://doi.org/10.1145/ (1) The proposal of a strategy for fairness-aware data prepa- 3335783.3335785 ration for classification. The proposed strategy searches [12] Lacramioara Mazilu, Norman W. Paton, Nikolaos Konstantinou, and Alvaro a space of interventions that impact on how data is inte- A. A. Fernandes. 2020. Fairness in Data Wrangling. In 21st International Conference on Information Reuse and Integration for Data Science, IRI 2020, Las grated for analysis. The approach builds on the ability to Vegas, NV, USA, August 11-13, 2020. IEEE, 341β348. https://doi.org/10.1109/ automate aspects of the data preparation process. Inter- IRI49571.2020.00056 [13] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and ventions are carried out, integrations generated, and the Aram Galstyan. 2019. A Survey on Bias and Fairness in Machine Learning. resulting data is used to train classifiers. The classifiers are CoRR abs/1908.09635 (2019). arXiv:1908.09635 http://arxiv.org/abs/1908.09635 then tested for bias, and the most promising interventions [14] J. Ross Quinlan. 1996. Bagging, Boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative investigated further. Applications of Artificial Intelligence Conference, AAAI, William J. Clancey (2) The realisation of the strategy as a search for fair data prepa- and Daniel S. Weld (Eds.). AAAI Press / The MIT Press, 725β730. http: ration plans. The strategy has been implemented as a Tabu //www.aaai.org/Library/AAAI/1996/aaai96-108.php [15] InΓͺs Valentim, Nuno LourenΓ§o, and Nuno Antunes. 2019. The Impact of Data search, over a space of interventions that include the ex- Preparation on the Fairness of Software Systems. CoRR abs/1910.02321 (2019). clusion of matches and inclusion dependencies. Removing arXiv:1910.02321 http://arxiv.org/abs/1910.02321 [16] Vladimiro Zelaya, Paolo Missier, and Dennis Prangle. 2019. Parametrised Data selected matches and inclusion dependencies may reduce Sampling for Fairness Optimisation. In KDD XAI. bias by removing problematic mappings, or by prioritising [17] Richard Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. alternative ways of preparing the data. 2013. Learning Fair Representations. In ICML (Atlanta, GA, USA) (ICMLβ13). IIIβ325βIIIβ333. (3) An evaluation of (2) for benchmark data sets that shows [18] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating how the interventions can improve a specific fairness metric, Unwanted Biases with Adversarial Learning. In AAAI (New Orleans, LA, USA). namely demographic parity. Also, we show how the accu- NY, USA, 335β340. https://doi.org/10.1145/3278721.3278779 racy of the trained classifier is impacted by the interventions. Results using two evaluation scenarios show that the ap- proach reduces bias in comparison with a non fairness- aware base case. Also, the evaluation shows that, com- pared to the baseline case, the accuracy of the classifier trained as a result of the interventions did not suffer. ACKNOWLEDGMENTS We are pleased to acknowledge the support of the UK Engineer- ing and Physical Sciences Research Council, through the VADA Programme Grant (EP/M025268/1). REFERENCES [1] Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage- based Rewriting for Data Preparation. In Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March 30, 2020 (CEUR Workshop Proceedings, Vol. 2578), Alexandra Poulovassilis et al. (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-2578/PIE4.pdf [2] Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. 2019. A comparative study of fairness-enhancing interventions in machine learning. In FAT*. ACM, 329β338. https://doi.org/10.1145/3287560.3287589 [3] Pratik Gajane. 2017. On formalizing fairness in prediction with machine learning. CoRR abs/1710.03184 (2017). arXiv:1710.03184 http://arxiv.org/abs/ 1710.03184 [4] Fred Glover. 1986. Future paths for integer programming and links to artificial intelligence. Computers Operations Research 13, 5 (1986), 533 β 549. https: //doi.org/10.1016/0305-0548(86)90048-1 Applications of Integer Programming. [5] Sara Hajian and Josep Domingo-Ferrer. 2013. A Methodology for Direct and Indirect Discrimination Prevention in Data Mining. IEEE Trans. Knowl. Data Eng. 25, 7 (2013), 1445β1459. https://doi.org/10.1109/TKDE.2012.72