1. Introduction

September

Data Ethics: Towards a Trade-of Evaluation

Fabio Azzalini

fabio.azzalini@polimi.it 0 1 2 3

Cinzia Cappiello

cinzia.cappiello@polimi.it 0 1 2 3

Chiara Criscuolo

chiara.criscuolo@polimi.it 0 1 2 3

Camilla Sancricca

camilla.sancricca@polimi.it 0 1 2 3

Letizia Tanca

letizia.tanca@polimi.it 0 1 2 3

Data Protection

0 2 3 0 Data Quality , Data Ethics, Fairness 1 Politecnico di Milano , Milan , Italy 2 are Accuracy , Completeness, Consistency, and Timeli- 3 note that, for Data Science to be reliable, DQ should also

2023

1 2023

In the last decades, one of the main drivers for organizational success has been data-driven decision-making: strategic decisions are based on data analysis and interpretation. In this scenario, relying on dependable results becomes imperative. Therefore we must ensure that input data have good quality and the algorithms on which the analysis is based are fair: in general, Data Quality (DQ) and Data Ethics (DE) should be guaranteed. However, maximizing DQ and DE simultaneously is non-trivial, since DQ improvement techniques can negatively afect DE and vice versa. Discovering which relationships exist between DQ and DE and thoroughly analyzing it is therefore of paramount importance. The goal of this paper is to study whether, in a given context, there is a trade-of between DQ and DE: specifically, we consider the Completeness dimension of DQ, and the Fairness dimension of DE. The results of our experiments, based on two real-world well-known datasets, provided details about this trade-of and allowed us to draw some guidelines.

1. Introduction

In the last decades, data-driven culture spread in several domains. The availability of large amounts of data and algorithms has made our lives more eficient and easier, and strategic decisions are made based on data analysis and interpretation; therefore, relying on dependable results becomes imperative. We need to be sure that the data sources have good quality and the algorithms on which the analysis is based are fair and do not introduce bias in the decision process.

In fact, on the one hand, the performance of Machine Learning (ML) algorithms may be, for example, seriously accurate, incomplete, and inconsistent data may produce poor analysis results. Therefore, in addition to the wellknown storage and processing problems related to data collection, addressing Data Quality (DQ) has become a fundamental issue [2, 3]. The most used DQ dimensions ness [2]: Accuracy is the extent to which data are correct, reliable and certified; Completeness is the degree to which Bases (VLDBW’23) — the 12th International Workshop on Quality Canada (L. Tanca)

It is already well known that there may be contrasting objectives also among the dimensions of DE, for instance, between Transparency and Data Protection. In the same way, the relationship between the DQ dimensions [2], and the ethical ones is complex. For example, commonly used DQ improvement techniques –e.g., imputing missing values using the mean value– might modify the overall distribution of values in the dataset, leading to a reduction of Fairness; on the other hand, some Bias Mitigation A system that considers also DQ is described in the techniques modify real data values to remove unfairness, paper by Abraham et al. [8], who proposed FairLOF, a thus lowering Accuracy, which is a fundamental dimen- Fairness-aware outlier-detection framework. This work sion of DQ. However, there are also contexts in which starts from the fact that underrepresented groups, althe user does not care about Fairness, like in the analysis though relevant in the dataset, could be identified as outof sensors data or in forecasting raw-material prices. In liers, and specifically, on calibrating the so-called local these cases, we do not have protected attributes (e.g., sex, outlier factor, by means of which a fairer outlier detecrace, ethnicity, etc.) and not even proxy ones (e.g., edu- tion is possible. Though this system actually focuses on cation, zip code, etc.). Moreover, in some applications, a specific problem, it can be considered a starting point diferences in treatment and outcomes among diferent for studying the relationship between DQ and DE. A groups are justified and explained: for example, dispro- similar system has been presented by Biswas et al. [9], portional recruitment rates for males and females might whose goal is to investigate the impact of data prepabe explained by the fact that more males have higher ration pipelines on algorithmic Fairness, focusing on education [5], thus not always Fairness is an issue. deep-learning techniques. The authors conduct a de

This research aims to study if, in a given context, a tailed evaluation of several Fairness metrics applied to trade-of between Data Quality and Data Ethics exists diferent deep-learning applications and discover that and, in this case, give guidelines to the user according to many data preparation actions can introduce bias in the that specific context. In this paper, we focus on the Com- data and, consequently, in the final prediction. However, pleteness dimension of DQ, and on the Fairness dimension they do not employ any Fairness improvement technique of DE. To this aim, we have designed experiments that inside their pipelines, considering only how DQ techtake a dataset as input and perform an assessment of niques impact Fairness, and not vice versa. these dimensions before and after applying some oper- Guha et al. [10] conducted a study to investigate ations that should improve them. The rest of the paper whether errors, e.g., missing values, outliers, and lais organized as follows: Section 2 summarizes related bel noise, can be related to demographic characteristics. work, while Section 3 introduces preliminary concepts of Moreover, they investigate if automated data cleaning both areas of Data Quality and Data Ethics and describes actions could impact Fairness. In their study, they disthe method we used to analyze the relationship between covered that tuples related to disadvantaged groups were Completeness of DQ and the Fairness dimension of DE; more afected by the presence of missing values; instead, Section 4 presents the experiments we conducted on a the number of mislabeled data was lower in the disadvanreal-world dataset, and Section 5 concludes the paper. taged groups w.r.t the privileged ones. Moreover, they proved that, in general, the probability that automated data cleaning contributes to worsening Fairness is higher 2. Related Work w.r.t. improving it. Finally, there is a work on the specific relationship between Fairness and missing values [11].

We discuss our diverse settings in Section 4.2.3.

Research studies on the relationship between DQ and

DE are in a very preliminary phase. In this section, we will first present seminal works on Fairness and then introduce two first attempts at studying its important relationship with Completeness. We do not focus on DQ systems since, in this paper, we will resort to well-known and established DQ definitions and techniques [ 2].

In the literature, one of the most notable solutions aiming to measure and enforce Fairness is AI Fairness 360 [6], an open-source framework. It aims to mitigate data bias, quantified using diferent statistical measures, by exploiting pre-processing (i.e., procedures that, before the application of a prediction algorithm, make sure that the learning data are fair) techniques and statistical measures to solve bias in the dataset. Similarly, Fairlearn [7], another pre-processing, open-source, community-driven project, aims to help data scientists improve Fairness of their ML systems by means of statistical Fairness metrics. Both papers focus on techniques that manipulate the data to make them fairer; however, they do not consistently consider the impact that their techniques have on DQ.

3. Experiment Design This section presents the method we used to investigate

the relationship between DE w.r.t. Fairness, and the DQ, w.r.t. the Completeness. Figure 1 schematizes the typical Data Science pipeline used to derive knowledge from data. The pipeline begins with the Acquisition and Extraction step: the information relevant to the data-science task is collected. The second step of the pipeline aims to solve the Data Quality issues: DQ Improvement and Annotation procedures are used to “sanitize” the data sources in such a way as to make them complete, correct and consistent. In the third phase, if needed, Data Integration provides a unified view of the data sources acquired in the first phase. Finally, in the last two steps, the predictive models are learned (Analysis and Modeling), and data and results are visualized (Visualization and Evaluation). We position our solution between the first and second step of the

SUGGESTIONS

3.1. Preliminaries

Data Quality (DQ) is defined as “fitness for use,” i.e., the ability of a data collection to meet the user requirements [12]. Data Quality is a multi-dimensional concept: a DQ model is composed of DQ dimensions representing the diferent aspects to be considered (i.e., errors, duplicates, format errors, typos, or missing values). The experiments concentrate on the Completeness DQ dimension. Completeness characterizes the extent to which a dataset represents the corresponding real-world. For instance, in a relational database, Completeness is strictly related to the presence of null values. A simple way to assess the Completeness of a table is to calculate the ratio between the number of non-null values and the number of cells in the table. It is important to specify that we also use the Accuracy dimension to evaluate the resulting data correctness. Accuracy is, in fact, defined as the closeness between a data value v and a data value v’, considered as the correct representation of the real-life phenomenon that the value v aims to represent. It is associated with syntactic and semantic issues that might create a discrepancy between the value stored in the dataset and the correct value. How each of these two dimensions is used will be explained in the description of the method.

Fairness whose most used definition is: “the absence of any prejudice or favoritism toward an individual or a group based on their inherent or acquired characteristics” [13, p.100], is one of the most important dimensions of Data Ethics (DE). Fairness is based on the concept of protected or sensitive attribute. A protected attribute is a characteristic for which non-discrimination should be established, such as religion, race, sex, and so on [14]. A protected group is a set of individuals identified by having the same value of a protected attribute (e.g.: females, young people, Hispanic people). There is no unique metric of Fairness, but many facets exist, and a model is considered fair if it satisfies some or all these metrics. The most used technique to identify unfairness in datasets is to train a classification algorithm to predict the binary value of the target class that can be a positive outcome like obtaining a loan or having a high income, or a negative outcome like not obtaining a loan or having a low income; and then use Fairness metrics to understand whether the prediction of this model encompasses discrimination for the protected group: if the metrics results show discrimination, we can conclude that also the dataset contains unfair behaviors since the model learned the bias from it. Specifically, we measure the importance of protected attributes in determining the result of the model. The following statistical metrics study how specific values of the protected attributes impact the result of the prediction algorithm (e.g., women are very frequently associated with salaries lower than 50k$/year, while men earn more than 50k$/year). Informally: Disparate Impact Ratio is the probability to get a positive outcome regardless of whether the person is in the protected group [15]; Predictive Parity Ratio evaluates if both protected and unprotected groups have equal probability that a group member with positive predictive value belongs to the negative class [14]; False Positive Ratio: evaluates if the probability of having a false positive prediction is the same for all protected groups [14].

3.2. A Method to analyze the DQ and DE tradeof

This section presents the two pipelines we defined to execute the experiments. In the first one, which can be applied to datasets afected by ethical issues and ethicscompliant datasets, we injected errors in the input dataset, causing data quality issues, and then applied DQ improvements techniques, measuring their impact on DE. In the second pipeline, we applied DE improvement techniques to a dataset afected by ethical problems and measured their impact on DQ. Through these results, we studied the trade-of between DQ and DE. In our experiments, we considered the trade-of between the Completeness DQ dimension and the Fairness DE dimension, while the

Accuracy DQ dimension is used to evaluate the final DQ level in both pipelines. We used the Adult Census Income Imputation methods. The pipeline output is the Suggested dataset1 and the German Credit dataset2 and considered DQ Improvement step in which we suggest the best DQ ‘sex’ as the protected attribute. Since the Adult Census improvement technique based on Accuracy and Fairness Income dataset already contains bias w.r.t. the income of results. The final users can choose the Imputation techUS citizens, injecting further bias to perform the experi- nique with the minimum impact on Fairness according ments was not necessary, therefore, we used it in both to their preferred trade-of. pipelines. The German Credit dataset, instead, is not af- DE-Oriented Experiments. Also in this case the input fected by bias – thus, we could not apply Bias Mitigation dataset was free of DQ problems. As regards Fairness, we techniques, and we tested it only in the first pipeline. did not have an error-injection phase since, this time, the The first operation, performed in both pipelines, is the considered dataset (Adult Census Income) was already Ethical Evaluation, in our case based on a classification biased. The DE Improvement phase consisted of applying algorithm that computes the Fairness level of the dataset. a Bias Mitigation Technique to remove unfairness. Also For the DQ level, we already knew that it was 100% for here, the repaired dataset was analyzed in the Final Evalboth datasets. We now describe the two pipelines shown uation phase, both Fairness and Accuracy are measured, in Figure 2. repeating this phase for all the selected Bias Mitigation

DQ-Oriented Experiments. The input dataset was free techniques. Some of these techniques, since they act of DQ problems. For this reason, we had to inject errors by directly replacing the data values with other (fake) in order to evaluate the impact of the DQ improvement values, also allow controlling the amount of bias repair techniques. In our case, to afect Completeness, we re- executed. For example, Correlation Remover [7], fully deplaced existing values with null values. By injecting a scribed in the next section, modifies the actual values to diferent percentage of uniformly distributed DQ errors 3 minimize the correlation between the feature attributes (from 90% to 0%, with a decreasing step of 10%) the Error and the sensitive ones. The output of the pipeline is the Injection phase generates ten instances of the original Suggested DE Improvement step in which we propose the dataset, at diferent levels of quality. These ‘dirty’ ver- best DE improvement technique based on both DQ and sions are the input of the DQ Improvement phase, in DE evaluation results. The final users can choose the which a DQ improvement technique is applied. In our Bias Mitigation technique having the minimum impact case, an Imputation technique was selected. The ten re- on Accuracy according to their preferred trade-of. paired datasets obtained as output were analyzed in the Final Evaluation phase, to check the impact of the DQ improvement on the Fairness and Accuracy measures, 4. Experiments used to evaluate respectively the lack of bias and the data correctness. This procedure was repeated for diferent

In this section, we first introduce the experimental setup

and then describe the results, both from the DQ and the DE perspectives. 1https://archive.ics.uci.edu/ml/datasets/adult 2https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+ data) 3Related to a specific DQ dimension

4.1. Experimental setup

measured the Accuracy as ℎ

where is the total

DQ Improvement phase. In this paper, we consider ; False Positive Ratio (FPR) three Data Imputation techniques: Density-based, where missing values are imputed for each feature with the same ( =̂1| =0,=) ; where is a protected attribute that distribution of the non-empty values; Mode Imputation, h(as=̂1tw|=o0,=va)lues discr (=discriminated), priv (=privileged); where the most frequent value is imputed; and Rare- is the actual classification result, two values (or labels) based, where the less frequent value is imputed. 0 or 1; and ̂ is the algorithm-predicted decision for the

Bias Mitigation phase. Three Bias Mitigation tech- individual, two values of the outcome 0 (negative outniques are proposed to remove the unfairness from data. come) or 1 (positive outcome). The ideal value for all The first one, Correlation Remover [ 7], removes the neg- three metrics is 1, which means both groups are treated ative correlation between the protected attribute and the equally. If the value is between 0 and 1 − , the discrimclassification label by modifying the non-protected at- inated group is treated unfairly, whereas if the value is tributes that are in turn correlated with the protected one: greater or equal to 1 + , the privileged group is treated mathematically speaking, it poses a minimization prob- unfairly. Parameter is a threshold value that must be set lem on the correlation between the feature attributes by an expert. In our experiment we set the parameter and the sensitive ones by centering the sensitive val- equal to 0.2. ues, training a linear regressor on the non-sensitive ones Dataset and classification algorithm. As explained in and reporting the residual. The second one is Learn- Section 3, we considered two datasets. The first one is ing Fair Representation [6], which maps each data tuple the Adult Census Income dataset, typically used to pre(corresponding to an individual) to a ‘prototype’, an ar- dict whether the income of an individual exceeds 50k$ tificial representation of the data containing the same per year. It comprises 48842 tuples, described by 15 atprotected attribute but with modified values for the other tributes, including the target class. This dataset contains features, to remove the correlation between the protected more than one protected attribute (‘race’, ‘sex’, and ‘naattributes and the target ones. To do so, this method uses tive country’), but our study considered only the attribute a neural network with the objective of retaining as much ‘sex’. The second one is the German Credit dataset, which information as possible. The last one, Optimized Pre- collects information on individuals that are classified processing [6], solves an optimization problem with the based on the fact that they are deemed good or bad payobjective of minimizing the diference between the modi- ers when asking for a loan. It comprises 1000 tuples, ifed distribution and the original one; specifically, it aims consisting of 20 attributes, including the target class. The to reduce the discrimination by mapping diferent feature sensitive attribute is ‘personal-status-sex’, i.e., the marital attributes to the classification labels of the individuals status, from which the protected attribute ‘sex’ can be inside the dataset, while keeping the protected attributes derived. Diferently from the previous one, this dataset is unchanged, to limit the dependency of the prediction on not afected by bias with respect to ‘sex’. Finally, we used the sensitive attributes. In all three cases, the techniques as classification algorithm the Decision Tree Classifier involve only the numerical features. ofered by the scikit-learn Python library.

Evaluation Metrics. To evaluate the DQ level of the dataset, during the Evaluation phase, the Accuracy metric has been selected. To this aim, the distance between the 4.2. Result evaluation original and the final dataset has been computed. Thus, we extracted the number ℎ of values that correspond to each other in the original and the final dataset, and

This section presents the main results we obtained. In

Figure 3, the x-axis represents the Completeness level; instead, in Figure 4, the x-axis shows the degree of Bias Mitigation. In both figures, the y-axis represents the level of the evaluated metrics. number of cells.

Since there is no standard system for measuring Fairness, we used two diferent systems. For the DQ-Oriented 4.2.1. DQ-Oriented Experiments Experiments, we measured Fairness by means of a set of already defined formulas. Instead, for the DE-Oriented The plots shown in Figure 3 focus on the DQ-Oriented ExExperiments, we computed the Fairness metrics ofered periments in which the Accuracy and Fairness results are by the Fairlearn [7] mitigation tool. The two results compared for the three Imputation techniques explained are comparable since there is a very small delta be- in Section 4.1. tween the two. For the DQ-Oriented Experiments, the Biased dataset. The three plots at the top of Figure 3 three metrics, taken from [14, 15], selected to evalu- show the results for the Adult dataset. In general, the ate Fairness (see Section 3) are expressed as: Disparate Mode and the Density-based Imputations reach higher Accuracy with respect to the Rare-based one, since the ADULT

GERMAN latter modifies the original distribution of values more decreases very quickly. In this specific case, this happens than the others. From the Fairness point of view, we can because, by imputing the less frequent values, the dataset observe that the Predictive Parity Ratio (PPR) metric can will be more balanced in favor of the protected class. As assume values greater than 1 + , i.e., 1.2. This means that the percentage of injected errors grows, the rare values the privileged class (men) is treated unfairly for that spe- become too many, unbalancing the dataset again. cific Fairness aspect; i.e., the probability of belonging to Unbiased dataset. The three plots at the bottom of class 0 (low income) for a man that instead was predicted Figure 3 show the results for the German dataset. Since to class 1 (high income) is lower than the probability the two datasets have a similar distribution, after the of belonging to class 0 for a woman predicted to class application of the Imputation techniques, the Accuracy 1. On the contrary, False Positive Ratio (FPR) always takes similar values as in the previous case. takes opposite values with respect to PPR. These two Since the dataset is already fair, FPR and DIR metrics metrics are symmetrical since they represent opposite assume values around 1, while the PPR is almost 2. After Fairness aspects: FPR evaluates whether the probability applying the Imputation techniques, FPR and DIR are of predicting class 1 is the same both for men and women not afected, while the value of PPR is closer to 1 (i.e., belonging to class 0. the probability of belonging to class 0 (bad credit) for As we can notice, in this specific experiment, the Mode a man predicted to class 1 (good credit) is lower than Imputation introduces minimal changes to the Fairness the probability of belonging to class 0 for a woman premetrics since imputing the most frequent value does not dicted to class 1), therefore the PPR has improved with afect the distribution of the original ones. respect to its initial value. In this case, the Imputation Instead, the Density-based Imputation is much better: techniques balanced the PPR, improving it as much as in fact, as the percentage of injected errors increases, they modify the original distribution of the values. In Fairness increases for all three metrics. This is related fact, Rare-based Imputation, which modifies the original to a vast majority of the class 0 in the dataset; since the distribution more, introduces unbalance, causing further Imputation follows the value distribution, it means that deterioration of Fairness over the 60% injected errors. those labels (class 0) have a higher probability of being From these results, we can notice a trade-of between Acassigned to men (who are over-represented). In this way, curacy and Fairness; from the DQ-Oriented Experiments the dataset will be balanced. We can conclude that the we see that this trade-of can be more or less emphasized application of this Imputation method improves Fairness. depending on the DQ improvement technique applied. Finally, when applying the Rare-based Imputation, when Completeness varies between 100% and 40%, the Fairness increases; for Completeness values below 40%, Fairness

ADULT 4.2.2. DE-Oriented Experiments ing, the Accuracy remains unchanged before and after the mitigation process. This happens because there is no data The plots shown in Figure 4 focus on the DE-Oriented modification, but only weights are given to the numerical Experiments. We compared the Accuracy and Fairness features in order to reduce the correlation between the results for the Bias Mitigation techniques explained in protected attribute and the prediction. However, applySection 4.1. The results of the experiments conducted on ing this technique to the full dataset is not suficient to the entire dataset are represented at the top of Figure 4. improve Fairness because the categorical features still The Bias Mitigation techniques we used focus only on afect the prediction. Moreover, applying this technique numerical attributes, thus, the results shown at the bot- considering only the numerical features improves one tom of Figure 4 show the same experiments based only Fairness metric (FPR) over three. on the numerical features. We now present our results In the DE-Oriented Experiments we detected a trade-of by analyzing one Bias Mitigation technique at a time. between Accuracy and Fairness, and this relationship can

Correlation Remover. When applying Correlation Re- be more or less strong depending on the Bias Mitigation mover for a partial Bias Mitigation between 0 and 1, the technique that is applied.

Fairness metrics (DIR, FPR, and PPR) slightly improve, but with an important loss in Accuracy (from 1.0 to 0.6).

This happens because the removal of correlation strongly 4.2.3. A brief comparison modifies the data, greatly afecting Accuracy. Consid- We can now summarize the diferences between our work ering the case in which only the numerical features are and the approach of [11] presented in Section 2: in [11] involved, the Fairness metrics are negatively afected. the authors studied only the Completeness dimension of This represents a case of over-correction. By modifying DQ, while we also evaluate the results using Accuracy; the entire dataset, data are too far from the original ones, the Fairness metric studied in [11] is only one, while we and the results are no longer reliable. studied two more metrics; in [11] the initial dataset used

Learning Fair Representation. Applying Learning Fair for the experiments is an unclean one, while we control Representation, we have the same loss in Accuracy as the process by applying error injection to a previously for Correlation Remover, since it modifies the numerical cleaned dataset; finally, in [ 11] the Imputation techniques features in order to remove correlations. However, this used are only Mode and Mean, while we also apply Rare technique also aims to minimize information loss, thus, and Density-based Imputation techniques. does not cause such a radical modification as the previous method. Therefore, Fairness improvement is minimal considering the full dataset, while considering only the 5. Conclusions numerical features, two metrics over three improve (DIR and FPR).

Optimized Preprocessing. Using Optimized Preprocess

Takeaway message. From our experiments, we have noticed that the application of Data Imputation techniques, in some particular cases, e.g., Density-based Imputation and Rare-based imputation on the Adult dataset, [2] C. Batini, M. Scannapieco, Data and Information can contribute to improving Fairness. Moreover, in the Quality - Dimensions, Principles and Techniques, experiments, starting from unbiased data, Fairness was Data-Centric Systems and Applications, Springer, not afected by the application of the Imputation tech- 2016. niques. In most cases, we noticed a trade-of: the Bias [3] C. Sancricca, C. Cappiello, Supporting the design Mitigation technique that less afects the Accuracy, in of data preparation pipelines (2022) 149–158. general the Optimized Preprocessing technique, is not the [4] D. Firmani, L. Tanca, R. Torlone, Ethical dimensions one that improves Fairness the most, and vice versa; for for data quality, JDIQ 12 (2019) 1–5. these cases, we can deduce that techniques that succeed [5] F. Kamiran, I. Žliobaitė, Explainable and nonin preserving both Accuracy and Fairness do not exist. explainable discrimination in classification, DisTherefore, as a takeaway message, we can afirm that the crimination and Privacy in the Information Society: best Data Imputation/Bias Mitigation technique to Data mining and profiling in large databases (2013) apply strictly depends on the analysis goal. If users 155–170. are more interested in preserving Fairness aspects, they [6] R. K. Bellamy, et al., Ai fairness 360: An extensiwill concentrate on a subset of techniques at the cost of ble toolkit for detecting and mitigating algorithmic losing DQ; if the major interest is to optimize the im- bias, IBM Journal of Research and Development 63 provement of the DQ, they will apply a subset of DQ (2019) 4–1. improvement tasks that could afect Fairness. It is worth [7] S. Bird, et al., Fairlearn: A toolkit for assessing noting that situations may also exist in which Accuracy and improving fairness in ai, Microsoft, Tech. Rep. and Fairness are not in conflict; however, this is strictly MSR-TR-2020-32 (2020). context-dependent. [8] S. S. Abraham, Fairlof: fairness in outlier detection,

Conclusions. In this work, we analyzed the relationship Data Science and Engineering 6 (2021) 485–499. between Data Quality (DQ) and Data Ethics (DE). Specif- [9] S. Biswas, H. Rajan, Fair preprocessing: towards ically, we focus on the Completeness dimension of DQ, understanding compositional fairness of data transand on the Fairness dimension of DE. Through a series of formers in machine learning pipeline, in: Proceedexperiments, we demonstrated that between DQ and DE ings of the 29th ACM Joint Meeting on ESEC/FSE, a trade-of is present. In fact, the experiments showed us 2021, pp. 981–993. that the application of Fairness improvement operations [10] S. Guha, F. A. Khan, J. Stoyanovich, S. Schelter, Aucan lead to a deterioration of Accuracy, used to evaluate tomated data cleaning can hurt fairness in machine the DQ, and vice versa. Analyzing the experiments more learning-based decision making, in: 2023 IEEE in detail, we can also state that the amount of Accuracy 39th International Conference on Data Engineering deterioration after Fairness improvements depends on (ICDE), IEEE, 2023, pp. 3747–3754. the Bias Mitigation technique, as well as the deteriora- [11] F. Martínez-Plumed, C. Ferri, D. Nieves, tion of Fairness can depend on the selected Imputation J. Hernández-Orallo, Fairness and missing technique. Future work will focus on the definition of values, arXiv preprint arXiv:1905.12728 (2019). clear guidelines to recommend the best choice of DQ/DE [12] R. Y. Wang, D. M. Strong, Beyond accuracy: What improvement techniques to be applied depending on the data quality means to data consumers, JMIS 12 scope of the analysis. Moreover, we could enrich the (1996) 5–33. gathered knowledge with more datasets, DQ and DE di- [13] N. A. Saxena, et al., How do fairness definitions mensions, and Bias Mitigation techniques [16, 17]. fare? testing public attitudes towards three algorithmic definitions of fairness in loan allocations, Artif. Intell. 283 (2020) 103238.

Acknowledgments [14] S. Verma, J. Rubin, Fairness definitions explained, in: Proceedings of the FairWare@ICSE, 2018, pp.

This research was supported by EU Horizon Framework 1–7. grant agreement 101069543 (CS-AWARE-NEXT) and by [15] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, project ICT4Dev, funded by AICS (Italian Agency for A. Galstyan, A survey on bias and fairness in Development Cooperation). machine learning, ACM Comput. Surv. 54 (2022) 115:1–115:35.

References [16] F. Azzalini, C. Criscuolo, L. Tanca, E-fair-db: functional dependencies to discover data bias and en[1] A. Jain, et al., Overview and importance of data hance data equity, JDIQ 14 (2022) 1–26. quality for machine learning tasks, in: Proceedings [17] F. Azzalini, C. Criscuolo, L. Tanca, Fair-db: A sysof the 26th ACM SIGKDD, 2020, pp. 3561–3562. tem to discover unfairness in datasets, in: ICDE, IEEE, 2022, pp. 3494–3497.