<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Data Ethics: Towards a Trade-of Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Azzalini</string-name>
          <email>fabio.azzalini@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cinzia Cappiello</string-name>
          <email>cinzia.cappiello@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiara Criscuolo</string-name>
          <email>chiara.criscuolo@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camilla Sancricca</string-name>
          <email>camilla.sancricca@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Letizia Tanca</string-name>
          <email>letizia.tanca@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Data Protection</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Quality</institution>
          ,
          <addr-line>Data Ethics, Fairness</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Politecnico di Milano</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>are Accuracy</institution>
          ,
          <addr-line>Completeness, Consistency, and Timeli-</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>note that, for Data Science to be reliable, DQ should also</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <issue>2023</issue>
      <abstract>
        <p>In the last decades, one of the main drivers for organizational success has been data-driven decision-making: strategic decisions are based on data analysis and interpretation. In this scenario, relying on dependable results becomes imperative. Therefore we must ensure that input data have good quality and the algorithms on which the analysis is based are fair: in general, Data Quality (DQ) and Data Ethics (DE) should be guaranteed. However, maximizing DQ and DE simultaneously is non-trivial, since DQ improvement techniques can negatively afect DE and vice versa. Discovering which relationships exist between DQ and DE and thoroughly analyzing it is therefore of paramount importance. The goal of this paper is to study whether, in a given context, there is a trade-of between DQ and DE: specifically, we consider the Completeness dimension of DQ, and the Fairness dimension of DE. The results of our experiments, based on two real-world well-known datasets, provided details about this trade-of and allowed us to draw some guidelines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the last decades, data-driven culture spread in several
domains. The availability of large amounts of data and
algorithms has made our lives more eficient and easier,
and strategic decisions are made based on data analysis
and interpretation; therefore, relying on dependable
results becomes imperative. We need to be sure that the
data sources have good quality and the algorithms on
which the analysis is based are fair and do not introduce
bias in the decision process.</p>
      <p>In fact, on the one hand, the performance of Machine
Learning (ML) algorithms may be, for example, seriously
accurate, incomplete, and inconsistent data may produce
poor analysis results. Therefore, in addition to the
wellknown storage and processing problems related to data
collection, addressing Data Quality (DQ) has become a
fundamental issue [2, 3]. The most used DQ dimensions
ness [2]: Accuracy is the extent to which data are correct,
reliable and certified; Completeness is the degree to which
Bases (VLDBW’23) — the 12th International Workshop on Quality
Canada
(L. Tanca)</p>
      <p>It is already well known that there may be contrasting
objectives also among the dimensions of DE, for instance,
between Transparency and Data Protection. In the same
way, the relationship between the DQ dimensions [2],
and the ethical ones is complex. For example, commonly
used DQ improvement techniques –e.g., imputing
missing values using the mean value– might modify the
overall distribution of values in the dataset, leading to a
reduction of Fairness; on the other hand, some Bias Mitigation A system that considers also DQ is described in the
techniques modify real data values to remove unfairness, paper by Abraham et al. [8], who proposed FairLOF, a
thus lowering Accuracy, which is a fundamental dimen- Fairness-aware outlier-detection framework. This work
sion of DQ. However, there are also contexts in which starts from the fact that underrepresented groups,
althe user does not care about Fairness, like in the analysis though relevant in the dataset, could be identified as
outof sensors data or in forecasting raw-material prices. In liers, and specifically, on calibrating the so-called local
these cases, we do not have protected attributes (e.g., sex, outlier factor, by means of which a fairer outlier
detecrace, ethnicity, etc.) and not even proxy ones (e.g., edu- tion is possible. Though this system actually focuses on
cation, zip code, etc.). Moreover, in some applications, a specific problem, it can be considered a starting point
diferences in treatment and outcomes among diferent for studying the relationship between DQ and DE. A
groups are justified and explained: for example, dispro- similar system has been presented by Biswas et al. [9],
portional recruitment rates for males and females might whose goal is to investigate the impact of data
prepabe explained by the fact that more males have higher ration pipelines on algorithmic Fairness, focusing on
education [5], thus not always Fairness is an issue. deep-learning techniques. The authors conduct a
de</p>
      <p>This research aims to study if, in a given context, a tailed evaluation of several Fairness metrics applied to
trade-of between Data Quality and Data Ethics exists diferent deep-learning applications and discover that
and, in this case, give guidelines to the user according to many data preparation actions can introduce bias in the
that specific context. In this paper, we focus on the Com- data and, consequently, in the final prediction. However,
pleteness dimension of DQ, and on the Fairness dimension they do not employ any Fairness improvement technique
of DE. To this aim, we have designed experiments that inside their pipelines, considering only how DQ
techtake a dataset as input and perform an assessment of niques impact Fairness, and not vice versa.
these dimensions before and after applying some oper- Guha et al. [10] conducted a study to investigate
ations that should improve them. The rest of the paper whether errors, e.g., missing values, outliers, and
lais organized as follows: Section 2 summarizes related bel noise, can be related to demographic characteristics.
work, while Section 3 introduces preliminary concepts of Moreover, they investigate if automated data cleaning
both areas of Data Quality and Data Ethics and describes actions could impact Fairness. In their study, they
disthe method we used to analyze the relationship between covered that tuples related to disadvantaged groups were
Completeness of DQ and the Fairness dimension of DE; more afected by the presence of missing values; instead,
Section 4 presents the experiments we conducted on a the number of mislabeled data was lower in the
disadvanreal-world dataset, and Section 5 concludes the paper. taged groups w.r.t the privileged ones. Moreover, they
proved that, in general, the probability that automated
data cleaning contributes to worsening Fairness is higher
2. Related Work w.r.t. improving it. Finally, there is a work on the specific
relationship between Fairness and missing values [11].</p>
      <p>We discuss our diverse settings in Section 4.2.3.</p>
      <sec id="sec-1-1">
        <title>Research studies on the relationship between DQ and</title>
        <p>DE are in a very preliminary phase. In this section, we
will first present seminal works on Fairness and then
introduce two first attempts at studying its important
relationship with Completeness. We do not focus on DQ
systems since, in this paper, we will resort to well-known
and established DQ definitions and techniques [ 2].</p>
        <p>In the literature, one of the most notable solutions
aiming to measure and enforce Fairness is AI Fairness
360 [6], an open-source framework. It aims to mitigate
data bias, quantified using diferent statistical measures,
by exploiting pre-processing (i.e., procedures that, before
the application of a prediction algorithm, make sure that
the learning data are fair) techniques and statistical
measures to solve bias in the dataset. Similarly, Fairlearn [7],
another pre-processing, open-source, community-driven
project, aims to help data scientists improve Fairness of
their ML systems by means of statistical Fairness metrics.
Both papers focus on techniques that manipulate the data
to make them fairer; however, they do not consistently
consider the impact that their techniques have on DQ.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experiment Design</title>
      <sec id="sec-2-1">
        <title>This section presents the method we used to investigate</title>
        <p>the relationship between DE w.r.t. Fairness, and the DQ,
w.r.t. the Completeness. Figure 1 schematizes the typical
Data Science pipeline used to derive knowledge from data.
The pipeline begins with the Acquisition and Extraction
step: the information relevant to the data-science task is
collected. The second step of the pipeline aims to solve
the Data Quality issues: DQ Improvement and Annotation
procedures are used to “sanitize” the data sources in such
a way as to make them complete, correct and consistent.
In the third phase, if needed, Data Integration provides a
unified view of the data sources acquired in the first phase.
Finally, in the last two steps, the predictive models are
learned (Analysis and Modeling), and data and results are
visualized (Visualization and Evaluation). We position
our solution between the first and second step of the</p>
        <p>SUGGESTIONS</p>
        <sec id="sec-2-1-1">
          <title>3.1. Preliminaries</title>
          <p>Data Quality (DQ) is defined as “fitness for use,” i.e.,
the ability of a data collection to meet the user
requirements [12]. Data Quality is a multi-dimensional concept:
a DQ model is composed of DQ dimensions representing
the diferent aspects to be considered (i.e., errors,
duplicates, format errors, typos, or missing values). The
experiments concentrate on the Completeness DQ
dimension. Completeness characterizes the extent to which
a dataset represents the corresponding real-world. For
instance, in a relational database, Completeness is strictly
related to the presence of null values. A simple way to
assess the Completeness of a table is to calculate the ratio
between the number of non-null values and the number
of cells in the table. It is important to specify that we also
use the Accuracy dimension to evaluate the resulting data
correctness. Accuracy is, in fact, defined as the closeness
between a data value v and a data value v’, considered as
the correct representation of the real-life phenomenon
that the value v aims to represent. It is associated with
syntactic and semantic issues that might create a
discrepancy between the value stored in the dataset and the
correct value. How each of these two dimensions is used
will be explained in the description of the method.</p>
          <p>Fairness whose most used definition is: “the absence
of any prejudice or favoritism toward an individual or
a group based on their inherent or acquired
characteristics” [13, p.100], is one of the most important dimensions
of Data Ethics (DE). Fairness is based on the concept of
protected or sensitive attribute. A protected attribute is
a characteristic for which non-discrimination should be
established, such as religion, race, sex, and so on [14]. A
protected group is a set of individuals identified by
having the same value of a protected attribute (e.g.: females,
young people, Hispanic people). There is no unique
metric of Fairness, but many facets exist, and a model is
considered fair if it satisfies some or all these metrics. The
most used technique to identify unfairness in datasets is
to train a classification algorithm to predict the binary
value of the target class that can be a positive outcome
like obtaining a loan or having a high income, or a
negative outcome like not obtaining a loan or having a low
income; and then use Fairness metrics to understand
whether the prediction of this model encompasses
discrimination for the protected group: if the metrics
results show discrimination, we can conclude that also the
dataset contains unfair behaviors since the model learned
the bias from it. Specifically, we measure the importance
of protected attributes in determining the result of the
model. The following statistical metrics study how
specific values of the protected attributes impact the result of
the prediction algorithm (e.g., women are very frequently
associated with salaries lower than 50k$/year, while men
earn more than 50k$/year). Informally: Disparate Impact
Ratio is the probability to get a positive outcome
regardless of whether the person is in the protected group [15];
Predictive Parity Ratio evaluates if both protected and
unprotected groups have equal probability that a group
member with positive predictive value belongs to the
negative class [14]; False Positive Ratio: evaluates if the
probability of having a false positive prediction is the
same for all protected groups [14].</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>3.2. A Method to analyze the DQ and DE tradeof</title>
          <p>This section presents the two pipelines we defined to
execute the experiments. In the first one, which can be
applied to datasets afected by ethical issues and
ethicscompliant datasets, we injected errors in the input dataset,
causing data quality issues, and then applied DQ
improvements techniques, measuring their impact on DE. In the
second pipeline, we applied DE improvement techniques
to a dataset afected by ethical problems and measured
their impact on DQ. Through these results, we studied
the trade-of between DQ and DE. In our experiments,
we considered the trade-of between the Completeness
DQ dimension and the Fairness DE dimension, while the</p>
          <p>Accuracy DQ dimension is used to evaluate the final DQ
level in both pipelines. We used the Adult Census Income Imputation methods. The pipeline output is the Suggested
dataset1 and the German Credit dataset2 and considered DQ Improvement step in which we suggest the best DQ
‘sex’ as the protected attribute. Since the Adult Census improvement technique based on Accuracy and Fairness
Income dataset already contains bias w.r.t. the income of results. The final users can choose the Imputation
techUS citizens, injecting further bias to perform the experi- nique with the minimum impact on Fairness according
ments was not necessary, therefore, we used it in both to their preferred trade-of.
pipelines. The German Credit dataset, instead, is not af- DE-Oriented Experiments. Also in this case the input
fected by bias – thus, we could not apply Bias Mitigation dataset was free of DQ problems. As regards Fairness, we
techniques, and we tested it only in the first pipeline. did not have an error-injection phase since, this time, the
The first operation, performed in both pipelines, is the considered dataset (Adult Census Income) was already
Ethical Evaluation, in our case based on a classification biased. The DE Improvement phase consisted of applying
algorithm that computes the Fairness level of the dataset. a Bias Mitigation Technique to remove unfairness. Also
For the DQ level, we already knew that it was 100% for here, the repaired dataset was analyzed in the Final
Evalboth datasets. We now describe the two pipelines shown uation phase, both Fairness and Accuracy are measured,
in Figure 2. repeating this phase for all the selected Bias Mitigation</p>
          <p>DQ-Oriented Experiments. The input dataset was free techniques. Some of these techniques, since they act
of DQ problems. For this reason, we had to inject errors by directly replacing the data values with other (fake)
in order to evaluate the impact of the DQ improvement values, also allow controlling the amount of bias repair
techniques. In our case, to afect Completeness, we re- executed. For example, Correlation Remover [7], fully
deplaced existing values with null values. By injecting a scribed in the next section, modifies the actual values to
diferent percentage of uniformly distributed DQ errors 3 minimize the correlation between the feature attributes
(from 90% to 0%, with a decreasing step of 10%) the Error and the sensitive ones. The output of the pipeline is the
Injection phase generates ten instances of the original Suggested DE Improvement step in which we propose the
dataset, at diferent levels of quality. These ‘dirty’ ver- best DE improvement technique based on both DQ and
sions are the input of the DQ Improvement phase, in DE evaluation results. The final users can choose the
which a DQ improvement technique is applied. In our Bias Mitigation technique having the minimum impact
case, an Imputation technique was selected. The ten re- on Accuracy according to their preferred trade-of.
paired datasets obtained as output were analyzed in the
Final Evaluation phase, to check the impact of the DQ
improvement on the Fairness and Accuracy measures, 4. Experiments
used to evaluate respectively the lack of bias and the data
correctness. This procedure was repeated for diferent</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>In this section, we first introduce the experimental setup</title>
        <p>and then describe the results, both from the DQ and the
DE perspectives.
1https://archive.ics.uci.edu/ml/datasets/adult
2https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+
data)
3Related to a specific DQ dimension</p>
        <sec id="sec-2-2-1">
          <title>4.1. Experimental setup</title>
          <p>measured the Accuracy as  ℎ</p>
          <p>where   is the total</p>
          <p>DQ Improvement phase. In this paper, we consider ; False Positive Ratio (FPR)
three Data Imputation techniques: Density-based, where
missing values are imputed for each feature with the same (  =̂1| =0,=) ; where  is a protected attribute that
distribution of the non-empty values; Mode Imputation, h(as=̂1tw|=o0,=va)lues discr (=discriminated), priv (=privileged);
where the most frequent value is imputed; and Rare-  is the actual classification result, two values (or labels)
based, where the less frequent value is imputed. 0 or 1; and  ̂ is the algorithm-predicted decision for the</p>
          <p>Bias Mitigation phase. Three Bias Mitigation tech- individual, two values of the outcome 0 (negative
outniques are proposed to remove the unfairness from data. come) or 1 (positive outcome). The ideal value for all
The first one, Correlation Remover [ 7], removes the neg- three metrics is 1, which means both groups are treated
ative correlation between the protected attribute and the equally. If the value is between 0 and 1 −  , the
discrimclassification label by modifying the non-protected at- inated group is treated unfairly, whereas if the value is
tributes that are in turn correlated with the protected one: greater or equal to 1 +  , the privileged group is treated
mathematically speaking, it poses a minimization prob- unfairly. Parameter  is a threshold value that must be set
lem on the correlation between the feature attributes by an expert. In our experiment we set the  parameter
and the sensitive ones by centering the sensitive val- equal to 0.2.
ues, training a linear regressor on the non-sensitive ones Dataset and classification algorithm. As explained in
and reporting the residual. The second one is Learn- Section 3, we considered two datasets. The first one is
ing Fair Representation [6], which maps each data tuple the Adult Census Income dataset, typically used to
pre(corresponding to an individual) to a ‘prototype’, an ar- dict whether the income of an individual exceeds 50k$
tificial representation of the data containing the same per year. It comprises 48842 tuples, described by 15
atprotected attribute but with modified values for the other tributes, including the target class. This dataset contains
features, to remove the correlation between the protected more than one protected attribute (‘race’, ‘sex’, and
‘naattributes and the target ones. To do so, this method uses tive country’), but our study considered only the attribute
a neural network with the objective of retaining as much ‘sex’. The second one is the German Credit dataset, which
information as possible. The last one, Optimized Pre- collects information on individuals that are classified
processing [6], solves an optimization problem with the based on the fact that they are deemed good or bad
payobjective of minimizing the diference between the modi- ers when asking for a loan. It comprises 1000 tuples,
ifed distribution and the original one; specifically, it aims consisting of 20 attributes, including the target class. The
to reduce the discrimination by mapping diferent feature sensitive attribute is ‘personal-status-sex’, i.e., the marital
attributes to the classification labels of the individuals status, from which the protected attribute ‘sex’ can be
inside the dataset, while keeping the protected attributes derived. Diferently from the previous one, this dataset is
unchanged, to limit the dependency of the prediction on not afected by bias with respect to ‘sex’. Finally, we used
the sensitive attributes. In all three cases, the techniques as classification algorithm the Decision Tree Classifier
involve only the numerical features. ofered by the scikit-learn Python library.</p>
          <p>Evaluation Metrics. To evaluate the DQ level of the
dataset, during the Evaluation phase, the Accuracy metric
has been selected. To this aim, the distance between the 4.2. Result evaluation
original and the final dataset has been computed. Thus,
we extracted the number  ℎ of values that correspond
to each other in the original and the final dataset, and</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>This section presents the main results we obtained. In</title>
        <p>Figure 3, the x-axis represents the Completeness level;
instead, in Figure 4, the x-axis shows the degree of Bias
Mitigation. In both figures, the y-axis represents the level
of the evaluated metrics.
number of cells.</p>
        <p>Since there is no standard system for measuring
Fairness, we used two diferent systems. For the DQ-Oriented 4.2.1. DQ-Oriented Experiments
Experiments, we measured Fairness by means of a set of
already defined formulas. Instead, for the DE-Oriented The plots shown in Figure 3 focus on the DQ-Oriented
ExExperiments, we computed the Fairness metrics ofered periments in which the Accuracy and Fairness results are
by the Fairlearn [7] mitigation tool. The two results compared for the three Imputation techniques explained
are comparable since there is a very small delta be- in Section 4.1.
tween the two. For the DQ-Oriented Experiments, the Biased dataset. The three plots at the top of Figure 3
three metrics, taken from [14, 15], selected to evalu- show the results for the Adult dataset. In general, the
ate Fairness (see Section 3) are expressed as: Disparate Mode and the Density-based Imputations reach higher
Accuracy with respect to the Rare-based one, since the
ADULT</p>
        <p>GERMAN
latter modifies the original distribution of values more decreases very quickly. In this specific case, this happens
than the others. From the Fairness point of view, we can because, by imputing the less frequent values, the dataset
observe that the Predictive Parity Ratio (PPR) metric can will be more balanced in favor of the protected class. As
assume values greater than 1 +  , i.e., 1.2. This means that the percentage of injected errors grows, the rare values
the privileged class (men) is treated unfairly for that spe- become too many, unbalancing the dataset again.
cific Fairness aspect; i.e., the probability of belonging to Unbiased dataset. The three plots at the bottom of
class 0 (low income) for a man that instead was predicted Figure 3 show the results for the German dataset. Since
to class 1 (high income) is lower than the probability the two datasets have a similar distribution, after the
of belonging to class 0 for a woman predicted to class application of the Imputation techniques, the Accuracy
1. On the contrary, False Positive Ratio (FPR) always takes similar values as in the previous case.
takes opposite values with respect to PPR. These two Since the dataset is already fair, FPR and DIR metrics
metrics are symmetrical since they represent opposite assume values around 1, while the PPR is almost 2. After
Fairness aspects: FPR evaluates whether the probability applying the Imputation techniques, FPR and DIR are
of predicting class 1 is the same both for men and women not afected, while the value of PPR is closer to 1 (i.e.,
belonging to class 0. the probability of belonging to class 0 (bad credit) for
As we can notice, in this specific experiment, the Mode a man predicted to class 1 (good credit) is lower than
Imputation introduces minimal changes to the Fairness the probability of belonging to class 0 for a woman
premetrics since imputing the most frequent value does not dicted to class 1), therefore the PPR has improved with
afect the distribution of the original ones. respect to its initial value. In this case, the Imputation
Instead, the Density-based Imputation is much better: techniques balanced the PPR, improving it as much as
in fact, as the percentage of injected errors increases, they modify the original distribution of the values. In
Fairness increases for all three metrics. This is related fact, Rare-based Imputation, which modifies the original
to a vast majority of the class 0 in the dataset; since the distribution more, introduces unbalance, causing further
Imputation follows the value distribution, it means that deterioration of Fairness over the 60% injected errors.
those labels (class 0) have a higher probability of being From these results, we can notice a trade-of between
Acassigned to men (who are over-represented). In this way, curacy and Fairness; from the DQ-Oriented Experiments
the dataset will be balanced. We can conclude that the we see that this trade-of can be more or less emphasized
application of this Imputation method improves Fairness. depending on the DQ improvement technique applied.
Finally, when applying the Rare-based Imputation, when
Completeness varies between 100% and 40%, the Fairness
increases; for Completeness values below 40%, Fairness</p>
        <p>ADULT
4.2.2. DE-Oriented Experiments ing, the Accuracy remains unchanged before and after the
mitigation process. This happens because there is no data
The plots shown in Figure 4 focus on the DE-Oriented modification, but only weights are given to the numerical
Experiments. We compared the Accuracy and Fairness features in order to reduce the correlation between the
results for the Bias Mitigation techniques explained in protected attribute and the prediction. However,
applySection 4.1. The results of the experiments conducted on ing this technique to the full dataset is not suficient to
the entire dataset are represented at the top of Figure 4. improve Fairness because the categorical features still
The Bias Mitigation techniques we used focus only on afect the prediction. Moreover, applying this technique
numerical attributes, thus, the results shown at the bot- considering only the numerical features improves one
tom of Figure 4 show the same experiments based only Fairness metric (FPR) over three.
on the numerical features. We now present our results In the DE-Oriented Experiments we detected a trade-of
by analyzing one Bias Mitigation technique at a time. between Accuracy and Fairness, and this relationship can</p>
        <p>Correlation Remover. When applying Correlation Re- be more or less strong depending on the Bias Mitigation
mover for a partial Bias Mitigation between 0 and 1, the technique that is applied.</p>
        <p>Fairness metrics (DIR, FPR, and PPR) slightly improve,
but with an important loss in Accuracy (from 1.0 to 0.6).</p>
        <p>This happens because the removal of correlation strongly 4.2.3. A brief comparison
modifies the data, greatly afecting Accuracy. Consid- We can now summarize the diferences between our work
ering the case in which only the numerical features are and the approach of [11] presented in Section 2: in [11]
involved, the Fairness metrics are negatively afected. the authors studied only the Completeness dimension of
This represents a case of over-correction. By modifying DQ, while we also evaluate the results using Accuracy;
the entire dataset, data are too far from the original ones, the Fairness metric studied in [11] is only one, while we
and the results are no longer reliable. studied two more metrics; in [11] the initial dataset used</p>
        <p>Learning Fair Representation. Applying Learning Fair for the experiments is an unclean one, while we control
Representation, we have the same loss in Accuracy as the process by applying error injection to a previously
for Correlation Remover, since it modifies the numerical cleaned dataset; finally, in [ 11] the Imputation techniques
features in order to remove correlations. However, this used are only Mode and Mean, while we also apply Rare
technique also aims to minimize information loss, thus, and Density-based Imputation techniques.
does not cause such a radical modification as the previous
method. Therefore, Fairness improvement is minimal
considering the full dataset, while considering only the 5. Conclusions
numerical features, two metrics over three improve (DIR
and FPR).</p>
        <p>Optimized Preprocessing. Using Optimized
Preprocess</p>
        <p>Takeaway message. From our experiments, we have
noticed that the application of Data Imputation
techniques, in some particular cases, e.g., Density-based
Imputation and Rare-based imputation on the Adult dataset, [2] C. Batini, M. Scannapieco, Data and Information
can contribute to improving Fairness. Moreover, in the Quality - Dimensions, Principles and Techniques,
experiments, starting from unbiased data, Fairness was Data-Centric Systems and Applications, Springer,
not afected by the application of the Imputation tech- 2016.
niques. In most cases, we noticed a trade-of: the Bias [3] C. Sancricca, C. Cappiello, Supporting the design
Mitigation technique that less afects the Accuracy, in of data preparation pipelines (2022) 149–158.
general the Optimized Preprocessing technique, is not the [4] D. Firmani, L. Tanca, R. Torlone, Ethical dimensions
one that improves Fairness the most, and vice versa; for for data quality, JDIQ 12 (2019) 1–5.
these cases, we can deduce that techniques that succeed [5] F. Kamiran, I. Žliobaitė, Explainable and
nonin preserving both Accuracy and Fairness do not exist. explainable discrimination in classification,
DisTherefore, as a takeaway message, we can afirm that the crimination and Privacy in the Information Society:
best Data Imputation/Bias Mitigation technique to Data mining and profiling in large databases (2013)
apply strictly depends on the analysis goal. If users 155–170.
are more interested in preserving Fairness aspects, they [6] R. K. Bellamy, et al., Ai fairness 360: An
extensiwill concentrate on a subset of techniques at the cost of ble toolkit for detecting and mitigating algorithmic
losing DQ; if the major interest is to optimize the im- bias, IBM Journal of Research and Development 63
provement of the DQ, they will apply a subset of DQ (2019) 4–1.
improvement tasks that could afect Fairness. It is worth [7] S. Bird, et al., Fairlearn: A toolkit for assessing
noting that situations may also exist in which Accuracy and improving fairness in ai, Microsoft, Tech. Rep.
and Fairness are not in conflict; however, this is strictly MSR-TR-2020-32 (2020).
context-dependent. [8] S. S. Abraham, Fairlof: fairness in outlier detection,</p>
        <p>Conclusions. In this work, we analyzed the relationship Data Science and Engineering 6 (2021) 485–499.
between Data Quality (DQ) and Data Ethics (DE). Specif- [9] S. Biswas, H. Rajan, Fair preprocessing: towards
ically, we focus on the Completeness dimension of DQ, understanding compositional fairness of data
transand on the Fairness dimension of DE. Through a series of formers in machine learning pipeline, in:
Proceedexperiments, we demonstrated that between DQ and DE ings of the 29th ACM Joint Meeting on ESEC/FSE,
a trade-of is present. In fact, the experiments showed us 2021, pp. 981–993.
that the application of Fairness improvement operations [10] S. Guha, F. A. Khan, J. Stoyanovich, S. Schelter,
Aucan lead to a deterioration of Accuracy, used to evaluate tomated data cleaning can hurt fairness in machine
the DQ, and vice versa. Analyzing the experiments more learning-based decision making, in: 2023 IEEE
in detail, we can also state that the amount of Accuracy 39th International Conference on Data Engineering
deterioration after Fairness improvements depends on (ICDE), IEEE, 2023, pp. 3747–3754.
the Bias Mitigation technique, as well as the deteriora- [11] F. Martínez-Plumed, C. Ferri, D. Nieves,
tion of Fairness can depend on the selected Imputation J. Hernández-Orallo, Fairness and missing
technique. Future work will focus on the definition of values, arXiv preprint arXiv:1905.12728 (2019).
clear guidelines to recommend the best choice of DQ/DE [12] R. Y. Wang, D. M. Strong, Beyond accuracy: What
improvement techniques to be applied depending on the data quality means to data consumers, JMIS 12
scope of the analysis. Moreover, we could enrich the (1996) 5–33.
gathered knowledge with more datasets, DQ and DE di- [13] N. A. Saxena, et al., How do fairness definitions
mensions, and Bias Mitigation techniques [16, 17]. fare? testing public attitudes towards three
algorithmic definitions of fairness in loan allocations,
Artif. Intell. 283 (2020) 103238.</p>
        <p>Acknowledgments [14] S. Verma, J. Rubin, Fairness definitions explained,
in: Proceedings of the FairWare@ICSE, 2018, pp.</p>
        <p>This research was supported by EU Horizon Framework 1–7.
grant agreement 101069543 (CS-AWARE-NEXT) and by [15] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman,
project ICT4Dev, funded by AICS (Italian Agency for A. Galstyan, A survey on bias and fairness in
Development Cooperation). machine learning, ACM Comput. Surv. 54 (2022)
115:1–115:35.</p>
        <p>References [16] F. Azzalini, C. Criscuolo, L. Tanca, E-fair-db:
functional dependencies to discover data bias and
en[1] A. Jain, et al., Overview and importance of data hance data equity, JDIQ 14 (2022) 1–26.
quality for machine learning tasks, in: Proceedings [17] F. Azzalini, C. Criscuolo, L. Tanca, Fair-db: A
sysof the 26th ACM SIGKDD, 2020, pp. 3561–3562. tem to discover unfairness in datasets, in: ICDE,
IEEE, 2022, pp. 3494–3497.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>