Investigating Bias with a Synthetic Data Generator: Empirical Evidence and Philosophical Interpretation Alessandro Castelnovo1,* , Riccardo Crupi1 , Nicole Inverardi1 , Daniele Regoli1 and Andrea Cosentini1 1 Data Science & Artificial Intelligence, Intesa Sanpaolo S.p.A., Italy Abstract Machine learning applications are becoming increasingly pervasive in our society. Since these decision- making systems rely on data-driven learning, risk is that they will systematically spread the bias embedded in data. In this paper, we propose to analyze biases by introducing a framework for generating synthetic data with specific types of bias and their combinations. We delve into the nature of these biases discussing their relationship to moral and justice frameworks. Finally, we exploit our proposed synthetic data generator to perform experiments on different scenarios, with various bias combinations. We thus analyze the impact of biases on performance and fairness metrics both in non-mitigated and mitigated machine learning models. Keywords Synthetic Data, Bias, Fairness, Worldview, Machine Learning 1. Introduction As society grows more digital, a greater amount of data becomes accessible for decision-making. In this context, machine learning techniques are increasingly being adopted by businesses, governments, and organizations in many important domains that affect people’s lives every day. However, algorithms, like humans, are susceptible to biases that might lead to unfair outcomes [1]. Bias is not a recent problem; rather, it is ingrained in human society and, as a result, it is reflected in data [2]. The risk is that the adoption of machine learning algorithms could amplify or introduce biases that will reoccur in society in a perpetual cycle [3]. Several projects and initiatives have been launched in recent years aimed at bias mitigation and the development of fairness-aware machine learning models. Following to [2], we divide these works in three main categories: • Understanding bias. Approaches that help to understand how bias is generated in society and manifest in data. This category contains studies of differences among biases, as well as their definition and formalization. • Accounting for bias. Approaches discussing how to manage bias depending on the context, the regulation, the vision and the strategy on fairness [4, 5]. As discussed in [6], different 1st Workshop on Bias, Ethical AI, Explainability and the role of Logic and Logic Programming, BEWARE-22, co-located with AIxIA 2022, November 28 - December 2, 2022, University of Udine, Udine, Italy * Corresponding author. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) definitions of fairness and their implementations correspond to different axiomatic beliefs about the world (or worldviews), in general mutually incompatible. • Mitigating bias. Technical approaches aimed to the development of machine learning models devoted to bias reduction with performance optimization. Depending on the stage of the machine learning pipeline where the bias is mitigated, these methods are typically divided into: pre-processing [7, 8], in-processing [9] and post-processing [10, 11]. One common approach to investigate the nature of bias is to conduct experiments on ad hoc scenarios through the generation of synthetic data [12, 13]. The benefits of this strategy include the possibility of examining circumstances not available with real-world data but that may occur, and —even when real-world data is available— to precisely control and understand the data generation mechanism. Moreover, it is indisputable that making data, and related challenges, accessible to the research community for analysis could be of help for the development of sound policy decisions and benefit society [13]. 1.1. Contribution With this work, we aim to contribute to the literature on fairness in machine learning in each of the three areas discussed above through the use of synthetic data. In particular, we contribute to understanding bias by introducing a model framework for generating synthetic data with specific types of bias. Our formalisation of these various types of bias is based on the theoretical classifications present in the relevant literature, such as the surveys on bias in machine learning by Mehrabi et al. and Ntoutsi et al.. Against the background of the stream of literature about the relation between moral world- views and biases, and in particular following [6, 14, 15], we analyze the worldviews related with each bias that our framework is able to generate, thus providing some insights in the discussion on accounting for bias. Finally, about mitigating bias, we leverage our framework to generate twenty-five different scenarios characterized by the presence of various bias combinations. In each setting, we investigate the behavior and effects of traditional machine learning mitigation strategies [16]. An open source implementation of the proposed model framework is available at github.com/rcrupiISP/ISParity. 2. Related Works 2.1. Synthetic data Synthetic data generation is a relevant practice for both businesses and the scientific community. As a result, the literature has given it a lot of attention. Main directions behind the generation of synthetic data are: the emulation of certain key information in real dataset while preserving privacy [13, 17] and; the generation of different testing scenarios for evaluating phenomenon not covered by the available data [12]. Assefa et al.[17] presented basic use cases with specific examples in the financial domain like: internal data use restrictions, data sharing, tackling class imbalance, lack of historical data and training advanced Machine Learning models. Moreover, the authors defined Privacy preserving, Human readability and Compactness as desirable properties for synthetic representation. Is important to remark that synthetic data generation is not a possible technique for data anonymisation [18], but an alternative data sanitisation method to data masking for preserving privacy in published data [19]. In fact, synthetic data are typically randomly generated with constraints to protect sensitive private information and retain valid inferences with the attributes in the original data [13]. The provided synthetic data are generally classified into: fully synthetic data, partially synthetic data and hybrid synthetic data (see [20] for further details). We refer to [19] for a detailed overview of the techniques for generating synthetic data sets. As introduced in the previous section, we are interested in producing fully synthetic data to replicate some common biases that could affect the data, and investigates fairness-related issues that arise from the development of machine learning models on them. In this regard, there are numerous works in the literature that generate synthetic datasets to simulate desired scenarios and, from these, testing discrimination-aware methods [21, 22, 23, 24]. For examples, Reddy et al. evaluate different fairness methods trained with deep neural networks on synthetic dataset. In these data were present different imbalanced and correlated data configuration to verify the limits of the current models and better understand in which setups they are subject to failure. We contribute in this field of literature by introducing a modeling framework to generate synthetic data presenting specific forms of bias. Our formalisation of these different kinds of bias builds upon theoretical classifications present in the relevant literature, such as the works of Mehrabi et al. and Ntoutsi et al.. We leverage our proposed method to actually generate several datasets, each characterized by a specific combination of biases, and perform experiments on them to examine the effects of such biases on state-of-the-art mitigation approaches. 2.2. Bias and moral framework in decision-making There is little consensus in the literature regarding bias classification and taxonomy. Moreover, the very notion of bias depends on profound ethical and philosophical considerations, which is likely one of the very causes for the lack of consensus. Different understandings of bias and fairness depend on the assumption of a belief system beforehand. Friedler et al. and Hertweck et al. talk about worldviews. In particular, [6] outlines two extreme cases, referred to as What You See Is What You Get (WYSIWYG) and We are All Equal (WAE). Starting from the definition of three different metric spaces, these two perspectives differ because of the way they consider the relations in between. The first space is named Construct Space (CS) and represents all the unobservable realized characteristics of an individual, such as intelligence, skills, determination or commitment. The second space is the Observable Space (OS) and contains all the measurable properties that aim to quantify the unobservable features, think e.g. of IQ or aptitude tests. The last space is the Decision Space (DS), representing the set of choices made by the algorithm on the basis of the measurements available in OS. Note that shades of ambiguity are already detectable at this level, because the mappings between spaces are susceptible of distortions. Moreover, CS is by definition unobservable, thus we can only make assumptions on it. According to WYSIWYG, CS and OS are essentially equal and any distortion between the two is altogether irrelevant for the fairness of the decision resulting in DS. Contrarily, WAE doesn’t make assumptions about the similarity of OS and CS, and moreover assumes that we are all equal in CS, i.e. that any difference between CS and OS is due to a biased observation process that results in an unfair mapping between CS and OS. With the distance between worldviews in mind, the notion of fairness inspired by [25] affirming that individuals that are close in CS shall also be close in DS (commonly known as individual fairness) appears diversified and differently achievable. If WYSIWYG is chosen, non-discrimination is guaranteed as soon as the mapping between OS and DS is fair, since CS ≈ OS. On the other hand, according to WAE the mapping between CS and OS is distorted by some bias whenever a difference among individuals emerges (this difference is named Measurement Bias in [14]); therefore, to obtain a fair mapping between CS and DS those biases should be mitigated properly. Building on [6], Hertweck et al. describe a more realistic and nuanced scenario by introducing the notion of Potential Space (PS): individuals belonging to different groups may indeed have different realized talents (i.e. they actually differ in CS), and these may be accurately measured by resumes (i.e. CS ≈ OS), but, if we assume that these groups have the same potential talents (i.e. they are equal in PS), then the realized difference must be due to some form of unfair treatment of one group, that is referred to as life bias. Hertweck et al. call this view We Are All Equal in Potential Space (WAEPS). Actually, as argued in [14], we can effectively think of the WAE assumption as a family of assumptions, depending on the point time in which the equaility is assumed to hold: the more we go back in time in assuming equality between individuals, the more the consideration of life circumstances becomes strong, and thus the less discrimination between individuals is considered legitimate. These extreme worldviews amount on the one hand to accept the situation as it is observed (WYSIWIG), on the other to infer some form of unfairness whenever there is some observed disparity (WAE). To avoid such extreme scenarios, philosophical theories around Equality of Opportunity (EO) offer some suggestions and interpretive tools for approaching biases in differ- ent situations [26, 27, 28]. In this sense, western political philosophy and algorithmic fairness literature encounter themselves in the formulation of fairness around the concept of equal opportunities for all members of society. Heidari et al. list three different EO conceptions, going from more permissive to more stringent: Libertarian EO, by which individuals are held account- able for any characterizing feature, sensitive one included; Formal EO, by which individuals are not held accountable for differences in sensitive features only; Substantive EO, by which there is a set of individual characteristics which are due to circumstances and others that are a consequence of individual effort, and people should be held accountable only on the basis of the latter. The choice of which characteristics fall in the level of circumstances and which can be considered as individual effort is far from obvious. Depending on the EO framework that one is willing to embrace, observed disparities may be seen as “just” or “unjust” forms of bias. In the following section, we shall describe the most common biases, explaining how they relate to these fundamental concepts. 3. Fundamental types of bias Considering that the assumptions about the worldview and the EO framework affect the conception as well as the assessment of biases, in what follows we focus on what we consider the fundamental building blocks of most types of biases, namely: Historical bias, Measurement bias, Representation bias, Omitted variable bias. Historical bias —sometimes referred to as social bias, life bias, or structural bias [3, 2, 14]— occurs whenever a variable of the dataset relevant to some specific goal or task is dependent on some sensitive characteristic of individuals, but in principle it should not. An example of this bias is the different average income among men and women, which is due to long-lasting social pressures in a man-centered society, and does not reflect intrinsic differences among sexes. Following [3], we can talk of a form of bias going from users to data: this type of bias affects directly the actual phenomenon generating the data. A similar situation may arise when the dependence of sensitive individual characteristics is present with respect to the variable that we are trying to estimate or predict. For instance, there are cases in which the target of model estimation is itself prone to some form of bias, e.g. because it is the outcome of some human decision. Think e.g. to try to build a data-driven decision process to decide whether to grant or not a loan on the basis of past loan officers’ decisions, and not of actual repayments. Note that the actual presence of historical bias is conditioned by the previous assumption of the WAE worldview. Indeed, arguing that in principle there should be no dependence on some sensitive features makes sense only if a moral belief of substantial equity is required to begin with. Otherwise, according to WYSIWYG, CS is fairly reported in OS and therefore structural differences between individuals are legitimate sources of inequality. Moreover, accepting the Libertarian EO or the Substantive EO frameworks would involve the legitimate use of some sensitive features, respectively because they are a property of the self and because we should be aware of their influence on the values of non-sensitive features. Ultimately, the presence of historical bias depends on the assumption of WAE at the initial time of life. As argued in [14], interpreting bias as historical means conceiving equality at the level of PS, that describes the innate/native potential of each individual. Measurement bias occurs when a proxy of some variable relevant to a specific goal or target is employed, and that proxy happens to be dependent on some sensitive characteristics. For instance, one may argue that IQ is not a “fair” approximation of actual “intelligence”, and it might systematically favour/disfavour specific groups of individuals. Statistically speaking, this type of bias is not very different from historical bias —since it results in employing a variable correlated with sensitive attributes— but the underlying mechanism is nevertheless different, and in this case the bias needs not to be present in the phenomenon itself, but rather it may be a consequence of the choice of data to be employed. In other words, this is an example of bias from data to algorithm in the taxonomy of [3], i.e. a bias due to data availability, choice and collection. Incidentally, notice that this form of bias —using a biased proxy of a relevant variable— might as well happen with the target variable. In this situation, it is the quantity that we need to estimate/predict that is somehow “flawed”. The fact that measurement bias depends also on a choice component, which is to say the choice of the dataset, can extend its occurrence in the WYSIWYG worldview as well. Indeed, the choice of “what to measure” determines and modifies what is made observable. The eventuality of awareness of biased measurements would probably require mitigation also in WYSIWYG. Alternatively, according to WAE, measurement bias may lie in the mapping between construct and observed space, which is to say between a “real” ability of an individual and an observable quantity that tries to measure it. 𝑅 𝑅 𝑌 𝑌 𝐴 𝑄 𝐴 (a) Historical bias on feature and target (dashed) vari- (b) Omitted variable bias able 𝑅 𝑅 𝑌 𝑌 𝐴 𝑃𝑅 𝐴 𝑃𝑌 (c) Measurement bias on 𝑅 (d) Measurement bias on 𝑌 Figure 1: Atomic biases. Grey-filled circles represent variables employed by the model 𝑓^ . Representation bias occurs when, for some reasons, data are not representative of the world population. One subgroup of individuals, e.g. identified by some sensitive characteristic such as ethnicity, age, etc. may be heavily underrepresented. This under-representation may occur in different ways. It may be at random, i.e. the subgroup is less numerous than it should be, but without any particular skewness in the other characteristics: in this case this single mechanism is not sufficient to create disparities, but it may exacerbate existing ones. Alternatively, the under-represented subgroup might contain individuals with disproportionate characteristics with respect to their corresponding world population, e.g. only low-income individuals, or only low-education individuals. In this last case, representation bias may be enough to create disparities in decision making processes based on those data. The mechanism underlying the representation disparities should be analyzed on the basis of the assumed worldview and/or chosen EO framework: e.g. if the data has an under-represented ethnic minority one should investigate why it is so. If the target population is itself different from world population (i.e it is not merely a poor data collection), then one should consider the reasons by which this ethnic minority is under-represented in the target population, e.g. in the Substantive EO framework one should understand if these reasons have to be regarded as circumstance or as consequence of individual effort. In the latter case, the representation disparities are not to be considered as “unfair” per se. Representation bias is strictly connected to sampling bias, in that it embodies problems arising during data collection, e.g. by collecting disproportionately less observations from one subgroup, possibly skewed with respect to some characteristics. Like measurement bias, this is a form of bias going from data to algorithm. Omitted variable bias may occur when the collected dataset omits a variable relevant to some specific goal or task. In this case, if the variables that are present in the dataset have some dependence on sensitive characteristics of individuals, then a machine learning model trained on such dataset will learn such dependencies, thus producing outcomes with spurious dependence on sensitive attributes. Assuming the Formal EO framework, sensitive features are omitted by default. While this may appear fairer because the decision is made solely on the basis of the relevant attributes, on the other hand it becomes arduous to mitigate on structural biases that affect achievements. Depending on worldview assumptions or on the chosen EO framework, the mechanism through which the residual variables happen to depend on sensitive individual characteristics should be as well analyzed to understand/decide whether this dependence is legitimate or if they are themselves a consequence of some bias at work. The above list of biases should be seen as the set of the most important mechanisms through which “unfair” disparities happen to result in data-driven decision making. Notice, however, that in terms of consequences on the data, it may well be that different types of bias result in very similar effects. E.g. representation bias may create in the dataset spurious correlations among sensitive characteristics of individuals and other characteristics relevant to the problem at hand, a situation very similar to the correlations present as a consequence of historical bias. This reminds us that in reality we are not aware of the type of bias (or biases) affecting the data and that their interpretation depends on former assumptions. 4. Dataset Generation We propose a simple modeling framework able to simulate the bias-generating mechanisms described in section 3. The rationale behind the model is that of being at the same time sufficiently flexible to accommodate all the main forms of bias generation while maintaining a structure as simple and intuitive as possible to facilitate human readability and ensure compactness avoiding unnecessary complexities that might hide the relevant patterns. As noted in section 3, following [3] we can distinguish between biases from users to data and from data to algorithm. Namely, between biases that impact the phenomenon to be studied and thus the dataset, and biases that impact directly the dataset but not the phenomenon itself. Formally, we model the relevant quantities describing a phenomenon as random variables, in particular we label 𝑌 the target variable, namely the quantity to be estimated or predicted on the basis of other feature variables, that we collectively call 𝑋. As usual, we assume that the underlying phenomenon is described by the formula 𝑌 = 𝑓 (𝑋) + 𝜖, (1) where 𝑓 represents the actual relationship between features and target variables, modulated by some idiosyncratic noise 𝜖. A data-driven decision maker infers from a (training) set of samples {(̃︀ 𝑥𝑖 , 𝑦𝑖 )}𝑖=1,...,𝑁 , an ^ estimate for 𝑓 that we label 𝑓 , thus producing its best estimate for 𝑌 , namely 𝑌^ = 𝑓^ (𝑋). ̃︀ (2) The use of 𝑋 ̃︀ rather than 𝑋 describes the fact that the set of variables employed to make inferences about a phenomenon may not coincide with the actual variables that play a role in that phenomenon. This is precisely what happens in some forms of biases such as measurement bias or omitted variable bias. Notice that users to-data types of bias impact directly equation (1), while data-to-algorithm biases have a role at the level of equation (2). It is possible to make a schematic representation of building blocks of biases as discussed in section 3 via Directed Acyclic Graphs (DAGs), representing causal impacts among variables (see e.g. [29, 30, 31]). In general, in order to provide an intuitive grasp on interesting mechanisms and patterns, we shall make reference to the following situation: - we label with 𝑅 variables representing resources of individuals —be them economic resources, or personal talents and skills— which are relevant for the problem, i.e. they directly impact the target 𝑌 ; - we label with 𝐴 variables indicating sensitive attributes, such as ethnicity, gender, etc.; - we label with 𝑃𝑅 proxy variables that we have access to instead of the original variable 𝑅. We may think of 𝑅 as soft skills, or talent, or intelligence, i.e. something that we cannot measure or sometimes even quantify directly; - we label with 𝑄 additional variables, that may or may not be relevant for the problem (i.e. impacting 𝑌 ) and that may or may not be impacted either by 𝑅 or 𝐴, e.g. the neighborhood one lives in. With these examples in mind, we can e.g. think of 𝑌 —i.e. the target of our decision making process— as the repayment of a debt or the work performance. In particular, figure 1 shows 3 minimal graph representations of historical, omitted variable and measurement biases. Historical bias occurs when the relevant variable 𝑅 is somehow impacted by sensitive feature 𝐴. Omitted variable bias occurs when, for some reasons, we omit the relevant variable 𝑅 from our dataset and we employ another variable which happens to be impacted by 𝐴. Measurement bias occurs when the relevant variable 𝑅 is, in general, free of bias, but we cannot access it, thus we employ a proxy 𝑃𝑅 which is impacted by sensitive characteristic 𝐴. The measurement bias could occur also on the target variable 𝑌 , when we can access only on a proxy 𝑃𝑌 of the phenomenon that we want to predict. The following system of equations formalizes the relationships between variables that are used to simulate specific form of biases: 𝐴 =𝐵𝐴 , 𝐵𝐴 ∼ ℬ𝑒𝑟(𝑝𝐴 ); (3a) 𝑅 = − 𝛽ℎ𝑅 𝐴 + 𝑁𝑅 , 2 𝑁𝑅 ∼ 𝒩 (𝜇𝑅 , 𝜎𝑅 ); (3b) 𝑄 =𝐵𝑄 , 𝐵𝑄 | (𝑅, 𝐴) ∼ ℬ𝑖𝑛(𝐾, 𝑝𝑄 (𝑅, 𝐴)), (3c) (︁ )︁ 𝑝𝑄 (𝑅, 𝐴) = sigmoid −(𝛼𝑅𝑄 𝑅 − 𝛽ℎ𝑄 𝐴) ; 𝑆 =𝛼𝑅 𝑅 − 𝛼𝑄 𝑄 − 𝛽ℎ𝑌 𝐴 + 𝑁𝑆 , 𝑁𝑆 ∼ 𝒩 (0, 𝜎𝑆2 ); (3d) 𝑌 =1{𝑆>𝑠¯} . (3e) (3f) 𝑅 𝑅 𝑅 𝐴 𝑌 𝐴 𝑌 𝐴 𝑌 𝑄 𝑄 𝑄 I) No historical bias II) Historical bias combined III) Historical bias on 𝑅 𝑅 𝑅 𝐴 𝑌 𝐴 𝑌 𝑄 𝑄 IV) Historical bias on 𝑌 V) Historical bias on 𝑅 and 𝑌 Figure 2: The 5 considered scenarios with users-to-data biases. When simulating measurement bias, either on resources 𝑅 or on target 𝑌 , we are going to use the following proxies as noisy (and biased) substitutes of the actual variables: 𝑅 𝑃𝑅 = 𝑅 − 𝛽𝑚 𝐴 + 𝑁𝑃𝑅 ; 𝑁𝑃𝑅 ∼ 𝒩 (0, 𝜎𝑃2 𝑅 ); (4a) 𝑌 𝑃𝑆 = 𝑆 − 𝛽𝑚 𝐴 + 𝑁𝑃𝑆 ; 𝑁𝑃𝑆 ∼ 𝒩 (0, 𝜎𝑃2 𝑆 ); (4b) 𝑃𝑌 = 1{𝑃𝑆 >𝑠¯} (4c) By varying the values of the parameters, we are able to generate different aspect of biases as follows: - 𝛽ℎ𝑗 determines the presence and amplitude of the historical bias on the variable 𝑗 ∈ {𝑅, 𝑄, 𝑌 }; 𝑗 - 𝛽𝑚 , when the proxy 𝑃𝑗 is used instead of the original variable 𝑗, governs the intensity of measurement bias on 𝑗 ∈ {𝑅, 𝑌 }; - 𝛼𝑅 , 𝛼𝑄 control the linear impact on (𝑆 and thus) 𝑌 of 𝑅 and 𝑄, respectively; 𝛼𝑅𝑄 represents the intensity of the dependence of 𝑄 on 𝑅. Additionally, in order to account for representation bias, we undersample the 𝐴 = 1 group. The amount of undersampling is governed by the parameter 𝑝𝑢 defined as the proportion of the under-represented group 𝐴 = 1 with respect to the majority group 𝐴 = 0. We draw the undersampling conditioned on 𝑅 by selecting the 𝐴 = 1 individuals with lower 𝑅. Finally, simulating omission bias is as simple as dropping the variable 𝑅 from the set of features the model uses to estimate 𝑌 . 5. Experiments We employ our modeling framework to generate a total of twenty-five different datasets by combining a set of five users-to-data biases with a set of four data-to-algorithm biases, plus a configuration without additional biases. The aim of the experiments is to investigate the effects of these bias combinations on machine learning algorithms. As clarified in section 3, users-to-data biases impact the phenomenon to be studied and thus the dataset. The five scenarios that we have chosen to explore, represented in figure 2, are define as follows: I) No historical bias: 𝑌 = 𝑓 (𝑅, 𝑄) + 𝜖, 𝑅, 𝑄 ⊥⊥ 𝐴. (5) II) Historical bias with effect compensation: 𝑌 = 𝑓 (𝑅, 𝑄) + 𝜖, 𝑅 = 𝑅(𝐴), 𝑄 = 𝑄(𝐴), 𝑌 ⊥⊥ 𝐴. (6) III) Historical bias on 𝑅: 𝑌 = 𝑓 (𝑅, 𝑄) + 𝜖, 𝑅 = 𝑅(𝐴), 𝑄 ⊥⊥ 𝐴. (7) IV) Historical bias on 𝑌 : 𝑌 = 𝑓 (𝑅, 𝑄, 𝐴) + 𝜖, 𝑅, 𝑄 ⊥⊥ 𝐴. (8) V) Historical bias on 𝑅 and 𝑌 : 𝑌 = 𝑓 (𝑅, 𝑄, 𝐴) + 𝜖, 𝑅 = 𝑅(𝐴), 𝑄 ⊥⊥ 𝐴. (9) We combine each scenario with the following data-to-algorithm biases, having an effect on the variables that we assume we can access in order to train the models: - no additional bias; - measurement bias on 𝑅; - omission bias on 𝑅; - representation bias on 𝐴 = 1; - measurement bias on 𝑌 . We thus generate twenty-five synthetic datasets, each of which characterized by a specific choice of the set of parameters discussed in section 4. Details on the values of the param- eters and the complete python code to reproduce the experiments is publicly available at github.com/rcrupiISP/ISParity. On each generated dataset, we train and test the following three algorithms: (i) a Random Forest (RF) [32] that aims to maximize performance by utilizing all available variables; (ii) Blinded Random Forest (BRF): a random forest that does not use the sensitive variable 𝐴 in order to avoid creating decision-making patterns based on it; (iii) Equalized Random Forest (ERF): the Table 1 Experiments results. Results regarding 25 experiments: 5 users-to-data scenarios with varying types of biases (rows) combined with 5 data-to-algorithm types of bias. Metrics employed are Accuracy (Acc), Demographic Parity and Accuracy difference between 𝐴 = 0 and 𝐴 = 1 groups (ΔDP and ΔAcc). ΔDP for the “true” target variable 𝑌 is also provided for each scenario (values in round brackets). RF is a random forest trained with all the available information, BRF is a RF blinded with respect to 𝐴, ERF is a RF with a post-processing fairness mitigation to impose Demographic Parity. Values are expressed as percentage points. data-to-algorithm bias users-to-data bias no additional bias + meas. bias on 𝑅 + omission bias + representation bias + meas. bias on 𝑌 (𝐴 = 1 randomly Scenario (𝑃𝑅 substitutes 𝑅) (𝑅 is omitted) (𝑃𝑌 substitutes 𝑌 ) undersampled) metrics RF BRF ERF RF BRF ERF RF BRF ERF RF BRF ERF RF BRF ERF Acc 86.4 86.4 86.4 81.2 81.1 81.2 62.8 62.8 62.8 86.3 86.3 86.3 83.3 84.5 84.3 I) No historical bias ΔDP (0) 0.1 0.1 0.0 0.8 7.4 0.5 1.2 1.2 1.0 0.7 0.6 0.4 11.2 0.1 0.0 ΔAcc 0.1 0.1 0.1 0.5 0.7 0.5 0.4 0.4 0.4 2.0 1.8 2.0 4.0 0.1 0.0 Acc 86.3 86.4 86.4 81.1 81.1 81.1 62.7 61.6 62.4 86.3 86.3 86.4 83.7 84.2 84.6 II) Historical bias compensation ΔDP (0.9) 0.1 0.8 0.3 2.2 5.6 0.9 9.8 27.1 0.9 1.6 0.7 0.6 11.5 3.9 0.4 ΔAcc 0.0 0.1 0.0 0.4 0.7 0.3 0.7 1.6 0.0 1.4 1.5 1.2 3.4 0.7 0.5 Acc 86.7 86.6 85.5 81.5 81.4 80.6 64.5 62.6 62.6 86.4 86.3 86.2 84.0 84.9 83.7 III) Historical bias on 𝑅 ΔDP (5) 12.6 12.9 0.0 11.8 19.1 0.7 36.7 1.2 1.0 11.2 10.3 0.0 25.1 13.1 0.1 ΔAcc 2.2 2.1 2.3 2.5 2.2 2.3 1.2 2.6 2.6 0.9 1.0 1.3 1.7 1.5 4.9 Acc 86.5 85.5 85.5 81.5 81.3 80.6 64.5 62.6 62.6 86.4 86.1 86.2 84.3 83.5 83.9 IV) Historical bias on 𝑌 ΔDP (12) 11.2 0.1 0.0 10.9 7.4 0.6 36.7 1.2 1.0 10.7 0.5 0.0 23.1 0.2 0.1 ΔAcc 1.7 0.0 2.4 2.7 2.9 2.2 1.3 2.6 2.6 1.0 1.6 1.3 1.1 5.7 4.7 Acc 86.9 86.0 83.1 82.0 81.6 78.7 66.4 62.1 62.0 86.5 86.3 85.6 84.6 84.3 81.2 V) Historical bias on 𝑅 and 𝑌 ΔDP (7) 25.1 13.1 0.1 27.0 18.8 0.6 36.7 1.2 1.0 19.1 10.3 1.3 37.4 13.3 0.1 ΔAcc 4.0 3.8 3.7 5.1 5.0 4.6 4.7 4.1 4.1 4.2 1.4 6.5 0.1 6.6 11.3 same model in (i) with a post-processing fairness mitigation block [16] to impose the same rate of 𝑌^ = 1 between the different classes of 𝐴 (i.e. imposing Demographic Parity [33]). Fairness of outcomes is evaluated through Demographic Parity difference (ΔDP) [33], i.e. the difference between the rate of 𝑌^ = 1 among the two sensitive groups 𝐴 = 0 and 𝐴 = 1, namely ΔDP = 𝑃 (𝑌^ = 1 | 𝐴 = 0) − 𝑃 (𝑌^ = 1 | 𝐴 = 1). ΔDP can also be computed for the “true” target variable 𝑌 , thus capturing the dependence of the target variable on the sensitive attribute 𝐴. Performance is evaluated by overall Accuracy (Acc). As an additional metric both of fairness and of performance, we compute the difference in accuracy between 𝐴 = 0 and 𝐴 = 1 groups (ΔAcc). 6. Discussion Table 1 summarizes the results obtained by the twenty-five experiments on generated datasets, on the basis of which we draw the following non-exhaustive list of insights: On the effect of equalization: as expected, the adoption of a model to enforce a group equalization of the outcomes results in lower performance in the majority of scenarios considered. This gap is proportional to the level of disparity present in the data. On the contrary, when there is representation bias, the effects on performance are less visible, since enforcing equalization impacts fewer instances. This suggests that observing significant reductions in ΔDP with negligible performance deterioration is not per se sufficient to conclude that a mitigation strategy has no effects on performance —as expressed by numerous studies in the field of fairness— since the protected class happens often to be a minority. Observing the difference in accuracy (ΔAcc), on the other hand, is a much more robust indicator because it is unaffected by the different numerosity of the protected subgroups. On the effect of blindness: excluding sensitive attributes when training machine learning models —as suggested by the Formal EO framework— has a negative impact on performance only when 𝑌 is directly influenced by 𝐴. When the factors that determine 𝑌 contain historical bias, this technique has no benefit in terms of demographic parity. It’s worth noting that in the presence of measurement bias in scenarios I, II, and III, the effects on ΔDP are quite worse than when 𝐴 is included in the model. This occurs because the machine learning model actually needs 𝐴 to remove the excess of dependence induced by the inclusion of the proxy. On the performance-bias trade-off in moral frameworks: one crucial aspect to keep in mind when developing a decision system is that what is and what is not a bias depends on the moral framework and worldview chosen in the first place. In this regard, the employment of a modeling strategy over another (e.g. using BRF instead of ERF) is in response to the chosen worldview. Thus, given a specific moral view, there are circumstances where it is justifiable to prefer performance over group equality and vice versa. Notice that the choice of a framework needs often be complemented with additional (crucial) decisions, e.g. choosing Substantive EO per se is not enough to understand whether some phenomenon underlying data is or not a bias, since one needs to agree on which characteristics are to be considered as “individual effort” (as opposed to “circumstance”), and thus legitimately employed as rationale of decisions. On the interpretation of results: in this work, we simulate the presence of one or more biases and observe the corresponding effects on data. In real-world scenarios, data are typically observed without knowing the exact bias-generation mechanism underlying them. It is the assumption on the data generation mechanism that shapes the interpretation of biases present in the context under consideration. This concept becomes more evident when the target variable in the data serves as proxy for the true target. In our experiments with measurement bias on 𝑌 , we trained the models with the proxy and evaluated the results with the real variable. This approach, even if unlikely in real-world situations, demonstrates that reducing bias can also lead to improved performances. These set of experiments shows that under a specific worldview, mitigation strategies that result in lower performance on the target variable in the observed space, may lead to higher performance on the “true” target. On the absence of bias: when operating in an ideal bias-free environment, i.e. when any possible relevant characteristic is independent of sensitive features, all machine learning strate- gies provide the same performance and fairness results. Furthermore, in this ideal situation, all worldviews would agree, even if they are mutually incompatible in the general case. Regardless of how trivial this conclusion may be, it is essential to mention it in order to promote any legal, social, or strategic initiative with the goal of eliminating society biases. 7. Conclusion In this work we have investigated how different types of bias impact fairness and performance metrics of machine learning models by introducing of a unified framework for generating synthetic data. Regardless of whether the test scenarios used to draw conclusions are synthetic, we have discussed moral frameworks and worldviews that can be applied in real-world situations. We release the presented model framework as open-source at github.com/rcrupiISP/ISParity in order to encourage further research into the challenge of bias in data. For example, one crucial aspect to be further analyzed is the sensitiveness of results with varying magnitude of parameters; another is to control the effect of random fluctuations by averaging results of many simulations for each choice of the parameters set. References [1] J. Angwin, J. Larson, S. Mattu, L. Kirchner, Machine bias: There’s software used across the country to predict future criminals, and it’s biased against blacks, ProPublica (2016). [2] E. Ntoutsi, P. Fafalios, U. Gadiraju, V. Iosifidis, W. Nejdl, M.-E. Vidal, S. Ruggieri, F. Turini, S. Papadopoulos, E. Krasanakis, et al., Bias in data-driven artificial intelligence systems—an introductory survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10 (2020) e1356. [3] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan, A survey on bias and fairness in machine learning, ACM Computing Surveys (CSUR) 54 (2021) 1–35. [4] A. Castelnovo, L. Malandri, F. Mercorio, M. Mezzanzanica, A. Cosentini, Towards fairness through time, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2021, pp. 647–663. [5] A. Castelnovo, R. Crupi, G. Del Gamba, G. Greco, A. Naseer, D. Regoli, B. S. M. Gonzalez, Befair: Addressing fairness in the banking sector, in: 2020 IEEE International Conference on Big Data (Big Data), IEEE, 2020, pp. 3652–3661. [6] S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, The (im) possibility of fairness: Different value systems require different mechanisms for fair decision making, Communi- cations of the ACM 64 (2021) 136–143. [7] M. A. Madaio, L. Stark, J. Wortman Vaughan, H. Wallach, Co-designing checklists to under- stand organizational challenges and opportunities around fairness in ai, in: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 2020, pp. 1–14. [8] F. Kamiran, T. Calders, Classifying without discriminating, in: 2009 2nd International Conference on Computer, Control and Communication, IEEE, 2009, pp. 1–6. [9] A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, H. Wallach, A reductions approach to fair classification, in: International Conference on Machine Learning, PMLR, 2018, pp. 60–69. [10] M. Hardt, E. Price, N. Srebro, Equality of opportunity in supervised learning, in: Advances in neural information processing systems, 2016, pp. 3315–3323. [11] P. K. Lohia, K. N. Ramamurthy, M. Bhide, D. Saha, K. R. Varshney, R. Puri, Bias mitigation post-processing for individual and group fairness, in: Icassp 2019-2019 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, 2019, pp. 2847–2851. [12] T. Le Quy, A. Roy, V. Iosifidis, W. Zhang, E. Ntoutsi, A survey on datasets for fairness- aware machine learning, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2022) e1452. [13] T. E. Raghunathan, Synthetic data, Annual Review of Statistics and Its Application 8 (2021) 129–140. [14] C. Hertweck, C. Heitz, M. Loi, On the moral justification of statistical parity, in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 747–757. [15] H. Heidari, M. Loi, K. P. Gummadi, A. Krause, A moral framework for understanding fair ml through economic models of equality of opportunity, in: Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 181–190. [16] S. Bird, M. Dudík, R. Edgar, B. Horn, R. Lutz, V. Milan, M. Sameki, H. Wallach, K. Walker, Fairlearn: A toolkit for assessing and improving fairness in AI, Technical Report MSR-TR- 2020-32, Microsoft, 2020. URL: https://www.microsoft.com/en-us/research/publication/ fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/. [17] S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, M. Veloso, Generating synthetic data in finance: opportunities, challenges and pitfalls, in: Proceedings of the First ACM International Conference on AI in Finance, 2020, pp. 1–8. [18] A. Majeed, S. Lee, Anonymization techniques for privacy preserving data publishing: A comprehensive survey, IEEE access 9 (2020) 8512–8545. [19] H. Surendra, H. Mohan, A review of synthetic data generation methods for privacy preserving data publishing, International Journal of Scientific & Technology Research 6 (2017) 95–101. [20] C. C. Aggarwal, S. Y. Philip, Privacy-preserving data mining: models and algorithms, Springer Science & Business Media, 2008. [21] A. D’Amour, H. Srinivasan, J. Atwood, P. Baljekar, D. Sculley, Y. Halpern, Fairness is not static: deeper understanding of long term fairness via simulation studies, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 525–534. [22] W.-Y. Loh, L. Cao, P. Zhou, Subgroup identification for precision medicine: A comparative review of 13 methods, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019) e1326. [23] C. Reddy, D. Sharma, S. Mehri, A. Romero-Soriano, S. Shabanian, S. Honari, Benchmarking bias mitigation algorithms in representation learning through fairness metrics, in: Thirty- fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. [24] A. Castelnovo, R. Crupi, G. Greco, D. Regoli, I. G. Penco, A. C. Cosentini, A clarification of the nuances in the fairness metrics landscape, Scientific Reports 12 (2022) 1–21. [25] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel, Fairness through awareness, in: Proceedings of the 3rd innovations in theoretical computer science conference, 2012, pp. 214–226. [26] M. Fleurbaey, Fairness, responsibility, and welfare, OUP Oxford, 2008. [27] J. E. Roemer, A. Trannoy, Equality of opportunity, in: Handbook of income distribution, volume 2, Elsevier, 2015, pp. 217–300. [28] J. Rawls, A theory of justice, in: Ethics, Routledge, 2004, pp. 229–234. [29] J. Pearl, Causality, Cambridge university press, 2009. [30] J. Pearl, D. Mackenzie, The book of why: the new science of cause and effect, Basic Books, 2018. [31] J. Peters, J. M. Mooij, D. Janzing, B. Schölkopf, Causal discovery with continuous additive noise models, Journal of Machine Learning Research 15 (2014). [32] G. Biau, E. Scornet, A random forest guided tour, Test 25 (2016) 197–227. [33] S. Barocas, M. Hardt, A. Narayanan, Fairness and Machine Learning, fairmlbook.org, 2019. http://www.fairmlbook.org.