Combining Fairness and Causal Graphs to Advance Both Lea Cohausz1,* , Jakob Kappenberger1 and Heiner Stuckenschmidt1 1 University of Mannheim, Germany Abstract Recent work on fairness in Machine Learning (ML) demonstrated that it is important to know the causal relationships among variables to decide whether a sensitive variable may have a problematic influence on the prediction and what fairness metric and potential bias mitigation strategy to use. These causal relationships can best be represented by Directed Acyclic Graphs (DAGs). However, so far, there is no clear classification of different causal structures containing sensitive variables in these DAGs. This paper’s first contribution is classifying the structures into four classes, each with different implications for fairness. However, we first need to learn the DAGs to uncover these structures. Structure learning algorithms exist but currently do not make systematic use of the background knowledge we have when considering fairness in ML, although the background knowledge could increase the correctness of the DAGs. Therefore, the second contribution is an adaptation of the structure learning methods. This is evaluated in the paper, demonstrating that the adaptation increases correctness. The two contributions of this paper are implemented in our publicly available Python package causalfair, allowing everyone to evaluate which relationships in the data might become problematic when applying ML. Keywords Fairness, Causal Models, Algorithmic Bias, Bayesian Network Structure Learning 1. Introduction The importance of fairness in AI and, more specifically, Machine Learning (ML) has been recognized in recent years, in particular in areas directly concerning humans, such as education, finance, or health care [1, 2, 3]. One way to discuss whether an ML system can be considered fair is to look at the outcome of the ML model [4]. Then, fairness is usually evaluated by looking at metrics of algorithmic bias1 , such as Demographic Parity (DP) [4]. These encode different notions of fairness, and once a metric has been decided on, it can indicate whether a model is fair according to this notion. However, apart from the normative2 , overarching question of which metric is most fair, the choice of metric, its interpretation, and how we should deal with potential fairness concerns is context-dependent. This aspect was also noted in previous works, remarking that taking a causal view allows us to account for much of the context-dependency and, hence, to properly assess whether an AI system’s outcome should be considered fair [5, 6, 7]. AEQUITAS 2024: Workshop on Fairness and Bias in AI | co-located with ECAI 2024, Santiago de Compostela, Spain * Corresponding author. $ lea.cohausz@uni-mannheim.de (L. Cohausz); jakob.kappenberger@uni-mannheim.de (J. Kappenberger); heiner.stuckenschmidt@uni-mannheim.de (H. Stuckenschmidt) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 Although frequently coined fairness metrics, we stick to the term metrics of algorithmic bias to indicate that fairness is a multi-dimensional concept. 2 That means that the decision depends on a person’s individual beliefs. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In particular, Chiappa et al. argued for using Causal Bayesian Networks (CBNs) to understand how variables influence each other and what the data-generating mechanism looks like [5]. This procedure can help determine which metrics of algorithmic bias to choose, how to interpret them, and how to deal with potential fairness concerns. However, so far, no clear classification of specific causal structures and their indications for fairness exists that also considers that the data is used in an ML context. What follows is that there is no implementation that can automatically detect different kinds of structures, which allows researchers and practitioners to check what parts of their data they intend to learn a model on could be problematic. In addition, existing work assumes that the CBNs are already constructed. However, constructing CBNs is non-trivial as we typically do not know all relationships existing in the data (i.e., cannot simply use expert knowledge), and data-driven causal structure learning methods are known to be error-prone in more complicated settings [8, 9]. In many fairness settings, however, we automatically have some background knowledge, i.e., we know which variables in the data are sensitive, and this has certain implications for learning the causal structure, as we will see. Hence, the contributions of the paper are twofold. • We create a classification of different causal structures and explain their implications for fairness assessment. • We adapt data-driven causal structure learning algorithms to include background knowl- edge we have in fairness settings. We also evaluate whether the adaptation increases the accuracy of the CBNs on synthetic data for which we know the ground truth.3 We implemented both contributions in our publicly available Python package causalfair that researchers and practitioners can use to assess whether and how the data used for Machine Learning (ML) is problematic. The package can be found online (https://github.com/lea-cohausz/ causalfair).4 2. Background Before discussing these aspects, we will briefly detail what CBNs are. The graphical part of CBNs consists of Directed Acyclic Graphs (DAGs). A DAG is a graph with nodes (also called vertices) 𝒳 that, in the case of a Bayesian Network, encode random variables and directed edges ℰ connecting the vertices [10]. An edge from one node to another, i.e., 𝑥𝑖 → 𝑥𝑗 , means that the first node causally influences the second node. A path in a DAG encompasses a sequence of directed edges, i.e., 𝑥𝑖 → 𝑥𝑗 → ... → 𝑥𝑡 . Furthermore, it holds for a CBN that a variable 𝑥𝑖 is only dependent on its parents and independent of all other variables given its parents, i.e.: ∏︁ 𝑃 (𝑥𝑖 ) = 𝑃 (𝑥𝑖 |𝑃 𝑎(𝑥𝑖 )) (1) where 𝑃 𝑎(𝑥𝑖 ) are the parents of 𝑥𝑖 . Therefore, CBNs encode independence information. In DAG (6) in Figure 1, 𝐴 and 𝑌 are conditionally independent given 𝑋, which we write as 𝐴 ⊥ 𝑌 |𝑋. This is equivalent to saying that all information relevant for 𝑌 is encoded in 𝑋, and we do 3 In addition, our online repository contains an example of a real-life dataset. 4 Our experiments are also available here https://github.com/lea-cohausz/Causalfair_Experiments. not need to know 𝐴 to learn something about 𝑌 [10]. Note, however, that 𝐴 and 𝑌 are only conditionally independent, which means that the variables are correlated. An imperfect ML model may use this correlation, even though all information necessary is encoded in 𝑋. 2.1. Sensitive Variables When assessing whether an ML model is fair or not concerning the model’s outcome, we usually use metrics of algorithmic bias [4]. All of these metrics have in common that they monitor differences in the model’s outcome with regard to sensitive variables. These sensitive variables are usually demographic variables [4, 11]. Demographic variables are, among others, variables such as gender, age, socio-economic status, and variables pertaining to this information. Another definition is that demographic features are features that cannot be changed within the context of the setting [11]. For example, if we have a model in the educational setting, all those variables should be considered demographic and potentially sensitive, which cannot be changed within the educational setting (i.e., gender is not changed by education, but educational attainment itself is). The different fairness metrics require different absences of statistical relationships between a sensitive feature and the prediction of the target variable. Because DAGs encode independence relationships and information on which variables in- fluence each other, they are well suited for uncovering potential fairness problems in the data. Chiappa and Isaac showed that by looking at DAGs representing the data-generating mechanism, we could determine whether sensitive variables causally influence the target variable or not [5]. Based on this, we can then make an informed decision about whether this is actually prob- lematic and which fairness metric can be used [5]. However, no clear classification of different structures was made, and existing work mostly focused on whether a structure is potentially problematic according to the causal structure [5, 7]. However, when we have the ultimate goal of using the data to build ML models, more considerations apply [6]. Most importantly, ML models frequently use correlated but causally unconnected variables, even if the information in this variable is also entailed in another causally connected variable [12]. To the best of our knowledge, we are the first to provide a clear classification of the structures in which sensitive variables are involved. 3. Different Types of Structures Figure 1 shows different structures, including a sensitive and a target variable we identified as potentially existing within a DAG. These structures (i.e., the different ways in which sensitive features and the target can be part of a larger DAG) can further be classified with respect to whether and how the sensitive variable involved in the structure is problematic. We have identified four classes of structures regarding this. We want to highlight again that the final decision of whether a sensitive variable has a problematic influence is still up to the expert.5 In general, we speak of a problem for ML if it is likely that an ML model will use the sensitive variable or variables heavily dependent on observed or unobserved sensitive variables (proxies) 5 We recommend Cohausz et al. to receive an idea of when we might deem relationships problematic and how this relates to selecting relevant fairness metrics [7]. (1) (2a) (2b) (3a) (3b) A A A1 A2 A M A Y Y X X X Y X Y Y (4) (5) (6) A X2 A X Y X1 Y X A Y Figure 1: Different causal structures. The numbers correspond to the numbers in the table. for its prediction. We will now briefly mention these classes (i.e., the different ways in which sensitive variables have or do not have a potential impact on the target and, thus, fairness) before delving into the different structures within these classes. In the following, we use 𝐴 to refer to a sensitive variable and 𝑌 to refer to the target; other letters refer to other predictive variables. • Potentially problematic structures that are problematic for ML (structures 1, 2a, 2b, 5). This class is characterized by a direct or indirect connection or both but with the same direction of effects. To deal with the fairness problem, we need to remove 𝐴 and potentially mitigate the effect of 𝐴 on 𝑋 if the relationships are deemed problematic. In the following, we call these problematic variables. • Unproblematic structures that are unproblematic for ML (structure 4). This class is characterized by no undirected path between 𝐴 and 𝑌 . In the following, we call these unproblematic variables. • Unproblematic structures but potentially problematic from an ML perspective (structure 3a). This class occurs if all directed paths from 𝐴 to 𝑌 are blocked. In the following, we call these blocked variables. • Potentially problematic structures where removing the sensitive variable (structure 3b) is problematic. This class occurs if there is a direct and indirect connection from 𝐴 to 𝑌 and the effects are opposing. In the following, we call these opposing effects variables. causalfair returns both the exact structures a sensitive variable is involved in as well as the classification of these structures. We want to highlight that the different structures within the same class still have to be viewed and handled differently [7]. We will now discuss the different classes in slightly greater detail. Table 1 summarizes the structures. Problematic Variables: As already mentioned, structures (1), (2a), (2b), and (5) are prob- lematic. They all have in common that there is at least one directed path from the sensitive variable to the target. Information about the target is encoded in the sensitive variables, which means that an ML model might use the correlation and, thus, place direct importance on the sensitive variable – which is potentially problematic. In the indirect case, as all information is also encoded in the mediating variable, the model may or may not place importance on it. Still, information from this variable will be passed on through the mediating variable. If we do not think that information from the sensitive variable should be used, then we should remove the variable and, in the indirect case, mitigate the effect of the variable while monitoring fairness metrics. We may also decide that only the direct effect should not be used (e.g., in (5)). Example: As an example, ethnicity may influence students’ grades in a specific course due to discrimination. In this case, the sensitive variables influence the target in such a way that the target is also biased. Unproblematic Variables: Structure (4) stands for all networks that are fragmented without a connection between the subnetwork containing the sensitive variable and the subnetwork containing the target variable. In these cases, we do not have to be worried much. No information about 𝑌 is encoded in 𝐴. Still, as imperfect ML models (in particular Neural Networks) tend to assign some importance even to irrelevant features [13], it is probably best to remove both 𝐴 and 𝑋1. Then, nothing needs to be monitored or mitigated. Example: Gender may influence height, but neither of those variables is relevant to whether students pass a course. Hence, there is no statistical relationship between any of the variables of the different fragments at all. Blocked Variables: For (3a), similar to (4), there is no path of directed edges that leads from the demographic variable to the target. In difference to (4), here, there is no lack of connection. Instead, 𝑀 blocks the paths, meaning no information from the sensitive variable is transported to the target. From a network perspective, the structure would be unproblematic, but from an ML perspective, it might not be. 𝑋, which is influenced by 𝐴, is correlated with 𝑌 . Although all information regarding the target is contained in 𝑀 and 𝑋 ⊥ 𝑌 |𝑀 , an ML model may still use and assign importance to 𝑋 due to the correlation. If the model uses 𝑋, it will likely also use 𝐴 to correct for the bias in 𝑋. This consequence was also observed by Ashurst et al. [6]. Therefore, we have two options for handling this. We can remove both 𝐴 and 𝑋, but if we remove 𝐴, we must also remove 𝑋. Otherwise, we would introduce a bias in our predictions that is not reflected by the real and unbiased target variable. Alternatively, we leave both in the data and closely monitor all metrics for algorithmic bias. Example: The students’ motivation influences both passing course X and passing course Y (the target). In course X, the professor discriminates against one gender. In this case, all information relevant to the target is encoded in the motivation, and X and gender are statistically independent of the target. However, course X is not independent of the motivation, which is a relevant variable for the target. This relationship may lead to an ML model that places weight on course X and, consequently, gender. Opposing Effects Variables: Although (3b) looks like (5) and, thus, should be classified as problematic, this becomes a very different case when the direct and indirect effects are opposing. This is the case if we have a missing variable. For example, if 𝑀 in (3a) is not observed, then the DAG learned from data will be (3b). If we do not know 𝑀 , then it will appear like 𝑋 and 𝐴 influence 𝑌 , and 𝐴 also influences 𝑋. The influence 𝐴 has on 𝑋 corrects itself through the connection 𝐴 → 𝑌 , meaning the target is unbiased. From a causal structure perspective, such a structure is clearly problematic. However, from an ML perspective, we again have the two options we had for (3a). We either leave 𝑋 and 𝐴 in, as the opposing effects of 𝐴 on 𝑋 and 𝑌 , Table 1 This table provides a summary of whether the structures in the corresponding figure are problematic from a causal structure and ML point of view, and what strategy to use to mitigate algorithmic bias. structure (1) (2a) (2b) (3a) (3b) (4) (5) (6) problematic according to yes yes yes no yes no yes no causal structure problematic for yes yes yes maybe maybe no yes yes ML remove A, remove As, remove A, none, mitigate mitigate remove A & X remove A & X mitigate strategy remove A or remove remove A influence influence or none or none influence X1 and A of A on X of As on X of A on X respectively, may cancel each other out. Or we must remove both. Then, however, we may lose a lot of predictive power. Although (5) and (3b) look identical from a network perspective, the implications are very different. Hence, for causalfair, we check when such a structure exists whether the effects of 𝐴 on 𝑋 and of 𝐴 on 𝑌 point in opposite directions (demonstrated in the graph with the two colors). If this is not the case, then it is (5). Otherwise, causalfair informs the user of the structure. It is important to know that the resulting graph learned from data is not strictly speaking a causal graph as a relevant variable is missing. Example: If the variable motivation, as described in the example for the blocked variables, is unobserved, the graph (3b) would likely be learned by a structure learning algorithm. Finally, structure (6) has been considered in the literature before, but we argue that we usually do not have to think about this case because sensitive variables are – at least according to the definition explained above – usually not changeable by other variables.6 4. Causal Structure Learning Detecting the above-described structures relies on accurate DAGs. These DAGs first need to be constructed. There are several ways to do so: 1. Expert knowledge. While we can construct the DAG using background knowledge [14], we usually do not know about causal relationships, or our ideas might not match the data. Still, expert knowledge is important: We often know about the temporal ordering of variables and, therefore, know that certain relationships cannot exist (e.g., grades cannot influence ethnicity). 2. Data-driven methods. Research on causal structure learning has produced several methods to learn CBNs from data [15]. If certain assumptions hold and data is sufficient, these methods work rather well [8]. In more realistic cases, however, the methods cannot reliably produce accurate DAGs [9]. 3. Combining expert knowledge and data-driven methods. We may know that some relationships in the data are impossible or must exist, but we do not know about all relationships. We can feed this knowledge to the structure learning algorithms. Although 6 If, however, this happens to be false in a specific setting, then we should remove 𝐴 to avoid that an ML model uses the correlation. combining both methods seems to lead to better results, doing so has been researched comparatively poorly, and some data-driven methods do not even allow the incorporation of background knowledge [16]. In part, this lack of research is because there is no general procedure for it, and it greatly depends on the data, knowledge, and general situation. We argue, however, that when constructing a graph to assess fairness, we can use a standard procedure to combine background knowledge and data-driven methods. The reason for this is the background information we automatically have when considering fairness. 4.1. Background Information As mentioned in section 2, it follows from a definition of sensitive features that non-sensitive variables cannot influence them [11]. Additionally, the target variable usually follows all other variables temporally. For example, if we try to predict admission to a university, all information that can be used has existed longer than the admission decision. Therefore, we can separate the variables into three groups: target variables (which cannot influence any other variables), sensitive variables (which cannot be influenced by any other variables), and regular predictive variables, for which it logically follows that they cannot influence sensitive variables or be influenced by the target. There may also be situations where sensitive variables can be influenced by other sensitive variables or where we know there is an order within the other predictive variables. However, we generally have at least three tiers: the target, other predictive features, and sensitive variables. With the specification of these tiers, we already have a lot of background knowledge: we can require that the data-driven structure learning methods do not include any edges that are impossible according to this specification. Using this background knowledge is particularly helpful, as it is also the knowledge we need to evaluate algorithmic bias, anyhow: knowing which variables are sensitive and what the target is. Additional knowledge we have about the structures can also be specified.7 When we now want to learn DAGs from data, we first need to choose among the families of data-driven methods. We will evaluate one method from each of the three most popular families: constraint-based methods, score-based methods, or methods from functional causal modeling and discuss how the background knowledge can be used [15].8 4.2. Constraint-Based Structure Learning Constraint-based structure learning consists of two stages. During the first stage, edges are removed iteratively from an initially complete undirected graph by performing independence tests [15]. Edges can be removed when two variables are (conditionally) independent of each other. Whenever an edge is removed, the variables that make these variables conditionally independent are stored. For example, if 𝐴 and 𝐵 are independent given C, i.e., 𝐴 ⊥ 𝐵|𝐶, then 𝐶 is stored. During the second stage, as many edges as possible are oriented. To do this, we look at groups of three variables 𝐴, 𝐵, 𝐶, and their separating sets. If we have two variables 7 That is, whether certain variables cannot have ingoing edges or cannot be influenced by certain other variables. 8 Hybrid methods connecting constraint-based and score-based structure learning also exist [15]. In practice, hybrid methods have been proven to work less well than the mentioned individual methods [9, 8]. Hence, we will not consider them here. 𝐴, 𝐵 that are conditionally independent and both are dependent on the same third variable 𝐶 and their separating set does not include 𝐶, i.e., 𝐶 ∈ / 𝑆𝐴𝐵 , then we have that 𝐴 → 𝐶 and 𝐵 → 𝐶. 𝐶 is a so-called collider, and 𝐴, 𝐵, 𝐶 form a v-structure. After all v-structures are identified and the corresponding edges are oriented, other edges are oriented to avoid new v-structures. This concludes the second stage. It has to be noted that not all edges are usually oriented, as only those edges that are part of a v-structure or directly avoid a v-structure can be oriented. Therefore, constraint-based methods do not return a DAG but a Complete Partial DAG (CPDAG). Constraint-based methods are guaranteed to return the correct CPDAG if the independence tests return correct results [15, 10]. Constraint-based algorithms are known to miss more edges than other methods but also insert fewer incorrect edges [9, 8]. We use the PC-Stable (abbreviated in this paper as PC) algorithm, which has been found to work well [15]. Adaptation: Including background information in constraint-based methods is not straight- forward as the first stage cannot really be modified, and no implementation so far allows a user to specify background information [16]. Our approach is to use the background information at the end of the second stage: If we have an undirected edge and our background knowledge does not allow one direction, then the edge is oriented accordingly. Afterward, further edges are oriented to avoid v-structures again. Compared to the adaptations of the other methods, this method makes comparatively little use of the background information. It is also not guaranteed that relationships that go against our background knowledge do not exist because the edge may already have been oriented during the v-structure orientation. However, if the CPDAG is correct until we inject the background knowledge, the resulting graph will also be correct. 4.3. Score-Based Structure Learning In score-based structure learning, we aim to find a DAG that maximizes a score [15]. Hence, the search space of possible graphs must be searched, and the possible graphs must be compared with a score (e.g., an information-theoretic score). Searching the space of possible graphs is usually (though not always) done with a heuristic approach. Despite its simplicity, one algorithm frequently used directly or in some variants is the Hill-Climber (HC) [9, 8]. HC starts with an empty graph and iteratively adds or deletes those edges that lead to the highest increase in the chosen score until the score no longer improves. A DAG that is at least a local maxima is returned, but reaching the global maxima is not guaranteed [15]. Adaptation: Adapting score-based methods to handle the background information is easier, as we can restrict the search space, i.e., edges that are impossible according to our classification will never be added [16].9 Constaninou et al. recently experimented with different kinds of background knowledge and their effect on the accuracy of DAGs but found that restricting edges only has a small effect [16]. However, we limit the search space more fundamentally. 4.4. Functional Causal Models The key idea of Functional Causal Modeling is that variables can be determined by a function of their parent variables and a noise term that is independent of their parents. If the function is 9 Similarly, we could also add information that a certain edge needs to exist – then the edge is directly added and can never be removed. Figure 2: The smallest DAG we created to evaluate against. The red paths are paths where problematic information is transported to the target 𝑦. correctly identified, it is the case that the noise term is only independent of the parent variables for one direction and not for the other. Hence, algorithms belonging to this family search for such relationships between variables. It should be noted that this method is usually used for continuous data, although it can also be used for discrete data. One of the most prominent algorithms belonging to this family is the Linear Non-Gaussian Acyclic Model (LiNGAM) [17]. Adaptation: Similar to HC, we can include background knowledge by preventing LiNGAM from considering certain relationships. Those are then not even attempted to be modeled. 5. Evaluation We will now evaluate whether our adaptations increase the correctness with which DAGs are learned. For this, we will look at several ground truth DAGs from which we sample data. Then, we will attempt to reconstruct the DAGs. We will vary the data size and the background information available. Additionally, we will check whether the sensitive variables are correctly classified according to the different classes we defined in section 3. We have the following research questions: RQ1: Does using background information improve the correctness of the models? RQ2: Is background information particularly helpful for specific data sizes or methods (PC- Stable, HC, LiNGAM)? RQ3: Does the classification accuracy of demographic variables according to section 3 increase with more background information correspondingly? 5.1. Strategy In order to evaluate the research questions, we need to know the ground truth DAGs. For this, we selected five Bayesian Networks (BN) from the “bnlearn” library that are frequently used to evaluate structure learning algorithms: asia, earthquake, sachs, alarm, and insurance [18]. For each of the DAGs, we selected some root nodes to represent the sensitive variables. We also selected one of the leaf nodes as the target. Because sampling from these BNs produces discrete data, but we also want to test with continuous data, we created three additional synthetic DAGs of different sizes for which we sampled continuous data. An example of the smallest continuous network we created can be seen in Figure 2. We specified non-linear relationships for two of the networks (II, III). A summary of the ground truth DAGs can be seen in Table 2. For each of the networks, we extracted which variables belong to each of the four classes of sensitive variables. For Figure 2, we have that 𝑎3 and 𝑎4 belong to the class that is problematic regarding both the causal structure and from the ML perspective (the paths they are involved in are highlighted in red). For 𝑎1, we have that it is neither, as it has no connection to 𝑦. 𝑎2 is not a problematic variable, as 𝑥2 blocks it, but it might be problematic when using ML. In this DAG, there is no opposing effects variable setting. Having gathered the ground truth information, we can now run the experiments. For each DAG, we vary the number of data instances used (500, 1000, and 10000). We also vary whether we have information available (Info) or not (No Info). For each configuration, we sample the data 30 times to receive reliable results. In detail, we proceed like this: Algorithm 1 The algorithm shows the setup of the experiments for each DAG. 1: for sample size in {500, 1000, 10000} do 2: for experiment in range(0,30) do 3: sample data 4: for method in {PC, HC, LiNGAM} do 5: for information in {No Info, Info} do 6: learn DAG 7: compare to ground truth We compare the correctness of the graph by a) computing how many of the true edges are present in the computed DAG10 (true positives) and b) how many edges in the computed DAG are incorrect (false positives).11 To make the values comparable across DAGs, we normalize them by dividing them by the number of actually existing edges in the ground-truth DAG. This procedure means that the range for the incorrect edges is theoretically [0, ∞], as, of course, more incorrect edges can be inserted than correct edges exist. Still, this normalizes the value with regard to the size of the DAG, and in practice, the value is never larger than 1. For the true positives, the value is bounded to 1. Likewise, for RQ3, we look at the accuracy with which variables are classified into the four classes. 5.2. Results Figure 3 (a) shows the results relevant for answering RQ1. We can see that providing information increases the correctly found edges compared to having no information available. There are much fewer wrongly placed edges when using information compared to not using information. 10 Note that for PC, we evaluate against the CPDAG and that the ground-truth CPDAG also changes with more information available. Only exact matches (i.e., same orientation or both unoriented) are counted. 11 Note that while these are usual metrics in structure learning research, other metrics are also frequently used, such as, e.g., the Hamming Distance [9, 15]. However, we believe this provides a relatively easy-to-understand view of the results. Table 2 This table provides a summary of the DAGs used in the evaluation. It states the number of nodes and edges, problematic variables, blocked variables, opposite effects variables, and unproblematic variables. The final column shows the average percentage of correct edges found when using the structure learning algorithms across all settings. |Opposing Name |Nodes| |Edges| |Problematic| |Blocked| |Unproblematic| % correct Effects| asia 8 8 2 0 0 0 0.66 earthquake 5 4 2 0 0 0 0.93 sachs 11 17 1 0 0 1 0.44 alarm 37 46 9 0 2 0 0.66 insurance 27 52 1 0 0 0 0.60 Synthetic I 9 7 2 1 0 1 0.89 Synthetic II 10 13 4 0 2 0 0.65 Synthetic III 20 29 3 2 0 1 0.48 That this metric is more affected than the one measuring correct edges is as expected, as the restrictions we define through the background knowledge directly impact this. In general, we can clearly answer RQ1 in the affirmative. Tables 3 and 4 show the results for RQ2. Generally, we can observe in Table 3 that the percentage of correct edges increases with more data, and the percentage of incorrect edges slightly increases with more data. However, it does not appear that background information is more valuable for more or less data available. As shown in Table 4, the conclusion is a bit more mixed for the methods. The percentage of correct edges for PC actually slightly decreases with more information available; the percentage of incorrect edges decreases quite a bit, though. It should be noted, however, that the ground-truth CPDAGs for PC also vary with information, so the numbers are not directly comparable. HC and LiNGAM greatly benefit from the information. In accordance with previous research, we can observe that HC performs best, whereas PC misses a lot of correct edges and LiNGAM places a lot of wrong edges [8]. For RQ2, we can say that background information is generally helpful for all settings but that HC and LiNGAM benefit more from it. Table 3 Table 4 Results by sample size and information. Results by method and information. Sample correct wrong correct wrong Info Method Info Size edges edges edges edges No Info 0.54 0.54 No Info 0.54 0.33 500 Info 0.49 0.33 PC Info 0.55 0.22 No Info 0.54 0.55 No Info 0.58 0.38 1000 Info 0.62 0.35 HC Info 0.73 0.19 No Info 0.62 0.59 No Info 0.52 0.98 10000 Info 0.71 0.41 LiNGAM Info 0.69 0.68 Figure 3 (b) answers RQ3. We can see that most of the problematic and unproblematic variables (a) (b) Figure 3: The average correctness of the learned DAGs (a) and the average percentage of correctly classified variables (b). are found. This positive result is not true for the other two classes. However, looking deeper into the data, HC actually finds roughly 67% of all blocked variables when having information available, whereas PC and LiNGAM do much worse and push down the average. HC also finds more than 75% of all problematic variables when having information available; LiNGAM is performing above average, too. Finding situations that are indicative of opposing effects variables is tough for all methods, although HC still does much better than average (roughly 35% when having information available). Looking deeper into our results, we observed that the variables are often misclassified as problematic. In this way, at least, attention is drawn to potential problems with them. Most importantly, Figure 3 (b) shows that providing information helps classify sensitive variables, confirming RQ3. The difference is not very large, though. In general, using the background information helps with learning more correct DAGs and classifying the sensitive variables. HC and LiNGAM greatly benefit from background informa- tion, and PC-Stable does so less. The above evaluation only serves to show that background information improves the correctness of DAGs. Because of space constraints, we did not add an evaluation of how well the problematic structures are detected (i.e., the paths along which problematic influences might exist). This evaluation will be added in future work. The correct- ness of the DAGs can be further improved by focusing on score-based methods and potentially adding even more background information (e.g., some sensitive variables may be influenced by other sensitive variables). We want to highlight that causalfair allows us to specify more background knowledge, such as whether specific variables must be connected. 6. Conclusion and Limitations 6.1. Limitations As a general limitation, we want to highlight that the learned DAGs may not reflect real causal relationships. Either important predictive variables (e.g., as highlighted by the discussion of structure (3b) in section 3) or sensitive variables that have an effect might simply not be in the data. While this is a general problem of measuring fairness, it is important to stress that our CBNs do not necessarily provide a complete picture of the causal mechanisms producing the target. Moreover, structure learning algorithms have limitations when the data is extremely imbal- anced, contains many missing values, and when the relationship between variables is non-linear and complex. In other words, real data could pose a challenge. Real data has rarely been used to evaluate structure learning algorithms, in general, [8], but doing so is, of course, very important. Thus, future research should focus on a real-life evaluation as well. 6.2. Conclusion and Outlook In this paper, we introduced a classification of sensitive variables into four classes depending on whether and how they are involved in causal structures that could be problematic in the ML context. Additionally, we showed that we can improve the data-driven learning of DAGs by using background knowledge we naturally have in fairness settings. These contributions are implemented in our Python package causalfair. We hope researchers and practitioners use this package to evaluate whether they have problematic relationships in their data before learning ML models. In the future, we plan to add more structure learning methods (particularly score-based) to the package. Furthermore, we believe that future research should focus on performing more targeted bias mitigation that can also handle it if we only consider some but not all paths from a sensitive variable to a target as problematic. Chiappa and Isaac discuss a technique to estimate the path-specific effects of variables [5], and we believe this is a good starting point. Moreover, we believe that more effort should be put into constructing accurate DAGs. We show in this paper that background knowledge helps immensely in learning better DAGs, and we believe that further advancing the learning of DAGs using background knowledge should be a future research endeavor. Acknowledgments Lea Cohausz is funded by the grant “Consequences of Artificial Intelligence for Urban Societies (CAIUS),” by Volkswagen Foundation. References [1] D. Ueda, T. Kakinuma, S. Fujita, K. Kamagata, Y. Fushimi, R. Ito, Y. Matsui, T. Nozaki, T. Nakaura, N. Fujima, et al., Fairness of artificial intelligence in healthcare: review and recommendations, Japanese Journal of Radiology 42 (2024) 3–15. [2] R. S. Baker, A. Hawn, Algorithmic bias in education, International Journal of Artificial Intelligence in Education (2021) 1–41. [3] P. Birzhandi, Y.-S. Cho, Application of fairness to healthcare, organizational justice, and finance: a survey, Expert Systems with Applications 216 (2023) 119465. [4] A. Castelnovo, R. Crupi, G. Greco, D. Regoli, I. G. Penco, A. C. Cosentini, A clarification of the nuances in the fairness metrics landscape, Scientific Reports 12 (2022) 4209. URL: https: //www.nature.com/articles/s41598-022-07939-1. doi:10.1038/s41598-022-07939-1. [5] S. Chiappa, W. S. Isaac, A causal bayesian networks viewpoint on fairness, Privacy and Identity Management. Fairness, Accountability, and Transparency in the Age of Big Data: 13th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2. 2 International Summer School, Vienna, Austria, August 20-24, 2018, Revised Selected Papers 13 (2019) 3–20. [6] C. Ashurst, R. Carey, S. Chiappa, T. Everitt, Why fair labels can yield unfair predictions: Graphical conditions for introduced unfairness, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp. 9494–9503. [7] L. Cohausz, J. Kappenberger, H. Stuckenschmidt, What fairness metrics can really tell you: A case study in the educational domain (2024). [8] M. Scutari, C. E. Graafland, J. M. Gutiérrez, Who learns better bayesian network structures: Accuracy and speed of structure learning algorithms, International Journal of Approximate Reasoning 115 (2019) 235–253. [9] M. Scanagatta, A. Salmerón, F. Stella, A survey on bayesian network structure learning from data, Progress in Artificial Intelligence 8 (2019) 425–439. [10] J. Pearl, Causality, Cambridge university press, 2009. [11] R. S. Baker, L. Esbenshade, J. Vitale, S. Karumbaiah, et al., Using demographic data as predictor variables: a questionable choice, Journal of Educational Data Mining 15 (2023) 22–52. [12] L. Cohausz, A. Tschalzev, C. Bartelt, H. Stuckenschmidt, Investigating the importance of demographic features for edm-predictions (2023). [13] L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data?, Advances in neural information processing systems 35 (2022) 507–520. [14] B. Hicks, K. Kitto, L. Payne, S. Buckingham Shum, Thinking with causal models: A visual formalism for collaboratively crafting assumptions, in: LAK22: 12th International Learning Analytics and Knowledge Conference, 2022, pp. 250–259. [15] N. K. Kitson, A. C. Constantinou, Z. Guo, Y. Liu, K. Chobtham, A survey of bayesian network structure learning, Artificial Intelligence Review (2023) 1–94. [16] A. C. Constantinou, Z. Guo, N. K. Kitson, The impact of prior knowledge on causal structure learning, Knowledge and Information Systems (2023) 1–50. [17] S. Shimizu, Lingam: Non-gaussian methods for estimating causal structures, Behav- iormetrika 41 (2014) 65–98. [18] M. Scutari, M. M. Scutari, H.-P. MMPC, Package ‘bnlearn’, Bayesian network structure learning, parameter learning and inference, R package version 4 (2019).