<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Fairness and Causal Graphs to Advance Both</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lea Cohausz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakob Kappenberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heiner Stuckenschmidt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent work on fairness in Machine Learning (ML) demonstrated that it is important to know the causal relationships among variables to decide whether a sensitive variable may have a problematic influence on the prediction and what fairness metric and potential bias mitigation strategy to use. These causal relationships can best be represented by Directed Acyclic Graphs (DAGs). However, so far, there is no clear classification of diferent causal structures containing sensitive variables in these DAGs. This paper's first contribution is classifying the structures into four classes, each with diferent implications for fairness. However, we first need to learn the DAGs to uncover these structures. Structure learning algorithms exist but currently do not make systematic use of the background knowledge we have when considering fairness in ML, although the background knowledge could increase the correctness of the DAGs. Therefore, the second contribution is an adaptation of the structure learning methods. This is evaluated in the paper, demonstrating that the adaptation increases correctness. The two contributions of this paper are implemented in our publicly available Python package causalfair, allowing everyone to evaluate which relationships in the data might become problematic when applying ML.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fairness</kwd>
        <kwd>Causal Models</kwd>
        <kwd>Algorithmic Bias</kwd>
        <kwd>Bayesian Network Structure Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The importance of fairness in AI and, more specifically, Machine Learning (ML) has been
recognized in recent years, in particular in areas directly concerning humans, such as education,
ifnance, or health care [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. One way to discuss whether an ML system can be considered
fair is to look at the outcome of the ML model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Then, fairness is usually evaluated by looking
at metrics of algorithmic bias1, such as Demographic Parity (DP) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These encode diferent
notions of fairness, and once a metric has been decided on, it can indicate whether a model
is fair according to this notion. However, apart from the normative2, overarching question of
which metric is most fair, the choice of metric, its interpretation, and how we should deal with
potential fairness concerns is context-dependent. This aspect was also noted in previous works,
remarking that taking a causal view allows us to account for much of the context-dependency
and, hence, to properly assess whether an AI system’s outcome should be considered fair [5, 6, 7].
      </p>
      <p>In particular, Chiappa et al. argued for using Causal Bayesian Networks (CBNs) to understand
how variables influence each other and what the data-generating mechanism looks like [ 5]. This
procedure can help determine which metrics of algorithmic bias to choose, how to interpret
them, and how to deal with potential fairness concerns.</p>
      <p>However, so far, no clear classification of specific causal structures and their indications
for fairness exists that also considers that the data is used in an ML context. What follows
is that there is no implementation that can automatically detect diferent kinds of structures,
which allows researchers and practitioners to check what parts of their data they intend to
learn a model on could be problematic. In addition, existing work assumes that the CBNs are
already constructed. However, constructing CBNs is non-trivial as we typically do not know all
relationships existing in the data (i.e., cannot simply use expert knowledge), and data-driven
causal structure learning methods are known to be error-prone in more complicated settings
[8, 9]. In many fairness settings, however, we automatically have some background knowledge,
i.e., we know which variables in the data are sensitive, and this has certain implications for
learning the causal structure, as we will see. Hence, the contributions of the paper are twofold.
• We create a classification of diferent causal structures and explain their implications for
fairness assessment.
• We adapt data-driven causal structure learning algorithms to include background
knowledge we have in fairness settings. We also evaluate whether the adaptation increases the
accuracy of the CBNs on synthetic data for which we know the ground truth.3
We implemented both contributions in our publicly available Python package causalfair that
researchers and practitioners can use to assess whether and how the data used for Machine
Learning (ML) is problematic. The package can be found online (https://github.com/lea-cohausz/
causalfair).4</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Before discussing these aspects, we will briefly detail what CBNs are. The graphical part of
CBNs consists of Directed Acyclic Graphs (DAGs). A DAG is a graph with nodes (also called
vertices)  that, in the case of a Bayesian Network, encode random variables and directed edges
ℰ connecting the vertices [10]. An edge from one node to another, i.e.,  →  , means that the
ifrst node causally influences the second node. A path in a DAG encompasses a sequence of
directed edges, i.e.,  →  → ... → . Furthermore, it holds for a CBN that a variable  is
only dependent on its parents and independent of all other variables given its parents, i.e.:
 () = ∏︁  (| ())
(1)
where  () are the parents of . Therefore, CBNs encode independence information. In DAG
(6) in Figure 1,  and  are conditionally independent given , which we write as  ⊥  |.
This is equivalent to saying that all information relevant for  is encoded in , and we do
3In addition, our online repository contains an example of a real-life dataset.
4Our experiments are also available here https://github.com/lea-cohausz/Causalfair_Experiments.
not need to know  to learn something about  [10]. Note, however, that  and  are only
conditionally independent, which means that the variables are correlated. An imperfect ML
model may use this correlation, even though all information necessary is encoded in .</p>
      <sec id="sec-2-1">
        <title>2.1. Sensitive Variables</title>
        <p>
          When assessing whether an ML model is fair or not concerning the model’s outcome, we usually
use metrics of algorithmic bias [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. All of these metrics have in common that they monitor
diferences in the model’s outcome with regard to sensitive variables. These sensitive variables
are usually demographic variables [
          <xref ref-type="bibr" rid="ref4">4, 11</xref>
          ]. Demographic variables are, among others, variables
such as gender, age, socio-economic status, and variables pertaining to this information. Another
definition is that demographic features are features that cannot be changed within the context
of the setting [11]. For example, if we have a model in the educational setting, all those variables
should be considered demographic and potentially sensitive, which cannot be changed within
the educational setting (i.e., gender is not changed by education, but educational attainment
itself is). The diferent fairness metrics require diferent absences of statistical relationships
between a sensitive feature and the prediction of the target variable.
        </p>
        <p>Because DAGs encode independence relationships and information on which variables
inlfuence each other, they are well suited for uncovering potential fairness problems in the data.
Chiappa and Isaac showed that by looking at DAGs representing the data-generating mechanism,
we could determine whether sensitive variables causally influence the target variable or not
[5]. Based on this, we can then make an informed decision about whether this is actually
problematic and which fairness metric can be used [5]. However, no clear classification of diferent
structures was made, and existing work mostly focused on whether a structure is potentially
problematic according to the causal structure [5, 7]. However, when we have the ultimate goal
of using the data to build ML models, more considerations apply [6]. Most importantly, ML
models frequently use correlated but causally unconnected variables, even if the information
in this variable is also entailed in another causally connected variable [12]. To the best of our
knowledge, we are the first to provide a clear classification of the structures in which sensitive
variables are involved.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Diferent Types of Structures</title>
      <p>Figure 1 shows diferent structures, including a sensitive and a target variable we identified as
potentially existing within a DAG. These structures (i.e., the diferent ways in which sensitive
features and the target can be part of a larger DAG) can further be classified with respect to
whether and how the sensitive variable involved in the structure is problematic. We have
identified four classes of structures regarding this. We want to highlight again that the final
decision of whether a sensitive variable has a problematic influence is still up to the expert. 5
In general, we speak of a problem for ML if it is likely that an ML model will use the sensitive
variable or variables heavily dependent on observed or unobserved sensitive variables (proxies)
5We recommend Cohausz et al. to receive an idea of when we might deem relationships problematic and how this
relates to selecting relevant fairness metrics [7].</p>
      <p>(1)
A
Y
A
X1
(4)</p>
      <p>X2
Y
(2a)
A
X
Y</p>
      <p>A1</p>
      <p>A2
(2b)
X
Y</p>
      <p>A
X
(5)</p>
      <p>Y
(3a)</p>
      <p>(3b)
A
X</p>
      <p>M
Y</p>
      <p>A
X
A
(6)
X</p>
      <p>Y
Y
for its prediction. We will now briefly mention these classes (i.e., the diferent ways in which
sensitive variables have or do not have a potential impact on the target and, thus, fairness)
before delving into the diferent structures within these classes. In the following, we use  to
refer to a sensitive variable and  to refer to the target; other letters refer to other predictive
variables.</p>
      <p>• Potentially problematic structures that are problematic for ML (structures 1, 2a, 2b, 5). This
class is characterized by a direct or indirect connection or both but with the same direction
of efects. To deal with the fairness problem, we need to remove  and potentially mitigate
the efect of  on  if the relationships are deemed problematic. In the following, we
call these problematic variables.
• Unproblematic structures that are unproblematic for ML (structure 4). This class is
characterized by no undirected path between  and  . In the following, we call these
unproblematic variables.
• Unproblematic structures but potentially problematic from an ML perspective (structure
3a). This class occurs if all directed paths from  to  are blocked. In the following, we
call these blocked variables.
• Potentially problematic structures where removing the sensitive variable (structure 3b) is
problematic. This class occurs if there is a direct and indirect connection from  to 
and the efects are opposing. In the following, we call these opposing efects variables .
causalfair returns both the exact structures a sensitive variable is involved in as well as the
classification of these structures. We want to highlight that the diferent structures within the
same class still have to be viewed and handled diferently [ 7]. We will now discuss the diferent
classes in slightly greater detail. Table 1 summarizes the structures.</p>
      <p>Problematic Variables: As already mentioned, structures (1), (2a), (2b), and (5) are
problematic. They all have in common that there is at least one directed path from the sensitive
variable to the target. Information about the target is encoded in the sensitive variables, which
means that an ML model might use the correlation and, thus, place direct importance on the
sensitive variable – which is potentially problematic. In the indirect case, as all information
is also encoded in the mediating variable, the model may or may not place importance on
it. Still, information from this variable will be passed on through the mediating variable. If
we do not think that information from the sensitive variable should be used, then we should
remove the variable and, in the indirect case, mitigate the efect of the variable while monitoring
fairness metrics. We may also decide that only the direct efect should not be used (e.g., in (5)).
Example: As an example, ethnicity may influence students’ grades in a specific course due to
discrimination. In this case, the sensitive variables influence the target in such a way that the
target is also biased.</p>
      <p>Unproblematic Variables: Structure (4) stands for all networks that are fragmented without
a connection between the subnetwork containing the sensitive variable and the subnetwork
containing the target variable. In these cases, we do not have to be worried much. No information
about  is encoded in . Still, as imperfect ML models (in particular Neural Networks) tend to
assign some importance even to irrelevant features [13], it is probably best to remove both 
and 1. Then, nothing needs to be monitored or mitigated. Example: Gender may influence
height, but neither of those variables is relevant to whether students pass a course. Hence, there
is no statistical relationship between any of the variables of the diferent fragments at all.</p>
      <p>Blocked Variables: For (3a), similar to (4), there is no path of directed edges that leads from
the demographic variable to the target. In diference to (4), here, there is no lack of connection.
Instead,  blocks the paths, meaning no information from the sensitive variable is transported
to the target. From a network perspective, the structure would be unproblematic, but from an
ML perspective, it might not be. , which is influenced by , is correlated with  . Although
all information regarding the target is contained in  and  ⊥  | , an ML model may still
use and assign importance to  due to the correlation. If the model uses , it will likely also
use  to correct for the bias in . This consequence was also observed by Ashurst et al. [6].
Therefore, we have two options for handling this. We can remove both  and , but if we
remove , we must also remove . Otherwise, we would introduce a bias in our predictions
that is not reflected by the real and unbiased target variable. Alternatively, we leave both in the
data and closely monitor all metrics for algorithmic bias. Example: The students’ motivation
influences both passing course X and passing course Y (the target). In course X, the professor
discriminates against one gender. In this case, all information relevant to the target is encoded in
the motivation, and X and gender are statistically independent of the target. However, course X
is not independent of the motivation, which is a relevant variable for the target. This relationship
may lead to an ML model that places weight on course X and, consequently, gender.</p>
      <p>Opposing Efects Variables : Although (3b) looks like (5) and, thus, should be classified as
problematic, this becomes a very diferent case when the direct and indirect efects are opposing.
This is the case if we have a missing variable. For example, if  in (3a) is not observed, then
the DAG learned from data will be (3b). If we do not know  , then it will appear like  and
 influence  , and  also influences . The influence  has on  corrects itself through the
connection  →  , meaning the target is unbiased. From a causal structure perspective, such
a structure is clearly problematic. However, from an ML perspective, we again have the two
options we had for (3a). We either leave  and  in, as the opposing efects of  on  and  ,
respectively, may cancel each other out. Or we must remove both. Then, however, we may lose
a lot of predictive power. Although (5) and (3b) look identical from a network perspective, the
implications are very diferent. Hence, for causalfair, we check when such a structure exists
whether the efects of  on  and of  on  point in opposite directions (demonstrated in the
graph with the two colors). If this is not the case, then it is (5). Otherwise, causalfair informs
the user of the structure. It is important to know that the resulting graph learned from data is
not strictly speaking a causal graph as a relevant variable is missing. Example: If the variable
motivation, as described in the example for the blocked variables, is unobserved, the graph (3b)
would likely be learned by a structure learning algorithm.</p>
      <p>Finally, structure (6) has been considered in the literature before, but we argue that we usually
do not have to think about this case because sensitive variables are – at least according to the
definition explained above – usually not changeable by other variables. 6</p>
    </sec>
    <sec id="sec-4">
      <title>4. Causal Structure Learning</title>
      <p>Detecting the above-described structures relies on accurate DAGs. These DAGs first need to be
constructed. There are several ways to do so:
1. Expert knowledge. While we can construct the DAG using background knowledge [14],
we usually do not know about causal relationships, or our ideas might not match the
data. Still, expert knowledge is important: We often know about the temporal ordering of
variables and, therefore, know that certain relationships cannot exist (e.g., grades cannot
influence ethnicity).
2. Data-driven methods. Research on causal structure learning has produced several
methods to learn CBNs from data [15]. If certain assumptions hold and data is suficient,
these methods work rather well [8]. In more realistic cases, however, the methods cannot
reliably produce accurate DAGs [9].
3. Combining expert knowledge and data-driven methods. We may know that some
relationships in the data are impossible or must exist, but we do not know about all
relationships. We can feed this knowledge to the structure learning algorithms. Although
6If, however, this happens to be false in a specific setting, then we should remove  to avoid that an ML model uses
the correlation.</p>
      <p>combining both methods seems to lead to better results, doing so has been researched
comparatively poorly, and some data-driven methods do not even allow the incorporation
of background knowledge [16]. In part, this lack of research is because there is no general
procedure for it, and it greatly depends on the data, knowledge, and general situation.
We argue, however, that when constructing a graph to assess fairness, we can use a standard
procedure to combine background knowledge and data-driven methods. The reason for this is
the background information we automatically have when considering fairness.</p>
      <sec id="sec-4-1">
        <title>4.1. Background Information</title>
        <p>As mentioned in section 2, it follows from a definition of sensitive features that non-sensitive
variables cannot influence them [ 11]. Additionally, the target variable usually follows all other
variables temporally. For example, if we try to predict admission to a university, all information
that can be used has existed longer than the admission decision. Therefore, we can separate
the variables into three groups: target variables (which cannot influence any other variables),
sensitive variables (which cannot be influenced by any other variables), and regular predictive
variables, for which it logically follows that they cannot influence sensitive variables or be
influenced by the target. There may also be situations where sensitive variables can be influenced
by other sensitive variables or where we know there is an order within the other predictive
variables. However, we generally have at least three tiers: the target, other predictive features,
and sensitive variables. With the specification of these tiers, we already have a lot of background
knowledge: we can require that the data-driven structure learning methods do not include any
edges that are impossible according to this specification. Using this background knowledge is
particularly helpful, as it is also the knowledge we need to evaluate algorithmic bias, anyhow:
knowing which variables are sensitive and what the target is. Additional knowledge we have
about the structures can also be specified. 7</p>
        <p>When we now want to learn DAGs from data, we first need to choose among the families
of data-driven methods. We will evaluate one method from each of the three most popular
families: constraint-based methods, score-based methods, or methods from functional causal
modeling and discuss how the background knowledge can be used [15].8</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Constraint-Based Structure Learning</title>
        <p>Constraint-based structure learning consists of two stages. During the first stage, edges are
removed iteratively from an initially complete undirected graph by performing independence
tests [15]. Edges can be removed when two variables are (conditionally) independent of each
other. Whenever an edge is removed, the variables that make these variables conditionally
independent are stored. For example, if  and  are independent given C, i.e.,  ⊥ |, then
 is stored. During the second stage, as many edges as possible are oriented. To do this, we
look at groups of three variables , , , and their separating sets. If we have two variables
7That is, whether certain variables cannot have ingoing edges or cannot be influenced by certain other variables.
8Hybrid methods connecting constraint-based and score-based structure learning also exist [15]. In practice, hybrid
methods have been proven to work less well than the mentioned individual methods [9, 8]. Hence, we will not
consider them here.
,  that are conditionally independent and both are dependent on the same third variable
 and their separating set does not include , i.e.,  ∈/ , then we have that  →  and
 → .  is a so-called collider, and , ,  form a v-structure. After all v-structures are
identified and the corresponding edges are oriented, other edges are oriented to avoid new
v-structures. This concludes the second stage. It has to be noted that not all edges are usually
oriented, as only those edges that are part of a v-structure or directly avoid a v-structure can
be oriented. Therefore, constraint-based methods do not return a DAG but a Complete Partial
DAG (CPDAG). Constraint-based methods are guaranteed to return the correct CPDAG if the
independence tests return correct results [15, 10]. Constraint-based algorithms are known to
miss more edges than other methods but also insert fewer incorrect edges [9, 8]. We use the
PC-Stable (abbreviated in this paper as PC) algorithm, which has been found to work well [15].</p>
        <p>Adaptation: Including background information in constraint-based methods is not
straightforward as the first stage cannot really be modified, and no implementation so far allows a user
to specify background information [16]. Our approach is to use the background information at
the end of the second stage: If we have an undirected edge and our background knowledge does
not allow one direction, then the edge is oriented accordingly. Afterward, further edges are
oriented to avoid v-structures again. Compared to the adaptations of the other methods, this
method makes comparatively little use of the background information. It is also not guaranteed
that relationships that go against our background knowledge do not exist because the edge
may already have been oriented during the v-structure orientation. However, if the CPDAG is
correct until we inject the background knowledge, the resulting graph will also be correct.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Score-Based Structure Learning</title>
        <p>In score-based structure learning, we aim to find a DAG that maximizes a score [ 15]. Hence, the
search space of possible graphs must be searched, and the possible graphs must be compared
with a score (e.g., an information-theoretic score). Searching the space of possible graphs is
usually (though not always) done with a heuristic approach. Despite its simplicity, one algorithm
frequently used directly or in some variants is the Hill-Climber (HC) [9, 8]. HC starts with an
empty graph and iteratively adds or deletes those edges that lead to the highest increase in
the chosen score until the score no longer improves. A DAG that is at least a local maxima is
returned, but reaching the global maxima is not guaranteed [15].</p>
        <p>Adaptation: Adapting score-based methods to handle the background information is easier,
as we can restrict the search space, i.e., edges that are impossible according to our classification
will never be added [16].9 Constaninou et al. recently experimented with diferent kinds of
background knowledge and their efect on the accuracy of DAGs but found that restricting
edges only has a small efect [16]. However, we limit the search space more fundamentally.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Functional Causal Models</title>
        <p>The key idea of Functional Causal Modeling is that variables can be determined by a function
of their parent variables and a noise term that is independent of their parents. If the function is
9Similarly, we could also add information that a certain edge needs to exist – then the edge is directly added and can
never be removed.
correctly identified, it is the case that the noise term is only independent of the parent variables
for one direction and not for the other. Hence, algorithms belonging to this family search for
such relationships between variables. It should be noted that this method is usually used for
continuous data, although it can also be used for discrete data. One of the most prominent
algorithms belonging to this family is the Linear Non-Gaussian Acyclic Model (LiNGAM) [17].</p>
        <p>Adaptation: Similar to HC, we can include background knowledge by preventing LiNGAM
from considering certain relationships. Those are then not even attempted to be modeled.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>We will now evaluate whether our adaptations increase the correctness with which DAGs
are learned. For this, we will look at several ground truth DAGs from which we sample data.
Then, we will attempt to reconstruct the DAGs. We will vary the data size and the background
information available. Additionally, we will check whether the sensitive variables are correctly
classified according to the diferent classes we defined in section 3. We have the following
research questions:
RQ1: Does using background information improve the correctness of the models?
RQ2: Is background information particularly helpful for specific data sizes or methods
(PCStable, HC, LiNGAM)?
RQ3: Does the classification accuracy of demographic variables according to section 3 increase
with more background information correspondingly?</p>
      <sec id="sec-5-1">
        <title>5.1. Strategy</title>
        <p>In order to evaluate the research questions, we need to know the ground truth DAGs. For this,
we selected five Bayesian Networks (BN) from the “bnlearn” library that are frequently used to
evaluate structure learning algorithms: asia, earthquake, sachs, alarm, and insurance [18]. For
each of the DAGs, we selected some root nodes to represent the sensitive variables. We also
selected one of the leaf nodes as the target. Because sampling from these BNs produces discrete
data, but we also want to test with continuous data, we created three additional synthetic DAGs
of diferent sizes for which we sampled continuous data. An example of the smallest continuous
network we created can be seen in Figure 2. We specified non-linear relationships for two of
the networks (II, III). A summary of the ground truth DAGs can be seen in Table 2.</p>
        <p>For each of the networks, we extracted which variables belong to each of the four classes of
sensitive variables. For Figure 2, we have that 3 and 4 belong to the class that is problematic
regarding both the causal structure and from the ML perspective (the paths they are involved
in are highlighted in red). For 1, we have that it is neither, as it has no connection to . 2 is
not a problematic variable, as 2 blocks it, but it might be problematic when using ML. In this
DAG, there is no opposing efects variable setting.</p>
        <p>Having gathered the ground truth information, we can now run the experiments. For each
DAG, we vary the number of data instances used (500, 1000, and 10000). We also vary whether
we have information available (Info) or not (No Info). For each configuration, we sample the
data 30 times to receive reliable results. In detail, we proceed like this:
Algorithm 1 The algorithm shows the setup of the experiments for each DAG.
1: for sample size in {500, 1000, 10000} do
2: for experiment in range(0,30) do
3: sample data
4: for method in {PC, HC, LiNGAM} do
5: for information in {No Info, Info} do
6: learn DAG
7: compare to ground truth</p>
        <p>We compare the correctness of the graph by a) computing how many of the true edges are
present in the computed DAG10 (true positives) and b) how many edges in the computed DAG
are incorrect (false positives).11 To make the values comparable across DAGs, we normalize
them by dividing them by the number of actually existing edges in the ground-truth DAG. This
procedure means that the range for the incorrect edges is theoretically [0, ∞], as, of course,
more incorrect edges can be inserted than correct edges exist. Still, this normalizes the value
with regard to the size of the DAG, and in practice, the value is never larger than 1. For the
true positives, the value is bounded to 1. Likewise, for RQ3, we look at the accuracy with which
variables are classified into the four classes.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>10Note that for PC, we evaluate against the CPDAG and that the ground-truth CPDAG also changes with more
information available. Only exact matches (i.e., same orientation or both unoriented) are counted.
11Note that while these are usual metrics in structure learning research, other metrics are also frequently used, such
as, e.g., the Hamming Distance [9, 15]. However, we believe this provides a relatively easy-to-understand view of
the results.</p>
        <p>Name
That this metric is more afected than the one measuring correct edges is as expected, as the
restrictions we define through the background knowledge directly impact this. In general, we
can clearly answer RQ1 in the afirmative.</p>
        <p>Tables 3 and 4 show the results for RQ2. Generally, we can observe in Table 3 that the percentage
of correct edges increases with more data, and the percentage of incorrect edges slightly
increases with more data. However, it does not appear that background information is more
valuable for more or less data available. As shown in Table 4, the conclusion is a bit more mixed
for the methods. The percentage of correct edges for PC actually slightly decreases with more
information available; the percentage of incorrect edges decreases quite a bit, though. It should
be noted, however, that the ground-truth CPDAGs for PC also vary with information, so the
numbers are not directly comparable. HC and LiNGAM greatly benefit from the information. In
accordance with previous research, we can observe that HC performs best, whereas PC misses
a lot of correct edges and LiNGAM places a lot of wrong edges [8]. For RQ2, we can say that
background information is generally helpful for all settings but that HC and LiNGAM benefit
more from it.
Figure 3 (b) answers RQ3. We can see that most of the problematic and unproblematic variables
(a)
(b)
are found. This positive result is not true for the other two classes. However, looking deeper
into the data, HC actually finds roughly 67% of all blocked variables when having information
available, whereas PC and LiNGAM do much worse and push down the average. HC also
ifnds more than 75% of all problematic variables when having information available; LiNGAM
is performing above average, too. Finding situations that are indicative of opposing efects
variables is tough for all methods, although HC still does much better than average (roughly
35% when having information available). Looking deeper into our results, we observed that
the variables are often misclassified as problematic. In this way, at least, attention is drawn to
potential problems with them. Most importantly, Figure 3 (b) shows that providing information
helps classify sensitive variables, confirming RQ3. The diference is not very large, though.</p>
        <p>In general, using the background information helps with learning more correct DAGs and
classifying the sensitive variables. HC and LiNGAM greatly benefit from background
information, and PC-Stable does so less. The above evaluation only serves to show that background
information improves the correctness of DAGs. Because of space constraints, we did not add
an evaluation of how well the problematic structures are detected (i.e., the paths along which
problematic influences might exist). This evaluation will be added in future work. The
correctness of the DAGs can be further improved by focusing on score-based methods and potentially
adding even more background information (e.g., some sensitive variables may be influenced
by other sensitive variables). We want to highlight that causalfair allows us to specify more
background knowledge, such as whether specific variables must be connected.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Limitations</title>
      <sec id="sec-6-1">
        <title>6.1. Limitations</title>
        <p>As a general limitation, we want to highlight that the learned DAGs may not reflect real causal
relationships. Either important predictive variables (e.g., as highlighted by the discussion of
structure (3b) in section 3) or sensitive variables that have an efect might simply not be in the
data. While this is a general problem of measuring fairness, it is important to stress that our
CBNs do not necessarily provide a complete picture of the causal mechanisms producing the
target.</p>
        <p>Moreover, structure learning algorithms have limitations when the data is extremely
imbalanced, contains many missing values, and when the relationship between variables is non-linear
and complex. In other words, real data could pose a challenge. Real data has rarely been used to
evaluate structure learning algorithms, in general, [8], but doing so is, of course, very important.
Thus, future research should focus on a real-life evaluation as well.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Conclusion and Outlook</title>
        <p>In this paper, we introduced a classification of sensitive variables into four classes depending
on whether and how they are involved in causal structures that could be problematic in the
ML context. Additionally, we showed that we can improve the data-driven learning of DAGs
by using background knowledge we naturally have in fairness settings. These contributions
are implemented in our Python package causalfair. We hope researchers and practitioners
use this package to evaluate whether they have problematic relationships in their data before
learning ML models. In the future, we plan to add more structure learning methods (particularly
score-based) to the package. Furthermore, we believe that future research should focus on
performing more targeted bias mitigation that can also handle it if we only consider some but
not all paths from a sensitive variable to a target as problematic. Chiappa and Isaac discuss a
technique to estimate the path-specific efects of variables [ 5], and we believe this is a good
starting point. Moreover, we believe that more efort should be put into constructing accurate
DAGs. We show in this paper that background knowledge helps immensely in learning better
DAGs, and we believe that further advancing the learning of DAGs using background knowledge
should be a future research endeavor.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Lea Cohausz is funded by the grant “Consequences of Artificial Intelligence for Urban Societies
(CAIUS),” by Volkswagen Foundation.
[5] S. Chiappa, W. S. Isaac, A causal bayesian networks viewpoint on fairness, Privacy and
Identity Management. Fairness, Accountability, and Transparency in the Age of Big Data:
13th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2. 2 International Summer School, Vienna, Austria,
August 20-24, 2018, Revised Selected Papers 13 (2019) 3–20.
[6] C. Ashurst, R. Carey, S. Chiappa, T. Everitt, Why fair labels can yield unfair predictions:
Graphical conditions for introduced unfairness, in: Proceedings of the AAAI Conference
on Artificial Intelligence, volume 36, 2022, pp. 9494–9503.
[7] L. Cohausz, J. Kappenberger, H. Stuckenschmidt, What fairness metrics can really tell you:</p>
      <p>A case study in the educational domain (2024).
[8] M. Scutari, C. E. Graafland, J. M. Gutiérrez, Who learns better bayesian network structures:
Accuracy and speed of structure learning algorithms, International Journal of Approximate
Reasoning 115 (2019) 235–253.
[9] M. Scanagatta, A. Salmerón, F. Stella, A survey on bayesian network structure learning
from data, Progress in Artificial Intelligence 8 (2019) 425–439.
[10] J. Pearl, Causality, Cambridge university press, 2009.
[11] R. S. Baker, L. Esbenshade, J. Vitale, S. Karumbaiah, et al., Using demographic data as
predictor variables: a questionable choice, Journal of Educational Data Mining 15 (2023)
22–52.
[12] L. Cohausz, A. Tschalzev, C. Bartelt, H. Stuckenschmidt, Investigating the importance of
demographic features for edm-predictions (2023).
[13] L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep
learning on typical tabular data?, Advances in neural information processing systems 35
(2022) 507–520.
[14] B. Hicks, K. Kitto, L. Payne, S. Buckingham Shum, Thinking with causal models: A visual
formalism for collaboratively crafting assumptions, in: LAK22: 12th International Learning
Analytics and Knowledge Conference, 2022, pp. 250–259.
[15] N. K. Kitson, A. C. Constantinou, Z. Guo, Y. Liu, K. Chobtham, A survey of bayesian
network structure learning, Artificial Intelligence Review (2023) 1–94.
[16] A. C. Constantinou, Z. Guo, N. K. Kitson, The impact of prior knowledge on causal
structure learning, Knowledge and Information Systems (2023) 1–50.
[17] S. Shimizu, Lingam: Non-gaussian methods for estimating causal structures,
Behaviormetrika 41 (2014) 65–98.
[18] M. Scutari, M. M. Scutari, H.-P. MMPC, Package ‘bnlearn’, Bayesian network structure
learning, parameter learning and inference, R package version 4 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ueda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kakinuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fujita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kamagata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fushimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nozaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nakaura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fujima</surname>
          </string-name>
          , et al.,
          <article-title>Fairness of artificial intelligence in healthcare: review and recommendations</article-title>
          ,
          <source>Japanese Journal of Radiology</source>
          <volume>42</volume>
          (
          <year>2024</year>
          )
          <fpage>3</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hawn</surname>
          </string-name>
          , Algorithmic bias in education,
          <source>International Journal of Artificial Intelligence in Education</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Birzhandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Application of fairness to healthcare, organizational justice, and ifnance: a survey</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>216</volume>
          (
          <year>2023</year>
          )
          <fpage>119465</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castelnovo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Crupi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Regoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. G.</given-names>
            <surname>Penco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Cosentini</surname>
          </string-name>
          ,
          <article-title>A clarification of the nuances in the fairness metrics landscape</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <article-title>4209</article-title>
          . URL: https: //www.nature.com/articles/s41598-022-07939-1. doi:
          <volume>10</volume>
          .1038/s41598-022-07939-1.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>