Marginal causal consistency in constraint-based causal learning


      Anna Roumpelaki              Giorgos Borboudakis          Sofia Triantafillou           Ioannis Tsamardinos
                                               Computer Science Dept.
                                                  University of Crete
                                   Voutes University Campus, Heraklion 70013, Crete


                        Abstract                               of m-separation. MAGs have several attractive properties:
                                                               They are closed under marginalization, and they are pair-
                                                               wise Markov: Every missing edge in the graph corresponds
     Maximal Ancestral Graphs (MAGs) are proba-
                                                               to a conditional independence in the distribution.
     bilistic graphical models that can model the dis-
     tribution and causal properties of a set of vari-         The independence model of a joint probability distribu-
     ables in the presence of latent confounders. They         tion is the set of conditional independencies entailed by the
     are closed under marginalization. Invariant pair-         distribution. The set of MAGs that entail the same inde-
     wise features of a class of Markov equivalent             pendence model define a Markov equivalence class. All
     MAGs can be learnt from observational data sets           invariant pairwise features of a Markov equivalence class
     using the FCI algorithm and its variations (such          of MAGs can be represented using a Partial Ancestral
     as conservative FCI and order independent FCI).           Graph (PAG) [14]. FCI [14, 15] is the first sound and
     We investigate the consistency of causal features         complete algorithm that identifies a PAG given a data set
     (causal ancestry relations) obtained by FCI in            over a set of possibly confounded variables V and a test of
     different marginals of a single data set. In prin-        conditional independence.
     ciple, the causal relationships identified by FCI
                                                               Since MAGs are closed under marginalization, FCI can be
     on a data set D measuring a set of variables V
                                                               also used in any subset V \ L of V to obtain the corre-
     should not conflict the output of FCI on marginal
                                                               sponding marginal PAG. Naturally, the causal features of
     data sets including only subsets of V. In prac-
                                                               marginal PAGs should not contradict those of the PAG on
     tice, however, FCI is prone to error propagation,
                                                               the full set of variables: a disagreement between a PAG and
     and running FCI in different marginals results
                                                               a marginal PAG can only be a result of the sensitivity of
     in inconsistent causal predictions. We introduce
                                                               FCI to statistical errors. We use the term marginal causal
     the term of marginal causal consistency to de-
                                                               consistency to describe the degree of agreement of causal
     note the consistency of causal relationships when
                                                               relationships among a PAG and its marginals. To the best
     learning marginal distributions, and investigate
                                                               of our knowledge, this type of consistency of constraint-
     the marginal causal consistency of different FCI
                                                               based causal discovery has not been examined before. We
     variations.Results indicate that marginal causal
                                                               examine the outputs of FCI [14], order-independent FCI
     consistency varies for different algorithms, and
                                                               [3], and conservative FCI with the majority rule heuristic
     is also sensitive to network density and marginal
                                                               for collider orientations [11, 3].
     size.
                                                               In a simulated setting, we found that the algorithms’ con-
                                                               sistency is sensitive to network density and number of vari-
                                                               ables that are marginalized out. To examine whether the
1   INTRODUCTION
                                                               consistency of causal relationships correlates with the cor-
                                                               rectness of the induced causal features, we ranked them ac-
Maximal Ancestral Graphs (MAGs) [12] can represent the
                                                               cording to their frequency of appearance in randomly se-
causal relationships among a set of measured variables, as
                                                               lected marginals, and compared the resulting AUCs with
well as the conditional independence of their joint proba-
                                                               bootstrapping. While marginal consistency measures the
bility distribution, in the presence of latent confounders.
                                                               sensitivity of an algorithm to a specific choice of mea-
Under the causal Markov and Faithfulness assumptions           sured variables, bootstrapping measures the sensitivity of
[14], every conditional independence that holds in the dis-    an algorithm to the specific sample. Results show that
tribution can be identified in the graph using the criterion   marginal consistency can help identify accurate causal fea-
tures. However, bootstrapping outperforms marginal-based         circles on the path into appropriate tails or arrowheads.
ranking in all cases.
                                                                 Under the causal Markov and Faithfulness assumptions
The rest of this paper is structured as follows: Section 2       [14], the conditional independencies that hold in a joint
introduces basic MAG notions and notation. Section 3 de-         probability distribution can be identified in the correspond-
fines marginal causal consistency in FCI outputs, presents       ing causal graph according to the graphical criterion of m-
a sound and complete method for identifying all pairwise         separation [12]. Constraint-based methods query the data
causal ancestry relationships in a PAG, and compares in-         to identify the independence model, and then try to find
ternal consistency of outputs of different FCI variations.       the causal MAGs that satisfy all and only the observed in-
Section 4.1 describes work related to obtaining confidence       dependencies. A class of Markov equivalent MAGs, that
estimates for causal ancestry relations. Section 4 describes     differ in a subset of edge orientations, will in general sat-
an algorithm for ranking causal relationships for a given        isfy the observed constraints. PAGs have the same adjacen-
data set, and compares the AUC of the proposed approach          cies and all invariant orientations shared by all MAGs in a
to confidence estimates obtained with bootstrapping. Con-        Markov equivalence class. Specifically, an edge end-point
clusions and future directions are discussed in Section 5.       is oriented as an arrowhead (’>’) or tail (’-’) in a PAG if
                                                                 and only if it is invariant in all MAGs represented by it,
2   PRELIMINARIES                                                and is left as a circle (’o’) otherwise.
                                                                 FCI is a sound and complete algorithm for discovering
We use V to denote random variables, and V to denote a           PAGs from observational data, but it is prone to statisti-
set of variables. A graph G, denoted as G = (V, E) is            cal errors. Several extensions have been proposed that aim
defined over a set of variables V with edges E. A path p is a    to tackle the sensitivity of FCI to error propagation: or-
sequence of adjacent edges, without repetition. A directed       der independent FCI, conservative FCI and majority rule
path is a path where all edges are directed and have the         FCI among others. Order independent FCI (denoted iFCI
same direction. We use X 99K Y to denote that there exists       in the rest of this paper), proposed by [3], outputs a PAG
a directed path from X to Y (X is an ancestor of Y in G).        that does not depend on the order the variables are given.
We use X ⊥ ⊥Y |Z to denote variables X and Y to be inde-         Conservative FCI [11] checks all unshielded triplets in the
pendent given a set Z. In a graph G a vertex V is a collider     following way: for every unshielded triple X − Y − Z
on a path u if and only if there are two distinct edges on u     check all subsets of X’s potential parents and all subsets of
containing V as an endpoint and both are into V. Otherwise,      Z’s potential parents. If Y is not in any subsets, then ori-
V is a non-collider on U. In a graph G, a triplet X − V − Y      ent the triplet as a collider; if Y is in all subsets leave the
is unshielded if X and Y are not adjacent in G.                  triplet as a non-collider; otherwise tag the triple as unfaith-
                                                                 ful. Majority rule FCI (denoted mFCI in the rest of this
A Bayesian network is represented by a Directed Acyclic          paper) is slightly less strict than conservative FCI. In this
Graph (DAG) G over a set of variables V and a joint proba-       extension, an unshielded triple is oriented as a collider if Y
bility distribution P. A directed edge denotes a direct causal   is in less than 50% of the subsets and as a non-collider if
relationship. We assume that the following two conditions        it is in more than 50% of the subsets. To avoid unfaithful
hold: the Causal Markov Condition (CMC), which states            triples in case of ties, we leave the triple as a non-collider.
that every variable is independent of its non-descendants        We examine and compare the marginal consistency of FCI,
given it’s parents, and the Faithfulness Condition (FC),         order-independent FCI and majority rule FCI in simulated
which states that all the conditional independencies that        data.
hold in P stem from G and the CMC.
MAGs are mixed graphs, i.e. they can have both directed
                                                                 3   MARGINAL CAUSAL CONSISTENCY
and bi-directed edges. A directed edge X → Y indicates
that X causes Y , while a bi-directed edge X ↔ Y indi-               IN PAGS
cates that X and Y are confounded. Two nodes can be con-
nected with only one type of edge, and ancestry has prece-       We now define the problem of marginal causal consistency
dence over confounding: if X causes Y and the two are            of a constraint-based algorithm. Intuitively, we are inter-
also confounded, only the directed edge X → Y is present         ested in how much marginalizing out variables and rerun-
in the MAG. MAGs can also be used to represent selection         ning FCI (or a variation) preserves causal relationships.
bias via undirected edges. For this paper, we assume no
                                                                 To help define the problem, we use AnP to be the set of all
selection bias and only consider MAGs with directed and
                                                                 ancestral relationships that hold in the Markov equivalence
bi-directed edges.
                                                                 class [P] entailed by P. Notice that this set may be differ-
A path is called uncovered if every consecutive triple on the    ent than the set of ancestral relationships T RP that can be
path is unshielded [15]. Also, a path is potentially directed    identified directly from P by taking the transitive closure
if it can be oriented into a directed path by changing the       of the directed edges: It may include additional pairs that
are connected by a different causal path in every member               path p2 = hX, V, V2 , . . . Vm , Y i of length m with U 6= V
of [P], so that there is no fully oriented path in P. Fig-             must also exist. Assume there is no such path. If for every
ure 1 shows an example where an invariant causal relation-             MAG in [P], X → U , then p1 is a directed path in P (since
ship does not correspond to a single directed path in the              every triple on the path is a definite non collider), and X ∈
PAG: while T RP for the PAG of Figure 1a is the empty                  AnP (Y ), which is a contradiction. Thus, X            U in P,
set, AnP = {(X, Y )}.                                                  and there exists a MAG in [P] where X           U and p1 is not
                                                                       a directed path. If p1 , is the only p.d. path from X to Y in
Identifying AnP is not trivial. [1] use a method to im-
                                                                       P, then X 99K Y 6∈ [P], which is a contradiction. Thus,
plicitly enumerate all MAGs in a Markov equivalence class
                                                                       ∃p2 = hX, V, V2 , . . . Vm , Y i with U 6= V .
represented by P. The same techniques can be used to
identify AnP , by identifying the invariant ancestral rela-            Next we show that hV, X, U i form an unshielded definite
tions present in all such MAGs1 .                                      non-collider in P:
                                                                       hV, X, U i is not a collider in P (trivially since p1 and p2 are
However, in this work we show that all causal relationships
                                                                       potentially directed paths). We must also show that U, V
that are invariant in P but are not in T RP correspond to
                                                                       are not adjacent. If U        V in , then there exists a MAG
a specific pattern, illustrated in Figure 1. The pattern can
                                                                       in [P] such that U → X in P, which is inconsistent since
be easily found in P to identify all additional relationships
                                                                       p1 is a p.d. path.                                            
that are causal in P but not present in T RP . Theorem 3.1
proves soundness and completeness of this rule.                        Apart from (positive) causal relations, a PAG can also have
                                                                       negative and ambiguous causal relations. X and Y share
Theorem 3.1 Let P be the PAG over a set of variables V                 a negative causal relationship in P if X cannot be a cause
representing the Markov equivalence class of MAGs [P].                 of Y in P. This happens if (X, Y ) 6∈ AnP , and there can
Then, if (X, Y ) 6∈ T RP , X 99K Y ∈ AnP if and only if                be no directed path from X to Y in P (i.e. X           Y in
∃U, V, U 6= V such that                                                P, or there is no potentially directed path from X to Y in
                                                                       P). Naturally, this does not mean that Y causes X. An
  1. hX, U, . . . Y i and hX, V, . . . Y i form uncovered po-          ambiguous causal relationship occurs when a relationship
     tentially directed paths and                                      is neither positive nor negative.
  2. hU, X, V i is an unshielded definite non collider in P.           We use the notation N AnP to denote the set of negative
                                                                       causal relationships in a PAG P. The conditions mentioned
Proof (⇐) Let U, V be distinct variables such that                     above for membership in N AnP can easily be tested in P:
hX, U, . . . Y i and hX, V, . . . Y i are uncovered p.d paths,         To rule out the existence of a possible directed path, only
with hU, X, V i an unshielded definite non-collider in P.              uncovered possibly directed paths need to be checked [15].
Let U        X in some MAG in [P], where ? is used as                  Notice that the set of ancestral and non-ancestral relations
a meta-symbol denoting any plausible orientation. Then                 in a PAG are by no means complementary, since ambiguous
X → V (since U, X, V form a definite non collider) and                 relations also exist.
V 99K Y (since every triple on the path is a definite non              We are interested in constraint-based algorithms’ consis-
collider). Else if U ← X, then U 99K Y (since every                    tency to marginal ancestral sets: Let D be a data set over
triple on the path is a definite non collider). Thus, (X, Y )          a set of possibly confounded variables V, and let G be the
in AnP .                                                               PAG output of a sound and complete constraint-based algo-
(⇒) We will first prove that if there is a directed path from          rithm. P defines an ancestral set AnP and a non-ancestral
X to Y in all MAGs in [P], and there is no such directed               set N AnP . Also, let P[L , L ⊂ V, be the marginal PAG
path in P then there exist at least two potentially directed           obtained using the same algorithm (with the same hyper-
paths (and thus, two uncovered potentially directed paths)             parameters) on the restriction of data set D on variables in
in P:                                                                  L. Each marginal PAG P[L defines a marginal ancestral set
                                                                       AnP[L and a marginal negative ancestral set N AnP[L .
If there is a directed path from X to Y in every MAG in
[P], then there is a p.d. path from X to Y in P. By Lemma              Assuming no statistical errors, any marginal ancestral set is
B.1 in [15] there is an uncovered p.d. path from X to Y                a subset of the ancestral set. Thus, there is no pair (X, Y )
in P. Let p1 = hX, U, U2 , . . . Un , Y i, be such a path of           present in any AnP [L that is not present in AnP . On the
length n. We want to show that another uncovered p.d.                  contrary, some ancestral relationships that can be identified
                                                                       in the full data set may not be identifiable in a marginal,
    1
      Although theoretically possible, the algorithm assumes that      due to (a) the fact that some variables are not included in
the input PAG and separating sets are correct, that is, that the de-   the marginal and (b) the loss of information from exclud-
pendencies and independencies encoded in the PAG are the ones
implied by the separating sets. In practice however, as suggested      ing variables. Therefore, members of AnP are possibly
by anecdotal experiments, FCI and its variants rarely produce          not present in some AnP[L . Similarly, any marginal non-
such output.                                                           ancestral set is a subset of the non-ancestral set, while some
          (a)                             (b)                            (c)                             (d)

Figure 1: (a) Example of a PAG in which each Markov equivalent MAG contains a directed path from X to Y . (b,c)
Orienting the edge between U and X as U      X creates the directed path X → V → Y , while orienting it as U ← X
creates the directed path X → U → Y . (d) The general pattern as described in Theorem 3.1. Dashed edges correspond to
potentially directed paths.

      E              F                                                             F


 A              B              C                D               A              B                C                  D


                         (a)                                                           (b)

Figure 2: An example of ancestral and non-ancestral sets for a PAG P and a marginal PAG P[E .
Ancestral and non-ancestral sets: AnP = {(B, {C, D}), (C, D)},
N AnP = {(A, {E, F }), (B, {A, E, F }), (C, {A, B, E, F }), (D, {A, B, C, E, F }), (E, {A, F }), (F, {A, B, E})}.
The corresponding marginal sets are AnP[E = {(C, D)},
N AnP[E = {(A, F ), (B, F ), (C, {A, B, F }), (D, {A, B, C, F }), (F, {A, B})}


members of N AnP may not be present in some N AnP[L .
                                                                                               Marginal PAG P[L
In addition, the PAGs should not encode any conflicting
causal information: Members of AnP cannot also be mem-                                   Ancestral Non ancestral
bers of N AnP[L , and members of N AnP cannot also be                      Ancestral            p              c
members of AnP[L .
                                                                PAG P Non ancestral             d           n
For finite samples, however, statistical errors propagate in
both the skeleton identification and the orientation steps of             Ambiguous             e           f
constraint-based algorithms, and can result in conflicting                                   |AnP[L |   |N AnP[L |
orientations. Since the algorithms are run on marginal ver-
sions of the same data set, conflicts in the marginal ances-    Table 1: Confusion matrix for (non) ancestral relationships
tral sets can be viewed as a measure of robustness of the       for a marginal of P.
algorithm to statistical errors. We are therefore interested
in comparing the marginal causal consistency of different
FCI variations. Marginal consistency can also been as a                                        Marginal PAG P[L
measure of the sensitivity of an algorithm to the specific
                                                                                         Ancestral Non ancestral
choice of observed variables.
                                                                           Ancestral            2              0
For a given algorithm, we are interested in how often pos-
itive and negative ancestral relationships entailed by the      PAG P Non ancestral             0          11
marginal outputs P[L agree or disagree with the output P
                                                                          Ambiguous             0              0
over the full set of variables. This information is entailed
in the confusion matrix described in Table 1.                                                   2          11
Some remarks:
                                                                    Table 2: Confusion matrix for the PAGs in Figure 2.
  • For perfect statistical knowledge, c = d = e = f = 0,       Due to statistical errors, the output of the FCI algorithm
    p = |AnP[L |, n = |N AnP[L |. Notice that an-               is also not necessarily a valid PAG. A very common prob-
    cestral and non-ancestral relationships identified in a     lem is the creation of cycles or almost cycles. To tackle
    marginal PAG are not expected to be ambiguous in the        this problem, we added the option to aggressively prevent
    PAG over the complete set of variables V.                   cycles, as implemented in TETRAD [13]. This functional-
                                                                ity is applied in the phase of orienting edges, where every
  • The sum p + d + c + n + e + f is different (smaller)        attempted orientation is checked for creating an (almost)
    than the number of possible causal relationships in         cycle. If that is the case, then that specific orientation is not
    the marginal data set, since P[L also has ambiguous         performed, and the orientation rules move on to the next
    edges. We do not take these edges into account here,        possible orientation. We have to note that we do not use
    because they are not an indication of consistency or        the orientation rules that aim at recovering undirected edges
    inconsistency of the marginal. Edges that are (non)         (selection bias).
    ancestral in P can often be ambiguous in P[L , even if
    the endpoints are present in the marginal.                  The results of the experiments are shown in Figures 3 and
                                                                4. The ratios were computed by summing over all numer-
                                                                ators and dividing by the sum of all denominators (for ex-
An example of (non) ancestral relations in a PAG and a cor-     ample, for |AnpP[ | we summed over all correctly identified
responding marginal PAG is shown in Figure 2. The corre-                         L
                                                                ancestral relations p and divided by the total number of pre-
sponding confusion matrix is shown in Table 2 (assuming
                                                                dicted ancestral relations |AnP[L | over all marginals and
perfect statistical knowledge).
                                                                datasets). This guarantees that each bar sums to one and
Notice however, that the confusion matrix in Table 1 only       avoids divisions by zero, in case no relation is predicted.
measures the robustness of an algorithm in terms of causal
                                                                For all algorithms, the consistency drops for both denser
predictions. Other characteristics and uncertainties of FCI
                                                                networks and smaller marginals. For networks with den-
outputs are not taken into account.
                                                                sity 0.1 all algorithms have more than 50% consistent pre-
                                                                dictions for both 18 and 15-variable marginals. For denser
3.1   Marginal consistency of FCI variations in                 networks however, the performance of iFCI and FCI drops
      simulated data.                                           below 0.2. mFCI has the largest ratio of consistent rela-
                                                                tionships, and retains a consistency ratio of 0.60 for 18-
We used simulated data to examine the internal consistency      variable marginals. However, its performance drops to 0.38
of the FCI algorithm. We used the pcalg package [8] to          for 15-variable marginals. For all algorithms, the majority
simulate random DAGs with 20 variables. We tried two            of causal relationships are found non-ancestral in P, and a
different graph densities, 0.1 and 0.2 (corresponding to 1.9    small ratio is found ambiguous. It is worth noting, how-
and 3.8 neighbors per variable on average, respectively).       ever, that the algorithms typically output very few positive
We use linear Gaussian parametrization, with coefficient        causal relationships.
of each model sampled using the default parameters in the
pcalg package. For each graph density we generated 50           Non-ancestral causal relationships on the other hand are
DAGs and for each such network we simulated data with           much more consistent, as shown in Figures 5 and 6. The
1000 samples.                                                   majority of non-ancestral relationships are consistent (blue
                                                                bars, |AnnP[ | ). Overall, mFCI has the highest ratio of con-
We used different variations of the FCI algorithm to retrieve               L
                                                                sistent negative relationships.
a causal network with significance threshold α = 0.05 and
unconstrained maximum conditioning set. We also created         Overall, results show that (a) the performance of constraint-
100 randomly selected marginal data sets of size 18 and         based algorithms heavily depends on the graph density, par-
15 for each DAG, and ran all FCI variations with the same       ticularly for algorithms that are less conservative, and thus
parameters.                                                     more sensitive to error propagation and (b) The causal pre-
                                                                dictions of the algorithms are sensitive to marginalization.
The variations of the FCI algorithm we used are the follow-
                                                                Even for mFCI, removing two out of 20 variables results in
ing: order independent FCI (iFCI), FC, FCI with majority
                                                                30-40% relations that are not validated in the marginal data
rule (mFCI). We did not include the conservative FCI, as
                                                                set.
the existence of ambiguous triples results in outputs that
are not complete PAGs. Hence, Theorem 3.1 can not be ap-
plied. Instead, we included the majority rule FCI, which is
inspired by conservative FCI, but in which the triplet’s ori-
entation is dictated by a majority vote on the corresponding
conditional independence tests. To guarantee that the out-
put is a valid PAG, ambiguous triplets (occurring in a tie
vote) are marked as definite non-colliders.
                              (a)                                                          (b)

Figure 3: Barplots showing the ratios of positive relations in (a) 18-variable marginals, (b) 15-variable marginals in the
graphs of density 0.1. Blue: Consistent ancestral relationships. Red: Inconsistent ancestral relationships (found non-
ancestral in P). Pink: Inconsistent ancestral relationships (found ambiguous in P).


                              (a)                                                          (b)

Figure 4: Barplots showing the ratios of positive relations in (a) 18-variable marginals, (b) 15-variable marginals in the
graphs of density 0.2. Blue: Consistent ancestral relationships. Red: Inconsistent ancestral relationships (found non-
ancestral in P). Pink: Inconsistent ancestral relationships (found ambiguous in P).


                              (a)                                                          (b)

Figure 5: Barplots showing the ratios of negative relations in (a) 18-variable marginals, (b) 15-variable marginals in the
graphs of density 0.1. Blue: Consistent non-ancestral relationships. Red: Inconsistent non-ancestral relationships (found
ancestral in P). Pink: Inconsistent non-ancestral relationships (found ambiguous in P). The majority of non-causal
predictions are consistent (blue).


4   RANKING CAUSAL RELATIONSHIPS                                they are not consistent with the output of the algorithm on
    BASED ON MARGINAL CAUSAL                                    the whole data set.
    CONSISTENCY
                                                                4.1   Related Work
Calculating the marginal ancestral sets indicates a way of
                                                                Alternative approaches for ranking causal ancestry rela-
ranking pairwise causal relationships according to their fre-
                                                                tions can be categorized into (a) Bayesian model averaging
quency of appearance in AnP[L for different marginals.
                                                                methods and (b) resampling-based methods.
The idea is that causal relationships that frequently appear
in marginal PAGs will tend to be true more often, even if       Bayesian model averaging methods compute the posterior
                             (a)                                                                (b)

Figure 6: Barplots showing the ratios of negative relations in (a) 18-variable marginals, (b) 15-variable marginals in the
graphs of density 0.2. Blue: Consistent non-ancestral relationships. Red: Inconsistent ancestral relationships (found
ancestral in P). Pink: Inconsistent non-ancestral relationships (found ambiguous in P). The majority of non-causal
predictions are consistent (blue).


                   (a)                                         (b)                                         (c)

Figure 7: ROC curves of the variations of the FCI algorithm in graphs with originally 20 variables and density 0.1. (a)
18-variable marginals (b) 15-variable marginals (c) bootstrapping.


                   (a)                                         (b)                                         (c)

Figure 8: ROC curves of the variations of the FCI algorithm in graphs with originally 20 variables and density 0.2. (a)
18-variable marginals (b) 15-variable marginals (c) bootstrapping.


probability of network features by averaging over network            Resampling-based methods repeatedly apply a learning
structures. This can be either approximated using MCMC               method on resampled datasets and estimate the confidence
[6] or by using exact methods [10, 2]. Apart from their high         of network features as the proportion of induced networks
computational cost, such methods are not very general and            in which they appeared. Such methods include parametric
are only applicable to cases where network scores can be             and non-parametric bootstraps [5], as well as stability se-
computed. Although various scores exist for Bayesian net-            lection [9]. The main advantage is that they are general and
work structures [7], scores for MAGs have only recently              thus also applicable in the case of PAGs.
been explored [12] and require more expensive fitting pro-
                                                                     In this section we compare our approach with the non-
cedures [4].
parametric bootstrap by [5].                                    sible unmeasured variables. It is also possible that confi-
                                                                dence metrics computed based on marginals could be more
4.2   Experiments                                               helpful in situations where you have ”outlier” variables that
                                                                do not satisfy the algorithm’s assumptions (e.g. they create
We produced rankings for all possible pairs of variables        cycles, do not satisfy the distributional assumptions of con-
according to (a) the frequency of appearance in the cor-        ditional independence tests etc). In such cases, it is pos-
responding marginal ancestral sets AnP[L and (b) the fre-       sible that taking random marginals could improve the al-
quency of appearance in AnP for FCI outputs on differ-          gorithms’ performance (similar to the way bootstrapping is
ent bootstrap samples of the initial data set D. Since we       beneficial when you have outlier samples).
only use 100 different marginals per iteration, for each pair   We must also point out that PAGs encode much richer in-
(X, Y ), the frequency of appearance in AnP[L was divided       formation than the causal ancestry relations examined here.
by the number of data sets in which X and Y are both            Exploring different types of marginal consistency is also an
present (i.e. we excluded from the calculation marginals        area of interest.
in which the causal ancestry could not be found). For ev-
ery pair of variables the true label is 1 if the relationship
                                                                Acknowledgments
is ancestral in the DAG the data was sampled from, and 0
otherwise.                                                      We would like to thank the anonymous reviewers for their
Based on these ranking and the status of the relationship       comments, particularly reviewer 2 who helped us identify
in the ground truth network (used to simulate the data), we     a problem in the simulations in the submitted version of
calculated ROC curves and the corresponding AUCs. Fig-          this paper. This work was funded by European Research
ures 7(a,b) and 8(a,b) show the performance of the algo-        Council (ERC) and is part of the CAUSALPATH - Next
rithms using marginals of 15 and 18 variables for densities     Generation Causal Analysis project, No 617393.
0.1 and 0.2 respectively. For all settings, ROC curves are
significantly better than random guessing. Again, the re-       References
sults are better for sparser networks, where AUC ranges
from 82.25-91.84%. The AUCs drop for denser networks,            [1] G. Borboudakis and I. Tsamardinos. Incorporat-
ranging from 60.78-83.57%. mFCI has the highest AUCs                 ing causal prior knowledge as path-constraints in
for all settings.                                                    Bayesian networks and maximal ancestral graphs.
                                                                     Proceedings of the 29th International Conference on
We compared our method against bootstrapping. For each
                                                                     Machine Learning, pages 1799–1806, 2012.
DAG, 100 data sets with 1000 samples were resampled
with replacement from the original data set. FCI was ran         [2] Y. Chen, L. Meng, and J. Tian. Exact bayesian learn-
on the new data sets, each time calculating the ancestral set        ing of ancestor relations in bayesian networks. In Pro-
AnP on the output PAG. Each order pair of variables was              ceedings of the Eighteenth International Conference
then ranked according to the frequency of appearance in              on Artificial Intelligence and Statistics, 2015.
the ancestral sets. The results are shown in figures 7(c) and    [3] D. Colombo and M. H. Maathuis. Order-independent
8(c), where we can see that bootstrapping outperforms the            constraint-based causal structure learning. Journal of
marginal-based ranking in all cases. Again, the majority             Machine Learning Research, 15(1):3741–3782, 2014.
rule FCI outperforms the rest of the algorithms.
                                                                 [4] M. Drton, M. Eichler, and T. S. Richardson. Comput-
                                                                     ing maximum likelihood estimates in recursive linear
5     DISCUSSION                                                     models with correlated errors. Journal of Machine
                                                                     Learning Research, 10:2329–2348, 2009.
We defined and examined the problem of marginal causal           [5] N. Friedman, M. Goldszmidt, and A. Wyner. Data
consistency in constraint-based causal discovery. We pre-            analysis with bayesian networks: A bootstrap ap-
sented a way to identify all invariant causal relationships          proach. In Proceedings of the Fifteenth Conference
entailed in a complete PAG.                                          on Uncertainty in Artificial Intelligence, 1999.
We examined how well causal and non-causal relationships         [6] N. Friedman and D. Koller. Being Bayesian about
predicted by three different variations of FCI are preserved         network structure. In Proceedings of the Sixteenth
when you marginalize out variables. Results indicate that            Conference Annual Conference on Uncertainty in Ar-
constraint-based learning methods for causal networks are            tificial Intelligence, 2000.
sensitive to the selected marginal, particularly for dense
                                                                 [7] D. Heckerman, D. Geiger, and D. M. Chickering.
networks.
                                                                     Learning Bayesian networks: The combination of
The results are important because in most real-life applica-         knowledge and statistical data. Machine Learning,
tions researchers may have limited knowledge on the pos-             20(3):197–243, 1995.
 [8] M. Kalisch, M. Mächler, D. Colombo, M. H.
     Maathuis, and P. Bühlmann. Causal inference using
     graphical models with the r package pcalg. Journal of
     Statistical Software, 47(11):1–26, 2012.
 [9] N. Meinshausen and P. Bühlmann. Stability selec-
     tion. Journal of the Royal Statistical Society, Series
     B, 72:417–473, 2010.
[10] P. Parviainen and M. Koivisto. Ancestor relations in
     the presence of unobserved variables. In Machine
     Learning and Knowledge Discovery in Databases,
     volume 6912 of Lecture Notes in Computer Science,
     pages 581–596. Springer, 2011.
[11] J. Ramsey, J. Zhang, and P. Spirtes. Adjacency-
     faithfulness and conservative causal inference. In
     Proceedings of the Twenty-Second Conference on Un-
     certainty in Artificial Intelligence, 2006.
[12] T. Richardson and P. Spirtes. Ancestral graph Markov
     models. Annals of Statistics, 30(4):962–1030, 2002.
[13] R. Scheines, P. Spirtes, C. Glymour, C. Meek, and
     T. Richardson. The tetrad project: Constraint based
     aids to causal model specification. Multivariate Be-
     havioral Research, 33(1):65–117, 1998.
[14] P. Spirtes, C. Glymour, and R. Scheines. Causation,
     Prediction, and Search. MIT Press, Cambridge, MA,
     2nd edition, 2000.
[15] J. Zhang. On the completeness of orientation rules
     for causal discovery in the presence of latent con-
     founders and selection bias. Artificial Intelligence,
     172(16-17):1873–1896, 2008.