Score based vs constraint based causal learning in the presence of confounders


                        Sofia Triantafillou                                  Ioannis Tsamardinos
                      Computer Science Dept.                                Computer Science Dept.
                        University of Crete                                    University of Crete
              Voutes Campus, 700 13 Heraklion, Greece               Voutes Campus, 700 13 Heraklion, Greece


                         Abstract                                 We consider a representation for an equivalence class
                                                                  of models based on maximal ancestral graphs (MAGs)
     We compare score-based and constraint-based                  [Richardson and Spirtes, 2002]. MAGs are extensions of
     learning in the presence of latent confounders.              CBNs that also consider latent confounders. Latent con-
     We use a greedy search strategy to identify the              founders are represented with bi-directed edges. The set of
     best fitting maximal ancestral graph (MAG) from              conditional independencies that hold in a faithful probabil-
     continuous data, under the assumption of mul-                ity distribution can be identified from the graph with the
     tivariate normality. Scoring maximal ancestral               graphical criterion of m-separation. The causal semantics
     graphs is based on (a) residual iterative con-               of edges in MAGs are more complicated: Directed edges
     ditional fitting [Drton et al., 2009] for obtain-            denote causal ancestry, but the relationship is not necessar-
     ing maximum likelihood estimates for the pa-                 ily direct. Bi-directed edges denote latent common causes.
     rameters of a given MAG and (b) factorization                However, each pair of variables can only share one edge,
     and score decomposition results for mixed causal             and causal ancestry has precedence over confounding: If X
     graphs [Richardson, 2009, Nowzohour et al.,                  is a causal ancestor of Y and the two are also confounded,
     2015]. We compare the score-based approach in                then X → Y in the MAG. MAGs have several attractive
     simulated settings with two standard constraint-             properties: They are closed under marginalization and ev-
     based algorithms: FCI and conservative FCI.                  ery non-adjacency corresponds to a conditional indepen-
     Results show a promising performance of the                  dence.
     greedy search algorithm.                                     There exist two main approaches for learning causal graphs
                                                                  from data. Constraint-based approaches infer the con-
                                                                  ditional independencies imprinted in the data and search
1   INTRODUCTION                                                  for a DAG/MAG that entails all (and only) of these inde-
                                                                  pendences according to d/m-separation. Score-based ap-
Causal graphs can capture the probabilistic and causal
                                                                  proaches try to find the graph G that maximizes the like-
properties of multivariate distributions. Under the assump-
                                                                  lihood of the data given G (or the posterior), according
tions of causal Markov condition and faithfulness, the
                                                                  to the factorization imposed by G. In general, a class of
graph induces a factorization for the joint probability distri-
                                                                  causal graphs, that are called Markov equivalent, fit the
bution, and a graphical criterion (d-separation) can be used
                                                                  data equally well. Constraint-based approaches are more
to identify all and only the conditional independencies that
                                                                  efficient and output a single graph with clear semantics,
hold in the joint probability distribution.
                                                                  but give no indication the relative confidence in the model.
The simplest case of a causal graph is a directed acyclic         Moreover, they have been shown to be sensitive to error
graph (DAG). A causal DAG G and faithful probability dis-         propagation [Spirtes, 2010]. Score-based methods on the
tribution P constitute a causal Bayesian network (CBN)            other hand do not have this problem, and they also provide
[Pearl, 2000]. Edges in the graph of a CBN have a straight-       a metric of confidence in the entire output model. Hybrid
forward interpretation: A directed edge X → Y denotes             methods that exploit the best of both worlds have there-
a causal relationship that is direct in the context of vari-      fore proved successful in learning causal graphs from data
ables included in the DAG. In general, CBNs are consid-           [Tsamardinos et al., 2006].
ered in the setting where causal sufficiency holds, i.e. the
                                                                  Numerous constraint-based and score-based algorithms ex-
absence of latent confounders. This is restrictive, since in
                                                                  ist that learn causal DAGs (classes of Markov equiva-
most cases we can/do not observe all variables that partici-
                                                                  lent DAGs) from data. Learning MAGs on the other
pate in the causal mechanism of a multivariate system.
hand is typically done with constraint-based algorithms.        condition and faithfulness Pearl [2000], G is connected to
A score-based method for mixed causal graphs (not nec-          the joint probability distribution P over V through the cri-
essarily MAGs) has recently been proposed [Nowzohour            terion of d-separation (defined below). Equivalently, the
et al., 2015] based on relative factorization results [Tian     causal Markov condition imposes a simple factorization of
and Pearl, 2003, Richardson, 2009].                             the joint probability distribution:
Using these decomposition results, we implemented a sim-
                                                                                           Y
                                                                               P (V) =         P (V |P aG (V ))          (1)
ple greedy search for learning MAGs from data. We com-                                  V ∈V
pare the results of this approach with FCI [Spirtes et al.,
2000, Zhang, 2008] and conservative FCI [Ramsey et al.,         Thus, the parameters of the joint probability distribution
2006] outputs. Greedy search performs slightly worse in         describe the probability density function of each variable
most settings in terms of structural hamming distance, and      given its parents in the graph. An interesting property of
better than FCI in terms of precision and recall.               CBNs, that constitutes the basis of constraint-based learn-
                                                                ing, is the following: Every missing edge in a DAG of a
Based on these results, we believe that score-based ap-         CBN corresponds to a conditional independence. Hence, if
proach can be used to improve learning causal graphs            X is independent from Y given Z (symb. X ⊥   ⊥ Y |Z) in P,
in the presence of confounders. Algorithm implementa-           then X and Y are not adjacent in G.
tion and code for the detailed results are available in
https://github.com/striantafillou.                              In general, a class of Markov equivalent DAGs fit the data
                                                                equally well. DAGs in a Markov equivalent class share the
The rest of the paper is organized as follows: Section 2        same skeleton and unshielded colliders. A Pattern DAG
briefly reviews causal graphs with and without causal suf-      (PDAG) can be used to represent the Markov equivalent
ficiency. Section 3 gives an overview of constraint-based       class of DAGs: It has the same edges as every DAG in
and score-based methods for DAGs and MAGs. Section              the Markov equivalence class, and the orientations that are
4 describes a greedy search algorithm for learning MAGs.        shared by all DAGs in the Markov equivalence class.
Related work is discussed in Section 5. Section 6 compares
the performance of the algorithm against FCI and CFCI.          Confounded relationships cannot be represented in DAGs,
Conclusions and future work are presented in 7.                 and mixed causal graphs were introduced to tackle this
                                                                problem. The most straightforward approach is with semi-
                                                                Markov causal models (SMCMs) Tian and Pearl [2003].
2   CAUSAL GRAPHS                                               The graphs of semi-Markov causal models are acyclic di-
                                                                rected mixed graph (ADMGs). Bi-directed edges are used
We begin with some graphical notation: A mixed graph            to denote confounded variables, and directed edges denote
(MG) is a collection of nodes (interchangeably variables)       direct causation. A pair of variables can share up to two
V, along with a collection of edges E. Edges can be di-         edges (one directed, one bi-directed). The conditional in-
rected (X → Y ) or bi-directed (X ↔ Y ). A path is a            dependencies that hold in a faithful distribution are repre-
sequence of adjacent edges (without repetition). The first      sented through the criterion of m-separation:
and last node of a path are called endpoints of the path.
                                                                Definition 2.1 (m-connection, m-separation.) In a mixed
A bi-directed path is a path where every edge is bi-directed.   graph G = (E, V), a path p between A and B is m-
A directed path is a path where every edge is directed and      connecting given (conditioned on) a set of nodes Z , Z ⊆
oriented in the same direction. We use X 99K Y to symbol-       V \ {A, B} if
ize a directed path from X to Y . A directed cycle occurs
when there exists a directed path X 99K X. An almost
                                                                 1. Every non-collider on p is not a member of Z.
directed cycle is occurs when X ↔ Y and X 99K Y . A
triplet hX, Y, Zi on consequent nodes on a path are form a       2. Every collider on the path is an ancestor of some
collider if X → Y ← Z. If X and Z are not adjacent, the             member of Z.
triplet is an unshielded collider.
A mixed graph is called ancestral if it has no directed and     A and B are said to be m-separated by Z if there is no
almost directed cycles. An ancestral graph without bi-          m-connecting path between A and B relative to Z. Other-
directed edges is a DAG. X is a parent of Y in a MG G           wise, they are said to be m-connected given Z.For graphs
if X → Y in G. We use the notation P aG (X), AnG (X) to         without bi-directed edges, m-separation is reduced to the
denote the set of parents and ancestors of X in G.              d-separation criterion.

Under causal sufficiency, DAGs can be used to model             Markov equivalence classes of semi-Markov causal mod-
causal relationships: For a graph G over a set of variables     els do not have a simple characterization, because Markov
V, X → Y in G if X causes Y directly (no variables in           equivalent SMCMs do not necessarily share the same
V mediate this relationship). Under the causal Markov           edges: Absence of an edge in a SMCM does not necessarily
correspond to an m-separation. Figure 1 shows an example               G1 :                          G2 :
                                                                                    L
of two SMCMs that encode the same m-separations but do
not have the same edges (figure taken from [Triantafillou               A     B     C      D          A      B      C      D

and Tsamardinos, 2015]).
                                                                       S1 :                          S2 :
Maximal ancestral graphs are also used to model causality
and conditional independencies in causally insufficient sys-
                                                                        A     B     C      D          A      B      C      D
tems. MAGs are mixed ancestral graphs, which means that
they can have no directed or almost directed cycles. Every            M1 :                          M2 :
pair of variables X, Y in an ancestral graph is joined by
at most one edge. The orientation of this edge represents
                                                                       A      B      C         D     A       B       C         D
(non) causal ancestry. A directed edge X → Y denotes that
X is an ancestor of Y , but the relation is not necessarily di-
                                                                      P1 :                          P2 :
rect in the context of modeled variables (see for example
edge A → D in MAG M1 of Figure 1). Moreover, X and
Y may also be confounded (e.g. edge B → D in MAG                       A      B      C         D     A       B      C          D
M1 of Figure 1). A bi-directed edge X ↔ Y denotes that
X and Y are confounded.
                                                                  Figure 1: An example of two different DAGs and the cor-
Like SMCMs, ancestral graphs encode the conditional in-
                                                                  responding mixed causal graphs over observed variables.
dependencies of a faithful distribution according to the
                                                                  From the top: DAGs G1 over variables {A, B, C, D, L}
criterion of m-separation. Maximal ancestral graphs are
                                                                  (left) and G2 over variables {A, B, C, D} (right). From
graphs in which every missing edge (non-adjacency) cor-
                                                                  left to right, on the same row as the underlying causal DAG,
responds to a conditional independence. Every ancestral
                                                                  the respective SMCMs S1 and S2 over {A, B, C, D};
graph can be extended to a maximal ancestral graph by
                                                                  the respective MAGs M1 = G1 [L and M2 = G2 over vari-
adding some bi-directed edges [Richardson and Spirtes,
                                                                  ables {A, B, C, D}; finally, the respective PAGs P1 and
2002]. Thus, Markov equivalence classes of maximal an-
                                                                  P2 . Notice that, M1 and M2 are identical, despite repre-
cestral graphs share the same edges and unshielded col-
                                                                  senting different underlying causal structures.
liders, and some additional shielded colliders, discussed
in Zhang [2008], Ali et al. [2009]. Partial ancestral
graphs (PAGs) are used to represent the Markov equiva-
                                                                  3     LEARNING THE STRUCTURE OF
lence classes of MAGs.
                                                                        CAUSAL GRAPHS
Figure 1 illustrates some differences in SMCMs and MAGs
that represent the same marginal of a DAG. For example,           As mentioned above, there are two main approaches for
A is a causal ancestor of D in DAG G1 , but not a direct          learning causal networks from data, constraint-based and
cause (in the context of observed variables). Therefore, the      score-based. Constraint-based methods estimate from the
two are not adjacent in the corresponding SMCM S1 over            data which conditional independencies hold, using ap-
{A, B, C, D}. However, the two cannot be rendered inde-           propriate tests of conditional independence. Each con-
pendent given any subset of {B, C}, and therefore A → D           ditional independence corresponds to an m(d)-separation
is in the respective MAG M1 .                                     constraint. Constraint-based algorithms try to eliminate
On the same DAG, B is another causal ancestor (but not            all graphs that are inconsistent with the observed con-
a direct cause) of D. The two variables share the com-            straints, and ultimately return only the statistically equiv-
mon cause L. Thus, in the corresponding SMCM S1 over              alent graphs consistent with all the tests.
{A, B, C, D} B ↔ D is present. However, a bi-directed             Notice that the number of possible conditional independen-
edge between B and D is not allowed in MAG M1 , since             cies are exponential to the number of variables. For graphs
it would create an almost directed cycle. Thus, B → D is          that are maximal, i.e. every missing edge corresponds to a
in M1 .                                                           conditional independence (DAGs and MAGs but not SM-
Overall, a SMCM has a subset of the adjacencies of its            CMs), there exist efficient procedures that can return the
MAG counterpart. These extra adjacencies in MAGs corre-           skeleton and invariant orientations of the Markov equiva-
spond to pairs of variables that cannot be m-separated given      lence class of graphs that are consistent with a data set,
any subset of observed variables, but neither directly causes     using only a subset of conditional independencies.
the other, and the two are not confounded. These adjacen-         The PC algorithm [Spirtes et al., 2000] is a prototypi-
cies can be checked in a SMCM using a special type of             cal, asymptotically correct constraint-based algorithm that
path, called inducing path [Richardson and Spirtes, 2002].        identifies the PDAG consistent with a data set. FCI [Spirtes
                                                                  et al., 2000, Zhang, 2008] is the first asymptotically correct
constraint-based algorithm that identifies the PAG consis-        sets of variables, called the c-components [Tian and Pearl,
tent with a data set. The algorithms work in two stages.          2003], or districts [Richardson, 2009] of the graph. The
The first stage is the skeleton identification stage: Starting    c-components correspond to the connected components of
from the full graph, the algorithm tries to identify a con-       the bi-directed part of G (denoted G↔ ), the graph stemming
ditional independence X ⊥   ⊥ Y |Z for each pair of variables     from G after the removal of all directed edges.
X, Y . The corresponding edge is then removed, and the
                                                                  Parametrizations of the set of distributions obeying the
separating set Z is cached. The second stage is the orien-
                                                                  conditional independence relations given by an ADMG
tations stage, where the cached conditioning sets are em-
                                                                  are available for multivariate discrete [Richardson, 2009]
ployed to orient the edges.
                                                                  and multivariate Gaussian distributions [Richardson and
Given faithfulness, the subset of conditional independen-         Spirtes, 2002, Drton et al., 2009]. Gaussian parametriza-
cies that have been identified during the skeleton identifica-    tions for SMCMs are not always identifiable, but they have
tion stage are sufficient to make all invariant orientation and   been shown to be almost everywhere identifiable for AD-
return the PDAG or PAG that represents the Markov equiv-          MGs without         edges (called bows in the structural equa-
alence class of causal graphs that are consistent with all        tion model literature) [Brito and Pearl, 2002]. If, in addi-
and only the cached conditional independencies. In prac-          tion, an ADMG does not have almost directed cycles (i.e.
tice, the orientation stage is sensitive to error propagation     is ancestral), the parametrization is everywhere identifiable
[Spirtes, 2010]. Conservative PC (CPC) [Ramsey et al.,            [Richardson and Spirtes, 2002].
2006] proposes a modification of PC that results in more ro-
bust orientations: The algorithm performs additional condi-       4   GSMAG: GREEDY SEARCH FOR
tional independence tests during the orientation stage, and
performs only a subset of robust orientations that are con-
                                                                      MAXIMAL ANCESTRAL GRAPHS
sistent with multiple conditional (in) dependencies. We use
                                                                  Let V = {Vi : i = 1, . . . , V } be a random vector of V vari-
the term conservative FCI (CFCI) to describe FCI with the
                                                                  ables following a multivariate normal distribution N (O, Σ)
same conservative extension.
                                                                  with positive definite covariance matrix Σ. Let G be a
Score-based methods on the other hand search over the             MAG. Then graph G defines a system of linear equations:
space of possible graphs trying to maximize a score that                         X
reflects how well the graph fits the data. This score is                 Vi =            βij Vj + i , i ∈ {1, . . . , V }   (2)
typically related to the likelihood of the graph given the                     j∈P aG (Vi )
data, P (D|G). For multinomial and Gaussian parametriza-
tions, respectively, BDe[Heckerman et al., 1995] and              Let B(G) be the collection of all real V × V matrices B =
BGe[Geiger and Heckerman, 1994] are Bayesian scores               (βij ) such that (i) βij = 0 when j → i is not in G, and (ii)
that integrate over all possible parameters. These scores are     (I − B) is invertible. Let Ω(G) be all the V × V matrices
employed in most DAG/PDAG-learning algorithms. Other              Ω = (ωij ) such that (i) Ω is positive definite and (ii) ωij =
criteria like BIC or MDL can be used to score a graph with        0 if j ↔ i is not in G.
specific parameters [Bouckaert, 1995].                            Then the system of linear equations (2) can be written as
The number of possible graphs is super-exponential to the         V = BV + , and for B ∈ B(G), Cov() = Ω ∈ Ω(G)
number of variables. For DAGs, efficient search-and-score         it has a unique solution that is a multivariate normal vec-
learning is based on the factorization in Equation 1. Thus,       tor with covariance matrix Σ = (I − B)−1 Ω(I − B)−T ,
the likelihood of the graph can be decomposed into a prod-        where the superscript −T denotes the transpose inverse.
uct of individual likelihoods of each node given its parents.     The family of distributions with covariance matrix in the
This can make greedy search very efficient: In each step          set {Σ = (I − B)−1 Ω(I − B)−T } is called the normal
of the search, all the graphs that occur with single changes      linear model associated with G (symb N(G)). For MAGs,
of the current graph are considered. Using the score de-          the normal linear model is everywhere identifiable.
composition, one only needs to recompute the scores of            Let D be a V × N matrix of observations for the variables
nodes that are affected by the change (i.e. the set of par-       V . Then the empirical covariance matrix is defined as
ents changes). Unfortunately, the factorization presented
in Equation 1 does not apply to probability distributions                                          1
                                                                                              S=     DDT .
that are faithful to mixed graphs. This happens because                                            n
variables connected with bi-directed edges (confounded)           For N ≥ V + 1, S is almost surely positive definite. For a
are no longer independent given their parents in the graph.       MAG G, the log likelihood of the model is
Thus, SMCMs and MAGs do not admit a simple factoriza-
tion, where each node has a single contribution to the like-                                           N
                                                                                    lG (B, Ω|S) = −      ln(|2πΩ|−
lihood. However, the joint probability of a set of variables                                           2                    (3)
V according to an ADMG G can be factorized based on                           N −1
                                                                                   tr[(I − B)T Ω−1 (I − B)S])
                                                                               N
input : Data set D over V with N samples, tolerance tol        input : Pair X, Y , Action action, c-components C,
output: MAG G, score sc                                                 scores s, MAG G, covariance matrix S, tolerance
S ← corr(D);                                                            tol, sample size N
G ← empty graph;                                               output: c-components C0 , scores s0 , MAG G 0
C ← {V ∈ V};                                                   G 0 ← action( X, Y, G);
foreach Ck ∈ C do                                              if action==(orientBidirected ∨ addBidirected) then
    sk ← scoreContrib(V, 1, N );                                   m ← m : X ∈ Cm ;
end                P                                               l ← l : Y ∈ Cl ;
curScore ← −2 k sk + ln(N )(2V + E)                                Cm ← Cm ∪ Cl ;
minScore ← curScore;                                               C0 ← C \ Cl ;
repeat
                                                                   Σ̂m ← RICF(Gm , Sm , tol);
    foreach pair (X, Y ) ∈ V do
        foreach action in {addLeft, addRight,                      s0m ← scoreContrib(Cm , Σm , N );
        addBidirected, orientLeft, orientRight,                end
        orientBidirected, remove, reverse} do                  else if X ↔ Y in G then
            if action is applicable and does not create            m ← m : X, Y ∈ Cm ;
                                                                                                          0
            directed or almost directed cycles then                Cnew ← connectedComponents(Gm             );
                   0   0   0                                       C ← (C \ {Cm }) ∪ Cnew ;
                (s , C , G ) ←
                                                                   foreach C ∈ Cnew do
                updateScores(X,    P Y, action, s, C, G,tol,N)
                curScore ← −2 k s0k +ln(N )(2V +E);                    m ← index of C in C0 ;
                if curScore < minScore then                            Σ̂m ← RICF(Gm , Sm , tol);
                     (s, C, G) ← (s0 , C0 , G 0 );                     s0m ← scoreContrib(Cm , Σm , N );
                end                                                end
            end                                                end
        end                                                    else
                                                                                        0
    end                                                            m ← Cm : Gm 6= Gm      ;
until no action reduces curScore;                                  Σ̂m ← RICF(Gm , Sm , tol);
sc ← curScore ;                                                    s0m ← scoreContrib(Cm , Σm , N );
                   Algorithm 1: GSMAG                          end
                                                                                Algorithm 2: updateScores

Maximum likelihood estimates B̂, Ω̂ that maximize (3)
can be found using the residual iterative conditional fit-      regularized by a penalty for the number of parameters to
ting (RICF) algorithm presented in Drton et al. [2009],         avoid over-fitting. The BIC score for MAGs is:
and the corresponding implied covariance matrix is Σ̂ =
(I − B̂)−1 Ω̂(I − B̂)−T .
Based on the factorization of MAGs presented in Richard-          BIC(Σ̂, G) = −2ln(lG (Σ̂|S)) + ln(N )(2V + E), (5)
son [2009], the likelihood can be decomposed according to
the c-components of G as follows [Nowzohour et al., 2015]:      where lG (Σ̂|G) is the likelihood of the graph G with the
                                                                MLE parameters B̂, Ω̂. BIC is an asymptotically cor-
                                                                rect criterion for selecting among Gaussian ancestral graph
                N X                        |ΣGk |             models [Richardson and Spirtes, 2002].
lG (Σ̂|S) = −         |Ck |ln(2π) + ln Q            2    +
                2                         j∈P aGk σkj
                   k                                            A simple greedy strategy starts from a MAG G with a score
                N −1                                    
                                                                sc and then checks the local neighborhood (i.e. the graphs
                     tr[Σ−1
                          Gk SGk − |P aG (Ck ) \ {Ck }|] ,
                  N                                             that stem from the current graph after making a single edge
                                                        (4)     change) for the lowest-scoring network. The algorithm
                                                                continues this “hill-climbing” until no single edge change
where the Gk is the graph consisting only of nodes in Ck ∪      reduces the score.
P aG (Ck ) without any edges among variables in P aG (Ck )\
                                                                Algorithm 1 begins with the empty graph, where each node
Ck , and the subscript Gk denotes the restriction of a matrix
                                                  2             is a component. At every subsequent step, every possible
to the rows and columns participating in Gk . σkj     denotes
                                                                edge change is considered: For absent edges, the possible
the diagonal entry of ΣGk corresponding to parent node k.
                                                                actions are addLeft, addRight, addBidirected. For directed
The log likelihood is now a sum of c-component scores.
                                                                edges, the possible actions are reverse, orientBidirected, re-
The scoring function is typically the negative log likelihood   move. For bi-directed edges the possible actions are ori-
entLeft, orientRight, remove.                                          5   RELATED WORK
Score decomposition described in Equation 4 is used to
avoid re-fitting the entire MAG. Instead, only the likelihood          Several constraint-based algorithms exist for learning a
of the c-components affected by the change need to be re-              Markov equivalence class of MAGs from an observational
estimated. Algorithm 1 describes a simple greedy search                data set: FCI [Spirtes et al., 2000, Zhang, 2008] is a sound
strategy for learning MAG structure.                                   and complete algorithm that returns the complete PAG.
                                                                       RFCI Colombo et al. [2012] and FCI+[Claassen et al.,
Only actions that do not create directed or almost directed            2013] are modifications of FCI that try to avoid the com-
cycles are attempted. To efficiently check for cycle cre-              putationally expensive possible d-separating stage in the
ation, a matrix of ancestral relationships1 of the current             skeleton search of FCI. Conservative FCI [Ramsey et al.,
MAG is cached. Edge removals can never create directed                 2006] is a modification of FCI that makes fewer, but more
cycles. Using the cached ancestral matrix, it is straightfor-          robust orientations, to avoid error propagation.
ward to check whether the addition of a directed edge will
create a directed cycle, or if the addition of a bi-directed           Nowzohour et al. [2015] propose a greedy search with ran-
edge will create an almost directed cycle. Almost directed             dom restarts for learning “bow-free” ADMGs from data
cycles can also be created when adding directed edges: For             and introduce the score decomposition showed in Equa-
each edge X ↔ Y , adding edge J → I will create a semi-                tion 4. Since Markov equivalence for ADMGs that are not
directed cycle if I is an ancestor if X(Y ) and Y (X) is an            MAGs has not yet been characterized, they use a greedy
ancestor of J.                                                         strategy for obtaining the empirical Markov equivalence
                                                                       class, based on score similarity. The authors use the esti-
At the end of each iteration, the matrix of ancestral rela-            mated ADMGs to compute causal effects and show promis-
tionships is updated. If a previously missing edge is added,           ing results. However, since they do not necessarily find
the update takes O(V 2 ) time. If an edge is removed, the              maximal ancestral graphs, they do not compare against
matrix is recomputed using Warshall’ s algorithm for tran-             constraint-based methods or evaluate the accuracy of the
sitive closure [Warshall, 1962].                                       learnt graphs.
The c-components are only updated in case a bi-directed                Marginalizing out variables from causal DAGs results in
edge is added or altered in any way. When adding a bi-                 some additional equality constraints that are not condi-
directed edge, the corresponding c-components of the end-              tional independencies. Nested Markov models Shpitser
points are merged if separate. When an existing bi-directed            et al. [2013] extend SMCMs and are used to also model
edge is removed (completely or becomes directed), the cor-             these additional constraints. Shpitser et al. [2012] use a pe-
responding c-component Ck is divided in the new con-                   nalized likelihood score and a greedy search with TABU
nected components. The scores of the affected components               list to identify a nested Markov model from discrete data.
are recomputed using new RICF estimates. In any other
case, the c-components remain the same, and the score of
the c-component whose corresponding graph Gk is affected               6   COMPARISON OF GSMAG WITH FCI,
by the change is recomputed. This procedure is described                   CFCI
in Algorithm 2.
When no single-edge change improves the current score,                 We compared the performance of Algorithm 1 against FCI
the algorithm terminates and the current network is re-                and CFCI in simulated data. We simulated 100 random
turned. Greedy hill-climbing procedures can be stuck in lo-            DAGs over 10, 20 and 50 variables. To control the sparse-
cal optima (minima). To tackle this problem, they are often            ness of the DAGs, we set the maximum parents of each
augmented with meta-heuristics such as random restarts,                node. We present results for sparse networks, where each
TABU lists or simulated annealing. For the scope of this               variable was allowed to have up to 3 parents, and denser
work we use no such heuristic. In preliminary experiments,             networks where each variables was allowed to have up
however, we found that augmenting Algorithm 1 with a                   to 5 parents. For each DAG, 10% of the variables were
TABU heuristic did not significantly improve performance.              marginalized (1, 2 and 5 variables respectively). The
                                                                       ground truth PAG PGT was then created for each marginal
                                                                       DAG.
                                                                       Data sets with 100, 1000 and 5000 samples were simulated
                                                                       for each DAG and random parameters with absolute values
    1
      Changing edge orientation is equivalent to a removing the        in {0.1, 0.9}. The corresponding marginal data sets were
edge and then adding it re-oriented. To test for possible cycles       input in Algorithm 1, FCI and CFCI. FCI and CFCI were
efficiently, a matrix of all the non-trivial ancestral relationships
(more than one variable in the path) is also cached. Reversing an      run with a significance threshold of 0.05 and a maximum
edge X → Y creates a directed cycle only if X is a non-trivial         conditioning size 5. Algorithm 1 outputs a MAG. To com-
ancestor of Y .                                                        pare the outputs of the algorithms, the corresponding PAG
Figure 2: Performance of FCI, CFCI and GSMAG for networks with 9 observed variables (top) 3 maximum parents per
variables (bottom) 5 maximum parents per variable.


Figure 3: Performance of FCI, CFCI and GSMAG for networks with 18 observed variables (top) 3 maximum parents per
variable (bottom) 5 maximum parents per variable.


Figure 4: Performance of FCI, CFCI and GSMAG for networks with 45 observed variables (top) 3 maximum parents per
variable (bottom) 5 maximum parents per variable.


was created for each MAG output. We use PF CI , PCF CI     1, respectively.
and PGS to denote the outputs of FCI, CFCI and Algorithm
                                                           Summarizing PAG differences is not trivial, and many dif-
                                                              being second. Specifically, terms of recall, FCI outper-
                                                              forms GSMAG in 139 and 83 cases, while CFCI outper-
                                                              forms GSMAG in 357 and 249 cases for sparse networks
                                                              and dense networks, respectively. Naturally, greedy search
                                                              is much slower than both conservative and plain FCI.
Figure 5: Score divided by number of samples for FCI,         Intriguingly, GSMAG’s performance declines for the
GS and the ground truth network for sparse networks (3        largest attempted sample size (5000 samples), particularly
maximum parents) for 9(left), 18(middle) and 45(right) ob-    for larger networks. This happens because greedy search
served variables.                                             tends to include many false positive edges. It is possible
                                                              that this is related to the naive greedy search, and could be
                                                              improved by augmenting some kind of heuristic for escap-
ferent approaches are used in the literature. As a general
                                                              ing local minima, or by adjusting the scoring criterion.
metric of how different two PAGs are, we use the structural
hamming distance (shd) for PAGs, defined as follows: Let      Figure 6 shows the score of the output MAG for Algorithm
P̂ be the output PAG and P be the ground truth PAG. For       1 and the ground truth MAG. To compare also with FCI,
each change (edge addition, edge deletion, change arrow-      we used the method presented in Zhang [2008] to obtain
head, change tail) required to transform P̂ into P, shd is    a MAG from PF CI . Notice that this cannot be applied to
increased by 1.                                               the output of CFCI, since it is not a complete PAG (due to
                                                              unfaithful triplets). Greedy search typically scores closer
                                                              to the ground truth, particularly for denser networks.

                                                              7 FUTURE WORK

                                                              We present an implementation of a greedy search algorithm
Figure 6: Score divided by number of samples for FCI,         for learning MAGs from observations, and compare it to
GS and the ground truth network (5 maximum parents) for       FCI and CFCI. To the best of our knowledge, this is the first
9(left), 18(middle) and 45(right) observed variables.         comparison of score-based and constraint-based search in
                                                              the presence of confounders.
We also use precision and recall, as described in Tillman     The algorithm uses the decomposition presented in Now-
and Spirtes [2011]: Precision is defined as the number of     zohour et al. [2015] for bow-free SMCMs. Compared to
edges in the output PAG with the correct orientations, di-    SMCMs, MAGs are less expressive in terms of causal state-
vided by the number of edges in the output PAG. Recall        ments. However, since they have no almost directed cycles,
is defined as the number of edges in the output PAG with      fitting procedures for obtaining maximum likelihood esti-
correct orientations, divided by the number of edges in the   mates always converge. Semi-Markov causal models that
ground truth PAG. These metrics are very conservative,        are Markov equivalent to the output MAG could be identi-
since they penalize even small differences. For example,      fied as a post-processing step.
an edge that is → in the ground truth but     in the output
                                                              Heuristic procedures for escaping local minima could also
PAG will be classified as a false positive.
                                                              be explored, to improve the performance of GSMAG. Al-
Figures 2, 3, and 4 show the performance results for net-     gorithm efficiency could also possibly be improved by up-
works of 10, 20 and 50 variables, respectively. Mean val-     dating without recomputing inverse matrices and where ap-
ues over 100 iterations are presented for all experiments.    plicable.
All algorithms perform better in sparser networks.
                                                              Other interesting directions include taking weighted aver-
Greedy search has larger structural hamming distances than    ages for specific PAG features, or using both constraint-
FCI and CFCI. More specifically, out of 900 cases (over all   based and score-based techniques for hybrid learning.
variable and sample sizes), FCI outperforms GSMAG in          Greedy search in the space of PAGs instead of MAGs could
657 cases for sparse networks and in 715 cases for dense      also be explored, since a transformational characterization
networks, while CFCI outperforms GSMAG in 682 cases           for Markov equivalent MAGs exists [Zhang and Spirtes,
for sparse networks and in 710 cases for dense networks.      2005].
In terms of precision, CFCI is again the best of the three
(outperforms GSMAG in 601 and 659 out of 900 cases for        Acknowledgments
sparse and dense networks, respectively). FCI has the poor-
est precision: it outperforms GSMAG in 318 and 221 cases      This work was funded by European Research Council
for sparse and dense networks, respectively). Finally, GS-    (ERC) and is part of the CAUSALPATH - Next Generation
MAG has the best recall out of all algorithms, with CFCI      Causal Analysis project, No 617393.
References                                                      Proceedings of the 29h Conference on Uncertainty in Ar-
                                                                tificial Intelligence, 2013.
RA Ali, TS Richardson, and P Spirtes. Markov equivalence
 for ancestral graphs. The Annals of Statistics, 37(5B):      P Spirtes. Introduction to causal inference. Journal of Ma-
 2808–2837, October 2009.                                       chine Learning Research, 11:1643–1662, 2010.
RR Bouckaert. Bayesian belief networks: from construc-        P Spirtes, C Glymour, and R Scheines. Causation, Pre-
 tion to inference. PhD thesis, University of Utrecht,          diction, and Search. MIT Press, second edition, January
 1995.                                                          2000.
C Brito and J Pearl. A new identification condition for re-   J Tian and J Pearl. On the identification of causal ef-
  cursive models with correlated errors. Structural Equa-       fects. Technical Report R-290-L, UCLA Cognitive Sys-
  tion Modeling, 9(4):459–474, 2002.                            tems Laboratory, 2003.
T Claassen, JM Mooij, and T Heskes. Learning sparse           RE Tillman and P Spirtes. Learning equivalence classes of
  causal models is not NP-hard. In Proceedings of the 29th      acyclic models with latent and selection variables from
  Annual Conference on Uncertainty in Artificial Intelli-       multiple datasets with overlapping variables. In Pro-
  gence, 2013.                                                  ceedings of the 14th International Conference on Arti-
                                                                ficial Intelligence and Statistics, 2011.
D Colombo, MH Maathuis, M Kalisch, and TS Richardson.
  Learning high-dimensional directed acyclic graphs with      Sofia Triantafillou and Ioannis Tsamardinos. Constraint-
  latent and selection variables. The Annals of Statistics,     based causal discovery from multiple interventions over
  40(1):294–321, 02 2012.                                       overlapping variable sets. Journal of Machine Learning
                                                                Research, 16:2147–2205, 2015.
M Drton, M Eichler, and TS Richardson. Computing max-
 imum likelihood estimates in recursive linear models         I Tsamardinos, LE Brown, and CF Aliferis. The max-min
 with correlated errors. The Journal of Machine Learn-           hill-climbing Bayesian network structure learning algo-
 ing Research, 10:2329–2348, 2009.                               rithm. Machine Learning, 65(1):31–78, 2006.
D Geiger and D Heckerman. Learning Gaussian networks.         S Warshall. A theorem on boolean matrices. J. ACM, 9(1):
  In Proceedings of the 10th Annual Conference on Uncer-        11–12, 1962.
  tainty in Artificial Intelligence, 1994.                    J Zhang. On the completeness of orientation rules for
D Heckerman, D Geiger, and DM Chickering. Learn-                causal discovery in the presence of latent confounders
  ing bayesian networks: The combination of knowledge           and selection bias. Artificial Intelligence, 172(16-17):
  and statistical data. Machine Learning, 20(3):197–243,        1873–1896, 2008.
  1995.                                                       J Zhang and P Spirtes. A transformational characteriza-
C Nowzohour, M Maathuis, and P Bühlmann. Structure             tion of Markov equivalence for directed acyclic graphs
  learning with bow-free acyclic path diagrams. arXiv           with latent variables. In Proceedings of the 21st Confer-
  preprint arXiv:1508.01717, 2015.                              ence on Uncertainty in Artificial Intelligence, UAI 2005,
J Pearl. Causality: Models, Reasoning and Inference, vol-       pages 667–674, 2005. cited By 8.
   ume 113 of Hardcover. Cambridge University Press,
   2000.
J Ramsey, P Spirtes, and J Zhang. Adjacency faithfulness
  and conservative causal inference. In Proceedings of
  the 22nd Conference on Uncertainty in Artificial Intel-
  ligence, 2006.
TS Richardson. A factorization criterion for acyclic di-
  rected mixed graphs. In Proceedings of the Twenty-
  Fifth Conference on Uncertainty in Artificial Intelli-
  gence, 2009.
TS Richardson and P Spirtes. Ancestral graph Markov
  models. The Annals of Statistics, 30(4):962–1030, 2002.
I Shpitser, TS Richardson, JM Robins, and R Evans. Pa-
   rameter and structure learning in nested markov mod-
   els. In In UAI (Workshop on Causal Structure Learning),
   2012.
I Shpitser, R Evans, TS Richardson, and JM Robins. Sparse
   nested Markov models with log-linear parameters. In