<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Score based vs constraint based causal learning in the presence of confounders</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sofia Triantafillou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Tsamardinos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Dept., University of Crete</institution>
          ,
          <addr-line>Voutes Campus, 700 13 Heraklion</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We compare score-based and constraint-based learning in the presence of latent confounders. We use a greedy search strategy to identify the best fitting maximal ancestral graph (MAG) from continuous data, under the assumption of multivariate normality. Scoring maximal ancestral graphs is based on (a) residual iterative conditional fitting [Drton et al., 2009] for obtaining maximum likelihood estimates for the parameters of a given MAG and (b) factorization and score decomposition results for mixed causal graphs [Richardson, 2009, Nowzohour et al., 2015]. We compare the score-based approach in simulated settings with two standard constraintbased algorithms: FCI and conservative FCI. Results show a promising performance of the greedy search algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Causal graphs can capture the probabilistic and causal
properties of multivariate distributions. Under the
assumptions of causal Markov condition and faithfulness, the
graph induces a factorization for the joint probability
distribution, and a graphical criterion (d-separation) can be used
to identify all and only the conditional independencies that
hold in the joint probability distribution.</p>
      <p>
        The simplest case of a causal graph is a directed acyclic
graph (DAG). A causal DAG G and faithful probability
distribution P constitute a causal Bayesian network (CBN)
[
        <xref ref-type="bibr" rid="ref17">Pearl, 2000</xref>
        ]. Edges in the graph of a CBN have a
straightforward interpretation: A directed edge X ! Y denotes
a causal relationship that is direct in the context of
variables included in the DAG. In general, CBNs are
considered in the setting where causal sufficiency holds, i.e. the
absence of latent confounders. This is restrictive, since in
most cases we can/do not observe all variables that
participate in the causal mechanism of a multivariate system.
We consider a representation for an equivalence class
of models based on maximal ancestral graphs (MAGs)
[Ri
        <xref ref-type="bibr" rid="ref3">chardson and Spirtes, 2002</xref>
        ]. MAGs are extensions of
CBNs that also consider latent confounders. Latent
confounders are represented with bi-directed edges. The set of
conditional independencies that hold in a faithful
probability distribution can be identified from the graph with the
graphical criterion of m-separation. The causal semantics
of edges in MAGs are more complicated: Directed edges
denote causal ancestry, but the relationship is not
necessarily direct. Bi-directed edges denote latent common causes.
However, each pair of variables can only share one edge,
and causal ancestry has precedence over confounding: If X
is a causal ancestor of Y and the two are also confounded,
then X ! Y in the MAG. MAGs have several attractive
properties: They are closed under marginalization and
every non-adjacency corresponds to a conditional
independence.
      </p>
      <p>
        There exist two main approaches for learning causal graphs
from data. Constraint-based approaches infer the
conditional independencies imprinted in the data and search
for a DAG/MAG that entails all (and only) of these
independences according to d/m-separation. Score-based
approaches try to find the graph G that maximizes the
likelihood of the data given G (or the posterior), according
to the factorization imposed by G. In general, a class of
causal graphs, that are called Markov equivalent, fit the
data equally well. Constraint-based approaches are more
efficient and output a single graph with clear semantics,
but give no indication the relative confidence in the model.
Moreover, they have been shown to be sensitive to error
        <xref ref-type="bibr" rid="ref16">propagation [Spirtes, 2010</xref>
        ]. Score-based methods on the
other hand do not have this problem, and they also provide
a metric of confidence in the entire output model. Hybrid
methods that exploit the best of both worlds have
therefore proved successful in learning causal graphs from data
[Tsamard
        <xref ref-type="bibr" rid="ref21">inos et al., 2006</xref>
        ].
      </p>
      <p>Numerous constraint-based and score-based algorithms
exist that learn causal DAGs (classes of Markov
equivalent DAGs) from data. Learning MAGs on the other
hand is typically done with constraint-based algorithms.
A score-based method for mixed causal graphs (not
necessarily MAGs) has recently been proposed [Nowzohour
et al., 2015] based on relative factorization results [Tian
and Pearl, 2003, Richardson, 2009].</p>
      <p>
        Using these decomposition results, we implemented a
simple greedy search for learning MAGs from data. We
compare the results of this approach with FCI [S
        <xref ref-type="bibr" rid="ref17">pirtes et al.,
2000</xref>
        , Zhang, 2008] and conservat
        <xref ref-type="bibr" rid="ref21">ive FCI [Ramsey et al.,
2006</xref>
        ] outputs. Greedy search performs slightly worse in
most settings in terms of structural hamming distance, and
better than FCI in terms of precision and recall.
Based on these results, we believe that score-based
approach can be used to improve learning causal graphs
in the presence of confounders. Algorithm
implementation and code for the detailed results are available in
https://github.com/striantafillou.
      </p>
      <p>The rest of the paper is organized as follows: Section 2
briefly reviews causal graphs with and without causal
sufficiency. Section 3 gives an overview of constraint-based
and score-based methods for DAGs and MAGs. Section
4 describes a greedy search algorithm for learning MAGs.
Related work is discussed in Section 5. Section 6 compares
the performance of the algorithm against FCI and CFCI.
Conclusions and future work are presented in 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>CAUSAL GRAPHS</title>
      <p>We begin with some graphical notation: A mixed graph
(MG) is a collection of nodes (interchangeably variables)
V, along with a collection of edges E. Edges can be
directed (X ! Y ) or bi-directed (X $ Y ). A path is a
sequence of adjacent edges (without repetition). The first
and last node of a path are called endpoints of the path.
A bi-directed path is a path where every edge is bi-directed.
A directed path is a path where every edge is directed and
oriented in the same direction. We use X 99K Y to
symbolize a directed path from X to Y . A directed cycle occurs
when there exists a directed path X 99K X. An almost
directed cycle is occurs when X $ Y and X 99K Y . A
triplet hX; Y; Zi on consequent nodes on a path are form a
collider if X ! Y Z. If X and Z are not adjacent, the
triplet is an unshielded collider.</p>
      <p>
        A mixed graph is called ancestral if it has no directed and
almost directed cycles. An ancestral graph without
bidirected edges is a DAG. X is a parent of Y in a MG G
if X ! Y in G. We use the notation P aG (X), AnG (X) to
denote the set of parents and ancestors of X in G.
Under causal sufficiency, DAGs can be used to model
causal relationships: For a graph G over a set of variables
V, X ! Y in G if X causes Y directly (no variables in
V mediate this relationship). Under the causal Markov
condition and faithfulness
        <xref ref-type="bibr" rid="ref17">Pearl [2000</xref>
        ], G is connected to
the joint probability distribution P over V through the
criterion of d-separation (defined below). Equivalently, the
causal Markov condition imposes a simple factorization of
the joint probability distribution:
      </p>
      <p>P (V) = Y P (V jP aG (V ))</p>
      <p>V 2V
(1)
Thus, the parameters of the joint probability distribution
describe the probability density function of each variable
given its parents in the graph. An interesting property of
CBNs, that constitutes the basis of constraint-based
learning, is the following: Every missing edge in a DAG of a
CBN corresponds to a conditional independence. Hence, if
X is independent from Y given Z (symb. X ?? Y jZ) in P,
then X and Y are not adjacent in G.</p>
      <p>In general, a class of Markov equivalent DAGs fit the data
equally well. DAGs in a Markov equivalent class share the
same skeleton and unshielded colliders. A Pattern DAG
(PDAG) can be used to represent the Markov equivalent
class of DAGs: It has the same edges as every DAG in
the Markov equivalence class, and the orientations that are
shared by all DAGs in the Markov equivalence class.
Confounded relationships cannot be represented in DAGs,
and mixed causal graphs were introduced to tackle this
problem. The most straightforward approach is with
semiMarkov causal models (SMCMs) Tian and Pearl [2003].
The graphs of semi-Markov causal models are acyclic
directed mixed graph (ADMGs). Bi-directed edges are used
to denote confounded variables, and directed edges denote
direct causation. A pair of variables can share up to two
edges (one directed, one bi-directed). The conditional
independencies that hold in a faithful distribution are
represented through the criterion of m-separation:</p>
      <sec id="sec-2-1">
        <title>Definition 2.1 (m-connection, m-separation.) In a mixed</title>
        <p>graph G = (E; V), a path p between A and B is
mconnecting given (conditioned on) a set of nodes Z , Z
V n fA; Bg if
1. Every non-collider on p is not a member of Z.
2. Every collider on the path is an ancestor of some
member of Z.</p>
        <p>A and B are said to be m-separated by Z if there is no
m-connecting path between A and B relative to Z.
Otherwise, they are said to be m-connected given Z.For graphs
without bi-directed edges, m-separation is reduced to the
d-separation criterion.</p>
        <p>
          Markov equivalence classes of semi-Markov causal
models do not have a simple characterization, because Markov
equivalent SMCMs do not necessarily share the same
edges: Absence of an edge in a SMCM does not necessarily
correspond to an m-separation. Figure 1 shows an example
of two SMCMs that encode the same m-separations but do
not have the same edges
          <xref ref-type="bibr" rid="ref20 ref9">(figure taken from [Triantafillou
and Tsamardinos, 2015])</xref>
          .
        </p>
        <p>Maximal ancestral graphs are also used to model causality
and conditional independencies in causally insufficient
systems. MAGs are mixed ancestral graphs, which means that
they can have no directed or almost directed cycles. Every
pair of variables X; Y in an ancestral graph is joined by
at most one edge. The orientation of this edge represents
(non) causal ancestry. A directed edge X ! Y denotes that
X is an ancestor of Y , but the relation is not necessarily
direct in the context of modeled variables (see for example
edge A ! D in MAG M1 of Figure 1). Moreover, X and
Y may also be confounded (e.g. edge B ! D in MAG
M1 of Figure 1). A bi-directed edge X $ Y denotes that
X and Y are confounded.</p>
        <p>
          Like SMCMs, ancestral graphs encode the conditional
independencies of a faithful distribution according to the
criterion of m-separation. Maximal ancestral graphs are
graphs in which every missing edge (non-adjacency)
corresponds to a conditional independence. Every ancestral
graph can be extended to a maximal ancestral graph by
adding some bi-directed edges [Ri
          <xref ref-type="bibr" rid="ref3">chardson and Spirtes,
2002</xref>
          ]. Thus, Markov equivalence classes of maximal
ancestral graphs share the same edges and unshielded
colliders, and some additional shielded colliders, discussed
in Zhang [2008], Ali et al. [2009]. Partial ancestral
graphs (PAGs) are used to represent the Markov
equivalence classes of MAGs.
        </p>
        <p>Figure 1 illustrates some differences in SMCMs and MAGs
that represent the same marginal of a DAG. For example,
A is a causal ancestor of D in DAG G1, but not a direct
cause (in the context of observed variables). Therefore, the
two are not adjacent in the corresponding SMCM S1 over
fA; B; C; Dg. However, the two cannot be rendered
independent given any subset of fB; Cg, and therefore A ! D
is in the respective MAG M1.</p>
        <p>On the same DAG, B is another causal ancestor (but not
a direct cause) of D. The two variables share the
common cause L. Thus, in the corresponding SMCM S1 over
fA; B; C; Dg B $ D is present. However, a bi-directed
edge between B and D is not allowed in MAG M1, since
it would create an almost directed cycle. Thus, B ! D is
in M1.</p>
        <p>
          Overall, a SMCM has a subset of the adjacencies of its
MAG counterpart. These extra adjacencies in MAGs
correspond to pairs of variables that cannot be m-separated given
any subset of observed variables, but neither directly causes
the other, and the two are not confounded. These
adjacencies can be checked in a SMCM using a special type of
path, called inducing path [Ri
          <xref ref-type="bibr" rid="ref3">chardson and Spirtes, 2002</xref>
          ].
S1:
        </p>
        <p>A
M1:
A
P1:
A</p>
        <p>B
B
B
B
C
C
C
C</p>
        <p>D
D</p>
        <p>D
D
A
S2:</p>
        <p>A
M2:
A
P2:
A</p>
        <p>B
B
B
B</p>
        <p>C
C
C
C</p>
        <p>D
D</p>
        <p>D
D</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>LEARNING THE STRUCTURE OF</title>
    </sec>
    <sec id="sec-4">
      <title>CAUSAL GRAPHS</title>
      <p>As mentioned above, there are two main approaches for
learning causal networks from data, constraint-based and
score-based. Constraint-based methods estimate from the
data which conditional independencies hold, using
appropriate tests of conditional independence. Each
conditional independence corresponds to an m(d)-separation
constraint. Constraint-based algorithms try to eliminate
all graphs that are inconsistent with the observed
constraints, and ultimately return only the statistically
equivalent graphs consistent with all the tests.</p>
      <p>
        Notice that the number of possible conditional
independencies are exponential to the number of variables. For graphs
that are maximal, i.e. every missing edge corresponds to a
conditional independence (DAGs and MAGs but not
SMCMs), there exist efficient procedures that can return the
skeleton and invariant orientations of the Markov
equivalence class of graphs that are consistent with a data set,
using only a subset of conditional independencies.
The
        <xref ref-type="bibr" rid="ref17">PC algorithm [Spirtes et al., 2000</xref>
        ] is a
prototypical, asymptotically correct constraint-based algorithm that
identifies the PDAG consistent with a data set. FCI [S
        <xref ref-type="bibr" rid="ref17">pirtes
et al., 2000</xref>
        , Zhang, 2008] is the first asymptotically correct
constraint-based algorithm that identifies the PAG
consistent with a data set. The algorithms work in two stages.
The first stage is the skeleton identification stage: Starting
from the full graph, the algorithm tries to identify a
conditional independence X ?? Y jZ for each pair of variables
X; Y . The corresponding edge is then removed, and the
separating set Z is cached. The second stage is the
orientations stage, where the cached conditioning sets are
employed to orient the edges.
      </p>
      <p>
        Given faithfulness, the subset of conditional
independencies that have been identified during the skeleton
identification stage are sufficient to make all invariant orientation and
return the PDAG or PAG that represents the Markov
equivalence class of causal graphs that are consistent with all
and only the cached conditional independencies. In
practice, the orientation stage is sensitive to error
        <xref ref-type="bibr" rid="ref16">propagation
[Spirtes, 2010</xref>
        ]. Conservat
        <xref ref-type="bibr" rid="ref21">ive PC (CPC) [Ramsey et al.,
2006</xref>
        ] proposes a modification of PC that results in more
robust orientations: The algorithm performs additional
conditional independence tests during the orientation stage, and
performs only a subset of robust orientations that are
consistent with multiple conditional (in) dependencies. We use
the term conservative FCI (CFCI) to describe FCI with the
same conservative extension.
      </p>
      <p>
        Score-based methods on the other hand search over the
space of possible graphs trying to maximize a score that
reflects how well the graph fits the data. This score is
typically related to the likelihood of the graph given the
data, P (DjG). For multinomial and Gaussian
parametrizations, respectively, B
        <xref ref-type="bibr" rid="ref8">De[Heckerman et al., 1995</xref>
        ] an
        <xref ref-type="bibr" rid="ref7">d
BGe[Geiger and Heckerman, 1994</xref>
        ] are Bayesian scores
that integrate over all possible parameters. These scores are
employed in most DAG/PDAG-learning algorithms. Other
criteria like BIC or MDL can be used to score a graph with
specific parameters [Bouckaert, 1995].
      </p>
      <p>
        The number of possible graphs is super-exponential to the
number of variables. For DAGs, efficient search-and-score
learning is based on the factorization in Equation 1. Thus,
the likelihood of the graph can be decomposed into a
product of individual likelihoods of each node given its parents.
This can make greedy search very efficient: In each step
of the search, all the graphs that occur with single changes
of the current graph are considered. Using the score
decomposition, one only needs to recompute the scores of
nodes that are affected by the change (i.e. the set of
parents changes). Unfortunately, the factorization presented
in Equation 1 does not apply to probability distributions
that are faithful to mixed graphs. This happens because
variables connected with bi-directed edges (confounded)
are no longer independent given their parents in the graph.
Thus, SMCMs and MAGs do not admit a simple
factorization, where each node has a single contribution to the
likelihood. However, the joint probability of a set of variables
V according to an ADMG G can be factorized based on
sets of variables, called the c-components [Tian and Pearl,
2003], or distric
        <xref ref-type="bibr" rid="ref12">ts [Richardson, 2009</xref>
        ] of the graph. The
c-components correspond to the connected components of
the bi-directed part of G (denoted G$), the graph stemming
from G after the removal of all directed edges.
      </p>
      <p>
        Parametrizations of the set of distributions obeying the
conditional independence relations given by an ADMG
are available for multivariate discrete [Richardson, 2009]
and multivariate Gaussian distributions [Ri
        <xref ref-type="bibr" rid="ref3">chardson and
Spirtes, 2002</xref>
        , Drton et al., 2009]. Gaussian
parametrizations for SMCMs are not always identifiable, but they have
been shown to be almost everywhere identifiable for
ADMGs without edges (called bows in the structural
equation model literature) [Brito and Pearl, 2002]. If, in
addition, an ADMG does not have almost directed cycles (i.e.
is ancestral), the parametrization is everywhere identifiable
[Ri
        <xref ref-type="bibr" rid="ref3">chardson and Spirtes, 2002</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>GSMAG: GREEDY SEARCH FOR</title>
    </sec>
    <sec id="sec-6">
      <title>MAXIMAL ANCESTRAL GRAPHS</title>
      <p>Let V = fVi : i = 1; : : : ; V g be a random vector of V
variables following a multivariate normal distribution N (O; )
with positive definite covariance matrix . Let G be a
MAG. Then graph G defines a system of linear equations:
ij Vj + i; i 2 f1; : : : ; V g
(2)
Vi =</p>
      <p>X
j2P aG(Vi)
Let B(G) be the collection of all real V V matrices B =
( ij ) such that (i) ij = 0 when j ! i is not in G, and (ii)
(I B) is invertible. Let (G) be all the V V matrices
= (!ij ) such that (i) is positive definite and (ii) !ij =
0 if j $ i is not in G.</p>
      <p>Then the system of linear equations (2) can be written as
V = BV + , and for B 2 B(G); Cov( ) = 2 (G)
it has a unique solution that is a multivariate normal
vector with covariance matrix = (I B) 1 (I B) T ,
where the superscript T denotes the transpose inverse.
The family of distributions with covariance matrix in the
set f = (I B) 1 (I B) T g is called the normal
linear model associated with G (symb N(G)). For MAGs,
the normal linear model is everywhere identifiable.
Let D be a V N matrix of observations for the variables
V . Then the empirical covariance matrix is defined as
1
n
S =</p>
      <p>DDT :
For N V + 1, S is almost surely positive definite. For a
MAG G, the log likelihood of the model is</p>
      <p>N</p>
      <p>1
N
lG (B; jS) =
ln(j2</p>
      <p>j
N
2
tr[(I</p>
      <p>
        B)T
where lG ( ^ jG) is the likelihood of the graph G with the
MLE parameters B^; ^ . BIC is an asymptotically
correct criterion for selecting among Gaussian ancestral graph
models [Ri
        <xref ref-type="bibr" rid="ref3">chardson and Spirtes, 2002</xref>
        ].
      </p>
      <p>A simple greedy strategy starts from a MAG G with a score
sc and then checks the local neighborhood (i.e. the graphs
that stem from the current graph after making a single edge
change) for the lowest-scoring network. The algorithm
continues this “hill-climbing” until no single edge change
reduces the score.</p>
      <p>Algorithm 1 begins with the empty graph, where each node
is a component. At every subsequent step, every possible
edge change is considered: For absent edges, the possible
actions are addLeft, addRight, addBidirected. For directed
edges, the possible actions are reverse, orientBidirected,
remove. For bi-directed edges the possible actions are
oriend
end
end
end
until no action reduces curScore;
sc curScore ;</p>
      <sec id="sec-6-1">
        <title>Algorithm 1: GSMAG</title>
        <p>
          Maximum likelihood estimates B^; ^ that maximize (3)
can be found using the residual iterative conditional
fitting (RICF) algorith
          <xref ref-type="bibr" rid="ref6">m presented in Drton et al. [2009</xref>
          ],
and the corresponding implied covariance matrix is ^ =
(I B^) 1 ^ (I B^) T .
        </p>
        <p>
          Based on the factorization of
          <xref ref-type="bibr" rid="ref6">MAGs presented in
Richardson [2009</xref>
          ], the likelihood can be decomposed according to
the c-components of G as follows [Nowzohour et al., 2015]:
lG ( ^ jS) =
        </p>
        <p>N2 X jCkjln(2 ) + ln Q
j Gk j</p>
        <p>2
j2P aGk kj
tr[ Gk1SGk</p>
        <p>jP aG (Ck) n fCkgj] ;
where the Gk is the graph consisting only of nodes in Ck [
P aG (Ck) without any edges among variables in P aG (Ck)n
Ck, and the subscript Gk denotes the restriction of a matrix
to the rows and columns participating in Gk. k2j denotes
the diagonal entry of Gk corresponding to parent node k.
The log likelihood is now a sum of c-component scores.
The scoring function is typically the negative log likelihood
entLeft, orientRight, remove.</p>
        <p>Score decomposition described in Equation 4 is used to
avoid re-fitting the entire MAG. Instead, only the likelihood
of the c-components affected by the change need to be
reestimated. Algorithm 1 describes a simple greedy search
strategy for learning MAG structure.</p>
        <p>Only actions that do not create directed or almost directed
cycles are attempted. To efficiently check for cycle
creation, a matrix of ancestral relationships1 of the current
MAG is cached. Edge removals can never create directed
cycles. Using the cached ancestral matrix, it is
straightforward to check whether the addition of a directed edge will
create a directed cycle, or if the addition of a bi-directed
edge will create an almost directed cycle. Almost directed
cycles can also be created when adding directed edges: For
each edge X $ Y , adding edge J ! I will create a
semidirected cycle if I is an ancestor if X (Y ) and Y (X ) is an
ancestor of J .</p>
        <p>
          At the end of each iteration, the matrix of ancestral
relationships is updated. If a previously missing edge is added,
the update takes O(V 2) time. If an edge is removed, the
matrix is recomputed using Warshall’ s algorithm for
tran
          <xref ref-type="bibr" rid="ref22">sitive closure [Warshall, 1962</xref>
          ].
        </p>
        <p>The c-components are only updated in case a bi-directed
edge is added or altered in any way. When adding a
bidirected edge, the corresponding c-components of the
endpoints are merged if separate. When an existing bi-directed
edge is removed (completely or becomes directed), the
corresponding c-component Ck is divided in the new
connected components. The scores of the affected components
are recomputed using new RICF estimates. In any other
case, the c-components remain the same, and the score of
the c-component whose corresponding graph Gk is affected
by the change is recomputed. This procedure is described
in Algorithm 2.</p>
        <p>When no single-edge change improves the current score,
the algorithm terminates and the current network is
returned. Greedy hill-climbing procedures can be stuck in
local optima (minima). To tackle this problem, they are often
augmented with meta-heuristics such as random restarts,
TABU lists or simulated annealing. For the scope of this
work we use no such heuristic. In preliminary experiments,
however, we found that augmenting Algorithm 1 with a
TABU heuristic did not significantly improve performance.</p>
        <p>1Changing edge orientation is equivalent to a removing the
edge and then adding it re-oriented. To test for possible cycles
efficiently, a matrix of all the non-trivial ancestral relationships
(more than one variable in the path) is also cached. Reversing an
edge X ! Y creates a directed cycle only if X is a non-trivial
ancestor of Y .</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>RELATED WORK</title>
      <p>
        Several constraint-based algorithms exist for learning a
Markov equivalence class of MAGs from an observational
data set: FCI [S
        <xref ref-type="bibr" rid="ref17">pirtes et al., 2000</xref>
        , Zhang, 2008] is a sound
and complete algorithm that returns the complete PAG.
RFC
        <xref ref-type="bibr" rid="ref14">I Colombo et al. [2012</xref>
        ] and FCI+[Claassen e
        <xref ref-type="bibr" rid="ref4">t al.,
2013</xref>
        ] are modifications of FCI that try to avoid the
computationally expensive possible d-separating stage in the
skeleton search of FCI. Conservat
        <xref ref-type="bibr" rid="ref21">ive FCI [Ramsey et al.,
2006</xref>
        ] is a modification of FCI that makes fewer, but more
robust orientations, to avoid error propagation.
      </p>
      <p>Nowzohour et al. [2015] propose a greedy search with
random restarts for learning “bow-free” ADMGs from data
and introduce the score decomposition showed in
Equation 4. Since Markov equivalence for ADMGs that are not
MAGs has not yet been characterized, they use a greedy
strategy for obtaining the empirical Markov equivalence
class, based on score similarity. The authors use the
estimated ADMGs to compute causal effects and show
promising results. However, since they do not necessarily find
maximal ancestral graphs, they do not compare against
constraint-based methods or evaluate the accuracy of the
learnt graphs.</p>
      <p>
        Marginalizing out variables from causal DAGs results in
some additional equality constraints that are not
conditional independencies. Nested Markov models Shpi
        <xref ref-type="bibr" rid="ref4">tser
et al. [2013</xref>
        ] extend SMCMs and are used to also model
these additional constra
        <xref ref-type="bibr" rid="ref14">ints. Shpitser et al. [2012</xref>
        ] use a
penalized likelihood score and a greedy search with TABU
list to identify a nested Markov model from discrete data.
6
      </p>
    </sec>
    <sec id="sec-8">
      <title>COMPARISON OF GSMAG</title>
    </sec>
    <sec id="sec-9">
      <title>CFCI</title>
    </sec>
    <sec id="sec-10">
      <title>WITH FCI,</title>
      <p>We compared the performance of Algorithm 1 against FCI
and CFCI in simulated data. We simulated 100 random
DAGs over 10, 20 and 50 variables. To control the
sparseness of the DAGs, we set the maximum parents of each
node. We present results for sparse networks, where each
variable was allowed to have up to 3 parents, and denser
networks where each variables was allowed to have up
to 5 parents. For each DAG, 10% of the variables were
marginalized (1, 2 and 5 variables respectively). The
ground truth PAG PGT was then created for each marginal
DAG.</p>
      <p>Data sets with 100, 1000 and 5000 samples were simulated
for each DAG and random parameters with absolute values
in f0:1; 0:9g. The corresponding marginal data sets were
input in Algorithm 1, FCI and CFCI. FCI and CFCI were
run with a significance threshold of 0.05 and a maximum
conditioning size 5. Algorithm 1 outputs a MAG. To
compare the outputs of the algorithms, the corresponding PAG
was created for each MAG output. We use PF CI ; PCF CI
and PGS to denote the outputs of FCI, CFCI and Algorithm
1, respectively.</p>
      <p>Summarizing PAG differences is not trivial, and many
different approaches are used in the literature. As a general
metric of how different two PAGs are, we use the structural
hamming distance (shd) for PAGs, defined as follows: Let
P^ be the output PAG and P be the ground truth PAG. For
each change (edge addition, edge deletion, change
arrowhead, change tail) required to transform P^ into P, shd is
increased by 1.
We also use precision and recall, as described in Tillman
and Spirtes [2011]: Precision is defined as the number of
edges in the output PAG with the correct orientations,
divided by the number of edges in the output PAG. Recall
is defined as the number of edges in the output PAG with
correct orientations, divided by the number of edges in the
ground truth PAG. These metrics are very conservative,
since they penalize even small differences. For example,
an edge that is ! in the ground truth but in the output
PAG will be classified as a false positive.
Greedy search has larger structural hamming distances than
FCI and CFCI. More specifically, out of 900 cases (over all
variable and sample sizes), FCI outperforms GSMAG in
657 cases for sparse networks and in 715 cases for dense
networks, while CFCI outperforms GSMAG in 682 cases
for sparse networks and in 710 cases for dense networks.
In terms of precision, CFCI is again the best of the three
(outperforms GSMAG in 601 and 659 out of 900 cases for
sparse and dense networks, respectively). FCI has the
poorest precision: it outperforms GSMAG in 318 and 221 cases
for sparse and dense networks, respectively). Finally,
GSMAG has the best recall out of all algorithms, with CFCI
being second. Specifically, terms of recall, FCI
outperforms GSMAG in 139 and 83 cases, while CFCI
outperforms GSMAG in 357 and 249 cases for sparse networks
and dense networks, respectively. Naturally, greedy search
is much slower than both conservative and plain FCI.
Intriguingly, GSMAG’s performance declines for the
largest attempted sample size (5000 samples), particularly
for larger networks. This happens because greedy search
tends to include many false positive edges. It is possible
that this is related to the naive greedy search, and could be
improved by augmenting some kind of heuristic for
escaping local minima, or by adjusting the scoring criterion.
Figure 6 shows the score of the output MAG for Algorithm
1 and the ground truth MAG. To compare also with FCI,
we used the method presented in Zhang [2008] to obtain
a MAG from PF CI . Notice that this cannot be applied to
the output of CFCI, since it is not a complete PAG (due to
unfaithful triplets). Greedy search typically scores closer
to the ground truth, particularly for denser networks.
7</p>
    </sec>
    <sec id="sec-11">
      <title>FUTURE WORK</title>
      <p>We present an implementation of a greedy search algorithm
for learning MAGs from observations, and compare it to
FCI and CFCI. To the best of our knowledge, this is the first
comparison of score-based and constraint-based search in
the presence of confounders.</p>
      <p>The algorithm uses the decomposition presented in
Nowzohour et al. [2015] for bow-free SMCMs. Compared to
SMCMs, MAGs are less expressive in terms of causal
statements. However, since they have no almost directed cycles,
fitting procedures for obtaining maximum likelihood
estimates always converge. Semi-Markov causal models that
are Markov equivalent to the output MAG could be
identified as a post-processing step.</p>
      <p>Heuristic procedures for escaping local minima could also
be explored, to improve the performance of GSMAG.
Algorithm efficiency could also possibly be improved by
updating without recomputing inverse matrices and where
applicable.</p>
      <p>Other interesting directions include taking weighted
averages for specific PAG features, or using both
constraintbased and score-based techniques for hybrid learning.
Greedy search in the space of PAGs instead of MAGs could
also be explored, since a transformational characterization
for Markov equivalent MAGs exists [Zhang and Spirtes,
2005].</p>
      <sec id="sec-11-1">
        <title>Acknowledgments</title>
        <p>This work was funded by European Research Council
(ERC) and is part of the CAUSALPATH - Next Generation
Causal Analysis project, No 617393.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>RA</given-names>
            <surname>Ali</surname>
          </string-name>
          , TS Richardson, and
          <string-name>
            <given-names>P</given-names>
            <surname>Spirtes</surname>
          </string-name>
          .
          <article-title>Markov equivalence for ancestral graphs</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>37</volume>
          (5B):
          <fpage>2808</fpage>
          -
          <lpage>2837</lpage>
          ,
          <year>October 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>RR</given-names>
            <surname>Bouckaert</surname>
          </string-name>
          .
          <article-title>Bayesian belief networks: from construction to inference</article-title>
          .
          <source>PhD thesis</source>
          , University of Utrecht,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>C</given-names>
            <surname>Brito</surname>
          </string-name>
          and
          <string-name>
            <given-names>J</given-names>
            <surname>Pearl</surname>
          </string-name>
          .
          <article-title>A new identification condition for recursive models with correlated errors</article-title>
          .
          <source>Structural Equation Modeling</source>
          ,
          <volume>9</volume>
          (
          <issue>4</issue>
          ):
          <fpage>459</fpage>
          -
          <lpage>474</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>T</given-names>
            <surname>Claassen</surname>
          </string-name>
          , JM Mooij, and
          <string-name>
            <given-names>T</given-names>
            <surname>Heskes</surname>
          </string-name>
          .
          <article-title>Learning sparse causal models is not NP-hard</article-title>
          .
          <source>In Proceedings of the 29th Annual Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>D</given-names>
            <surname>Colombo</surname>
          </string-name>
          , MH Maathuis,
          <string-name>
            <given-names>M</given-names>
            <surname>Kalisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and TS</given-names>
            <surname>Richardson</surname>
          </string-name>
          .
          <article-title>Learning high-dimensional directed acyclic graphs with latent and selection variables</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>40</volume>
          (
          <issue>1</issue>
          ):
          <fpage>294</fpage>
          -
          <lpage>321</lpage>
          ,
          <year>02 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>M</given-names>
            <surname>Drton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Eichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and TS</given-names>
            <surname>Richardson</surname>
          </string-name>
          .
          <article-title>Computing maximum likelihood estimates in recursive linear models with correlated errors</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>10</volume>
          :
          <fpage>2329</fpage>
          -
          <lpage>2348</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>D</given-names>
            <surname>Geiger</surname>
          </string-name>
          and
          <string-name>
            <given-names>D</given-names>
            <surname>Heckerman</surname>
          </string-name>
          .
          <article-title>Learning Gaussian networks</article-title>
          .
          <source>In Proceedings of the 10th Annual Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>D</given-names>
            <surname>Heckerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D</given-names>
            <surname>Geiger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>DM</given-names>
            <surname>Chickering</surname>
          </string-name>
          .
          <article-title>Learning bayesian networks: The combination of knowledge and statistical data</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>20</volume>
          (
          <issue>3</issue>
          ):
          <fpage>197</fpage>
          -
          <lpage>243</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>C</given-names>
            <surname>Nowzohour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Maathuis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and P</given-names>
            <surname>Bu</surname>
          </string-name>
          <article-title>¨ hlmann. Structure learning with bow-free acyclic path diagrams</article-title>
          .
          <source>arXiv preprint arXiv:1508.01717</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>J</given-names>
            <surname>Pearl</surname>
          </string-name>
          .
          <source>Causality: Models, Reasoning and Inference</source>
          , volume
          <volume>113</volume>
          of Hardcover. Cambridge University Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>J</given-names>
            <surname>Ramsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P</given-names>
            <surname>Spirtes</surname>
          </string-name>
          , and
          <string-name>
            <surname>J Zhang.</surname>
          </string-name>
          <article-title>Adjacency faithfulness and conservative causal inference</article-title>
          .
          <source>In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>TS</given-names>
            <surname>Richardson</surname>
          </string-name>
          .
          <article-title>A factorization criterion for acyclic directed mixed graphs</article-title>
          .
          <source>In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>TS</given-names>
            <surname>Richardson and P Spirtes</surname>
          </string-name>
          .
          <article-title>Ancestral graph Markov models</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>30</volume>
          (
          <issue>4</issue>
          ):
          <fpage>962</fpage>
          -
          <lpage>1030</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>I</given-names>
            <surname>Shpitser</surname>
          </string-name>
          , TS Richardson, JM Robins, and
          <string-name>
            <given-names>R</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>Parameter and structure learning in nested markov models</article-title>
          .
          <source>In In UAI (Workshop on Causal Structure Learning)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>I</given-names>
            <surname>Shpitser</surname>
          </string-name>
          ,
          <string-name>
            <surname>R Evans</surname>
          </string-name>
          ,
          <source>TS Richardson, and JM Robins</source>
          .
          <article-title>Sparse nested Markov models with log-linear parameters</article-title>
          .
          <source>In Proceedings of the 29h Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>P</given-names>
            <surname>Spirtes</surname>
          </string-name>
          .
          <article-title>Introduction to causal inference</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>11</volume>
          :
          <fpage>1643</fpage>
          -
          <lpage>1662</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>P</given-names>
            <surname>Spirtes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C</given-names>
            <surname>Glymour</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R</given-names>
            <surname>Scheines. Causation</surname>
          </string-name>
          , Prediction, and Search. MIT Press, second edition,
          <year>January 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>J</given-names>
            <surname>Tian</surname>
          </string-name>
          and
          <string-name>
            <given-names>J</given-names>
            <surname>Pearl</surname>
          </string-name>
          .
          <article-title>On the identification of causal effects</article-title>
          .
          <source>Technical Report</source>
          R-290-L, UCLA Cognitive Systems Laboratory,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>RE</given-names>
            <surname>Tillman and P Spirtes</surname>
          </string-name>
          .
          <article-title>Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Sofia</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Tsamardinos</surname>
          </string-name>
          .
          <article-title>Constraintbased causal discovery from multiple interventions over overlapping variable sets</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>16</volume>
          :
          <fpage>2147</fpage>
          -
          <lpage>2205</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>I</given-names>
            <surname>Tsamardinos</surname>
          </string-name>
          , LE Brown, and CF Aliferis.
          <article-title>The max-min hill-climbing Bayesian network structure learning algorithm</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>65</volume>
          (
          <issue>1</issue>
          ):
          <fpage>31</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>S</given-names>
            <surname>Warshall</surname>
          </string-name>
          .
          <article-title>A theorem on boolean matrices</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>11</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>1962</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>J</given-names>
            <surname>Zhang.</surname>
          </string-name>
          <article-title>On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>172</volume>
          (
          <fpage>16</fpage>
          -17):
          <fpage>1873</fpage>
          -
          <lpage>1896</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>J</given-names>
            <surname>Zhang and P Spirtes</surname>
          </string-name>
          .
          <article-title>A transformational characterization of Markov equivalence for directed acyclic graphs with latent variables</article-title>
          .
          <source>In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, UAI</source>
          <year>2005</year>
          , pages
          <fpage>667</fpage>
          -
          <lpage>674</lpage>
          ,
          <year>2005</year>
          . cited By 8.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>