<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>S´everine A↵eldt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herv´e Isambert</string-name>
          <email>herve.isambert@curie.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut Curie, Research Center</institution>
          ,
          <addr-line>CNRS, UMR168, 26 rue d'Ulm, 75005</addr-line>
          ,
          <institution>Paris France; and Universit ́e Pierre et Marie Curie</institution>
          ,
          <addr-line>4 Place Jussieu, 75005, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <volume>14</volume>
      <fpage>465</fpage>
      <lpage>471</lpage>
      <abstract>
        <p>We report a novel network reconstruction method, which combines constraintbased and Bayesian frameworks to reliably reconstruct graphical models despite inherent sampling noise in finite observational datasets. The approach is based on an information theory result tracing back the existence of colliders in graphical models to negative conditional 3-point information between observed variables. In turn, this provides a confident assessment of structural independencies in causal graphs, based on the ranking of their most likely contributing nodes with (significantly) positive conditional 3-point information. Starting from a complete undirected graph, dispensible edges are progressively pruned by iteratively “taking o↵” the most likely positive conditional 3-point information from the 2-point (mutual) information between each pair of nodes. The resulting network skeleton is then partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. This “3o↵2” network reconstruction approach is shown to outperform constraint-based, search-and-score and earlier hybrid methods on a range of benchmark networks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The prospect of learning the direction of causal
dependencies from mere correlations in observational data
has long defied practical implementations
(Reichenbach, 1956). The fact that causal relationships can,
to some extent, be inferred from nontemporal
statistical data is now known to hinge on the unique
statistical imprint of colliders in causal graphical models,
provided that certain assumptions are made about the
underlying process of data generation, such as its
faithfulness to a tree structure (Rebane and Pearl, 1988) or
a directed acyclic graph model (Spirtes, Glymour, and
Scheines, 2000; Pearl, 2009).</p>
      <p>
        These early findings led to the developments of two
types of network reconstruction approaches; on the
one hand, search and score methods (Cooper and
Herskovits, 1992; Heckerman, Geiger, and Chickering,
1995; Chickering, 2002) need heuristic strategies, such
as hill-climbing algorithms, to sample network space,
on the other hand, constraint-based methods, such as
the PC (Spirtes and Glymour, 1991) and IC (Pearl
and Verma, 1991) algorithms, rely on the identification
of structural independencies, that correspond to edges
to be removed from the underlying network (Spirtes,
Glymour, and Scheines, 2000; Pearl, 2009). Yet, early
errors in removing edges from the complete graph
often lead to the accumulation of compensatory errors
later on in the pruning process. Hence, despite recent,
more stable implementations intending to overcome
order-dependency in the pruning process
        <xref ref-type="bibr" rid="ref2">(Colombo
and Maathuis, 2014)</xref>
        , constraint-based methods are
not robust to sampling noise in finite datasets.
In this paper, we present a more robust
constrainedbased method and corresponding 3o↵2 algorithm. It
is directly inspired by the PC anc IC algorithms but
relies on a quantitative information theoretic
framework to reliably uncover conditional independencies in
finite datasets and subsequently orient and propagate
edge directions between connected variables.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>RESULTS</title>
      <p>UNCOVERING CAUSALITY FROM A
STABLE / FAITHFUL DISTRIBUTION
Consider a network G = (V, E) and a stable (or
faithful) distribution P (X) over V , implying that each
structural independency (i.e. missing edge XY in G)
corresponds to a vanishing conditional 2-point
(mutual) information and reciprocally as,
(X? ?</p>
      <p>Y |{Ui})G
()
()
(X? ?</p>
      <p>Y |{Ui})P
I(X; Y |{Ui}) = 0
(1)
(2)
Eq.1 assumes, in particular, that P (X) is a
theoretical distribution, defined by a formal expression of its
variables X = {X, Y, U1, U2, . . .}. Note, however, that
no such expression is known a priori, in general, and
P (X) must typically be estimated from the available
data. In principle, an infinite amount of data would be
necessary to infer an ‘exact’ stable distribution P (X)
consistent with Eq.1. In the following, we will first
assume that such an infinite amount of data is
available and distributed as a stable P (X) to establish how
causality can be inferred statistically from conditional
2-point and 3-point information. We will then
consider the more realistic situation for which P (X) is
not known exactly and must be estimated from a
finite amount of data.</p>
      <p>Let us first recall the generic decomposition of a
conditional 2-point (or mutual) information I(X; Y |{Ui})
by the introduction of a third node Z and the
conditional 3-point information I(X; Y ; Z|{Ui}),</p>
      <p>I(X;Y |{Ui}) = I(X;Y;Z|{Ui}) + I(X;Y |{Ui}, Z) (3)
This relation can be taken as the definition of
conditional 3-point information I(X; Y ; Z|{Ui}) which is in
fact symmetric in X, Y and Z,</p>
      <p>I(X; Y ; Z|{Ui}) = I(X; Y |{Ui})
= I(X; Z|{Ui})
= I(Y ; Z|{Ui})</p>
      <p>I(X; Y |{Ui}, Z)
I(X; Z|{Ui}, Y )
I(Y ; Z|{Ui}, X)
Note that Eq.3 is always valid, regardless of any
assumption on the underlying graphical model and of
the amount of data available to estimate conditional
2-point and 3-point information terms. Eq.3 will be
used to prove the following lemmas and propositions,
which trace back the origin of necessary causal
relationships in a graphical model to the existence of a
negative conditional 3-point information between three
variables {X, Y, Z}, I(X; Y ; Z|{Ui}) &lt; 0, where {Ui}
accounts for a structural independency between two of
them, e.g. I(X; Y |{Ui}) = 0 (see Theorem 4).
Lemma 1. Given a stable distribution P (X) on V ,
8 X, Y 2 V not adjacent in G, 9{ Ui} ✓ V\{X,Y }
s.t. I(X; Y |{Ui}) = 0 and 8 Z 6= X, Y, {Ui},
I(X; Y ; Z|{Ui}) 6 0.</p>
      <p>Proof. If X, Y 2 V are not adjacent in G, this
corresponds to a structural independency, i.e. 9{ Ui} ✓
V\{X,Y } s.t. I(X; Y |{Ui}) = 0. Then 8 Z 6= X, Y, {Ui}
Eq.3 implies I(X; Y ; Z|{Ui}) = I(X; Y |{Ui}, Z) 6 0,
as conditional mutual information is always positive. ⇤
Corollary 2 (3-point contribution). 8 X, Y, Z 2 V
and 8{ Ui} ✓ V\{X,Y,Z} s.t. I(X; Y ; Z|{Ui}) &gt; 0, then
I(X; Y |{Ui}) &gt; 0 (as well as I(X; Z|{Ui}) &gt; 0 and
I(Y ; Z|{Ui}) &gt; 0 by symmetry of I(X; Y ; Z|{Ui})).
Corollary 2, which is a direct consequence of Eq.3
and the positivity of mutual information, will be the
basis of the 3o↵2 causal network reconstruction
algorithm, which iteratively “takes o↵” 3-point
information from 2-point information, as I(X; Y |{Ui})
I(X; Y ; Z|{Ui}) = I(X; Y |{Ui}, Z), and update
{Ui} { Ui} + Z as long as there remains some Z 2 V
with (significantly) positive conditional 3-point
information I(X; Y ; Z|{Ui}) &gt; 0.</p>
      <p>Lemma 3 (vanishing conditional 2-point and
3point information in undirected networks). If
G is an undirected (Markov) network, 8 X, Y 2 V and
8{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0, then 8 Z 6=
X, Y, {Ui}, I(X; Y ; Z|{Ui}) = 0.</p>
      <p>Proof. If G is a Markov network, 8 X, Y 2 V and
8{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0, then 8 Z 6=
X, Y, {Ui}, I(X; Y |{Ui}, Z) = 0 as conditioning
observation cannot induce correlations in Markov
networks (Koller and Friedman, 2009). This implies that
I(X; Y ; Z|{Ui}) = 0 through Eq.3. ⇤
Note, however, that the converse of Lemma 3 is not
true. Namely, (partially) directed networks can also
have vanishing conditional 3-point information
associated to all their structural independencies. In
particular, tree-like bayesian networks without colliders
(i.e. without v-structures, X ! Z Y ) present
only vanishing 3-point information associated to their
structural independencies, i.e. I(X; Y ; Z|{Ui}) = 0,
8 X, Y, Z, {Ui} 2 V s.t. I(X; Y |{Ui}) = 0.
However, such a directed network must be Markov
equivalent to an undirected network corresponding to the
same structural independencies but lacking any trace
of causal relationships (i.e. no directed edges). The
probability distributions faithful to such directed
networks do not contain evidence of obligate causality;
i.e. no directed edges can be unambiguously oriented.
The following Theorem 4 establishes the existence of
negative conditional 3-point information as statistical
evidence of obligate causality in graphical models. For
the purpose of generality in this section, we do not
exclude the possibility that unobserved ‘latent’
variables might mediate the causal relationships among
observed variables. However, this requires
dissociating the labelling of the two endpoints of each edges.
Let us first introduce three di↵erent endpoint marks
associated to such edges in mixed graphs: they are the
tail ( ), the head (&gt;) and the unspecified ( ) endpoint
marks. In addition, we will use the asterisk symbol (⇤ )
as a wild card denoting any of the three marks.
Theorem 4 (negative conditional 3-point
information as statistical evidence of
causality). If 9 X, Y, Z 2 V and {Ui} ✓ V\{X,Y,Z} s.t.
I(X; Y |{Ui}) = 0 and I(X; Y ; Z|{Ui}) &lt; 0 then, G
is (partially) directed, i.e. some variables in G are
causally linked, either directly or indirectly through
other variables, including possibly unknown, ‘latent’
variables unobserved in G.</p>
      <p>Proof. Theorem 4 is the contrapositive of Lemma 3,
with the additional use of Lemma 1. ⇤
Proposition 5 (origin of causality at unshielded
triples with negative conditional 3-point
information). for all unshielded triple, X ⇤ Z ⇤ Y ,
9{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0, if Z 2/ { Ui}
then I(X; Y ; Z|{Ui}) &lt; 0 and the unshielded triple
should be oriented as X !⇤ Z ⇤ Y .</p>
      <p>Proof. if I(X; Y |{Ui}) = 0 with Z 2/{ Ui}, the
unshielded triple has to be a collider and I(X; Y |{Ui}, Z) &gt; 0,
by faithfulness, hence, I(X; Y ; Z|{Ui}) &lt; 0 by Eq.3. ⇤
Hence, the origin of causality manifests itself in the
form of colliders or v-structures in graphical models
which reveal ‘genuine’ causations (X! Z or Y ! Z) or,
alternatively, ‘possible’ causations (X ! Z or Y !
Z), provided that the corresponding correlations are
not due to unobserved ‘latent’ variables L or L0 as,
X L99 L 99K Z or Y L99 L0 99K Z.</p>
      <p>Following the rationale of constraint-based
approaches, it is then possible to ‘propagate’ further the
orientations downstream of colliders, through positive
(conditional) 3-point information, if one assumes that
the underlying distribution P (X) is faithful to an
ancestral graph G on V . An ancestral graph is a mixed
graph, that is, with three types of edges, undirected
( ), directed ( or ! ) or bidirectional ($ ), but with
i.) no directed cycle, ii.) no almost directed cycle
(including one bidirectional edge) and iii.) no undirected
edge with incoming arrowhead (such as X !⇤ Z Y ).
In particular, Directed Acyclic Graphs (DAG) are
subclasses of ancestral graphs (i.e. without undirected nor
bidirectional edges).</p>
      <p>Proposition 6 (‘propagation’ of causality
at unshielded triples with positive
conditional 3-pt information). Given a
distribution P (X) faithful to an ancestral graph G on V ,
for all unshielded triple with already one
converging orientation, X !⇤ Z ⇤ Y , 9{ Ui} ✓ V\{X,Y }
s.t. I(X; Y |{Ui}) = 0, if Z 2 { Ui} then
I(X; Y ; Z|{Ui}\Z ) &gt; 0 and the first orientation should
be ‘propagated’ to the second edge as X !⇤ Z ! Y .
Proof. if I(X; Y |{Ui}) = 0 with Z 2 { Ui}, the
unshielded triple cannot be a collider and, since G is
assumed to be an ancestral graph, the edge Z Y cannot
LD|G
LD|G\X!Y = e NI(X;Y |{PaY }\X )</p>
      <p>Z(G, D)
Z(G\X! Y , D)
where I(X; Y |{PaY }\X ) = H(Y |{PaY }\X )
H(Y |{PaY }). However, Eq.6 cannot be used as such
be an undirected edge either. Hence, it has to be a
directed edge, Z ! Y and I(X; Y ; Z|{Ui}\Z ) &gt; 0 by
faithfulness and Eq.3. ⇤
Note that the propagation rule of Proposition 6 can
be applied iteratively to successive unshielded triples
corresponding to positive conditional 3-point
information. Yet, all arrowhead orientations can be ultimately
traced back to a negative conditional 3-point
information, Theorem 4 and Proposition 5.
2.2</p>
      <p>ROBUST RECONSTRUCTION OF
CAUSAL GRAPHS FROM FINITE</p>
      <p>DATASETS
We now turn to the more practically relevant
situation of finite datasets consisting of N independent data
points. The associated sampling noise will
instrinsically limit the accuracy of causal network
reconstruction. In particular, conditional independencies cannot
be exactly achieved (I(X; Y |{Ui}) = 0) but can be
reliably established using statistical criteria that depend
on the number of data points N .</p>
      <p>Given N independent datapoints from the available
data D, let us introduce the maximum likelihood,
LD|G , that they might have been generated by the
graphical model G (Sanov, 1957),</p>
      <p>LD|G =
e NH(G,D)</p>
      <p>Z(G, D)
=
eN P{xi} p({xi}) log(q({xi}))</p>
      <p>Z(G, D)
(4)
where H(G, D) = P{xi} p({xi}) log(q({xi})) is the
cross entropy between the “true” probability
distribution p({xi}) of the data D and the theoretical
probability distribution q({xi}) of the model G and Z(G, D)
is a data- and model-dependent factor ensuring proper
normalization condition. The structural constraints of
the model G can be included a priori in the
factorization form of the theoretical probability distribution,
q({xi}). In particular, if we assume a Bayesian
network as underlying graphical model, q({xi}) factorizes
as q({xi}) = Qi p(xi|{paxi }), where {paxi } denote the
values of the parents of node Xi, {PaXi }, and leads to
the following maximum likelihood expression,
(5)
(6)
LD|G =
e N Pi H(Xi|{PaXi })</p>
      <p>Z(G, D)
The model G can then be compared to the alternative
model G\X! Y with one additional missing edge X !
Y using the maximum likelihood ratio,
to learn the underlying graphical model, as it assumes
that the order between the nodes and their parents is
already known (see however (de Campos, 2006)). Yet,
following the rationale of constraint-based approaches,
Eq.6 can be reformulated by replacing the parent
nodes with an unknown separation set {Ui} to be
learnt simultaneously with the missing edge candidate
XY ,</p>
      <p>
        LG
LG\XY |{Ui} = e NI(X;Y |{Ui})+kX;Y |{Ui}
(7)
kX;Y |{Ui} = log Z(G, D)/Z(G\XY |{Ui}, D)
where the factor kX;Y |{Ui} &gt; 0 tends to limit the
complexity of the models by favoring fewer edges.
Namely, the condition, I(X; Y |{Ui}) &lt; kX;Y |{Ui}/N ,
implies that simpler models compatible with the
structural independency, X? ? Y |{Ui}, are more likely than
model G, given the finite available dataset. This
replaces the ‘perfect’ conditional independency
condition, I(X; Y |{Ui}) = 0, valid in the limit of an infinite
dataset, N ! 1 . A common complexity criteria in
model selection is the Bayesian Information Criteria
(BIC) or Minimal Description Length (MDL) criteria
        <xref ref-type="bibr" rid="ref3">(Rissanen, 1978; Hansen and Yu, 2001)</xref>
        ,
where rx, ry and rui are the number of levels of the
corresponding variables. The MDL complexity, Eq.8,
is simply related to the normalisation constant of the
normal distribution reached in the asymptotic limit
of a large dataset N ! 1 (Central Limit
Theorem). However, such a central limit distribution is
only reached for very large datasets in practice.
Alternatively, the normalisation of the maximum likelihood
can also be done over all possible datasets including
the same number of data points to yield a
(universal) Normalized Maximum Likelihood (NML)
criteria
        <xref ref-type="bibr" rid="ref7">(Shtarkov, 1987; Rissanen and Tabus, 2005)</xref>
        and its
decomposable
        <xref ref-type="bibr" rid="ref4">(Kontkanen and Myllym¨aki, 2007; Roos
NML
et al., 2008)</xref>
        and XY -symmetric version, kX;Y |{Ui},
defined in the Supplementary Methods.
      </p>
      <p>Then, instead of exploring the combinatorics of sepset
composition {Ui} for each missing edge candidate XY
as in traditional constraint-based approaches, we
propose that Eq.7 can be used to iteratively extend a
likely sepset using the maximum likelihood ratios
between two successive sepset candidates, i.e. between
the already ascertained {Ui} and the possible extended
{Ui} + Z, as,
using Eq.3 for I(X; Y ; Z|{Ui}) and introducing a
similar 3-point complexity conditioned on {Ui} as,
kX;Y ;Z|{Ui} = kX;Y |{Ui},Z
kX;Y |{Ui}
(10)
where kX;Y ;Z|{Ui} &gt; 0, unlike 3-point information,
I(X; Y ; Z|{Ui}) which can be positive or negative.
Introducing also the shifted 2-point and 3-point
information for finite datasets as,
=
=
I0(X; Y |{Ui})</p>
      <p>I(X; Y |{Ui})
I0(X; Y ; Z|{Ui})</p>
      <p>I(X; Y ; Z|{Ui}) +
kX;Y |{Ui}</p>
      <p>N
kX;Y ;Z|{Ui}</p>
      <p>N
leads to maximum likelihood ratios equivalent to Eqs.7
and 9,
As will become apparent in the following discussion,
learning, iteratively, the most likely edge to be
removed XY and its corresponding separation set {Ui}
will imply to simultaneously minimize 2-point
information (Eq.11) while maximizing 3-point information
(Eq.12).</p>
      <p>We start the discussion with 3-point information,
Eq.12. The sign and magnitude of shifted
conditional 3-point information I0(X; Y ; Z|{Ui}) determine
the probability that Z should be included in or
excluded from the sepset candidate {Ui},
• If I0(X; Y ; Z|{Ui}) &gt; 0, Z is more likely to be
included in {Ui} with probability,</p>
      <p>Pnv(X; Y ; Z|{Ui}) =</p>
      <p>LD|G\XY |{Ui},Z
=</p>
      <p>LD|G\XY |{Ui} + LD|G\XY |{Ui},Z</p>
      <p>1
1 + e NI0(X;Y ;Z|{Ui})
(13)
• If I0(X; Y ; Z|{Ui}) &lt; 0, Z is more likely to be
excluded from {Ui}, suggesting obligatory causal
relationships in the form of a v-structure or collider
between X, Y, Z with probability,</p>
      <p>Pv(X; Y ; Z|{Ui}) = 1
=</p>
      <p>Pnv(X; Y ; Z|{Ui})</p>
      <p>1
1 + eNI0(X;Y ;Z|{Ui})
(14)
But, in the case I0(X; Y ; Z|{Ui}) &gt; 0, Eq.12 can
also be interpreted as quantifying the likelihood
increase that the edge XY should be removed
from the model by extending the candidate sepset
from {Ui} to {Ui} + Z, i.e. LD|G\XY |{Ui},Z =
LD|G\XY |{Ui} ⇥ exp(N I0(X; Y ; Z|{Ui})), with
exp(N I0(X; Y ; Z|{Ui})) &gt; 1. Yet, as the 3-point
information, I0(X; Y ; Z|{Ui}), is actually symmetric
with respect to the variables, X, Y and Z, the factor
exp(N I0(X; Y ; Z|{Ui})) &gt; 1 provides in fact the same
likelihood increase for the removal of the three edges
XY , XZ and ZY , conditioned on the same initial set
of nodes {Ui}, namely,
However, despite this symmetry of 3-point
information, I0(X; Y ; Z|{Ui}), the likelihoods that the edges
XY , XZ and ZY should be removed are not the
same, as they depend on di↵erent 2-point
information, I0(X; Y |{Ui}), I0(X; Z|{Ui}) and I0(Z; Y |{Ui}),
Eq.11. In particular, the likelihood ratio between the
removals of the alternative edges XY and XZ is given
by,</p>
      <p>LD|G\XY |{Ui},Z = LD|G\XY |{Ui} =
LD|G\XZ|{Ui},Y</p>
      <p>LD|G\XZ|{Ui}</p>
      <p>Hence, for XY to be the most likely edge to be
removed conditioned on the sepset {Ui} + Z, not only
Z should contribute through I0(X; Y ; Z|{Ui}) &gt; 0
with probability Pnv(X; Y ; Z|{Ui}) (Eq.13), but XY
must also correspond to the ‘weakest’ edge of XY ,
XZ and ZY conditioned on {Ui}, as given by the
lowest conditioned 2-point information, Eq.15. Note
that removing the edge XY with the lowest
conditional 2-point information is consistent, as expected,
with the Data Processing Inequality, I(X; Y |{Ui}) 6
min(I(X; Z|{Ui}), I(Z; Y |{Ui})), in the limit of large
datasets. However, quite frequently, XZ or ZY might
also have low conditional 2-point information, so that
the edge removal associated with the symmetric
contribution I(X; Y ; Z|{Ui}) will only be consistent with
the Data Processing Inequality (DPI) with probability,
Pdpi(XY ; Z|{Ui}) =
=
=
nificantly improves the results obtained by
relying solely on the ‘non-v-structure’ probability
Pnv(X; Y ; Z|{Ui}). Conversely, the DPI-consistency
probability Pdpi(XY ; Z|{Ui}) is not sucient on its
own to uncover causal relationships between
variables, which require to compute 3-point information
I(X; Y ; Z|{Ui}) and the probability Pnv(X; Y ; Z|{Ui})
(see Proposition 7 and Proposition 8, below).
To optimize the likelihood that the edge XY can be
accounted for by the additional contribution of Z
conditioned on previously selected {Ui}, we propose to
combine the maximum of 3-point information (Eq.13)
and the minimum of 2-point information (Eq.16) by
defining the score Slb(Z; XY |{Ui}) as the lower bound
of Pnv(X; Y ; Z|{Ui}) and Pdpi(XY ; Z|{Ui}), since both
conditions need to be fulfilled to warrant that edge
XY is likely to be absent from the model G,
Slb(Z;XY |{Ui}) =</p>
      <p>h
= min Pnv(X; Y ; Z|{Ui}), Pdpi(XY ; Z|{Ui})
i
Hence, the pair of nodes XY with the most likely
contribution from a third node Z and likely to be absent
from the model can be ordered according to their rank
R(XY ; Z|{Ui}) defined as,</p>
      <p>R(XY ; Z|{Ui}) = max Slb(Z; XY |{Ui})</p>
      <p>Z
(17)
Then, Z can be iteratively added to the set of
contributing nodes (i.e. {Ui} { Ui} + Z) of the top edge
XY = argmaxXY R(XY ; Z|{Ui}) to progressively
recover the most significant indirect contributions to all
pairwise mutual information in a causal graph.
Implementing this local optimization scheme, the 3o↵2
algorithm eventually learns the network skeleton by
collecting the nodes of the separation sets one-by-one,
instead of exploring the full combinatorics of sepset
composition without any likelihood guidance. Indeed,
the 3o↵2 scheme amounts to identify {Ui} by “taking
o↵” iteratively the “most likely” conditional 3-point
information from each 2-point information as,
I(X; Y |{Ui}n) = I(X; Y )
I(X; Y ; U1)
I(X; Y ; U2|U1) · · ·
I(X; Y ; Un|{Ui}n 1)
or equivalently between the shifted 2-point and 3-point
information terms,</p>
      <p>I0(X; Y |{Ui}n) = I0(X; Y )
I0(X; Y ; U1)
I0(X; Y ; U2|U1) · · ·
I0(X; Y ; Un|{Ui}n 1)
In practice, taking into account this DPI-consistency
probability Pdpi(XY ; Z|{Ui}), as detailed below,
sigThis leads to the following Algorithm 1 for the
reconstruction of the graph skeleton using the 3o↵2 scheme.
while 9 XY edge with R(XY ; Z|{Ui}) &gt; 1/2 do
for edge XY with highest rank R(XY ; Z|{Ui}) do
expand contributing set {Ui}
{ Ui} + Z
else
end
if I0(X; Y |{Ui}) &lt; 0 then</p>
      <p>XY edge is non-essential and removed
separation set of XY : SepXY = {Ui}
find next most contributing node Z
neighbor of X or Y and compute new
3o↵2 rank: R(XY ; Z|{Ui})
sort the 3o↵2 rank list R(XY ; Z|{Ui})
else
end
end</p>
      <sec id="sec-2-1">
        <title>Iteration</title>
        <p>end
end
Then, given the skeleton obtained from Algorithm 1,
Eqs.13 and 14 lead to the following Proposition 7 and
Proposition 8 for the orientation and propagation rules
of unshielded triples, which are equivalent to
Proposition 5 and Proposition 6 but for underlying DAG
models (assuming no latent variables) and for finite
datasets with the corresponding probabilities for the
initiation/propagation of orientations.</p>
        <p>Proposition 7 (Significantly negative
condiNote, in particular, that the 3o↵2 scheme to
reconstruct graph skeleton is solely based on identifying
structural independencies, which can also be applied
to graphical models for undirected Markov networks.</p>
        <sec id="sec-2-1-1">
          <title>Algorithm 1:</title>
          <p>3o↵2</p>
          <p>Skeleton Reconstruction
In: observational data of finite size N
Out: skeleton of causal graph G</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Initiation</title>
        <p>Start with complete undirected graph
forall edges XY do
if I0(X; Y ) &lt; 0 then</p>
        <p>XY edge is non-essential and removed
separation set of XY : SepXY = ;
find the most contributing node Z
neighbor of X or Y and compute 3o↵2 rank,
R(XY ; Z|; )
tional 3-point information as robust statistical
evidence of causality in finite datasets).
Assuming that the underlying graphical model is
a DAG G on V , 8 X, Y, Z 2 V and 8{ Ui} ✓
V\{X,Y,Z} s.t. I0(X; Y |{Ui}) &lt; 0 (i.e. no XY edge)
and I0(X; Y ; Z|{Ui}) &lt; 0 then,
i. if X, Y, Z form an unshielded triple, X
then it should be oriented as X ! Z
probabilities,
Z Y ,
Y , with
PX! Z = PY ! Z = 1 + 3eNI0(X;Y ;Z|{Ui})
1 + eNI0(X;Y ;Z|{Ui})
ii. similarly, if X, Y, Z form an unshielded triple,
with one already known converging arrow,
X ! Z Y , with probability PX! Z &gt; PX! Z ,
then the second edge should be oriented to form a
v-structure, X ! Z Y , with probability,
PY ! Z = PX! Z
✓</p>
        <p>1
1 + eNI0(X;Y ;Z|{Ui})
1 ◆
2
+
Proof. The implications (i.) and (ii.) rely on Eq.14
to estimate the probability that the two edges form a
collider. We start proving (ii.) using the probability
decomposition formula:</p>
        <p>PX! Z Y
PY ! Z = PX! Z PX! Z Y + PX! Z! Y</p>
        <p>+ (1
Following the rationale of constraint-based
approaches, it is then possible to ‘propagate’ further the
orientations downstream of colliders, using Eq.13 for
positive (conditional) 3-point information. For
simplicity and consistency, we only implement the
propagation of orientation based on likelihood ratios, which
can be quantified for finite datasets as proposed in
the following Proposition 8. In particular, we do not
extend the propagation rules (Meek, 1995) to inforce
acyclic constraints that are necessary to have a
complete reconstruction of the Markov equivalent class of
the underlying DAG model.</p>
        <p>Proposition 8 (robust ‘propagation’ of
causality at unshielded triples with significantly
positive conditional 3-pt information).
Assuming that the underlying graphical model is a DAG
G on V , 8 X, Y, Z 2 V and 8{ Ui} ✓ V\{X,Y,Z}
s.t. I0(X; Y |{Ui}, Z) &lt; 0 (i.e. no XY edge) and
I0(X; Y ; Z|{Ui}) &gt; 0, then if X, Y, Z form an
unshielded triple with one already known converging
orientation, X ! Z⇤ Y , with probability PX! Z &gt; 1/2,
this orientation should be ‘propagated’ to the second
edge as X ! Z ! Y , with probability,</p>
        <p>PZ! Y = PX! Z
✓</p>
        <p>1
1 + e NI0(X;Y ;Z|{Ui})
1 ◆
2
+
Proof. This results is shown using the probability
decomposition formula,</p>
        <p>PX! Z! Y
PZ! Y = PX! Z PX! Z Y + PX! Z! Y</p>
        <p>
          + (1
We have tested the 3o↵2 method on a range of
benchmark networks of 50 nodes with up to 160
edges generated with the causal modeling tool Tetrad
IV (http://www.phil.cmu.edu/tetrad). The
average connectivity hki of these benchmark networks
ranges between 1.6 to 6.4, and the average
maximal in/out-degree between 3.2 to 8.8 (see
Table S1 for a detailed description). The
evaluation metrics are the Precision, P rec = T P/(T P +
F P ), the Recall, Rec = T P/(T P + F N ) and
the F score = 2P rec.Rec/(P rec + Rec). However,
in order to take into account the
orientation/nonorientation of edges in the predicted networks and
compare them with the CPDAG of the benchmark
graphs, we define orientation-dependent counts as,
T P 0 = T P T Pmisorient and F P 0 = F P + T Pmisorient,
where T Pmisorient corresponds to all true positive
edges of the skeleton with di↵erent
orientation/nonorientation status as in the CPDAG reference.
The first methods used for comparison with 3o↵2
are the PC-stable algorithm
          <xref ref-type="bibr" rid="ref2">(Colombo and Maathuis,
2014)</xref>
          with conservative (Ramsey et al, 2006) or
majority orientation rules, implemented in the pcalg
package (Kalisch et al., 2012; Kalisch and Bu¨hlmann, 2008)
and the hybrid method MMHC combining
constraintbased skeleton and Bayesian orientation
(Tsamardinos, Brown, and Aliferis, 2006), implemented in the
Algorithm 2: 3o↵2 Orientation / Propagation Step
In: Graph skeleton from Algorithm 1 and
corresponding conditional 3-point information I0(X; Y ; Z|{Ui}).
Out: Partially oriented causal graph G with edge
orientation probabilities.
3o↵2 Orientation / Propagation Step
sort list of unshielded triples, Lc = {hX, Z, Y iX 6 Y },
in decreasing order of their orientation/propagation
probability initialized at 1/2 and computed from:
- (i.) Proposition 7, if I0(X; Y ; Z|{Ui}) &lt; 0, or
- (ii.) Proposition 8, if I0(X; Y ; Z|{Ui}) &gt; 0
repeat
        </p>
        <p>Take hX, Z, Y iX 6 Y 2 L c with highest orientation
/ propagation probability &gt; 1/2.
if I0(X; Y ; Z|{Ui}) &lt; 0 then
else
end</p>
        <p>Orient/propagate edge direction(s) to form a
v-structure X ! Z Y with probabilities
PX! Z and PY ! Z given by Proposition 7.
Propagate second edge direction to form a
non-v-structure X ! Z ! Y assigning
probability PZ! Y from Proposition 8.</p>
        <p>Apply new orientation(s) and sort remaining list
of unshielded triples Lc L c\hX, Z, Y iX 6 Y after
updating propagation probabilities.
until no additional orient./propa. probability &gt; 1/2 ;
bnlearn package (Scutari, 2010). Figs. 1-5 give the
average CPDAG comparison results over 100 dataset
replicates from 5 di↵erent benchmark networks
(Table S1). The causal graphical models predicted by the
3o↵2 method are obtained using either the MDL/BIC
or the NML complexities (see Supplementary
Methods). Figs. S1-S6 provide additional results on the
prediction of the network skeletons and execution times.
The PC and MMHC results are shown, Figs. 1-5, for
an independence test parameter ↵ = 0.1, as
reducing ↵ tends to worsen the CPDAG F-score for
benchmark networks with hki &gt; 1.6 (Figs. S7-S18). All in
all, we found that the 3o↵2 method outperforms both
PC-stable and MMHC methods on all tested datasets,
Figs. 1-5.</p>
        <p>Additional comparisons were obtained with Bayesian
inference implemented in the bnlearn package
(Scutari, 2010), using AIC, BDe and BIC/MDL scores
and hill-climbing heuristics with 30 to 100 random
restarts, Figs. S19-S30. 3o↵2 reaches equivalent or
significantly better F-scores than Bayesian hill-climbing
for all dataset sizes on benchmark networks up to 120
edges (hki 6 4.8). In particular, 3o↵2 with MDL
scores reaches excellent F-scores on sparse networks
(Figs. S19 &amp; S20) and keeps one of the best F-scores
over all sample sizes for less sparse networks when
combined to NML complexity (Figs. S21 &amp; S22). For
somewhat denser networks (hki ' 5), the 3o↵2
Fscore appears slightly lower than for Bayesian
inference methods, Fig. S23, although it eventually
becomes equivalent for large datasets (N &gt; 1000).
On denser networks (hki &gt; 5 6), Bayesian inference
exhibits better F-scores than 3o↵2 , in particular with
AIC score, Fig. S24. However, the good performance
with AIC strongly relies on its high Recall (but low
Precision), due to its very small penalty term on large
datasets, which makes it favor more complex networks
(Figs. S24) but perform very poorly on sparse graphs
(Figs. S19-S21). By contrast, the reconstruction of
dense networks is impeded with the 3o↵2 scheme, as it
is not always possible to uncover structural
independencies, I(X; Y |{Ui}n) ' 0, in dense graphs through an
ordered set {Ui}n with only positive conditional 3-point
information, I0(X; Y ; Uk|{Ui}k 1) &gt; 0. Indeed in
complex graphs, there are typically many indirect paths
X ! Uj ! Y between unconnected node pairs (X, Y ).
At the beginning of the pruning process, this is prone
to suggest likely v-structures X ! Y Uj , instead
of the correct non-v-structures, X ! Uj ! Y (for
instance if I(X; Uj ) ⌧ I(X; Y ), I(X; Uj ) ⌧ I(Uj ; Y )
and I(X; Uj ) I(X; Uj |Y ) = I(X; Y ; Uj ) &lt; 0, for all
j). Such elimination of F N edge X ! Uj and
conservation of F P X ! Y tend to decrease both
Precision and Recall, although 3o↵2 remains significantly
more robust than PC and MMHC, Fig. 5. Besides,
for most practical applications on real life data,
interpretable causal models should remain relatively sparse
and avoid to display multiple indirected paths between
unconnected nodes.</p>
        <p>Finally, 3o↵2 running times on these benchmark
networks are similar to MMHC and Bayesian hill-climbing
heuristic methods (with 100 restarts) and 10 to 100
times faster than PC for large datasets, Figs. S1-S30.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>DISCUSSION</title>
      <p>In this paper, we propose to combine constraint-based
and score-based frameworks to improve network
reconstruction. Earlier hybrid methods, including MMHC,
have also attempted to exploit the best of these two
types of inference approaches by combining the
robustness of Bayesian scores with the attractive
conceptual features of constraint-based approaches (Dash
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.2
0.0
1.0
0.8
0.6
0.4
0.2
1.0
0.8
0.6
0.4
0.2
0.0
100 1000 10000 50000
50n. 40e. Recall TP/(TP+FN)
100
1000
0.0
100 1000 10000 50000
50n. 80e. Recall TP/(TP+FN)
100
1000
and Druzdzel, 1999; Tsamardinos, Brown, and
Aliferis, 2006; Cano, Gomez-Olmedo, and Moral, 2008;
Claassen and Heskes, 2012). In particular, (Dash
and Druzdzel, 1999) have proposed to exploit an
intrinsic weakness of the PC algorithm, its sensitivity
to the order in which conditional independencies are
tested on finite data, to rank these di↵erent
orderdependent PC predictions with Bayesian scores. More
recently, (Claassen and Heskes, 2012) have also
combined constraint-based and Bayesian approaches to
improve the reliability of causal inference. They proposed
to use Bayesian scores to directly assess the reliability
of conditional independencies by summing the
likelihoods over compatible graphs. By contrast, we
propose to use Bayesian scores to progressively uncover
the best supported conditional independencies, by
iteratively “taking o↵” the most likely indirect
contributions of conditional 3-point information from every
2-point (mutual) information of the causal graph. In
addition, using likelihood ratios (Eqs.11 &amp; 12) instead
of likelihood sums (Claassen and Heskes, 2012)
circumvents the need to score conditional independencies over
a potentially intractable number of compatible graphs.
All in all, we found that 3o↵2 outperforms
constraintbased, search-and-score and earlier hybrid methods on
a range of benchmark networks, while displaying
similar running times as hill-climbing heuristic methods.</p>
      <sec id="sec-3-1">
        <title>Acknowledgements</title>
        <p>S.A. acknowledges support from Ministry of Research
and Association pour la Recherche contre le Cancer.</p>
        <p>Rissanen, J. 1978. Modeling by shortest data
descrip</p>
        <p>tion. Automatica vol. 14:465–471.</p>
        <p>Roos, T.; Silander, T.; Kontkanen, P.; and Myllym¨aki,</p>
        <p>P. 2008. Bayesian network structure learning using
factorized nml universal models. In Proc. 2008
Information Theory and Applications Workshop
(ITA2008). IEEE Press. invited paper.</p>
        <p>Sanov, I. 1957. On the probability of large deviations</p>
        <p>of random variables. Mat. Sbornik 42:11–44.</p>
        <p>Scutari, M. 2010. Learning Bayesian Networks with
the bnlearn R Package. Journal of Statistical
Software 35(3):1–22.</p>
        <p>Shtarkov, Y. M. 1987. Universal sequential coding of
single messages. Problems of Information
Transmission (Translated from) 23(3):3–17.</p>
        <p>Spirtes, P., and Glymour, C. 1991. An algorithm for
fast recovery of sparse causal graphs. Social Science</p>
        <p>Computer Review 9:62–72.</p>
        <p>Spirtes, P.; Glymour, C.; and Scheines, R. 2000.
Causation, Prediction, and Search. The MIT Press,</p>
        <p>Cambridge, Massachusetts, 2nd edition.</p>
        <p>Tsamardinos, I.; Brown, L. E.; and Aliferis, C. F.</p>
        <p>2006. The Max-Min Hill-Climbing Bayesian
Network Structure Learning Algorithm. Machine</p>
        <p>Learning 65(1):31–78.</p>
        <p>SUPPLEMENTARY INFORMATION
for the manuscript “Robust reconstruction of causal graphical models
based on conditional 2-point and 3-point information”</p>
        <p>S´everine A↵eldt, Herv´e Isambert
Institut Curie, Research Center, UMR168, 26 rue d’Ulm, 75005, Paris France;
and Universit´e Pierre et Marie Curie, 4 Place Jussieu, 75005, Paris, France</p>
        <p>herve.isambert@curie.fr
SUPPLEMENTARY</p>
        <p>METHODS
Complexity of graphical models
The complexity kG,D of a graphical model is related to
the normalization constant Z(G, D) of its maximum
likelihood as kG,D = log Z(G, D),</p>
        <p>LG =</p>
        <p>Z(G, D)</p>
        <p>= e NH(G,D) kG,D
For Bayesian networks with decomposable entropy,
i.e. H(G, D) = Pi H(Xi|{PaXi }), it is convenient to
use decomposable complexities, kG,D = Pi kXi|{PaXi },
such that the comparison between alternative models
G and G\X! Y (i.e. G with one missing edge X ! Y )
leads to a simple local increment of the score,</p>
        <p>LG\X!Y = e</p>
        <p>LG</p>
        <p>NI(X;Y |{PaY }\X )+ kY |{PaY }\X
I(X; Y |{PaY }\X ) = H(Y |{PaY }\X )</p>
        <p>H(Y |{PaY }) &gt; 0
kY |{PaY }\X = kY |{PaY }</p>
        <p>
          kY |{PaY }\X &gt; 0
A common complexity criteria in model selection is
the Bayesian Information Criteria (BIC) or Minimal
Description Length (MDL) criteria
          <xref ref-type="bibr" rid="ref3">(Rissanen, 1978;
Hansen and Yu, 2001)</xref>
          ,
where rx, ry and rj are the number of levels of each
variable, x, y and j. The MDL complexity, Eq.4, is
simply related to the normalisation constant of the
        </p>
        <p>MDL
kY |{PaY } =</p>
        <p>
          MDL 1
kY |{PaY }\X = 2
normal distribution reached in the asymptotic limit of
a large dataset N ! 1 (Central Limit Theorem). The
MDL complexity can also be derived from the
Stirling approximation on the Bayesian measure
          <xref ref-type="bibr" rid="ref1">(Schwarz,
1978; Bouckaert, 1993)</xref>
          . Yet, in practice, this
central limit distribution is only reached for very large
datasets, as some of the least-likely (ry 1) Qj rj
combinations of states of variables are in fact rarely (if
ever) sampled in typical finite datasets. As a result,
the MDL complexity criteria tends to underestimate
the relevance of edges connecting variables with many
levels, ri, leading to the removal of false negative edges.
        </p>
        <p>
          To avoid such biases with finite datasets, the
normalisation of the maximum likelihood can be done over
all possible datasets with the same number N of data
points. This corresponds to the (universal)
Normalized Maximum Likelihood (NML) criteria
          <xref ref-type="bibr" rid="ref4 ref7">(Shtarkov,
1987; Rissanen and Tabus, 2005; Kontkanen and
Myllym¨aki, 2007; Roos et al., 2008)</xref>
          ,
        </p>
        <p>e NH(G,D)</p>
        <p>
          LG = P|D0|=N e NH(G,D0) = e NH(G,D) kGNM,DL
We introduce here the factorized version of the NML
criteria
          <xref ref-type="bibr" rid="ref4">(Kontkanen and Myllym¨aki, 2007; Roos et
al., 2008)</xref>
          which corresponds to a decomposable NML
score, kGNM, DL = PXi kXNMiL|{PaXi }, defined as,
        </p>
        <p>qy
kYNM|L{PaY } = X log CNryyj</p>
        <p>j</p>
        <p>NML
kY |{PaY }\X =
qy
X log CNryyj
j
qy/rx
X
j0
log CNryyj0
(6)
(7)
(8)
where Nyj is the number of data points
corresponding to the jth state of the parents of Y , {PaY }, and
Nyj0 the number of data points corresponding to the
j0th state of the parents of Y , excluding X, {PaY }\X .</p>
        <p>Hence, the factorized NML score for each node Xi
corresponds to a separate normalisation for each state
where Nijk corresponds to the number of data points
for which the ith node is in its kth state and its parents
in their jth state, with Nij = Prki Nijk. The universal
normalization constant Cnr is then obtained by
averaging over all possible partitions of the n data points
into a maximum of r subsets, `1 + `2 + · · · + `r = n
with `k &gt; 0,</p>
        <p>Cnr =</p>
        <p>X
`1+`2+···+`r=n `1!`2! · · · `r! k=1</p>
        <p>n
n!
r ✓ `k ◆`k</p>
        <p>
          Y
which can in fact be computed in linear-time using the
following recursion
          <xref ref-type="bibr" rid="ref4">(Kontkanen and Myllym¨aki, 2007)</xref>
          ,
        </p>
        <p>Cnr = Cn
r 1 +
r
n</p>
        <p>r 2
2 Cn
with C0r = 1 for all r, Cn1 = 1 for all n and applying the
general formula Eq.12 for r = 2,
h=0
n ✓n◆ ✓ h ◆h ✓ n
Cn2 = X
h
n
n</p>
        <p>
          h ◆n h
or its Szpankowski approximation for large n (needed
for n &gt; 1000 in practice)
          <xref ref-type="bibr" rid="ref5 ref6">(Szpankowski, 2001;
Kontkanen et al., 2003; Kontkanen, 2009)</xref>
          ,
        </p>
        <p>Cn2 =
'
r n⇡</p>
        <p>2
r n⇡
2</p>
        <p>1 +
exp
2 r 2
3 n⇡
r 8
9n⇡
+
+</p>
        <p>1
12n + O
✓ 1 ◆!</p>
        <p>n3/2
3⇡
36n⇡
j = 1, ..., qi of its parents and involving exactly Nij
data points of the finite dataset,
into a {Ui}-dependent complexity term, kX;Y |{Ui},
with the same XY -symmetry as I(X; Y |{Ui}),
Note, in particular, that the MDL complexity term
in Eq.18 is readily obtained from Eq.5 due to the
Markov equivalence of the MDL score, corresponding
to its XY -symmetry whenever {PaY }\X = {PaX }\Y .</p>
        <p>By contrast, the factorized NML score, Eq.7, is
not a Markov-equivalent score (although its
nonfactorized version, Eq.6, is Markov equivalent by
definition). To circumvent this non-equivalence of
factorized NML score, we propose to recover the expected
XY -symmetry of kXNM;LY |{Ui} through the simple XY
symmetrization of Eq.8, leading to Eq.19.</p>
        <p>Then, following the rationale of constraint-based
approaches, we can reformulate the likelihood ratio of
Eq. 3 by replacing the parent nodes {PaY }\X in the
conditional mutual information, I(X; Y |{PaY }\X ),
with an unknown separation set {Ui} to be learnt
simultaneously with the missing edge candidate XY ,</p>
        <p>LG\XY |{Ui} = e NI(X;Y |{Ui})+kX;Y |{Ui}</p>
        <p>LG
where we have also transformed the asymmetric
parent-dependent complexity di↵erence,</p>
        <p>kY |{PaY }\X ,</p>
        <p>Schwarz, G. 1978. Estimating the dimension of a</p>
        <p>model. The Annals of Statistics 6:461–464.</p>
        <p>Shtarkov, Y. M. 1987. Universal sequential coding of
single messages. Problems of Information
Transmission (Translated from) 23(3):3–17.</p>
        <p>Szpankowski, W. 2001. Average case analysis of
algo</p>
        <p>rithms on sequences. John Wiley &amp; Sons.
hki
0.8
50
50
50
50
50
40
0.2
0.0</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
100
1000
0.2
0.0</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
100
1000
0.2
0.0</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
100
1000</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
50n. 160e. Recall TP/(TP+FN)</p>
        <p>Execution time (sec.)
100
1000
5000
100
1000
5000</p>
        <p>1.0
0.8</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)</p>
        <p>1.0
0.8
Figure S15: MMHC, e↵ect of independence test parameter ↵ . 50 node, 60 edge benchmark networks generated using
Tetrad. hki = 2.4, hkminaxi = 4.6 and hkmouatxi = 3.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and
Aliferis, 2006).</p>
        <p>MMHC [BDe, 100 rst.]
50n. 80e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)</p>
        <p>1.0
0.8
Figure S17: MMHC, e↵ect of independence test parameter ↵ . 50 node, 120 edge benchmark networks generated using
Tetrad. hki = 4.8, hkminaxi = 8.8 and hkmouatxi = 7.2. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and
Aliferis, 2006).</p>
        <p>MMHC [BDe, 100 rst.]
50n. 160e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)</p>
        <p>1.0
0.8
1000
5000
100
1000</p>
        <p>5000
50n. 20e. Recall TP/(TP+FN)</p>
        <p>PC 1e-1 cpdag
HC AIC cpdag
HC BDe cpdag
HC BIC cpdag
3off2 MDL cpdag
3off2 NML cpdag
PC 1e-1 cpdag
HC AIC cpdag
HC BDe cpdag
HC BIC cpdag
3off2 MDL cpdag
3off2 NML cpdag
100
1000</p>
        <p>5000</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
100
1000
5000
100
1000</p>
        <p>5000
50n. 40e. Recall TP/(TP+FN)
100
1000</p>
        <p>5000
1000
5000
100
1000</p>
        <p>5000
50n. 60e. Recall TP/(TP+FN)</p>
        <p>PC 1e-1 cpdag
HC AIC cpdag
HC BDe cpdag
HC BIC cpdag
3off2 MDL cpdag
3off2 NML cpdag
PC 1e-1 cpdag
HC AIC cpdag
HC BDe cpdag
HC BIC cpdag
3off2 MDL cpdag
3off2 NML cpdag
100
1000</p>
        <p>5000
50n. 80e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
100
1000
5000
100
1000</p>
        <p>5000
50n. 80e. Recall TP/(TP+FN)
100
1000</p>
        <p>5000
1000
5000
100
1000</p>
        <p>5000
50n. 120e. Recall TP/(TP+FN)</p>
        <p>PC 1e-1 cpdag
HC AIC cpdag
HC BDe cpdag
HC BIC cpdag
3off2 MDL cpdag
3off2 NML cpdag
PC 1e-1 cpdag
HC AIC cpdag
HC BDe cpdag
HC BIC cpdag
3off2 MDL cpdag
3off2 NML cpdag
100
1000</p>
        <p>5000
50n. 160e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
100
1000
5000
100
1000</p>
        <p>5000
50n. 160e. Recall TP/(TP+FN)
100
1000</p>
        <p>5000
50n. 20e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
50n. 40e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
AIC 30rst. skeleton
AIC 100rst. skeleton
BDe 30rst. skeleton
BDe 100rst. skeleton
BIC 30rst. skeleton
BIC 100rst. skeleton
AIC 30rst. cpdag
AIC 100rst. cpdag
BDe 30rst. cpdag
BDe 100rst. cpdag
BIC 30rst. cpdag
BIC 100rst. cpdag
100
1000
5000
100
1000
5000
50n. 20e. Recall TP/(TP+FN)</p>
        <p>Execution time (sec.)
100
1000
5000
100
1000
5000
50n. 40e. Recall TP/(TP+FN)</p>
        <p>Execution time (sec.)
100
1000
5000
100
1000
5000
50n. 60e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
50n. 80e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
AIC 30rst. skeleton
AIC 100rst. skeleton
BDe 30rst. skeleton
BDe 100rst. skeleton
BIC 30rst. skeleton
BIC 100rst. skeleton
AIC 30rst. cpdag
AIC 100rst. cpdag
BDe 30rst. cpdag
BDe 100rst. cpdag
BIC 30rst. cpdag
BIC 100rst. cpdag
100
1000
5000
100
1000
5000
50n. 60e. Recall TP/(TP+FN)</p>
        <p>Execution time (sec.)
100
1000
5000
100
1000
5000
50n. 80e. Recall TP/(TP+FN)</p>
        <p>Execution time (sec.)
100
1000
5000
100
1000
5000
50n. 120e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
1e4
1e3
1e2
1e1
1e3
1e2
1e1
50n. 160e. Precision TP/(TP+FP)</p>
        <p>Fscore 2.Prec.Rec./(Prec.+Rec.)
AIC 30rst. skeleton
AIC 100rst. skeleton
BDe 30rst. skeleton
BDe 100rst. skeleton
BIC 30rst. skeleton
BIC 100rst. skeleton
AIC 30rst. cpdag
AIC 100rst. cpdag
BDe 30rst. cpdag
BDe 100rst. cpdag
BIC 30rst. cpdag
BIC 100rst. cpdag
AIC 30rst. skeleton
AIC 100rst. skeleton
BDe 30rst. skeleton
BDe 100rst. skeleton
BIC 30rst. skeleton
BIC 100rst. skeleton
AIC 30rst. cpdag
AIC 100rst. cpdag
BDe 30rst. cpdag
BDe 100rst. cpdag
BIC 30rst. cpdag
BIC 100rst. cpdag
50n. 120e. Recall TP/(TP+FN)</p>
        <p>Execution time (sec.)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bouckaert</surname>
            ,
            <given-names>R. R.</given-names>
          </string-name>
          <year>1993</year>
          .
          <article-title>Probabilistic network construction using the minimum description length principle</article-title>
          .
          <source>in Symbolic and Quantitative Approaches to Reasoning</source>
          and
          <string-name>
            <surname>Uncertainty (Clarke</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kruse</surname>
            <given-names>R</given-names>
          </string-name>
          , Moral S, eds)
          <volume>747</volume>
          :
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Colombo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Maathuis</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Order-independent constraint-based causal structure learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          :
          <fpage>3741</fpage>
          -
          <lpage>3782</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Model selection and the principle of minimum description length</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          <volume>96</volume>
          :
          <fpage>746</fpage>
          -
          <lpage>774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Kontkanen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and Myllym¨aki, P.
          <year>2007</year>
          .
          <article-title>A linear-time algorithm for computing the multinomial stochastic complexity</article-title>
          .
          <source>Inf. Process. Lett</source>
          .
          <volume>103</volume>
          (
          <issue>6</issue>
          ):
          <fpage>227</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Kontkanen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Buntine</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Myllym¨aki, P.;
          <string-name>
            <surname>Rissanen</surname>
            , J.; and Tirri,
            <given-names>H.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Ecient computation of stochastic complexity</article-title>
          . in: C.
          <string-name>
            <surname>Bishop</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Frey (Eds.)
          <source>Proceedings of the Ninth International Conference on Artificial Intelligence and Statistics, Society for Artificial Intelligence and Statistics</source>
          <volume>103</volume>
          :
          <fpage>233</fpage>
          -
          <lpage>238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Kontkanen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Computationally Ecient Methods for MDL-Optimal Density Estimation and Data Clustering</article-title>
          .
          <source>Ph.D. Dissertation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Rissanen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Tabus</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Kolmogorovs structure function in mdl theory and lossy data compression</article-title>
          .
          <source>In Adv. Min. Descrip. Length Theory Appl</source>
          . MIT Press. Chap.
          <volume>10</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Figure S14: MMHC, e↵ect of independence test parameter ↵ . 50 node, 40 edge benchmark networks generated using Tetrad. hki = 1.6, hkminaxi = 3.2 and hkmouatxi = 3.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and</article-title>
          <string-name>
            <surname>Aliferis</surname>
          </string-name>
          ,
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>Figure S16: MMHC, e↵ect of independence test parameter ↵ . 50 node, 80 edge benchmark networks generated using Tetrad. hki = 3.2, hkminaxi = 4.8 and hkmouatxi = 5.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and</article-title>
          <string-name>
            <surname>Aliferis</surname>
          </string-name>
          ,
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Figure S18: MMHC, e↵ect of independence test parameter ↵ . 50 node, 160 edge benchmark networks generated using Tetrad. hki = 6.4, hkminaxi = 8.6 and hkmouatxi = 8.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and</article-title>
          <string-name>
            <surname>Aliferis</surname>
          </string-name>
          ,
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>