INTRODUCTION

Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information

S´everine A↵eldt

Herv´e Isambert

herve.isambert@curie.fr 0 0 Institut Curie, Research Center , CNRS, UMR168, 26 rue d'Ulm, 75005 , Paris France; and Universit ́e Pierre et Marie Curie , 4 Place Jussieu, 75005, Paris , France

2008

14 465 471

We report a novel network reconstruction method, which combines constraintbased and Bayesian frameworks to reliably reconstruct graphical models despite inherent sampling noise in finite observational datasets. The approach is based on an information theory result tracing back the existence of colliders in graphical models to negative conditional 3-point information between observed variables. In turn, this provides a confident assessment of structural independencies in causal graphs, based on the ranking of their most likely contributing nodes with (significantly) positive conditional 3-point information. Starting from a complete undirected graph, dispensible edges are progressively pruned by iteratively “taking o↵” the most likely positive conditional 3-point information from the 2-point (mutual) information between each pair of nodes. The resulting network skeleton is then partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. This “3o↵2” network reconstruction approach is shown to outperform constraint-based, search-and-score and earlier hybrid methods on a range of benchmark networks.

INTRODUCTION

The prospect of learning the direction of causal dependencies from mere correlations in observational data has long defied practical implementations (Reichenbach, 1956). The fact that causal relationships can, to some extent, be inferred from nontemporal statistical data is now known to hinge on the unique statistical imprint of colliders in causal graphical models, provided that certain assumptions are made about the underlying process of data generation, such as its faithfulness to a tree structure (Rebane and Pearl, 1988) or a directed acyclic graph model (Spirtes, Glymour, and Scheines, 2000; Pearl, 2009).

These early findings led to the developments of two types of network reconstruction approaches; on the one hand, search and score methods (Cooper and Herskovits, 1992; Heckerman, Geiger, and Chickering, 1995; Chickering, 2002) need heuristic strategies, such as hill-climbing algorithms, to sample network space, on the other hand, constraint-based methods, such as the PC (Spirtes and Glymour, 1991) and IC (Pearl and Verma, 1991) algorithms, rely on the identification of structural independencies, that correspond to edges to be removed from the underlying network (Spirtes, Glymour, and Scheines, 2000; Pearl, 2009). Yet, early errors in removing edges from the complete graph often lead to the accumulation of compensatory errors later on in the pruning process. Hence, despite recent, more stable implementations intending to overcome order-dependency in the pruning process (Colombo and Maathuis, 2014) , constraint-based methods are not robust to sampling noise in finite datasets. In this paper, we present a more robust constrainedbased method and corresponding 3o↵2 algorithm. It is directly inspired by the PC anc IC algorithms but relies on a quantitative information theoretic framework to reliably uncover conditional independencies in finite datasets and subsequently orient and propagate edge directions between connected variables. 2.1

RESULTS

UNCOVERING CAUSALITY FROM A STABLE / FAITHFUL DISTRIBUTION Consider a network G = (V, E) and a stable (or faithful) distribution P (X) over V , implying that each structural independency (i.e. missing edge XY in G) corresponds to a vanishing conditional 2-point (mutual) information and reciprocally as, (X? ?

Y |{Ui})G () () (X? ?

Y |{Ui})P I(X; Y |{Ui}) = 0 (1) (2) Eq.1 assumes, in particular, that P (X) is a theoretical distribution, defined by a formal expression of its variables X = {X, Y, U1, U2, . . .}. Note, however, that no such expression is known a priori, in general, and P (X) must typically be estimated from the available data. In principle, an infinite amount of data would be necessary to infer an ‘exact’ stable distribution P (X) consistent with Eq.1. In the following, we will first assume that such an infinite amount of data is available and distributed as a stable P (X) to establish how causality can be inferred statistically from conditional 2-point and 3-point information. We will then consider the more realistic situation for which P (X) is not known exactly and must be estimated from a finite amount of data.

Let us first recall the generic decomposition of a conditional 2-point (or mutual) information I(X; Y |{Ui}) by the introduction of a third node Z and the conditional 3-point information I(X; Y ; Z|{Ui}),

I(X;Y |{Ui}) = I(X;Y;Z|{Ui}) + I(X;Y |{Ui}, Z) (3) This relation can be taken as the definition of conditional 3-point information I(X; Y ; Z|{Ui}) which is in fact symmetric in X, Y and Z,

I(X; Y ; Z|{Ui}) = I(X; Y |{Ui}) = I(X; Z|{Ui}) = I(Y ; Z|{Ui})

I(X; Y |{Ui}, Z) I(X; Z|{Ui}, Y ) I(Y ; Z|{Ui}, X) Note that Eq.3 is always valid, regardless of any assumption on the underlying graphical model and of the amount of data available to estimate conditional 2-point and 3-point information terms. Eq.3 will be used to prove the following lemmas and propositions, which trace back the origin of necessary causal relationships in a graphical model to the existence of a negative conditional 3-point information between three variables {X, Y, Z}, I(X; Y ; Z|{Ui}) < 0, where {Ui} accounts for a structural independency between two of them, e.g. I(X; Y |{Ui}) = 0 (see Theorem 4). Lemma 1. Given a stable distribution P (X) on V , 8 X, Y 2 V not adjacent in G, 9{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0 and 8 Z 6= X, Y, {Ui}, I(X; Y ; Z|{Ui}) 6 0.

Proof. If X, Y 2 V are not adjacent in G, this corresponds to a structural independency, i.e. 9{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0. Then 8 Z 6= X, Y, {Ui} Eq.3 implies I(X; Y ; Z|{Ui}) = I(X; Y |{Ui}, Z) 6 0, as conditional mutual information is always positive. ⇤ Corollary 2 (3-point contribution). 8 X, Y, Z 2 V and 8{ Ui} ✓ V\{X,Y,Z} s.t. I(X; Y ; Z|{Ui}) > 0, then I(X; Y |{Ui}) > 0 (as well as I(X; Z|{Ui}) > 0 and I(Y ; Z|{Ui}) > 0 by symmetry of I(X; Y ; Z|{Ui})). Corollary 2, which is a direct consequence of Eq.3 and the positivity of mutual information, will be the basis of the 3o↵2 causal network reconstruction algorithm, which iteratively “takes o↵” 3-point information from 2-point information, as I(X; Y |{Ui}) I(X; Y ; Z|{Ui}) = I(X; Y |{Ui}, Z), and update {Ui} { Ui} + Z as long as there remains some Z 2 V with (significantly) positive conditional 3-point information I(X; Y ; Z|{Ui}) > 0.

Lemma 3 (vanishing conditional 2-point and 3point information in undirected networks). If G is an undirected (Markov) network, 8 X, Y 2 V and 8{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0, then 8 Z 6= X, Y, {Ui}, I(X; Y ; Z|{Ui}) = 0.

Proof. If G is a Markov network, 8 X, Y 2 V and 8{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0, then 8 Z 6= X, Y, {Ui}, I(X; Y |{Ui}, Z) = 0 as conditioning observation cannot induce correlations in Markov networks (Koller and Friedman, 2009). This implies that I(X; Y ; Z|{Ui}) = 0 through Eq.3. ⇤ Note, however, that the converse of Lemma 3 is not true. Namely, (partially) directed networks can also have vanishing conditional 3-point information associated to all their structural independencies. In particular, tree-like bayesian networks without colliders (i.e. without v-structures, X ! Z Y ) present only vanishing 3-point information associated to their structural independencies, i.e. I(X; Y ; Z|{Ui}) = 0, 8 X, Y, Z, {Ui} 2 V s.t. I(X; Y |{Ui}) = 0. However, such a directed network must be Markov equivalent to an undirected network corresponding to the same structural independencies but lacking any trace of causal relationships (i.e. no directed edges). The probability distributions faithful to such directed networks do not contain evidence of obligate causality; i.e. no directed edges can be unambiguously oriented. The following Theorem 4 establishes the existence of negative conditional 3-point information as statistical evidence of obligate causality in graphical models. For the purpose of generality in this section, we do not exclude the possibility that unobserved ‘latent’ variables might mediate the causal relationships among observed variables. However, this requires dissociating the labelling of the two endpoints of each edges. Let us first introduce three di↵erent endpoint marks associated to such edges in mixed graphs: they are the tail ( ), the head (>) and the unspecified ( ) endpoint marks. In addition, we will use the asterisk symbol (⇤ ) as a wild card denoting any of the three marks. Theorem 4 (negative conditional 3-point information as statistical evidence of causality). If 9 X, Y, Z 2 V and {Ui} ✓ V\{X,Y,Z} s.t. I(X; Y |{Ui}) = 0 and I(X; Y ; Z|{Ui}) < 0 then, G is (partially) directed, i.e. some variables in G are causally linked, either directly or indirectly through other variables, including possibly unknown, ‘latent’ variables unobserved in G.

Proof. Theorem 4 is the contrapositive of Lemma 3, with the additional use of Lemma 1. ⇤ Proposition 5 (origin of causality at unshielded triples with negative conditional 3-point information). for all unshielded triple, X ⇤ Z ⇤ Y , 9{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0, if Z 2/ { Ui} then I(X; Y ; Z|{Ui}) < 0 and the unshielded triple should be oriented as X !⇤ Z ⇤ Y .

Proof. if I(X; Y |{Ui}) = 0 with Z 2/{ Ui}, the unshielded triple has to be a collider and I(X; Y |{Ui}, Z) > 0, by faithfulness, hence, I(X; Y ; Z|{Ui}) < 0 by Eq.3. ⇤ Hence, the origin of causality manifests itself in the form of colliders or v-structures in graphical models which reveal ‘genuine’ causations (X! Z or Y ! Z) or, alternatively, ‘possible’ causations (X ! Z or Y ! Z), provided that the corresponding correlations are not due to unobserved ‘latent’ variables L or L0 as, X L99 L 99K Z or Y L99 L0 99K Z.

Following the rationale of constraint-based approaches, it is then possible to ‘propagate’ further the orientations downstream of colliders, through positive (conditional) 3-point information, if one assumes that the underlying distribution P (X) is faithful to an ancestral graph G on V . An ancestral graph is a mixed graph, that is, with three types of edges, undirected ( ), directed ( or ! ) or bidirectional ($ ), but with i.) no directed cycle, ii.) no almost directed cycle (including one bidirectional edge) and iii.) no undirected edge with incoming arrowhead (such as X !⇤ Z Y ). In particular, Directed Acyclic Graphs (DAG) are subclasses of ancestral graphs (i.e. without undirected nor bidirectional edges).

Proposition 6 (‘propagation’ of causality at unshielded triples with positive conditional 3-pt information). Given a distribution P (X) faithful to an ancestral graph G on V , for all unshielded triple with already one converging orientation, X !⇤ Z ⇤ Y , 9{ Ui} ✓ V\{X,Y } s.t. I(X; Y |{Ui}) = 0, if Z 2 { Ui} then I(X; Y ; Z|{Ui}\Z ) > 0 and the first orientation should be ‘propagated’ to the second edge as X !⇤ Z ! Y . Proof. if I(X; Y |{Ui}) = 0 with Z 2 { Ui}, the unshielded triple cannot be a collider and, since G is assumed to be an ancestral graph, the edge Z Y cannot LD|G LD|G\X!Y = e NI(X;Y |{PaY }\X )

Z(G, D) Z(G\X! Y , D) where I(X; Y |{PaY }\X ) = H(Y |{PaY }\X ) H(Y |{PaY }). However, Eq.6 cannot be used as such be an undirected edge either. Hence, it has to be a directed edge, Z ! Y and I(X; Y ; Z|{Ui}\Z ) > 0 by faithfulness and Eq.3. ⇤ Note that the propagation rule of Proposition 6 can be applied iteratively to successive unshielded triples corresponding to positive conditional 3-point information. Yet, all arrowhead orientations can be ultimately traced back to a negative conditional 3-point information, Theorem 4 and Proposition 5. 2.2

ROBUST RECONSTRUCTION OF CAUSAL GRAPHS FROM FINITE

DATASETS We now turn to the more practically relevant situation of finite datasets consisting of N independent data points. The associated sampling noise will instrinsically limit the accuracy of causal network reconstruction. In particular, conditional independencies cannot be exactly achieved (I(X; Y |{Ui}) = 0) but can be reliably established using statistical criteria that depend on the number of data points N .

Given N independent datapoints from the available data D, let us introduce the maximum likelihood, LD|G , that they might have been generated by the graphical model G (Sanov, 1957),

LD|G = e NH(G,D)

Z(G, D) = eN P{xi} p({xi}) log(q({xi}))

Z(G, D) (4) where H(G, D) = P{xi} p({xi}) log(q({xi})) is the cross entropy between the “true” probability distribution p({xi}) of the data D and the theoretical probability distribution q({xi}) of the model G and Z(G, D) is a data- and model-dependent factor ensuring proper normalization condition. The structural constraints of the model G can be included a priori in the factorization form of the theoretical probability distribution, q({xi}). In particular, if we assume a Bayesian network as underlying graphical model, q({xi}) factorizes as q({xi}) = Qi p(xi|{paxi }), where {paxi } denote the values of the parents of node Xi, {PaXi }, and leads to the following maximum likelihood expression, (5) (6) LD|G = e N Pi H(Xi|{PaXi })

Z(G, D) The model G can then be compared to the alternative model G\X! Y with one additional missing edge X ! Y using the maximum likelihood ratio, to learn the underlying graphical model, as it assumes that the order between the nodes and their parents is already known (see however (de Campos, 2006)). Yet, following the rationale of constraint-based approaches, Eq.6 can be reformulated by replacing the parent nodes with an unknown separation set {Ui} to be learnt simultaneously with the missing edge candidate XY ,

LG LG\XY |{Ui} = e NI(X;Y |{Ui})+kX;Y |{Ui} (7) kX;Y |{Ui} = log Z(G, D)/Z(G\XY |{Ui}, D) where the factor kX;Y |{Ui} > 0 tends to limit the complexity of the models by favoring fewer edges. Namely, the condition, I(X; Y |{Ui}) < kX;Y |{Ui}/N , implies that simpler models compatible with the structural independency, X? ? Y |{Ui}, are more likely than model G, given the finite available dataset. This replaces the ‘perfect’ conditional independency condition, I(X; Y |{Ui}) = 0, valid in the limit of an infinite dataset, N ! 1 . A common complexity criteria in model selection is the Bayesian Information Criteria (BIC) or Minimal Description Length (MDL) criteria (Rissanen, 1978; Hansen and Yu, 2001) , where rx, ry and rui are the number of levels of the corresponding variables. The MDL complexity, Eq.8, is simply related to the normalisation constant of the normal distribution reached in the asymptotic limit of a large dataset N ! 1 (Central Limit Theorem). However, such a central limit distribution is only reached for very large datasets in practice. Alternatively, the normalisation of the maximum likelihood can also be done over all possible datasets including the same number of data points to yield a (universal) Normalized Maximum Likelihood (NML) criteria (Shtarkov, 1987; Rissanen and Tabus, 2005) and its decomposable (Kontkanen and Myllym¨aki, 2007; Roos NML et al., 2008) and XY -symmetric version, kX;Y |{Ui}, defined in the Supplementary Methods.

Then, instead of exploring the combinatorics of sepset composition {Ui} for each missing edge candidate XY as in traditional constraint-based approaches, we propose that Eq.7 can be used to iteratively extend a likely sepset using the maximum likelihood ratios between two successive sepset candidates, i.e. between the already ascertained {Ui} and the possible extended {Ui} + Z, as, using Eq.3 for I(X; Y ; Z|{Ui}) and introducing a similar 3-point complexity conditioned on {Ui} as, kX;Y ;Z|{Ui} = kX;Y |{Ui},Z kX;Y |{Ui} (10) where kX;Y ;Z|{Ui} > 0, unlike 3-point information, I(X; Y ; Z|{Ui}) which can be positive or negative. Introducing also the shifted 2-point and 3-point information for finite datasets as, = = I0(X; Y |{Ui})

I(X; Y |{Ui}) I0(X; Y ; Z|{Ui})

I(X; Y ; Z|{Ui}) + kX;Y |{Ui}

N kX;Y ;Z|{Ui}

N leads to maximum likelihood ratios equivalent to Eqs.7 and 9, As will become apparent in the following discussion, learning, iteratively, the most likely edge to be removed XY and its corresponding separation set {Ui} will imply to simultaneously minimize 2-point information (Eq.11) while maximizing 3-point information (Eq.12).

We start the discussion with 3-point information, Eq.12. The sign and magnitude of shifted conditional 3-point information I0(X; Y ; Z|{Ui}) determine the probability that Z should be included in or excluded from the sepset candidate {Ui}, • If I0(X; Y ; Z|{Ui}) > 0, Z is more likely to be included in {Ui} with probability,

Pnv(X; Y ; Z|{Ui}) =

LD|G\XY |{Ui},Z =

LD|G\XY |{Ui} + LD|G\XY |{Ui},Z

1 1 + e NI0(X;Y ;Z|{Ui}) (13) • If I0(X; Y ; Z|{Ui}) < 0, Z is more likely to be excluded from {Ui}, suggesting obligatory causal relationships in the form of a v-structure or collider between X, Y, Z with probability,

Pv(X; Y ; Z|{Ui}) = 1 =

Pnv(X; Y ; Z|{Ui})

1 1 + eNI0(X;Y ;Z|{Ui}) (14) But, in the case I0(X; Y ; Z|{Ui}) > 0, Eq.12 can also be interpreted as quantifying the likelihood increase that the edge XY should be removed from the model by extending the candidate sepset from {Ui} to {Ui} + Z, i.e. LD|G\XY |{Ui},Z = LD|G\XY |{Ui} ⇥ exp(N I0(X; Y ; Z|{Ui})), with exp(N I0(X; Y ; Z|{Ui})) > 1. Yet, as the 3-point information, I0(X; Y ; Z|{Ui}), is actually symmetric with respect to the variables, X, Y and Z, the factor exp(N I0(X; Y ; Z|{Ui})) > 1 provides in fact the same likelihood increase for the removal of the three edges XY , XZ and ZY , conditioned on the same initial set of nodes {Ui}, namely, However, despite this symmetry of 3-point information, I0(X; Y ; Z|{Ui}), the likelihoods that the edges XY , XZ and ZY should be removed are not the same, as they depend on di↵erent 2-point information, I0(X; Y |{Ui}), I0(X; Z|{Ui}) and I0(Z; Y |{Ui}), Eq.11. In particular, the likelihood ratio between the removals of the alternative edges XY and XZ is given by,

LD|G\XY |{Ui},Z = LD|G\XY |{Ui} = LD|G\XZ|{Ui},Y

LD|G\XZ|{Ui}

Hence, for XY to be the most likely edge to be removed conditioned on the sepset {Ui} + Z, not only Z should contribute through I0(X; Y ; Z|{Ui}) > 0 with probability Pnv(X; Y ; Z|{Ui}) (Eq.13), but XY must also correspond to the ‘weakest’ edge of XY , XZ and ZY conditioned on {Ui}, as given by the lowest conditioned 2-point information, Eq.15. Note that removing the edge XY with the lowest conditional 2-point information is consistent, as expected, with the Data Processing Inequality, I(X; Y |{Ui}) 6 min(I(X; Z|{Ui}), I(Z; Y |{Ui})), in the limit of large datasets. However, quite frequently, XZ or ZY might also have low conditional 2-point information, so that the edge removal associated with the symmetric contribution I(X; Y ; Z|{Ui}) will only be consistent with the Data Processing Inequality (DPI) with probability, Pdpi(XY ; Z|{Ui}) = = = nificantly improves the results obtained by relying solely on the ‘non-v-structure’ probability Pnv(X; Y ; Z|{Ui}). Conversely, the DPI-consistency probability Pdpi(XY ; Z|{Ui}) is not sucient on its own to uncover causal relationships between variables, which require to compute 3-point information I(X; Y ; Z|{Ui}) and the probability Pnv(X; Y ; Z|{Ui}) (see Proposition 7 and Proposition 8, below). To optimize the likelihood that the edge XY can be accounted for by the additional contribution of Z conditioned on previously selected {Ui}, we propose to combine the maximum of 3-point information (Eq.13) and the minimum of 2-point information (Eq.16) by defining the score Slb(Z; XY |{Ui}) as the lower bound of Pnv(X; Y ; Z|{Ui}) and Pdpi(XY ; Z|{Ui}), since both conditions need to be fulfilled to warrant that edge XY is likely to be absent from the model G, Slb(Z;XY |{Ui}) =

h = min Pnv(X; Y ; Z|{Ui}), Pdpi(XY ; Z|{Ui}) i Hence, the pair of nodes XY with the most likely contribution from a third node Z and likely to be absent from the model can be ordered according to their rank R(XY ; Z|{Ui}) defined as,

R(XY ; Z|{Ui}) = max Slb(Z; XY |{Ui})

Z (17) Then, Z can be iteratively added to the set of contributing nodes (i.e. {Ui} { Ui} + Z) of the top edge XY = argmaxXY R(XY ; Z|{Ui}) to progressively recover the most significant indirect contributions to all pairwise mutual information in a causal graph. Implementing this local optimization scheme, the 3o↵2 algorithm eventually learns the network skeleton by collecting the nodes of the separation sets one-by-one, instead of exploring the full combinatorics of sepset composition without any likelihood guidance. Indeed, the 3o↵2 scheme amounts to identify {Ui} by “taking o↵” iteratively the “most likely” conditional 3-point information from each 2-point information as, I(X; Y |{Ui}n) = I(X; Y ) I(X; Y ; U1) I(X; Y ; U2|U1) · · · I(X; Y ; Un|{Ui}n 1) or equivalently between the shifted 2-point and 3-point information terms,

I0(X; Y |{Ui}n) = I0(X; Y ) I0(X; Y ; U1) I0(X; Y ; U2|U1) · · · I0(X; Y ; Un|{Ui}n 1) In practice, taking into account this DPI-consistency probability Pdpi(XY ; Z|{Ui}), as detailed below, sigThis leads to the following Algorithm 1 for the reconstruction of the graph skeleton using the 3o↵2 scheme. while 9 XY edge with R(XY ; Z|{Ui}) > 1/2 do for edge XY with highest rank R(XY ; Z|{Ui}) do expand contributing set {Ui} { Ui} + Z else end if I0(X; Y |{Ui}) < 0 then

XY edge is non-essential and removed separation set of XY : SepXY = {Ui} find next most contributing node Z neighbor of X or Y and compute new 3o↵2 rank: R(XY ; Z|{Ui}) sort the 3o↵2 rank list R(XY ; Z|{Ui}) else end end

Iteration

end end Then, given the skeleton obtained from Algorithm 1, Eqs.13 and 14 lead to the following Proposition 7 and Proposition 8 for the orientation and propagation rules of unshielded triples, which are equivalent to Proposition 5 and Proposition 6 but for underlying DAG models (assuming no latent variables) and for finite datasets with the corresponding probabilities for the initiation/propagation of orientations.

Proposition 7 (Significantly negative condiNote, in particular, that the 3o↵2 scheme to reconstruct graph skeleton is solely based on identifying structural independencies, which can also be applied to graphical models for undirected Markov networks.

Algorithm 1:

3o↵2

Skeleton Reconstruction In: observational data of finite size N Out: skeleton of causal graph G

Initiation

Start with complete undirected graph forall edges XY do if I0(X; Y ) < 0 then

XY edge is non-essential and removed separation set of XY : SepXY = ; find the most contributing node Z neighbor of X or Y and compute 3o↵2 rank, R(XY ; Z|; ) tional 3-point information as robust statistical evidence of causality in finite datasets). Assuming that the underlying graphical model is a DAG G on V , 8 X, Y, Z 2 V and 8{ Ui} ✓ V\{X,Y,Z} s.t. I0(X; Y |{Ui}) < 0 (i.e. no XY edge) and I0(X; Y ; Z|{Ui}) < 0 then, i. if X, Y, Z form an unshielded triple, X then it should be oriented as X ! Z probabilities, Z Y , Y , with PX! Z = PY ! Z = 1 + 3eNI0(X;Y ;Z|{Ui}) 1 + eNI0(X;Y ;Z|{Ui}) ii. similarly, if X, Y, Z form an unshielded triple, with one already known converging arrow, X ! Z Y , with probability PX! Z > PX! Z , then the second edge should be oriented to form a v-structure, X ! Z Y , with probability, PY ! Z = PX! Z ✓

1 1 + eNI0(X;Y ;Z|{Ui}) 1 ◆ 2 + Proof. The implications (i.) and (ii.) rely on Eq.14 to estimate the probability that the two edges form a collider. We start proving (ii.) using the probability decomposition formula:

PX! Z Y PY ! Z = PX! Z PX! Z Y + PX! Z! Y

+ (1 Following the rationale of constraint-based approaches, it is then possible to ‘propagate’ further the orientations downstream of colliders, using Eq.13 for positive (conditional) 3-point information. For simplicity and consistency, we only implement the propagation of orientation based on likelihood ratios, which can be quantified for finite datasets as proposed in the following Proposition 8. In particular, we do not extend the propagation rules (Meek, 1995) to inforce acyclic constraints that are necessary to have a complete reconstruction of the Markov equivalent class of the underlying DAG model.

Proposition 8 (robust ‘propagation’ of causality at unshielded triples with significantly positive conditional 3-pt information). Assuming that the underlying graphical model is a DAG G on V , 8 X, Y, Z 2 V and 8{ Ui} ✓ V\{X,Y,Z} s.t. I0(X; Y |{Ui}, Z) < 0 (i.e. no XY edge) and I0(X; Y ; Z|{Ui}) > 0, then if X, Y, Z form an unshielded triple with one already known converging orientation, X ! Z⇤ Y , with probability PX! Z > 1/2, this orientation should be ‘propagated’ to the second edge as X ! Z ! Y , with probability,

PZ! Y = PX! Z ✓

1 1 + e NI0(X;Y ;Z|{Ui}) 1 ◆ 2 + Proof. This results is shown using the probability decomposition formula,

PX! Z! Y PZ! Y = PX! Z PX! Z Y + PX! Z! Y

+ (1 We have tested the 3o↵2 method on a range of benchmark networks of 50 nodes with up to 160 edges generated with the causal modeling tool Tetrad IV (http://www.phil.cmu.edu/tetrad). The average connectivity hki of these benchmark networks ranges between 1.6 to 6.4, and the average maximal in/out-degree between 3.2 to 8.8 (see Table S1 for a detailed description). The evaluation metrics are the Precision, P rec = T P/(T P + F P ), the Recall, Rec = T P/(T P + F N ) and the F score = 2P rec.Rec/(P rec + Rec). However, in order to take into account the orientation/nonorientation of edges in the predicted networks and compare them with the CPDAG of the benchmark graphs, we define orientation-dependent counts as, T P 0 = T P T Pmisorient and F P 0 = F P + T Pmisorient, where T Pmisorient corresponds to all true positive edges of the skeleton with di↵erent orientation/nonorientation status as in the CPDAG reference. The first methods used for comparison with 3o↵2 are the PC-stable algorithm (Colombo and Maathuis, 2014) with conservative (Ramsey et al, 2006) or majority orientation rules, implemented in the pcalg package (Kalisch et al., 2012; Kalisch and Bu¨hlmann, 2008) and the hybrid method MMHC combining constraintbased skeleton and Bayesian orientation (Tsamardinos, Brown, and Aliferis, 2006), implemented in the Algorithm 2: 3o↵2 Orientation / Propagation Step In: Graph skeleton from Algorithm 1 and corresponding conditional 3-point information I0(X; Y ; Z|{Ui}). Out: Partially oriented causal graph G with edge orientation probabilities. 3o↵2 Orientation / Propagation Step sort list of unshielded triples, Lc = {hX, Z, Y iX 6 Y }, in decreasing order of their orientation/propagation probability initialized at 1/2 and computed from: - (i.) Proposition 7, if I0(X; Y ; Z|{Ui}) < 0, or - (ii.) Proposition 8, if I0(X; Y ; Z|{Ui}) > 0 repeat

Take hX, Z, Y iX 6 Y 2 L c with highest orientation / propagation probability > 1/2. if I0(X; Y ; Z|{Ui}) < 0 then else end

Orient/propagate edge direction(s) to form a v-structure X ! Z Y with probabilities PX! Z and PY ! Z given by Proposition 7. Propagate second edge direction to form a non-v-structure X ! Z ! Y assigning probability PZ! Y from Proposition 8.

Apply new orientation(s) and sort remaining list of unshielded triples Lc L c\hX, Z, Y iX 6 Y after updating propagation probabilities. until no additional orient./propa. probability > 1/2 ; bnlearn package (Scutari, 2010). Figs. 1-5 give the average CPDAG comparison results over 100 dataset replicates from 5 di↵erent benchmark networks (Table S1). The causal graphical models predicted by the 3o↵2 method are obtained using either the MDL/BIC or the NML complexities (see Supplementary Methods). Figs. S1-S6 provide additional results on the prediction of the network skeletons and execution times. The PC and MMHC results are shown, Figs. 1-5, for an independence test parameter ↵ = 0.1, as reducing ↵ tends to worsen the CPDAG F-score for benchmark networks with hki > 1.6 (Figs. S7-S18). All in all, we found that the 3o↵2 method outperforms both PC-stable and MMHC methods on all tested datasets, Figs. 1-5.

Additional comparisons were obtained with Bayesian inference implemented in the bnlearn package (Scutari, 2010), using AIC, BDe and BIC/MDL scores and hill-climbing heuristics with 30 to 100 random restarts, Figs. S19-S30. 3o↵2 reaches equivalent or significantly better F-scores than Bayesian hill-climbing for all dataset sizes on benchmark networks up to 120 edges (hki 6 4.8). In particular, 3o↵2 with MDL scores reaches excellent F-scores on sparse networks (Figs. S19 & S20) and keeps one of the best F-scores over all sample sizes for less sparse networks when combined to NML complexity (Figs. S21 & S22). For somewhat denser networks (hki ' 5), the 3o↵2 Fscore appears slightly lower than for Bayesian inference methods, Fig. S23, although it eventually becomes equivalent for large datasets (N > 1000). On denser networks (hki > 5 6), Bayesian inference exhibits better F-scores than 3o↵2 , in particular with AIC score, Fig. S24. However, the good performance with AIC strongly relies on its high Recall (but low Precision), due to its very small penalty term on large datasets, which makes it favor more complex networks (Figs. S24) but perform very poorly on sparse graphs (Figs. S19-S21). By contrast, the reconstruction of dense networks is impeded with the 3o↵2 scheme, as it is not always possible to uncover structural independencies, I(X; Y |{Ui}n) ' 0, in dense graphs through an ordered set {Ui}n with only positive conditional 3-point information, I0(X; Y ; Uk|{Ui}k 1) > 0. Indeed in complex graphs, there are typically many indirect paths X ! Uj ! Y between unconnected node pairs (X, Y ). At the beginning of the pruning process, this is prone to suggest likely v-structures X ! Y Uj , instead of the correct non-v-structures, X ! Uj ! Y (for instance if I(X; Uj ) ⌧ I(X; Y ), I(X; Uj ) ⌧ I(Uj ; Y ) and I(X; Uj ) I(X; Uj |Y ) = I(X; Y ; Uj ) < 0, for all j). Such elimination of F N edge X ! Uj and conservation of F P X ! Y tend to decrease both Precision and Recall, although 3o↵2 remains significantly more robust than PC and MMHC, Fig. 5. Besides, for most practical applications on real life data, interpretable causal models should remain relatively sparse and avoid to display multiple indirected paths between unconnected nodes.

Finally, 3o↵2 running times on these benchmark networks are similar to MMHC and Bayesian hill-climbing heuristic methods (with 100 restarts) and 10 to 100 times faster than PC for large datasets, Figs. S1-S30. 3

DISCUSSION

In this paper, we propose to combine constraint-based and score-based frameworks to improve network reconstruction. Earlier hybrid methods, including MMHC, have also attempted to exploit the best of these two types of inference approaches by combining the robustness of Bayesian scores with the attractive conceptual features of constraint-based approaches (Dash 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.2 0.0 1.0 0.8 0.6 0.4 0.2 1.0 0.8 0.6 0.4 0.2 0.0 100 1000 10000 50000 50n. 40e. Recall TP/(TP+FN) 100 1000 0.0 100 1000 10000 50000 50n. 80e. Recall TP/(TP+FN) 100 1000 and Druzdzel, 1999; Tsamardinos, Brown, and Aliferis, 2006; Cano, Gomez-Olmedo, and Moral, 2008; Claassen and Heskes, 2012). In particular, (Dash and Druzdzel, 1999) have proposed to exploit an intrinsic weakness of the PC algorithm, its sensitivity to the order in which conditional independencies are tested on finite data, to rank these di↵erent orderdependent PC predictions with Bayesian scores. More recently, (Claassen and Heskes, 2012) have also combined constraint-based and Bayesian approaches to improve the reliability of causal inference. They proposed to use Bayesian scores to directly assess the reliability of conditional independencies by summing the likelihoods over compatible graphs. By contrast, we propose to use Bayesian scores to progressively uncover the best supported conditional independencies, by iteratively “taking o↵” the most likely indirect contributions of conditional 3-point information from every 2-point (mutual) information of the causal graph. In addition, using likelihood ratios (Eqs.11 & 12) instead of likelihood sums (Claassen and Heskes, 2012) circumvents the need to score conditional independencies over a potentially intractable number of compatible graphs. All in all, we found that 3o↵2 outperforms constraintbased, search-and-score and earlier hybrid methods on a range of benchmark networks, while displaying similar running times as hill-climbing heuristic methods.

Acknowledgements

S.A. acknowledges support from Ministry of Research and Association pour la Recherche contre le Cancer.

Rissanen, J. 1978. Modeling by shortest data descrip

tion. Automatica vol. 14:465–471.

Roos, T.; Silander, T.; Kontkanen, P.; and Myllym¨aki,

P. 2008. Bayesian network structure learning using factorized nml universal models. In Proc. 2008 Information Theory and Applications Workshop (ITA2008). IEEE Press. invited paper.

Sanov, I. 1957. On the probability of large deviations

of random variables. Mat. Sbornik 42:11–44.

Scutari, M. 2010. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software 35(3):1–22.

Shtarkov, Y. M. 1987. Universal sequential coding of single messages. Problems of Information Transmission (Translated from) 23(3):3–17.

Spirtes, P., and Glymour, C. 1991. An algorithm for fast recovery of sparse causal graphs. Social Science

Computer Review 9:62–72.

Spirtes, P.; Glymour, C.; and Scheines, R. 2000. Causation, Prediction, and Search. The MIT Press,

Cambridge, Massachusetts, 2nd edition.

Tsamardinos, I.; Brown, L. E.; and Aliferis, C. F.

2006. The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine

Learning 65(1):31–78.

SUPPLEMENTARY INFORMATION for the manuscript “Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information”

S´everine A↵eldt, Herv´e Isambert Institut Curie, Research Center, UMR168, 26 rue d’Ulm, 75005, Paris France; and Universit´e Pierre et Marie Curie, 4 Place Jussieu, 75005, Paris, France

herve.isambert@curie.fr SUPPLEMENTARY

METHODS Complexity of graphical models The complexity kG,D of a graphical model is related to the normalization constant Z(G, D) of its maximum likelihood as kG,D = log Z(G, D),

LG =

Z(G, D)

= e NH(G,D) kG,D For Bayesian networks with decomposable entropy, i.e. H(G, D) = Pi H(Xi|{PaXi }), it is convenient to use decomposable complexities, kG,D = Pi kXi|{PaXi }, such that the comparison between alternative models G and G\X! Y (i.e. G with one missing edge X ! Y ) leads to a simple local increment of the score,

LG\X!Y = e

NI(X;Y |{PaY }\X )+ kY |{PaY }\X I(X; Y |{PaY }\X ) = H(Y |{PaY }\X )

H(Y |{PaY }) > 0 kY |{PaY }\X = kY |{PaY }

kY |{PaY }\X > 0 A common complexity criteria in model selection is the Bayesian Information Criteria (BIC) or Minimal Description Length (MDL) criteria (Rissanen, 1978; Hansen and Yu, 2001) , where rx, ry and rj are the number of levels of each variable, x, y and j. The MDL complexity, Eq.4, is simply related to the normalisation constant of the

MDL kY |{PaY } =

MDL 1 kY |{PaY }\X = 2 normal distribution reached in the asymptotic limit of a large dataset N ! 1 (Central Limit Theorem). The MDL complexity can also be derived from the Stirling approximation on the Bayesian measure (Schwarz, 1978; Bouckaert, 1993) . Yet, in practice, this central limit distribution is only reached for very large datasets, as some of the least-likely (ry 1) Qj rj combinations of states of variables are in fact rarely (if ever) sampled in typical finite datasets. As a result, the MDL complexity criteria tends to underestimate the relevance of edges connecting variables with many levels, ri, leading to the removal of false negative edges.

To avoid such biases with finite datasets, the normalisation of the maximum likelihood can be done over all possible datasets with the same number N of data points. This corresponds to the (universal) Normalized Maximum Likelihood (NML) criteria (Shtarkov, 1987; Rissanen and Tabus, 2005; Kontkanen and Myllym¨aki, 2007; Roos et al., 2008) ,

e NH(G,D)

LG = P|D0|=N e NH(G,D0) = e NH(G,D) kGNM,DL We introduce here the factorized version of the NML criteria (Kontkanen and Myllym¨aki, 2007; Roos et al., 2008) which corresponds to a decomposable NML score, kGNM, DL = PXi kXNMiL|{PaXi }, defined as,

qy kYNM|L{PaY } = X log CNryyj

NML kY |{PaY }\X = qy X log CNryyj j qy/rx X j0 log CNryyj0 (6) (7) (8) where Nyj is the number of data points corresponding to the jth state of the parents of Y , {PaY }, and Nyj0 the number of data points corresponding to the j0th state of the parents of Y , excluding X, {PaY }\X .

Hence, the factorized NML score for each node Xi corresponds to a separate normalisation for each state where Nijk corresponds to the number of data points for which the ith node is in its kth state and its parents in their jth state, with Nij = Prki Nijk. The universal normalization constant Cnr is then obtained by averaging over all possible partitions of the n data points into a maximum of r subsets, `1 + `2 + · · · + `r = n with `k > 0,

Cnr =

X `1+`2+···+`r=n `1!`2! · · · `r! k=1

n n! r ✓ `k ◆`k

Y which can in fact be computed in linear-time using the following recursion (Kontkanen and Myllym¨aki, 2007) ,

Cnr = Cn r 1 + r n

r 2 2 Cn with C0r = 1 for all r, Cn1 = 1 for all n and applying the general formula Eq.12 for r = 2, h=0 n ✓n◆ ✓ h ◆h ✓ n Cn2 = X h n n

h ◆n h or its Szpankowski approximation for large n (needed for n > 1000 in practice) (Szpankowski, 2001; Kontkanen et al., 2003; Kontkanen, 2009) ,

Cn2 = ' r n⇡

2 r n⇡ 2

1 + exp 2 r 2 3 n⇡ r 8 9n⇡ + +

1 12n + O ✓ 1 ◆!

n3/2 3⇡ 36n⇡ j = 1, ..., qi of its parents and involving exactly Nij data points of the finite dataset, into a {Ui}-dependent complexity term, kX;Y |{Ui}, with the same XY -symmetry as I(X; Y |{Ui}), Note, in particular, that the MDL complexity term in Eq.18 is readily obtained from Eq.5 due to the Markov equivalence of the MDL score, corresponding to its XY -symmetry whenever {PaY }\X = {PaX }\Y .

By contrast, the factorized NML score, Eq.7, is not a Markov-equivalent score (although its nonfactorized version, Eq.6, is Markov equivalent by definition). To circumvent this non-equivalence of factorized NML score, we propose to recover the expected XY -symmetry of kXNM;LY |{Ui} through the simple XY symmetrization of Eq.8, leading to Eq.19.

Then, following the rationale of constraint-based approaches, we can reformulate the likelihood ratio of Eq. 3 by replacing the parent nodes {PaY }\X in the conditional mutual information, I(X; Y |{PaY }\X ), with an unknown separation set {Ui} to be learnt simultaneously with the missing edge candidate XY ,

LG\XY |{Ui} = e NI(X;Y |{Ui})+kX;Y |{Ui}

LG where we have also transformed the asymmetric parent-dependent complexity di↵erence,

kY |{PaY }\X ,

Schwarz, G. 1978. Estimating the dimension of a

model. The Annals of Statistics 6:461–464.

Shtarkov, Y. M. 1987. Universal sequential coding of single messages. Problems of Information Transmission (Translated from) 23(3):3–17.

Szpankowski, W. 2001. Average case analysis of algo

rithms on sequences. John Wiley & Sons. hki 0.8 50 50 50 50 50 40 0.2 0.0

Fscore 2.Prec.Rec./(Prec.+Rec.) 100 1000 0.2 0.0

Fscore 2.Prec.Rec./(Prec.+Rec.) 100 1000

Fscore 2.Prec.Rec./(Prec.+Rec.)

Fscore 2.Prec.Rec./(Prec.+Rec.) 50n. 160e. Recall TP/(TP+FN)

Execution time (sec.) 100 1000 5000 100 1000 5000

1.0 0.8

Fscore 2.Prec.Rec./(Prec.+Rec.)

1.0 0.8 Figure S15: MMHC, e↵ect of independence test parameter ↵ . 50 node, 60 edge benchmark networks generated using Tetrad. hki = 2.4, hkminaxi = 4.6 and hkmouatxi = 3.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and Aliferis, 2006).

MMHC [BDe, 100 rst.] 50n. 80e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.)

1.0 0.8 Figure S17: MMHC, e↵ect of independence test parameter ↵ . 50 node, 120 edge benchmark networks generated using Tetrad. hki = 4.8, hkminaxi = 8.8 and hkmouatxi = 7.2. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and Aliferis, 2006).

MMHC [BDe, 100 rst.] 50n. 160e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.)

1.0 0.8 1000 5000 100 1000

5000 50n. 20e. Recall TP/(TP+FN)

PC 1e-1 cpdag HC AIC cpdag HC BDe cpdag HC BIC cpdag 3off2 MDL cpdag 3off2 NML cpdag PC 1e-1 cpdag HC AIC cpdag HC BDe cpdag HC BIC cpdag 3off2 MDL cpdag 3off2 NML cpdag 100 1000

5000

Fscore 2.Prec.Rec./(Prec.+Rec.) 100 1000 5000 100 1000

5000 50n. 40e. Recall TP/(TP+FN) 100 1000

5000 1000 5000 100 1000

5000 50n. 60e. Recall TP/(TP+FN)

PC 1e-1 cpdag HC AIC cpdag HC BDe cpdag HC BIC cpdag 3off2 MDL cpdag 3off2 NML cpdag PC 1e-1 cpdag HC AIC cpdag HC BDe cpdag HC BIC cpdag 3off2 MDL cpdag 3off2 NML cpdag 100 1000

5000 50n. 80e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.) 100 1000 5000 100 1000

5000 50n. 80e. Recall TP/(TP+FN) 100 1000

5000 1000 5000 100 1000

5000 50n. 120e. Recall TP/(TP+FN)

PC 1e-1 cpdag HC AIC cpdag HC BDe cpdag HC BIC cpdag 3off2 MDL cpdag 3off2 NML cpdag PC 1e-1 cpdag HC AIC cpdag HC BDe cpdag HC BIC cpdag 3off2 MDL cpdag 3off2 NML cpdag 100 1000

5000 50n. 160e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.) 100 1000 5000 100 1000

5000 50n. 160e. Recall TP/(TP+FN) 100 1000

5000 50n. 20e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.) 50n. 40e. Precision TP/(TP+FP)

Execution time (sec.) 100 1000 5000 100 1000 5000 50n. 40e. Recall TP/(TP+FN)

Execution time (sec.) 100 1000 5000 100 1000 5000 50n. 60e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.) 50n. 80e. Precision TP/(TP+FP)

Execution time (sec.) 100 1000 5000 100 1000 5000 50n. 80e. Recall TP/(TP+FN)

Execution time (sec.) 100 1000 5000 100 1000 5000 50n. 120e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.) 1e4 1e3 1e2 1e1 1e3 1e2 1e1 50n. 160e. Precision TP/(TP+FP)

Fscore 2.Prec.Rec./(Prec.+Rec.) AIC 30rst. skeleton AIC 100rst. skeleton BDe 30rst. skeleton BDe 100rst. skeleton BIC 30rst. skeleton BIC 100rst. skeleton AIC 30rst. cpdag AIC 100rst. cpdag BDe 30rst. cpdag BDe 100rst. cpdag BIC 30rst. cpdag BIC 100rst. cpdag AIC 30rst. skeleton AIC 100rst. skeleton BDe 30rst. skeleton BDe 100rst. skeleton BIC 30rst. skeleton BIC 100rst. skeleton AIC 30rst. cpdag AIC 100rst. cpdag BDe 30rst. cpdag BDe 100rst. cpdag BIC 30rst. cpdag BIC 100rst. cpdag 50n. 120e. Recall TP/(TP+FN)

Execution time (sec.)

Bouckaert , R. R.

1993 . Probabilistic network construction using the minimum description length principle . in Symbolic and Quantitative Approaches to Reasoning and Uncertainty (Clarke

, Kruse

, Moral S, eds) 747 : 41 - 48 .

Colombo , D. , and Maathuis , M. H. 2014 . Order-independent constraint-based causal structure learning . Journal of Machine Learning Research 15 : 3741 - 3782 .

Hansen , M. H. , and Yu , B. 2001 . Model selection and the principle of minimum description length . Journal of the American Statistical Association 96 : 746 - 774 .

Kontkanen , P. , and Myllym¨aki, P. 2007 . A linear-time algorithm for computing the multinomial stochastic complexity . Inf. Process. Lett . 103 ( 6 ): 227 - 233 .

Kontkanen , P. ; Buntine , W. ; Myllym¨aki, P.; Rissanen , J.; and Tirri, H. 2003 . Ecient computation of stochastic complexity . in: C. Bishop , B. Frey (Eds.) Proceedings of the Ninth International Conference on Artificial Intelligence and Statistics, Society for Artificial Intelligence and Statistics 103 : 233 - 238 .

Kontkanen , P.

2009 . Computationally Ecient Methods for MDL-Optimal Density Estimation and Data Clustering . Ph.D. Dissertation.

Rissanen , J. , and Tabus , I. 2005 . Kolmogorovs structure function in mdl theory and lossy data compression . In Adv. Min. Descrip. Length Theory Appl . MIT Press. Chap. 10 .

Figure S14: MMHC, e↵ect of independence test parameter ↵ . 50 node, 40 edge benchmark networks generated using Tetrad. hki = 1.6, hkminaxi = 3.2 and hkmouatxi = 3.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and

Aliferis , 2006 ).

Figure S16: MMHC, e↵ect of independence test parameter ↵ . 50 node, 80 edge benchmark networks generated using Tetrad. hki = 3.2, hkminaxi = 4.8 and hkmouatxi = 5.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and

Aliferis , 2006 ).

Figure S18: MMHC, e↵ect of independence test parameter ↵ . 50 node, 160 edge benchmark networks generated using Tetrad. hki = 6.4, hkminaxi = 8.6 and hkmouatxi = 8.6. G2 independence test; MMHC, BDe score (Tsamardinos, Brown, and

Aliferis , 2006 ).