Toward Learning Graphical and Causal Process Models Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 98052 meek@microsoft.com Abstract graphs called δ*separation which plays an analogous role as d-separation in the work of Verma and Pearl We describe an approach to learning causal (1990) and Spirtes, Glymour, and Scheines (2001). models that leverages temporal information. Another key ingredient is the process independence We posit the existence of a graphical de- statement that plays an analogous role to the inde- scription of a causal process that generates pendence statement. Conceptually, we assume that observations through time. We explore as- we can test whether a process independence state- sumptions connecting the graphical descrip- ments about observable quantities holds by observing tion with the statistical process and what one the process and that these observation provide insight can infer about the causal structure of the into the causal structure governing the process. In process under these assumptions. particular, we posit the existence of a graphical de- scription of a causal process and make assumptions that connect δ*separation with observable process in- 1 Introduction dependence statements. We explore what can be in- ferred about the causal structure of the process un- Data that measure the temporal dynamics of systems der various observability assumptions. While the ulti- is pervasive. The goal of this paper is to describe an mate goal is to create a sound and complete method approach to the development of a sound approach to for causal inference for observations from a stochas- causal inference for dynamic systems. One of the pop- tic dynamic system, this paper represents some initial ular extant approaches is Granger causality (Granger steps towards this ultimate goal. In particular, the re- 1969) which fails to be sound in the presence of la- sults in Section 3.2 can be viewed sufficient conditions tent variables. Granger causality is typically applied in for Granger causality and, in Section 3.3, we present discrete-time continuous valued time-series. Roughly sufficient conditions under which we can make sound speaking, in a multivariate time series X a set of vari- inferences about causal relationships including the ex- ables are the Granger-causes of Xj if the historical val- istence of causal relationships and the existence and ues of this set of variables (including Xj ) are necessary non-existence of latent common causal relationships. and sufficient for optimal prediction. Unfortunately a As presented in Section 3, our causal discovery algo- variable deemed a Granger-cause can arise due to ei- rithm assumes the existence of an oracle for process in- ther a latent common cause or as a result of a direct dependence statements. Our approach of abstracting causal relationship and thus the approach cannot be away the details of how one connects process indepen- used to determine causal relationships if one does not dence statements with particular statistical processes exclude the possibility of latent variables. allows us to simultaneously make progress on the In this paper, we explore how one can leverage the causal discovery problem for multiple distinct statis- assumption that causes must precede effects to in- tical processes such as marked point processes, Gaus- form causal conclusions drawn from observations of sian processes and dynamic Bayesian networks. In Sec- a temporal statistical process. The approach taken tion 4, we discuss two particular statistical processes here is similar to the approach developed by Verma and their associated process independence statements. and Pearl (1990) and Spirtes, Glymour, and Scheines In Section 5, we discuss some related work and open (2001) for atemporal causal discovery. One key in- research questions. gredient in our approach is a new asymmetric graph- ical separation criterion for directed (possibly cyclic) 2 Graphical Separation separation for directed acyclic graphs, we would like a graphical separation criterion for directed graphs to answer questions about how past observations influ- We use G = hL, Ei to denote a directed graph where ence future observations. Due in part to the fact that L is a set of vertices and E ⊆ hL × Li is a set of edges a directed graph does not explicitly encode temporal represented as ordered pairs. We write a → b if ha, bi ∈ information we cannot simply apply d*separation on E and say that a is a parent of b and b is the child of the directed graph. Instead, we define δ*separation a. Note that, in addition to allowing cycles, we also which extends the graphical δ-separation of Didelez allow that a vertex can be its own parent and child (2008) to handle self-edges. For sets A, B, C ⊂ L (i.e., a self-edge a → a). We use the shorthand a ↔ b where A ∩ C = ∅ we say that that B is δ*separated to indicate that a → b and b → a. from A given C (or simply δ(A, C, B)) in G if an A path in G is a sequence hl1 , . . . , ln i where there is only if B is d-separated from A given C in the B- an edge between successive pairs of vertices in G. The historical dependency graph G B where G B = L, E B length of a path p = hl1 , . . . , ln i is |p| = n and a path and E B = E\{hb, ai ∈ E|b ∈ B, a 6= b}. Note that p is termed a trivial path if |p| = 1. A vertex li on δ*separation is not symmetric in the first and third path p = hl1 , . . . , ln i is a collider on p if li−1 → li and arguments due to the use of the graph G B . li ← li+1 and a non-collider otherwise. A directed path in graph G = hL, Ei is a sequence of vertices hl1 , . . . , ln i such that hli , li+1 i ∈ E. The source of a directed path 3 Learning the Structure of a Causal is the first vertex in the path. We denote the set of Process ancestors for a set A by An(A). The ancestor relation is reflexive and thus A ⊆ An(A). Our aim is to connect statistical processes with causal graphs and to learn the causal graph governing a sys- We define a graph separation criterion called tem of observed events. We assume that there is a d*separation for directed graphs which is an exten- statistical process governing what and when events sion of d-separation (Pearl 1988). An extension of d- happen. We denote a statistical process for a set separation is required as a pure vertex separation cri- of observation types L by PL . We also assume that terion like d-separation cannot separate a vertex from we can observe the process to determine the whether itself which is required to appropriately handle self- process independence statements hold. We will write edges in directed graphs. A path p d*connects ver- P I(A, C, B) to indicate that the process associated tices a and b given the set of vertices C in graph G if with observations of type B does not depend on the every collider on p is in An(C) and every non-collider history of observations of type A given the history of on p is not in C. For sets of vertices A, B, C ⊆ L observations of type C in a given process PL (where where A ∩ C = ∅ we say that B is d*separated from A ∩ C = ∅). We write ¬P I(A, C, B) if this is not the A by C in graph G if and only if there does not exist case. We call such statements process independence a non-trivial d*connecting path between some a ∈ A statements. We note that process independence state- and some b ∈ B given C in G. ments need not correspond to statistical independence There are two key differences from Pearl’s d-separation statements and, as with δ*separation, there is no ex- that allow us to appropriately handle cyclic directed pectation that such process independence statements graphs. First, we restrict d*separation statements to ought to be symmetric. In this section, we assume the sets in which A ∩ C is the empty set but allow the existence of a process independence oracle for the rel- sets A and B to overlap. Second, d*connecting paths evant statistical process. In Section 4, we discuss par- must be non-trivial. These modifications enable us ticular statistical processes and the problem of testing to use d*separation statements to distinguish between process independence statements for those processes. graphs in which there is a self-edge (a → a) and one A process PL satisfies the Causal Factorization As- in which there is not. sumption with respect to a causal process graph G = We use directed graphs to represent temporal statis- hL, Ei if and only if for all A, B, C ⊂ L where A∩B = ∅ tical processes. We associate the vertices L with a it is the case that δ(A, B, C) ⇒ P I(A, B, C) set of possible observation types (i.e., things that can A process PL satisfies the Causal Dependence Assump- happen). The edges denote potential dependencies tion with respect to a causal process graph G = hL, Ei between observations and the absence of a directed if and only if for all A, B, C ⊂ L where A ∩ B = ∅ it edge from observation type a to observation type b is the case that P I(A, B, C) ⇒ δ(A, B, C) indicates that the process that generates observations of type b does not directly depend on the history of The Causal Analysis (CA) Algorithm (Algorithm 1) observations of type a. Analogous to the use of d- uses a process independence oracle to construct a di- rected graph. We use πlG to denote the parents of l process graph G. Our causal factorization and depen- in graph G and |B| to denote the cardinality of the dence assumptions allow us to focus on δ*separation set B. The basic idea is to use process independence in G by assuming that the observed process indepen- statements to remove edges from an initially complete dence statements accurately reflect the δ*separation graph. This algorithm is analogous to the PC Algo- statements about G for the observed observation types. rithm of Spirtes, Glymour and Scheines (2001) but In order to understand and interpret the output of the does not have an orientation phase. CA algorithm we need to understand the conditions that lead to edges in the final output. We begin by Note that the output of the CA algorithm is a directed defining the concept of vertex blockability relative to graph and that any edges presented do not necessarily a set of observed event types. indicate a causal relationship. In the remainder of this section we explore the interpretation of the output of We say that a vertex a is b-unblockable relative to O in the CA algorithm under various assumptions. Recall G if and only if for all C ⊆ O \ {a, b} ¬δ(a, C, b) is true that a ↔ b simply indicates that a → b and b → a and of G. Otherwise the vertex is said to be b-blockable not the existence of a latent common cause. relative to O. Note that if b → b then if b ∈ O b is b-unblockable relative to O. Input: A set of events L and a process PL Output: A directed graph G We say that l is a direct cause of l0 relative to O for Let G = hL, Ei be a complete directed graph.; causal process graph G if and only if there exists a foreach l ∈ L do directed path hl1 , . . . , ln i where l1 = l and ln = l0 and Let n = 0; li 6∈ O for (1 < i < n). We call the path in the foreach l0 ∈ πlG do definition of direct cause a witnessing path that l is a foreach B ⊆ πlG \ {l0 } where |B| = n do direct cause of l0 . We let Db denote the set of observed if P I(l0 , B, l) holds in PL then direct causes of the variable b relative to O, that is, E = E \ hl0 , li members of O that are direct causes of b relative to O. end Example 1. Let E = {a → c, c → b}, L = {a, b, c} end and O = {a, b}. The vertex a is b-unblockable relative Let n = n + 1; to O for G = hL, Ei but the vertex b is a-blockable end relative to O. In this example, a is a direct cause of b end relative to O in graph G and a → c → b is a witnessing Return G = hL, Ei; path for this fact. Algorithm 1: The Causal Analysis (CA) Algorithm Lemma 3. If l0 is a direct cause of l relative to O in G then l0 is l-unblockable relative to O in G. Theorem 1 (Complete Observations). If PL satisfies both the causal dependence and factorization assump- The following lemma allows us to make causal infer- tions with respect to G then algorithm CA(L, PL ) re- ences using the causal analysis algorithm about the turns G 0 = G. absence of a direct causal relationship. Lemma 1. If PL satisfies the causal dependence as- Lemma 4. If PL satisfies the causal dependence as- sumption for G = hL, Ei and algorithm CA(L, PL ) re- sumption with respect to G then, in the graph G 0 output turns G 0 = hL, E 0 i then if l0 → l ∈ E then l0 → l ∈ E 0 . by CA(O,PL ), the set of parents for each event type include all of its direct causes relative to O. Lemma 2. If PL satisfies both the causal depen- In particular, if the algorithm finds that an event type dence and factorization assumptions for G = hL, Ei a is not a parent of event type b then a is not a direct and algorithm CA(L, PL ) returns G 0 = hL, E 0 i then if cause of b. l0 → l 6∈ E then l0 → l 6∈ E 0 . Proof of Theorem 1: The theorem follows from 3.2 Causal sufficiency Lemmas 1 and 2. In the section, we restrict the type of unobserved event types which enables us to make strong inferences about 3.1 Absence of a direct causal relationship the causal structure of a process. In particular we as- sume causal sufficiency which is essentially an assump- Next we consider the case in which some of the event tion that there are no latent confounding processes. types in the system are not observed. We let O ⊆ L be the set of observed event types. In this case we will A set of event types O ⊂ L is causally sufficient with assume that the causal factorization and dependence respect to a graph G = hL, Ei if and only if every assumptions hold for a process PL and some causal common cause of l, l0 ∈ O is in the set of event types O. We say that a is a cause of b in G and if there is a directed path from a to b in G. A directed graph G 0 = hO, E 0 i is causally correct with respect to a graph G = hL, Ei if for every edge ha, bi ∈ We aim to find common features of all graphs that are E 0 a is a direct cause of b with respect to O in G. consistent with the observed pattern of process inde- Theorem 2 (Causal Sufficiency). If PL satisfies both pendence statements. Latent processes, however, can the causal dependence and factorization assumptions mask the causal nature of the observed pattern of de- for G = hL, Ei and O ⊆ L is causally sufficient with pendencies. respect to G then the graph G 0 returned by algorithm For a pair of vertices a, b and graph G we say that there CA(O, PL ) is causally correct with respect to G and is a potential indirect inducing path into b relative to O O. if and only if (1) there is a vertex c1 ∈ O \ {a, b} such Lemma 5. If PL satisfies the causal dependence and that a → b in G and (2) there is a sequence of vertices factorization assumptions with respect to G and O is c1 , . . . , cn ⊆ O \ {a, c} such that ci ↔ ci+1 and cn ↔ b causally sufficient for G then the output of the CA al- in G. gorithm removes the edge a → b if a is not a direct Lemma 7. For any set of observed variable O, if a cause of b relative to O. graph has an inducing path between observed variables a, b into b containing another observed variable then 3.3 Causal insufficiency the output of the CA algorithm will contain a potential indirect inducing path into b. We have shown that the CA algorithm can provide Theorem 3 (Sufficient Cause). If PL satisfies both causally accurate information under the assumptions the causal dependence and factorization assumptions of causal sufficiency, causal factorization and causal for G = hL, Ei then if CA produces G 0 with vertices dependence. In this section we consider removing the O ⊆ L for which the subgraph over {a, b} is a → b and assumption of causal sufficiency. G 0 contains no potential inducing path between a, b into Example 2. Let E = {a ← c, c → b}, L = {a, b, c} b then a is a cause of b in G. and O = {a, b}. The observed event types O are not Lemma 8. If PL satisfies both the causal dependence causally sufficient for the graph G = hL, Ei. In ad- and factorization assumptions for G = hL, Ei and CA dition, the CA algorithm fails to provide output that produces G 0 with vertices O ⊆ L for which the subgraph is causally correct. In particular, the CA algorithm over {a, b, c} is a ↔ b ↔ c then yields the graph in which a → b and b → a despite the fact that neither is a a cause of b in G nor is b a cause • if P I(a, ∅, c) and P I(c, ∅, a) then there is a latent of a. common causes of a, b and a (possibly distinct) Our aim is to graphically characterize vertex separa- latent cause of b, c and b is not a direct cause of c bility. We do so using the idea of an inducing path in a and b is not a direct cause of a. directed graph that was introduced for directed acyclic • if P I(a, b, c) then there is no latent common graphs by Verma and Pearl (1990). For a pair of ver- causes of b, c, b is a cause of c in G. tices a, b, we define Aab = An({a}) ∪ An({b}) \ {a, b}. A path p between ha, bi is an inducing path relative to O if and only if (1) every vertex on p ∈ O is a collider 4 Statistical Processes and Process on p and (2) Every collider on p is in Aab . An induc- Independence ing path p = hl1 = a, . . . , ln = bi from a to b is into b if ln−1 → ln . An inducing path p = hl1 = a, . . . , ln = bi Our approach to causal discovery through the obser- from a to b is out of a if l1 → l2 . vation of a dynamic process is applicable to different temporal statistical processes. The key connection re- Lemma 6. For a directed graph G the following three quired is a connection between process independence statements are equivalent: statements and the observations from a particular sta- tistical process. In this section we consider two distinct (a) A vertex a is b-unblockable relative to O in graph statistical processes and discuss process independence G for these processes. (b) There is an inducing path between a and b relative to O in graph G b . Note this inducing path must 4.1 Dynamic Bayesian Networks be into b. Dynamic Bayesian networks (DBNs) are a popular (c) ¬δ(a, O ∩ Aab , b) in G. discrete-time model that can capture temporal dy- namics of a statistical process. A DBN is a statis- tical model of an infinite set of variables indexed by distribution for D in which the timestamps are con- time. A variable Xit denotes the ith variable at time tinuous random variables can be written in this form. t. We use X = X1 , . . . , Xn to denote the set of vari- For more details see [1, 2]. Despite the fact that the able types in the DBN, that is, a variable with an un- modeling assumptions are weak, these models offer a specified time component and X t to denote the set of powerful approach for decomposing the dependencies variables at time t. The DBN specifies the evolution of different event types on the past. In particular, this of X t as a stochastic function of the value of previous per label conditional specification allows one to model variables X t−i (i > 0). In particular, the variable Xit detailed label-specific dependence on past events. is a stochastic function of the value of its parents in Next we define a graphical conditional intensity model a graph. The causal process graph associated with a that we call a graphical event model (GEM). A fil- causal DBN is a graph over the variable types of the tered history for A ⊆ L as [h]A = {(ti , li )|(ti , li ) ∈ DBN X where there is an edge Xi → Xj if there ex- h ∧ li ∈ A}. A GEM is a pair < G, θ >, where ists a t, i such that there is an edge Xit−i → Xj in the G =< L, E > is a directed graph over a set of event DBN. Thus, the parent relationship of the causal pro- types and edges in E represent potential dependencies cess graph captures the dependence of a variable type among event types. The parameters θ = {θl }l∈L pa- on the history of other variable types. Furthermore, rameterize the intensity functions for each event type. process independence statements P I(Xi , C, Xj ) corre- In particular, λl (t|ht , θl ) = λl (t|[ht ]πl , θl ) where πl is spond to a set of independence statements of the form the set of parents for l in G. As in the case of the I(Xi1 , . . . , Xit−1 , XC1 , . . . , XCt−1 , Xjt ). Without further DBN, a process independence statement correspond to assumptions, testing process independence would be testing a dependence of an event type on set of event unfeasible but if we focus on stationary processes with histories. One potential approach to testing a process finite temporal dependency we can potentially test independence P I(a, C, b) is to estimate/learn an inten- process independence statements. sity function for b using the event histories for {a} ∪ C and see if the intensity model depends on the event 4.2 Graphical Event Models history for a. The work by Gunawardana et al (2011) on learning piecewise continuous intensity models is a In this section, we define Conditional Intensity Mod- good starting point for this approach. els and Graphical Event Models (GEMs) and con- nect these models with previous work on the class of Piecewise-Constant Conditional Intensity Models and 5 Discussion Poisson Networks. We assume that events of differ- ent types are distinguished by labels l drawn from One of the goals for the research direction described a finite alphabet L. An event is then composed of in this paper is the development a sound approach a non-negative time-stamp t and a label l. A his- to causal inference for dynamic systems. One of the n tory is an event sequence h = {(ti , li )}i=1 where popular extant approaches is that of Granger causal- 0 < t1 < · · · < tn , and our data is a specific history ity which fails on this account. This approach is typi- denoted by D. Given data D, we define the history cally applied in a discrete-time continuous valued time- at time t as h(t, D) = {(ti , li ) | (ti , li ) ∈ D, ti ≤ t}. We series and, thus, can be viewed as a dynamic Bayesian suppress D from h(t, D) when clear from context and network. Roughly speaking, in a multivariate time se- write hi = h(ti−1 ). By convention t0 = 0. We define ries X a set of variables are the Granger-causes of Xj if the ending time t(h) of a history h as the time of the the historical values of this set of variables (including last event in h: t(h) = max(t,l)∈h t so that t(hi ) = ti−1 . Xj ) are necessary and sufficient for optimal predic- A Conditional Intensity Model (CIM) is a set of non- tion. Unfortunately this approach does not appropri- negative conditional intensity functions indexed by la- ately handle latent common causes. In particular, for bel {λl (t|h; θ)}l∈L . The data likelihood for this model both of the scenarios described in Lemma 8 it is the is case that each of the variables is a Granger cause of its neighbors while this relationships need not be causal n YY as the lemma demonstrates. In fact, it is easy to con- p(D|θ) = λl (ti |hi , θ)1l (li ) e−Λl (ti |hi ;θ) (1) struct stochastic processes with latent factors which l∈L i=1 demonstrate that the inferential approach to Granger Rt causality is not sound with respect to causal relations. where Λl (t|h; θ) = −∞ λl (τ |h; θ)dτ and the function 1l (l0 ) is one if l0 = l and zero otherwise. The condi- There has been much work related to causal discov- tional intensities are assumed to satisfy λl (t|h; θ) = 0 ery and the estimation of causal effects in time-series. for t ≤ t(h) to ensure that ti > ti−1 = t(hi ). These As discussed above, the work on Granger causality modeling assumptions are quite weak. In fact, any (Granger 1969) is the most well known. The short- comings of this approach are also well known (e.g., References Eichler 2007) and there has been some work in trying to address these known short comings. For instance, [1] D. J. Daley and D. Vere-Jones. An Introduction to Eichler (2007) proposes a similar approach to the ap- the Theory of Point Processes: Elementary The- proach described here but differs in that it allows for ory and Methods, volume I. Springer, second edi- the possibility of “simultaneous correlation” which re- tion, 2002. quires the use of an alternative definition of separa- [2] Vanessa Didelez. Graphical models for marked tion. In addition, while providing definitions of cause point processes based on local independence. and spurious cause, sufficient conditions for the identi- JRSS-B, 70(1):245–264, 2008. fication of causal relationships are not presented. The work of Entner and Hoyer (2010) considers the prob- [3] Michael Eichler. Granger causality and path di- lem of causal discovery from time series data using agrams for multivariate time series. Journal of limited dependence vector autoregressive models and Econometrics, 137:334–353, 2007. the FCI algorithm that uses conditional independence tests to identify the structure. Our approach of using [4] Michael Eichler and Vanessa Didelez. Causal rea- δ*separation is inspired by the work of Didelez (2008) soning in graphical time series models. In Un- who defined δ-separation and shows the connection be- certainty in Artificial Intelligence, pages 109–116, tween that graphical separation criterion and local in- 2007. dependence of marked point processes. Our extension [5] Doris Entner and Patrik O. Hoyer. On causal dis- to δ*separation allows for the appropriate treatment of covery from time series data using FCI. In Prob- self-edges which are essential in any self-excitatory or abilistic Graphical Models, pages 121–128, 2010. self-inhibitory dynamic process. Another more loosely connected work is that of Eichler and Didelez (2007) [6] C.W.J. Granger. Investigating causal relations by that considers the estimation of causal effects based econometric models and cross-spectral methods. on an intervention in a time-series. Econometrica, 37:424–438, 1969. While the results described in this paper offer hope [7] Asela Gunawardana, Christopher Meek, and for developing a methodologically sound approach to Puyang Xu. A model for temporal dependencies causal inference for dynamic systems, there is much in event streams. In Advances in Neural Informa- work that needs to be done. Here are some of the tion Processing Systems, 2011. open research questions. [8] C. Meek. Strong completeness and faithfulness • Non-parametric tests for process independence for in Bayesian networks. In Proceedings of Eleventh various type of temporal statistical processes Conference on Uncertainty in Artificial Intelli- gence, Montreal, QU, pages 411–418. Morgan Kaufmann, August 1995. • Soundness and completeness results for δ*separation analogous to those provided by [9] P. Spirtes, C. Glymour, and R. Scheines. Cau- Pearl (1988), Meek (1995) and Spirtes et al sation, Prediction, and Search, Second Edition. (2001) for d-separation. Note that Didelez (2008) MIT Press, Cambridge, MA, second edition, 2001. has shown the soundness of δ-separation for a family of marked point processes related to [10] T. Verma and J. Pearl. Equivalence and synthe- GEMs. sis of causal models. In Proceedings of Sixth Con- ference on Uncertainty in Artificial Intelligence, • A representation for equivalence classes of causal Boston, MA, pages 220–227. Morgan Kaufmann, graphs with respect to δ*separation in the case of July 1990. causal insufficiency (O ⊂ L) analogous to those developed by Verma and Pearl (1990) and Spirtes et al (2001) that captures the common casual as- pects of the set of graphs in the equivalence class. Acknowledgments Thanks to Asela Gunawardana and two anonymous reviewers for their comments on an earlier draft of this paper.