=Paper= {{Paper |id=Vol-1274/paper8 |storemode=property |title=Toward Learning Graphical and Causal Process Models |pdfUrl=https://ceur-ws.org/Vol-1274/uai2014ci_paper8.pdf |volume=Vol-1274 |dblpUrl=https://dblp.org/rec/conf/uai/Meek14 }} ==Toward Learning Graphical and Causal Process Models== https://ceur-ws.org/Vol-1274/uai2014ci_paper8.pdf

Toward Learning Graphical and Causal Process Models

Christopher Meek
Microsoft Research
One Microsoft Way
Redmond, WA 98052
meek@microsoft.com

Abstract graphs called δ*separation which plays an analogous
role as d-separation in the work of Verma and Pearl
We describe an approach to learning causal (1990) and Spirtes, Glymour, and Scheines (2001).
models that leverages temporal information. Another key ingredient is the process independence
We posit the existence of a graphical de- statement that plays an analogous role to the inde-
scription of a causal process that generates pendence statement. Conceptually, we assume that
observations through time. We explore as- we can test whether a process independence state-
sumptions connecting the graphical descrip- ments about observable quantities holds by observing
tion with the statistical process and what one the process and that these observation provide insight
can infer about the causal structure of the into the causal structure governing the process. In
process under these assumptions. particular, we posit the existence of a graphical de-
scription of a causal process and make assumptions
that connect δ*separation with observable process in-
1 Introduction dependence statements. We explore what can be in-
ferred about the causal structure of the process un-
Data that measure the temporal dynamics of systems der various observability assumptions. While the ulti-
is pervasive. The goal of this paper is to describe an mate goal is to create a sound and complete method
approach to the development of a sound approach to for causal inference for observations from a stochas-
causal inference for dynamic systems. One of the pop- tic dynamic system, this paper represents some initial
ular extant approaches is Granger causality (Granger steps towards this ultimate goal. In particular, the re-
1969) which fails to be sound in the presence of la- sults in Section 3.2 can be viewed sufficient conditions
tent variables. Granger causality is typically applied in for Granger causality and, in Section 3.3, we present
discrete-time continuous valued time-series. Roughly sufficient conditions under which we can make sound
speaking, in a multivariate time series X a set of vari- inferences about causal relationships including the ex-
ables are the Granger-causes of Xj if the historical val- istence of causal relationships and the existence and
ues of this set of variables (including Xj ) are necessary non-existence of latent common causal relationships.
and sufficient for optimal prediction. Unfortunately a
As presented in Section 3, our causal discovery algo-
variable deemed a Granger-cause can arise due to ei-
rithm assumes the existence of an oracle for process in-
ther a latent common cause or as a result of a direct
dependence statements. Our approach of abstracting
causal relationship and thus the approach cannot be
away the details of how one connects process indepen-
used to determine causal relationships if one does not
dence statements with particular statistical processes
exclude the possibility of latent variables.
allows us to simultaneously make progress on the
In this paper, we explore how one can leverage the causal discovery problem for multiple distinct statis-
assumption that causes must precede effects to in- tical processes such as marked point processes, Gaus-
form causal conclusions drawn from observations of sian processes and dynamic Bayesian networks. In Sec-
a temporal statistical process. The approach taken tion 4, we discuss two particular statistical processes
here is similar to the approach developed by Verma and their associated process independence statements.
and Pearl (1990) and Spirtes, Glymour, and Scheines In Section 5, we discuss some related work and open
(2001) for atemporal causal discovery. One key in- research questions.
gredient in our approach is a new asymmetric graph-
ical separation criterion for directed (possibly cyclic)
2 Graphical Separation separation for directed acyclic graphs, we would like
a graphical separation criterion for directed graphs to
answer questions about how past observations influ-
We use G = hL, Ei to denote a directed graph where ence future observations. Due in part to the fact that
L is a set of vertices and E ⊆ hL × Li is a set of edges a directed graph does not explicitly encode temporal
represented as ordered pairs. We write a → b if ha, bi ∈ information we cannot simply apply d*separation on
E and say that a is a parent of b and b is the child of the directed graph. Instead, we define δ*separation
a. Note that, in addition to allowing cycles, we also which extends the graphical δ-separation of Didelez
allow that a vertex can be its own parent and child (2008) to handle self-edges. For sets A, B, C ⊂ L
(i.e., a self-edge a → a). We use the shorthand a ↔ b where A ∩ C = ∅ we say that that B is δ*separated
to indicate that a → b and b → a. from A given C (or simply δ(A, C, B)) in G if an
A path in G is a sequence hl1 , . . . , ln i where there is only if B is d-separated from A given C in the B-
an edge between successive pairs of vertices in G. The historical dependency graph G B where G B = L, E B
length of a path p = hl1 , . . . , ln i is |p| = n and a path and E B = E\{hb, ai ∈ E|b ∈ B, a 6= b}. Note that
p is termed a trivial path if |p| = 1. A vertex li on δ*separation is not symmetric in the first and third
path p = hl1 , . . . , ln i is a collider on p if li−1 → li and arguments due to the use of the graph G B .
li ← li+1 and a non-collider otherwise. A directed path
in graph G = hL, Ei is a sequence of vertices hl1 , . . . , ln i
such that hli , li+1 i ∈ E. The source of a directed path
3 Learning the Structure of a Causal
is the first vertex in the path. We denote the set of Process
ancestors for a set A by An(A). The ancestor relation
is reflexive and thus A ⊆ An(A). Our aim is to connect statistical processes with causal
graphs and to learn the causal graph governing a sys-
We define a graph separation criterion called
tem of observed events. We assume that there is a
d*separation for directed graphs which is an exten-
statistical process governing what and when events
sion of d-separation (Pearl 1988). An extension of d-
happen. We denote a statistical process for a set
separation is required as a pure vertex separation cri-
of observation types L by PL . We also assume that
terion like d-separation cannot separate a vertex from
we can observe the process to determine the whether
itself which is required to appropriately handle self-
process independence statements hold. We will write
edges in directed graphs. A path p d*connects ver-
P I(A, C, B) to indicate that the process associated
tices a and b given the set of vertices C in graph G if
with observations of type B does not depend on the
every collider on p is in An(C) and every non-collider
history of observations of type A given the history of
on p is not in C. For sets of vertices A, B, C ⊆ L
observations of type C in a given process PL (where
where A ∩ C = ∅ we say that B is d*separated from
A ∩ C = ∅). We write ¬P I(A, C, B) if this is not the
A by C in graph G if and only if there does not exist
case. We call such statements process independence
a non-trivial d*connecting path between some a ∈ A
statements. We note that process independence state-
and some b ∈ B given C in G.
ments need not correspond to statistical independence
There are two key differences from Pearl’s d-separation statements and, as with δ*separation, there is no ex-
that allow us to appropriately handle cyclic directed pectation that such process independence statements
graphs. First, we restrict d*separation statements to ought to be symmetric. In this section, we assume the
sets in which A ∩ C is the empty set but allow the existence of a process independence oracle for the rel-
sets A and B to overlap. Second, d*connecting paths evant statistical process. In Section 4, we discuss par-
must be non-trivial. These modifications enable us ticular statistical processes and the problem of testing
to use d*separation statements to distinguish between process independence statements for those processes.
graphs in which there is a self-edge (a → a) and one
A process PL satisfies the Causal Factorization As-
in which there is not.
sumption with respect to a causal process graph G =
We use directed graphs to represent temporal statis- hL, Ei if and only if for all A, B, C ⊂ L where A∩B = ∅
tical processes. We associate the vertices L with a it is the case that δ(A, B, C) ⇒ P I(A, B, C)
set of possible observation types (i.e., things that can
A process PL satisfies the Causal Dependence Assump-
happen). The edges denote potential dependencies
tion with respect to a causal process graph G = hL, Ei
between observations and the absence of a directed
if and only if for all A, B, C ⊂ L where A ∩ B = ∅ it
edge from observation type a to observation type b
is the case that P I(A, B, C) ⇒ δ(A, B, C)
indicates that the process that generates observations
of type b does not directly depend on the history of The Causal Analysis (CA) Algorithm (Algorithm 1)
observations of type a. Analogous to the use of d- uses a process independence oracle to construct a di-
rected graph. We use πlG to denote the parents of l process graph G. Our causal factorization and depen-
in graph G and |B| to denote the cardinality of the dence assumptions allow us to focus on δ*separation
set B. The basic idea is to use process independence in G by assuming that the observed process indepen-
statements to remove edges from an initially complete dence statements accurately reflect the δ*separation
graph. This algorithm is analogous to the PC Algo- statements about G for the observed observation types.
rithm of Spirtes, Glymour and Scheines (2001) but In order to understand and interpret the output of the
does not have an orientation phase. CA algorithm we need to understand the conditions
that lead to edges in the final output. We begin by
Note that the output of the CA algorithm is a directed
defining the concept of vertex blockability relative to
graph and that any edges presented do not necessarily
a set of observed event types.
indicate a causal relationship. In the remainder of this
section we explore the interpretation of the output of We say that a vertex a is b-unblockable relative to O in
the CA algorithm under various assumptions. Recall G if and only if for all C ⊆ O \ {a, b} ¬δ(a, C, b) is true
that a ↔ b simply indicates that a → b and b → a and of G. Otherwise the vertex is said to be b-blockable
not the existence of a latent common cause. relative to O. Note that if b → b then if b ∈ O b is
b-unblockable relative to O.
Input: A set of events L and a process PL
Output: A directed graph G We say that l is a direct cause of l0 relative to O for
Let G = hL, Ei be a complete directed graph.; causal process graph G if and only if there exists a
foreach l ∈ L do directed path hl1 , . . . , ln i where l1 = l and ln = l0 and
Let n = 0; li 6∈ O for (1 < i < n). We call the path in the
foreach l0 ∈ πlG do definition of direct cause a witnessing path that l is a
foreach B ⊆ πlG \ {l0 } where |B| = n do direct cause of l0 . We let Db denote the set of observed
if P I(l0 , B, l) holds in PL then direct causes of the variable b relative to O, that is,
E = E \ hl0 , li members of O that are direct causes of b relative to O.
end Example 1. Let E = {a → c, c → b}, L = {a, b, c}
end and O = {a, b}. The vertex a is b-unblockable relative
Let n = n + 1; to O for G = hL, Ei but the vertex b is a-blockable
end relative to O. In this example, a is a direct cause of b
end relative to O in graph G and a → c → b is a witnessing
Return G = hL, Ei; path for this fact.
Algorithm 1: The Causal Analysis (CA) Algorithm Lemma 3. If l0 is a direct cause of l relative to O in
G then l0 is l-unblockable relative to O in G.
Theorem 1 (Complete Observations). If PL satisfies
both the causal dependence and factorization assump- The following lemma allows us to make causal infer-
tions with respect to G then algorithm CA(L, PL ) re- ences using the causal analysis algorithm about the
turns G 0 = G. absence of a direct causal relationship.
Lemma 1. If PL satisfies the causal dependence as- Lemma 4. If PL satisfies the causal dependence as-
sumption for G = hL, Ei and algorithm CA(L, PL ) re- sumption with respect to G then, in the graph G 0 output
turns G 0 = hL, E 0 i then if l0 → l ∈ E then l0 → l ∈ E 0 . by CA(O,PL ), the set of parents for each event type
include all of its direct causes relative to O.
Lemma 2. If PL satisfies both the causal depen- In particular, if the algorithm finds that an event type
dence and factorization assumptions for G = hL, Ei a is not a parent of event type b then a is not a direct
and algorithm CA(L, PL ) returns G 0 = hL, E 0 i then if cause of b.
l0 → l 6∈ E then l0 → l 6∈ E 0 .

Proof of Theorem 1: The theorem follows from 3.2 Causal sufficiency
Lemmas 1 and 2. In the section, we restrict the type of unobserved event
types which enables us to make strong inferences about
3.1 Absence of a direct causal relationship the causal structure of a process. In particular we as-
sume causal sufficiency which is essentially an assump-
Next we consider the case in which some of the event
tion that there are no latent confounding processes.
types in the system are not observed. We let O ⊆ L
be the set of observed event types. In this case we will A set of event types O ⊂ L is causally sufficient with
assume that the causal factorization and dependence respect to a graph G = hL, Ei if and only if every
assumptions hold for a process PL and some causal common cause of l, l0 ∈ O is in the set of event types
O. We say that a is a cause of b in G and if there is a
directed path from a to b in G.
A directed graph G 0 = hO, E 0 i is causally correct with
respect to a graph G = hL, Ei if for every edge ha, bi ∈ We aim to find common features of all graphs that are
E 0 a is a direct cause of b with respect to O in G. consistent with the observed pattern of process inde-
Theorem 2 (Causal Sufficiency). If PL satisfies both pendence statements. Latent processes, however, can
the causal dependence and factorization assumptions mask the causal nature of the observed pattern of de-
for G = hL, Ei and O ⊆ L is causally sufficient with pendencies.
respect to G then the graph G 0 returned by algorithm For a pair of vertices a, b and graph G we say that there
CA(O, PL ) is causally correct with respect to G and is a potential indirect inducing path into b relative to O
O. if and only if (1) there is a vertex c1 ∈ O \ {a, b} such
Lemma 5. If PL satisfies the causal dependence and that a → b in G and (2) there is a sequence of vertices
factorization assumptions with respect to G and O is c1 , . . . , cn ⊆ O \ {a, c} such that ci ↔ ci+1 and cn ↔ b
causally sufficient for G then the output of the CA al- in G.
gorithm removes the edge a → b if a is not a direct Lemma 7. For any set of observed variable O, if a
cause of b relative to O. graph has an inducing path between observed variables
a, b into b containing another observed variable then
3.3 Causal insufficiency the output of the CA algorithm will contain a potential
indirect inducing path into b.
We have shown that the CA algorithm can provide
Theorem 3 (Sufficient Cause). If PL satisfies both
causally accurate information under the assumptions
the causal dependence and factorization assumptions
of causal sufficiency, causal factorization and causal
for G = hL, Ei then if CA produces G 0 with vertices
dependence. In this section we consider removing the
O ⊆ L for which the subgraph over {a, b} is a → b and
assumption of causal sufficiency.
G 0 contains no potential inducing path between a, b into
Example 2. Let E = {a ← c, c → b}, L = {a, b, c} b then a is a cause of b in G.
and O = {a, b}. The observed event types O are not Lemma 8. If PL satisfies both the causal dependence
causally sufficient for the graph G = hL, Ei. In ad- and factorization assumptions for G = hL, Ei and CA
dition, the CA algorithm fails to provide output that produces G 0 with vertices O ⊆ L for which the subgraph
is causally correct. In particular, the CA algorithm over {a, b, c} is a ↔ b ↔ c then
yields the graph in which a → b and b → a despite the
fact that neither is a a cause of b in G nor is b a cause
• if P I(a, ∅, c) and P I(c, ∅, a) then there is a latent
of a.
common causes of a, b and a (possibly distinct)
Our aim is to graphically characterize vertex separa- latent cause of b, c and b is not a direct cause of c
bility. We do so using the idea of an inducing path in a and b is not a direct cause of a.
directed graph that was introduced for directed acyclic • if P I(a, b, c) then there is no latent common
graphs by Verma and Pearl (1990). For a pair of ver- causes of b, c, b is a cause of c in G.
tices a, b, we define Aab = An({a}) ∪ An({b}) \ {a, b}.
A path p between ha, bi is an inducing path relative to
O if and only if (1) every vertex on p ∈ O is a collider
4 Statistical Processes and Process
on p and (2) Every collider on p is in Aab . An induc- Independence
ing path p = hl1 = a, . . . , ln = bi from a to b is into b if
ln−1 → ln . An inducing path p = hl1 = a, . . . , ln = bi Our approach to causal discovery through the obser-
from a to b is out of a if l1 → l2 . vation of a dynamic process is applicable to different
temporal statistical processes. The key connection re-
Lemma 6. For a directed graph G the following three
quired is a connection between process independence
statements are equivalent:
statements and the observations from a particular sta-
tistical process. In this section we consider two distinct
(a) A vertex a is b-unblockable relative to O in graph statistical processes and discuss process independence
G for these processes.
(b) There is an inducing path between a and b relative
to O in graph G b . Note this inducing path must 4.1 Dynamic Bayesian Networks
be into b.
Dynamic Bayesian networks (DBNs) are a popular
(c) ¬δ(a, O ∩ Aab , b) in G. discrete-time model that can capture temporal dy-
namics of a statistical process. A DBN is a statis-
tical model of an infinite set of variables indexed by distribution for D in which the timestamps are con-
time. A variable Xit denotes the ith variable at time tinuous random variables can be written in this form.
t. We use X = X1 , . . . , Xn to denote the set of vari- For more details see [1, 2]. Despite the fact that the
able types in the DBN, that is, a variable with an un- modeling assumptions are weak, these models offer a
specified time component and X t to denote the set of powerful approach for decomposing the dependencies
variables at time t. The DBN specifies the evolution of different event types on the past. In particular, this
of X t as a stochastic function of the value of previous per label conditional specification allows one to model
variables X t−i (i > 0). In particular, the variable Xit detailed label-specific dependence on past events.
is a stochastic function of the value of its parents in
Next we define a graphical conditional intensity model
a graph. The causal process graph associated with a
that we call a graphical event model (GEM). A fil-
causal DBN is a graph over the variable types of the
tered history for A ⊆ L as [h]A = {(ti , li )|(ti , li ) ∈
DBN X where there is an edge Xi → Xj if there ex-
h ∧ li ∈ A}. A GEM is a pair < G, θ >, where
ists a t, i such that there is an edge Xit−i → Xj in the
G =< L, E > is a directed graph over a set of event
DBN. Thus, the parent relationship of the causal pro-
types and edges in E represent potential dependencies
cess graph captures the dependence of a variable type
among event types. The parameters θ = {θl }l∈L pa-
on the history of other variable types. Furthermore,
rameterize the intensity functions for each event type.
process independence statements P I(Xi , C, Xj ) corre-
In particular, λl (t|ht , θl ) = λl (t|[ht ]πl , θl ) where πl is
spond to a set of independence statements of the form
the set of parents for l in G. As in the case of the
I(Xi1 , . . . , Xit−1 , XC1 , . . . , XCt−1 , Xjt ). Without further
DBN, a process independence statement correspond to
assumptions, testing process independence would be
testing a dependence of an event type on set of event
unfeasible but if we focus on stationary processes with
histories. One potential approach to testing a process
finite temporal dependency we can potentially test
independence P I(a, C, b) is to estimate/learn an inten-
process independence statements.
sity function for b using the event histories for {a} ∪ C
and see if the intensity model depends on the event
4.2 Graphical Event Models history for a. The work by Gunawardana et al (2011)
on learning piecewise continuous intensity models is a
In this section, we define Conditional Intensity Mod-
good starting point for this approach.
els and Graphical Event Models (GEMs) and con-
nect these models with previous work on the class of
Piecewise-Constant Conditional Intensity Models and 5 Discussion
Poisson Networks. We assume that events of differ-
ent types are distinguished by labels l drawn from One of the goals for the research direction described
a finite alphabet L. An event is then composed of in this paper is the development a sound approach
a non-negative time-stamp t and a label l. A his- to causal inference for dynamic systems. One of the
n
tory is an event sequence h = {(ti , li )}i=1 where popular extant approaches is that of Granger causal-
0 < t1 < · · · < tn , and our data is a specific history ity which fails on this account. This approach is typi-
denoted by D. Given data D, we define the history cally applied in a discrete-time continuous valued time-
at time t as h(t, D) = {(ti , li ) | (ti , li ) ∈ D, ti ≤ t}. We series and, thus, can be viewed as a dynamic Bayesian
suppress D from h(t, D) when clear from context and network. Roughly speaking, in a multivariate time se-
write hi = h(ti−1 ). By convention t0 = 0. We define ries X a set of variables are the Granger-causes of Xj if
the ending time t(h) of a history h as the time of the the historical values of this set of variables (including
last event in h: t(h) = max(t,l)∈h t so that t(hi ) = ti−1 . Xj ) are necessary and sufficient for optimal predic-
A Conditional Intensity Model (CIM) is a set of non- tion. Unfortunately this approach does not appropri-
negative conditional intensity functions indexed by la- ately handle latent common causes. In particular, for
bel {λl (t|h; θ)}l∈L . The data likelihood for this model both of the scenarios described in Lemma 8 it is the
is case that each of the variables is a Granger cause of its
neighbors while this relationships need not be causal
n
YY as the lemma demonstrates. In fact, it is easy to con-
p(D|θ) = λl (ti |hi , θ)1l (li ) e−Λl (ti |hi ;θ) (1)
struct stochastic processes with latent factors which
l∈L i=1
demonstrate that the inferential approach to Granger
Rt causality is not sound with respect to causal relations.
where Λl (t|h; θ) = −∞ λl (τ |h; θ)dτ and the function
1l (l0 ) is one if l0 = l and zero otherwise. The condi- There has been much work related to causal discov-
tional intensities are assumed to satisfy λl (t|h; θ) = 0 ery and the estimation of causal effects in time-series.
for t ≤ t(h) to ensure that ti > ti−1 = t(hi ). These As discussed above, the work on Granger causality
modeling assumptions are quite weak. In fact, any (Granger 1969) is the most well known. The short-
comings of this approach are also well known (e.g., References
Eichler 2007) and there has been some work in trying
to address these known short comings. For instance, [1] D. J. Daley and D. Vere-Jones. An Introduction to
Eichler (2007) proposes a similar approach to the ap- the Theory of Point Processes: Elementary The-
proach described here but differs in that it allows for ory and Methods, volume I. Springer, second edi-
the possibility of “simultaneous correlation” which re- tion, 2002.
quires the use of an alternative definition of separa-
[2] Vanessa Didelez. Graphical models for marked
tion. In addition, while providing definitions of cause
point processes based on local independence.
and spurious cause, sufficient conditions for the identi-
JRSS-B, 70(1):245–264, 2008.
fication of causal relationships are not presented. The
work of Entner and Hoyer (2010) considers the prob- [3] Michael Eichler. Granger causality and path di-
lem of causal discovery from time series data using agrams for multivariate time series. Journal of
limited dependence vector autoregressive models and Econometrics, 137:334–353, 2007.
the FCI algorithm that uses conditional independence
tests to identify the structure. Our approach of using [4] Michael Eichler and Vanessa Didelez. Causal rea-
δ*separation is inspired by the work of Didelez (2008) soning in graphical time series models. In Un-
who defined δ-separation and shows the connection be- certainty in Artificial Intelligence, pages 109–116,
tween that graphical separation criterion and local in- 2007.
dependence of marked point processes. Our extension
[5] Doris Entner and Patrik O. Hoyer. On causal dis-
to δ*separation allows for the appropriate treatment of
covery from time series data using FCI. In Prob-
self-edges which are essential in any self-excitatory or
abilistic Graphical Models, pages 121–128, 2010.
self-inhibitory dynamic process. Another more loosely
connected work is that of Eichler and Didelez (2007) [6] C.W.J. Granger. Investigating causal relations by
that considers the estimation of causal effects based econometric models and cross-spectral methods.
on an intervention in a time-series. Econometrica, 37:424–438, 1969.
While the results described in this paper offer hope [7] Asela Gunawardana, Christopher Meek, and
for developing a methodologically sound approach to Puyang Xu. A model for temporal dependencies
causal inference for dynamic systems, there is much in event streams. In Advances in Neural Informa-
work that needs to be done. Here are some of the tion Processing Systems, 2011.
open research questions.
[8] C. Meek. Strong completeness and faithfulness
• Non-parametric tests for process independence for in Bayesian networks. In Proceedings of Eleventh
various type of temporal statistical processes Conference on Uncertainty in Artificial Intelli-
gence, Montreal, QU, pages 411–418. Morgan
Kaufmann, August 1995.
• Soundness and completeness results for
δ*separation analogous to those provided by [9] P. Spirtes, C. Glymour, and R. Scheines. Cau-
Pearl (1988), Meek (1995) and Spirtes et al sation, Prediction, and Search, Second Edition.
(2001) for d-separation. Note that Didelez (2008) MIT Press, Cambridge, MA, second edition, 2001.
has shown the soundness of δ-separation for
a family of marked point processes related to [10] T. Verma and J. Pearl. Equivalence and synthe-
GEMs. sis of causal models. In Proceedings of Sixth Con-
ference on Uncertainty in Artificial Intelligence,
• A representation for equivalence classes of causal Boston, MA, pages 220–227. Morgan Kaufmann,
graphs with respect to δ*separation in the case of July 1990.
causal insufficiency (O ⊂ L) analogous to those
developed by Verma and Pearl (1990) and Spirtes
et al (2001) that captures the common casual as-
pects of the set of graphs in the equivalence class.

Acknowledgments

Thanks to Asela Gunawardana and two anonymous
reviewers for their comments on an earlier draft of this
paper.