=Paper=
{{Paper
|id=Vol-1504/uai2015aci_paper6
|storemode=property
|title=Learning the Structure of Causal Models with Relational and Temporal Dependence
|pdfUrl=https://ceur-ws.org/Vol-1504/uai2015aci_paper6.pdf
|volume=Vol-1504
|dblpUrl=https://dblp.org/rec/conf/uai/MarazopoulouMJ15
}}
==Learning the Structure of Causal Models with Relational and Temporal Dependence==
Learning the Structure of Causal Models
with Relational and Temporal Dependence
Katerina Marazopoulou Marc Maier David Jensen
kmarazo@cs.umass.edu maier@cs.umass.edu jensen@cs.umass.edu
College of Information and Computer Sciences
University of Massachusetts Amherst
Amherst, MA 01003
Abstract eters of causal graphical models to be learned from ob-
servational data. As early as the 1990s, researchers de-
veloped constraint-based algorithms, such as the IC [Pearl
Many real-world domains are inherently rela-
and Verma, 1991] and PC [Spirtes et al., 2000] algorithms,
tional and temporal—they consist of heteroge-
that leverage the connection between the graphical criterion
neous entities that interact with each other over
of d-separation and the conditional independencies that are
time. Effective reasoning about causality in such
inferred to hold in the underlying distribution, in order to
domains requires representations that explicitly
learn the structure of causal graphical models from data.
model relational and temporal dependence. In
this work, we provide a formalization of tem- In this work, we significantly extend the expressiveness
poral relational models. We define temporal ex- of the models learnable with constraint-based algorithms.
tensions to abstract ground graphs—a lifted rep- Specifically, we provide a formalization of temporal re-
resentation that abstracts paths of dependence lational models, an expressive class of models that can
over all possible ground graphs. Temporal ab- capture probabilistic dependencies between variables on
stract ground graphs enable a sound and com- different types of entities within and across time points.
plete method for answering d-separation queries We extend the notion of abstract ground graphs [Maier et
on temporal relational models. These methods al., 2013b]—a lifted representation that allows reasoning
provide the foundation for a constraint-based al- about the conditional independencies implied by a rela-
gorithm, TRCD, that learns causal models from tional model—for temporal relational models, and we show
temporal relational data. We provide experimen- that temporal abstract ground graphs are a sound and com-
tal evidence that demonstrates the need to explic- plete abstraction for ground graphs of temporal relational
itly represent time when inferring causal depen- models. Temporal abstract ground graphs can be used to
dence. We also demonstrate the expressive gain answer d-separation queries for temporal relational mod-
of TRCD compared to earlier algorithms that do els. We also extend an existing constraint-based algorithm
not explicitly represent time. for inferring causal dependence in relational data—the re-
lational causal discovery (RCD) algorithm—to incorporate
time, thus providing a constraint-based method that learns
causal models from temporal relational data.
1 INTRODUCTION
Recent work in artificial intelligence has devoted increas- 2 RELATIONAL MODELS
ing attention to learning and reasoning with causal knowl-
edge. Causality is central to understanding the behavior of Propositional representations, such as Bayesian networks,
complex systems and to selecting actions that will achieve describe domains containing a single entity class. Many
particular outcomes. Thus, causality is implicitly or explic- real world systems comprise instances from multiple en-
itly central to mainstream AI areas such as planning, cogni- tity classes whose variables are interdependent. Such do-
tive modeling, computational sustainability, game playing, mains are often referred to as relational. In this section, we
multiagent systems, and robotics. Causal inference is also introduce basic concepts of relational representations that
central to many areas beyond AI, including medicine, pub- exclude time.
lic policy, and nearly all areas of science.
A relational schema S = (E, R, A, card ) specifies the
Substantial research advances have been made over the types of entities, relationships, and attributes that exist in
past several decades that allow the structure and param- a domain. It includes a cardinality function that constrains
gpa Takes difficulty
Alice.gpa CS101.difficulty
Student Course
Bob.gpa CS201.difficulty
[Course, Takes, Student].gpa ! [Course].difficulty
Figure 1: Relational model for the example domain. The
underlying relational schema (ER diagram) is shown in Figure 3: Ground graph for the model of Figure 1 applied
gray. on the relational skeleton of Figure 2.
Takes
following relational dependency
Alice CS101
Takes [Course, Takes, Student].gpa → [Course].difficulty
Takes which states that the difficulty of a course is affected by the
Bob CS201
gpa of students taking that course. Presumably, instructors
Figure 2: Example relational skeleton for the schema adjust the difficulty of the course based on the grade-point
shown in Figure 1. average of enrolled students.
A relational model M = (S, D, Θ) is a collection of rela-
tional dependencies defined over a single relational schema
the number of times an entity instance can participate in a along with their parameterizations (a conditional probabil-
relationship. A relational schema can be depicted with an ity distribution for each attribute given its parents). The
Entity-Relationship (ER) diagram. Figure 1 shows an ex- structure of a relational model can be depicted by super-
ample ER diagram (in gray) that describes a simplified uni- imposing the dependencies on the ER diagram of the re-
versity domain. The domain consists of two entity classes lational schema, as shown in Figure 1, and labeling each
(Student and Course), and one relationship class (Takes). arrow with the corresponding relational dependency.
The entity class Student has one attribute, gpa, and the
Given a relational model M and a relational skeleton σ,
entity class Course has one attribute, difficulty. The car-
we can construct a ground graph GG M σ by applying the
dinality constraints are shown with crow’s feet notation—
relational dependencies as specified in the model to the spe-
students can enroll in multiple courses and a course can
cific instances of the relational skeleton. Figure 3 shows the
be taken by multiple students. A relational skeleton is a
ground graph for the model of Figure 1 applied on the re-
partial instantiation of a relational schema. It specifies the
lational skeleton of Figure 2. In this work, we restrict our
entity and relationship instances that exist in the domain.
attention to relational models that do not contain the kind
Figure 2 shows an example relational skeleton for the rela-
of relational autocorrelation that gives rise to cycles in the
tional schema of Figure 1. The skeleton consists of two
ground graph.
Student instances, Alice and Bob, and two Course in-
stances, CS101 and CS201. Alice is taking CS101 and Bob
is taking both courses. 3 TEMPORAL RELATIONAL MODELS
Given a relational schema, we can specify re-
Relational models can be extended with a temporal di-
lational paths, which intuitively correspond to
mension to model probabilistic dependencies over time.
ways of traversing the schema. For the schema
Such an extension is similar to the way in which dy-
shown in Figure 1, possible paths include
namic Bayesian networks (DBNs) extend Bayesian net-
[Student, Takes, Course] (the courses a student takes), as
works [Murphy, 2002]. A temporal relational model can
well as [Student, Takes, Course, Takes, Student] (other
be thought of as a sequence of time points, each of which
students that take the same courses). Relational variables
is associated with a (non-temporal) relational model, and a
consist of a relational path and an attribute that can be
set of dependencies that cross time points. Because depen-
reached through that path. For example, the relational
dencies in this model have a causal interpretation, depen-
variable [Student, Takes, Course].difficulty corresponds
dencies across time points are only directed from the past
to the difficulty of the courses that a student takes. Prob-
to the future.
abilistic dependencies can be defined between relational
variables. Dependencies are said to be in canonical form In this section, we extend the relational notions presented
when the path of the effect (or outcome) relational variable in Section 2 to include time. We assume that (1) time is dis-
is a single item. For canonical dependencies, the path crete; (2) the schema is static; (3) relational dependencies
of the cause (or treatment) relational variable describes do not change over time; (4) the temporal relational skele-
how dependence is induced. As an example, consider the ton is given a priori; (5) the first-order Markov assumption
[Course t , Takes t , Student t ].gpa ! [Course t ].difficulty
first finds the courses that a student is taking this semester,
and then finds those courses in the previous semester. In
gpa Takes difficulty t
the example skeleton of Figure 5, for Alice, the first path
Student Course
would reach CS101 at t = 0, while the second path, will
[Student t+1 , Student t , Takes t , Course t ].difficulty ! [Student t+1 ].gpa reach CS201 at t = 0.
Definition 1. A temporal relational path P is a sequence
of non-temporal relational paths P0t0 , . . . , Pktk (k ≥ 0) such
gpa Takes difficulty t+1 t
Student Course
that for any two consecutive paths Piti , Pj j in P the follow-
ing hold:
[Course t+1 , Takes t+1 , Student t+1 ].gpa ! [Course t+1 ].difficulty
Figure 4: Example structure of a temporal relational model. 1. |ti − tj | = 1
2. The last item class of Piti is the same as the first item
t
Takes
class of Pj j .
Alice CS101 Alice CS101
3. No subpath of P is of the form [Ikt , . . . , Ikt ], where all
Takes ...
Takes
Bob CS201 Bob CS201
relations in the subpath are one-to-one.
t=0 t=1
We use the notation time(P ) to denote the set of all time
Figure 5: Example temporal relational skeleton for the
points that appear in path P .
schema shown in Figure 4.
Temporal relational variables consist of a temporal rela-
tional path and an attribute that can be reached through
holds (i.e., treatment and outcome can be at most one time that path. Temporal relational dependencies define proba-
point apart); and (6) all entities participating in a relation- bilistic dependencies between two temporal relational vari-
ship are contemporaneous with the relationship. ables. Temporal probabilistic dependencies are never di-
rected backwards in time. Therefore, at the model level,
Under these assumptions, the structure of a temporal re-
there are no dependencies going back in time. However,
lational model can be represented by using only two time
the temporal constraints associated with a dependency are
points, as shown in Figure 4. Every time point has the same
also implicitly encoded by the temporal relational path that
relational schema, shown in gray. A temporal relational
annotates the dependency. To account for this, we forbid
skeleton provides a partial instantiation of a temporal rela-
the temporal relational path of the treatment to go through
tional schema. It specifies the entity and relationship in-
any time points later than the time point of the outcome.
stances that exist in each time point. Note that different
sets of entity and relationship instances may be present in Definition 2. A temporal relational dependency consists
each time step. An example temporal relational skeleton is of two temporal relational variables with a common base
0
shown in Figure 5. In this case, both time points have the item, [I1t , . . . , Ikt ].Vk → [I1t ].V1 such that
same set of entity instances (Alice and Bob are instances
0
max time [I1t , . . . , Ikt ] ≤ t
of the Student entity, CS101 and CS201 are instances of
the Course entity). However, the instances of the relation-
ship Takes differ. In t = 0, Alice takes CS101 and Bob The first-order Markov assumption for temporal causal
takes CS201. In t = 1, Alice takes CS201 and Bob takes models implies that for every probabilistic dependency, if
no classes. the treatment is in time point t, then the outcome is either
in t or in t + 1. In the case of relational domains, the rela-
Temporal relational paths capture possible ways to tional path of the treatment carries some temporal informa-
traverse the temporal schema; therefore, they can cross tion since it can contain multiple time points. For the first-
time points. For example, a temporal relational path order Markov condition to hold, we require the relational
is [Student t+1 , Student t , Takes t , Course t ], which de- path of the treatment to only go through the current and the
scribes the classes that a student took in the previous next time points. More formally, a temporal relational de-
0
semester. More formally, a temporal relational path is pendency [I1t , . . . , Ikt ].Vk → [I1t ].V1 follows the first-order
a sequence of non-temporal relational paths (relational Markov assumption if the following two conditions hold:
paths within a time point) and “jumps” between neigh-
t = t0 or t = t0 + 1 (1)
boring time points. These jumps can happen at both
0
entities and relationships because each choice encodes time([I1t , . . . , Ikt ]) ⊆ {t, t − 1} (2)
a distinct semantics. For example, the relational path
[Student t+1 , Student t , Takes t , Course t ] describes the The structure of a 2-slice temporal relational model is de-
courses that a student took in the previous semester, while fined as a set of temporal relational dependencies over a
the path [Student t+1 , Takes t+1 , Course t+1 , Course t ] relational schema.
Table 1: Jaccard distance between the terminal sets of the
Connects different paths for the reality mining dataset (100 sample
Tower Cellphone dates, distance is averaged across dates and across towers).
Jaccard distance P1 vs. P3 P1 vs. P2 P2 vs. P3
Figure 6: Relational schema for the reality mining dataset.
mean 0.47 0.31 0.31
min 0 0 0
Definition 3. The structure of a 2-slice temporal relational max 1 1 1
model is a pair M = hS, DT i, where S is a relational median 0.5 0 0
schema and DT is a set of temporal relational dependen-
cies that adhere to the first-order Markov assumption.
nected to a tower in the previous timestep. The second path
Figure 4 shows the structure of a 2-slice temporal relational corresponds to the cellphones that connected to a tower
model for the university domain. The temporal dependency both in the current and in the previous timestep. The
shows that the difficulty of a course in the fall semester third path corresponds to the cellphones that connected to
affects the spring GPA of the students that took this class a tower in the current time step, and gets the state of those
in the fall. If, for example, a student takes many difficult cellphones in the previous timestep.
classes, it is more likely that the student’s GPA will drop
in the next semester. Finally, given a temporal relational We randomly selected 100 dates from the dataset. For each
skeleton σT and a temporal relational model M, we can of these dates and for each tower that was used in these
construct a temporal ground graph GG MσT in the same dates, we computed the terminal sets for the above paths.
way as in the non-temporal case. For a given tower and date, we computed the Jaccard dis-
tance between the different terminal sets. The Jaccard dis-
tance between two sets A and B is defined as J(A, B) =
4 EXPRESSIVENESS OF TEMPORAL |A∩B|
1 − |A∪B| . Intuitively, this quantifies the overlap of two
RELATIONAL MODELS
sets, while accounting for the size of both. Table 1 shows
the average Jaccard distance between the terminal sets, av-
The temporal relational model described so far sub-
eraged across the dates and the towers. The results indicate
sumes propositional directed networks (Bayesian net-
that, on average, the terminal sets reached through the three
works), propositional temporal directed networks (dynamic
paths will be different; therefore, this more expressive rep-
Bayesian networks), and relational non-temporal models.
resentation could be of use in real data.
This added expressivity comes at the cost of increased com-
plexity. This raises an obvious and important question:
What is the value of this added expressivity?
5 TEMPORAL ABSTRACT GROUND
As an example of the practical utility of this added expres- GRAPHS
sivity, we consider a real dataset and show how a path with
temporal jumps leads to different terminal sets. Specifi-
An abstract ground graph is a lifted representation that ab-
cally, we used the Reality Mining dataset [Eagle and Pent-
stracts paths of dependence over all possible ground graphs
land, 2006]. The dataset contains two entities, Tower and
for a given relational model [Maier et al., 2013b]. Abstract
Cellphone, and one relationship, Connects. The relational
ground graphs are shown to be sound and complete in the
schema for this domain is shown in Figure 6. For this
sense that every edge in the abstract ground graph corre-
dataset, entity instances (i.e., the set of towers and cell-
sponds to an edge in some ground graph, and every edge
phones) do not change over time. However, the set of re-
in an arbitrary ground graph is represented by an edge in
lationship instances (the connections) are different at every
an abstract ground graph. In this section we adapt the def-
time point. In total, there are 95 cellphones, 32,579 tow-
inition of abstract ground graphs for the case of a 2-slice
ers, and 3,308,709 connections. Every connection is time-
temporal relational model.
stamped with precision of 1 second. For our work, the time
granularity was coarsened to a day. Definition 4. A temporal abstract ground graph
We computed the terminal sets for the following three tAGG MBh = hV, Ei for a 2-slice temporal rela-
paths: tional model M = hS, DT i, perspective B ∈ E ∪ R, and
hop threshold h ∈ N0 is an abstraction of the dependencies
P 1 : [Tower t+1 , Tower t , Connects t , Cellphone t ] DT for all possible ground graphs GG MσT of M on
arbitrary temporal skeletons σT . The temporal abstract
P 2 : [Tower t+1 , Connects t+1 , Connects t , Cellphone t ]
ground graph is a directed graph with the following nodes
P 3 : [Tower t+1 , Connects t+1 , Cellphone t+1 , Cellphone t ] and edges:
The first path corresponds to the cellphones that were con- 1. V = RV ∪ IV , where
(a) RV is the set of temporal relational variables with a [Student(t+1), Takes(t+1), Takes(t), Course(t)].difficulty
path of length at most h + 1. [Student(t+1), Takes(t+1), Takes(t), Course(t)].difficulty INT
[Student(t+1), Takes(t+1), Takes(t), Student(t), Student(t+1)].gpa INT
[Student(t+1), Student(t), Takes(t), Takes(t+1), Student(t+1)].gpa)
[Student(t+1),Takes(t+1), Course(t+1), Course(t)].difficulty
0 0
RV = {[B t , ... , Ijt ].V | length([B t , ... , Ijt ]) ≤ h + 1} [Student(t+1), Takes(t+1), Takes(t), Student(t), Student(t+1)].gpa INT
[Student(t+1), Takes(t+1), Course(t+1), Takes(t+1), Student(t+1)].gpa)
[Student(t+1), Student(t), Takes(t), Course(t)].difficulty INT
[Student(t+1), Takes(t+1), Takes(t), Course(t)].difficulty
[Student(t+1), Takes(t+1), Takes(t), Student(t), Student(t+1)].gpa)
(b) IV are intersection variables between pairs of tempo- [Student(t+1), Student(t), Takes(t), Course(t)].difficulty
ral relational variables that could intersect1 . [Student(t+1)].gpa)
[Student(t+1), Student(t), Takes(t), Course(t)].difficulty INT
[Student(t+1),Takes(t+1), Course(t+1), Course(t)].difficulty
IV = {X ∩ Y | X, Y ∈ RV
0 00
and X = [B t , ... , Ikt , ... , Ijt ].V Figure 7: Temporal abstract ground graph from
000 00 0 000 the perspective of Student and hop thresh-
and Y = [B t , ... , Ilt , ... , Ijt ].V and Ikt 6= Ilt }
old h = 4 for a model with one dependency:
[Student t+1 , Student t , Takes t , Course t ].difficulty →
2. E = RVE ∪ IVE , where [Student t+1 ].gpa. INT denotes intersection variables
between pairs of temporal relational variables.
(a) RVE ⊂ RV × RV are the relational variable edges:
0 00
RVE = {[B t , ... , Ikt ].Vk → [B t , ... , Ijt ].Vj | given Z if every path between variables in X and Y is not
00 0 00
[Ijt , ... , Ikt ].Vk → [Ijt ].Vj ∈ DT and a d-connecting path given Z. A path is d-connecting given
0 00 00 0 Z if for every collider W on the path, either W ∈ Z or a
[B t , ... , Ikt ] ∈ extend ([B t , ... , Ijt ], [Ijt , ... , Ikt ])}
descendant of W is in Z, and every non-collider on the path
is not in Z. Geiger and Pearl [1988] and Verma and Pearl
(b) IVE ⊂ IV × RV ∪ RV × IV are the intersection [1988] showed that d-separation is sound and complete.
variable edges. This is the set of edges that intersection
variables “inherit” from the relational variables that they Ground graphs (and temporal ground graphs) are directed
were created from. acyclic graphs; therefore, the rules of d-separation can be
applied to them. In the case of domains where the first-
The extend method converts dependencies of the model, order Markov assumption holds, a d-separating set for vari-
specified in the canonical form, into dependencies from the ables in the same time point can be found by examining
perspective of the abstract ground graph. only one time point in the past.
Proposition 5. Let G be a temporal directed acyclic graph
Temporal abstract ground graphs can be shown to be
that follows the first-order Markov assumption. (i) If X t+1
a correct abstraction over all possible temporal ground
and Y t+1 are conditionally independent given some set Z,
graphs. The proof follows the one provided for the non-
then there exists a separating set W such that time(W) ⊆
temporal abstract ground graphs, as presented by Maier et
{t, t + 1}. (ii) Similarly, if X t and Y t+1 are conditionally
al. [2013b].
independent given some set Z, then there exists a separat-
The temporal abstract ground graph for a model on the ing set W such that time(W) ⊆ {t, t + 1}.
student-courses domain with the dependency
Proof. For each case, we will construct an appropriate sep-
[Student t+1 , Student t , Takes t , Course t ].difficulty →
arating set:
[Student t+1 ].gpa
(i) Let W = parents(X t+1 )∪parents(Y t+1 ). Because of
is shown in Figure 7. This abstract ground graph is from the first-order Markov assumption, time(W) ⊆ {t, t + 1}.
the perspective of Student and for hop threshold h = 4. Moreover, conditioning of W renders X t+1 , Y t+1 inde-
Disconnected nodes are omitted. pendent because of the local Markov property (either X t+1
is a descendant of Y t+1 or vice versa, or none of them is a
descendant of the either).
6 d-SEPARATION IN TEMPORAL
ABSTRACT GROUND GRAPHS (ii) In this case, let W = parents(Y t+1 ). Because of
the temporal semantics, X t is a non-descendant of Y t+1 ;
The rules of d-separation provide a graphical criterion that therefore, Y t+1 is conditionally independent of its non-
specifies whether two sets of variables in a directed acyclic descendants (that include X t ) given W.
graph are conditionally independent given a third set of
variables. Specifically, let X, Y, Z be disjoint sets of vari-
ables in a directed acyclic graph. X is d-separated from Y
The significance of the above proposition is that, given a
1
Only relational variables in the same time point can intersect. model where the first-order Markov assumption holds, we
could infer conditional independencies (and use them to and the soundness and completeness of d-separation for
learn the structure of the model) by only considering con- abstract ground graphs, we conclude that d-separation is
secutive time points. sound and complete (up to a hop threshold) in temporal ab-
stract ground graphs.
Maier et al. [2013b] showed that d-separation cannot be
applied directly to a relational model. To correct for that,
they introduced relational d-separation, a graphical crite-
rion that can be applied to abstract ground graphs and used
7 TEMPORAL RCD
to infer conditional independencies that hold across all pos-
sible ground graphs of the model. Here, we show that
the notion of relational d-separation can be generalized for
The theory of temporal relational d-separation allows us
temporal abstract ground graphs as well. In the following
to derive all conditional independence facts that are con-
definition, X|b denotes the terminal set of X starting at b,
sistent with the structure of a temporal relational model,
i.e., the set of X instances that can be reached if we start
the same way that d-separation connects the structure of
from instance b and follow the relational path of X on the
a Bayesian network and the conditional independencies of
relational skeleton.
the underlying distribution. This is precisely the connec-
Definition 6 (Temporal relational d-separation). Let X, Y, tion that constraint-based algorithms leverage in order to
and Z be three disjoint sets of temporal relational variables learn the structure of models. Thus, temporal relational d-
from perspective B for a 2-slice temporal relational model separation (and temporal abstract ground graphs) enable a
M such that not both X and Y contain variables in t. Then constraint-based algorithm, TRCD2 , that learns the struc-
X and Y are d-separated by Z if and only if, for any tem- ture of temporal and relational causal models from data.
poral skeleton σT , X|b and Y|b are d-separated by Z|b in
ground graph GG MσT for all b ∈ σT (B). TRCD extends RCD [Maier et al., 2013a] to operate over
a 2-slice temporal relational model. More specifically, it
The following theorem shows that temporal relational d- constructs a set of temporal abstract ground graphs, one
separation is sound and complete up to a specified hop from the perspective of each entity, and uses the theory
threshold. We use the notation X̄ to denote the set of re- of temporal relational d-separation on temporal abstract
lational variables X augmented with the set of intersection ground graphs to decide conditional independence facts.
variables they participate in. TRCD uses the temporal abstract ground graphs to de-
Theorem 7. Let X, Y, and Z be three disjoint sets of tem- termine which conditional independence facts should be
poral relational variables for perspective B such that not checked in the data and which dependencies (edges) in the
both X and Y contain variables in t. Then, for any tem- temporal relational model are implied by those facts. As
poral skeleton σT and for all b ∈ σ(B), X|b and Y|b are in the case of RCD, for practical reasons, the space of po-
d-separated by Z|b up to h in ground graph GG MσT if tential dependencies is limited by a domain-specific hop
and only if X̄ and Ȳ are d-separated by Z̄ on the abstract threshold.
ground graph tAGG MBh . TRCD operates in two phases. Phase I learns a set of undi-
rected dependencies and Phase II employs a set of orienta-
Proof. We prove this by defining a non-temporal relational tion rules to orient those dependencies. Phase II of TRCD
model that is equivalent to the given temporal model. Since uses the same orientation rules as RCD—collider detection,
d-separation is sound and complete for non-temporal rela- known non-colliders (KNC), cycle avoidance (CA), Meek
tional models, the result extends to the equivalent model, rule 3 (MR3), and relational bivariate orientation (RBO).
and therefore, the temporal relational model. Given a 2- Additionally, TRCD orients dependencies that cross time
slice temporal relational model MT = hS, DT i, construct points from the past to the future.
a relational model M = hS 0 , Di as follows:
RCD was shown to be sound and complete in the sam-
• For every item in S, add an item in S 0 with superscript t ple limit (i.e., with perfect tests of conditional indepen-
and one with superscript t + 1: S 0 = {I t , I t+1 | I ∈ S}. dence) and for infinite hop threshold, under the standard
• For every entity E ∈ S, add in S 0 a 1-1 relation between assumptions of the causal Markov condition, faithfulness,
E t and E t+1 . and causal sufficiency for relational domains. By leverag-
ing the soundness and completeness of temporal relational
• The set of relational dependencies is the set of temporal d-separation, TRCD can be shown to be sound and com-
relational dependencies: D = DT . plete (under the same assumptions).
It can be shown that these two models are equivalent in
the sense that there is a one-to-one correspondence be-
tween valid paths in MT and M. Leveraging the tempo-
2
ral constraints of d-separation described by Proposition 5 Code available at kdl.cs.umass.edu/trcd.
Precision / Recall (with standard error of the mean)
1 entity 2 entities 3 entities
1.00 1.00 1.00
0.95 0.95 0.95
0.90 0.90 0.90
0.85 0.85 0.85
0.80 0.80 0.80
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of dependencies Number of dependencies Number of dependencies
Skeleton precision Skeleton Recall Oriented Precision Oriented Recall
Figure 8: Precision and recall for TRCD after Phase I (unoriented) and after Phase II (oriented) when using an oracle for
answering d-separation queries. The y-axis values start at 0.8.
8 EXPERIMENTS Table 2: Frequency of the most-used orientation rules dur-
ing Phase II of TRCD for the oracle experiments. For the
8.1 ORACLE EXPERIMENTS rest of the rules, the frequency was less than 1%. Temporal
dependencies are not oriented by the orientation rules.
The goal of this set of experiments is twofold. First, we
evaluate the theoretical performance of TRCD in the large Number Collider KNC RBO Percent of
sample limit. Towards that end, we employ an oracle for of detection temporal
answering d-separation queries in the place of statistical entities dependencies
tests of independence. Second, these experiments provide 1 71% 28% 0% 66%
experimental evidence about the correctness of temporal 2 66% 11% 23% 68%
relational d-separation. 3 53% 11% 36% 65%
We generated synthetic schemas with number of entities
varying from 1 to 3, number of relationships fixed to one
less than the number of entities, and number of attributes it is not possible to orient all dependencies (oriented recall
for each item drawn from Poisson(λ = 1) + 1. The car- is lower than 1). Note that comparing to an oracle version
dinalities were uniformly selected at random. For each of RCD is not straightforward. The oracle requires a true
number of entities specified, we generated 500 different model that is fully directed. Converting the true temporal
schemas. This generating process yielded 1,500 different model to a non-temporal one would often result in cycles
schemas. For a given schema, we generated 10 different or undirected edges, and the relational d-separation oracle
models with number of dependencies ranging from 1 to 10. cannot be used.
Specifically, we generated all possible dependencies, up to
Table 2 shows how often each orientation rule was used in
hop-threshold h = 3. Then, we chose the desired num-
Phase II. Collider detection, known non-colliders (KNC),
ber of dependencies from that space at random, subject to
and relational bivariate orientation (RBO) are orienting the
the following constraints: Each relational variable has at
majority of the edges. As expected, in the case of propo-
most 3 parents and the generated model contains no cycles.
sitional models (one entity), the relational bivariate orien-
Moreover, every model must contain at least one depen-
tation rule, a rule for edge orientation that is unique to
dency with a temporal relational path that contains more
relational data, is never used. We observe that the other
than one time point. This procedure resulted in a total of
two rules—cycle avoidance and Meek’s rule 3—do not fire
15,000 models.
often. That can be explained by the fact that those rules
We ran TRCD with a d-separation oracle for each model. would never fire in the presence of a temporal edge. Con-
Figure 8 shows average precision and recall after Phase I sider cycle avoidance in the propositional case: If the pat-
(unoriented dependencies) and after Phase II (partially ori- tern X → Y → Z and X − Y is encountered, then cycle
ented dependencies). As expected, given that these exper- avoidance orients X → Y . If any of the dependencies
iments use an oracle, the algorithm always learns the cor- X → Y → Z crosses time points, then X and Y would be
rect set of unoriented dependencies (unoriented precision in different time points and the edge between them would
and recall are always 1). The algorithm makes no mistakes be oriented based on temporal precedence. A similar argu-
in the orientation (oriented precision is always 1); however, ment can be made for the case of Meek’s rule 3.
8.2 EXPERIMENTS ON SYNTHETIC DATA
XY1
X1 X2 Y1
XY2 XY3
This experiment showcases the use of TRCD on data, i.e., X3
A AB
Y2
B
without the use of an oracle to decide conditional indepen-
t+1 t t t+1 t+1 t+1 t+1
[A , A , AB ].XY 2 ! [A ].X2 [AB ,A , AB , AB ].XY 1 ! [AB t+1 ].XY 3
t
dence. Towards that end, we generated synthetic models,
generated data from these models, and applied TRCD on [At+1 , At ].X2 ! [At+1 ].X3
[B t+1 , AB t+1 , At+1 , AB t+1 ].XY 3 ! [B t+1 ].Y 1
them. The use of synthetic data allows us to have access
XY1
Y1
to the ground truth, so we can measure the accuracy of the X1 X2
XY2 XY3
learned models. The data-generating process is described X3
A AB
Y2
B
in detail below.
[B t+1 ].Y 2 ! [B t+1 ].Y 1
Using the same process as described in 8.1, we generated 5
synthetic schemas with 2 entities and 5 synthetic schemas Figure 10: True temporal relational model. The dotted
with 3 entities. For each schema, we generated 10 models edges were not learned by TRCD, while the dashed edge
with number of dependencies ranging from 1 to 10. This re- was left unoriented.
sulted in 100 different models. For each model we created
[B, AB, A, AB].XY 3 ! [B].Y 1
3 different relational skeletons over 300 timepoints. The
number of entity instances at each time point was drawn XY1
Y1
from Poisson(λ) + 1, where λ ∼ U(5, 10). The degree X1 X2
XY2 XY3
X3 Y2
distribution for the relationship instances was drawn from A AB B
Poisson(λ) + 1, where λ ∼ U(1, 5). Regarding the param-
eters of the graphical model, the marginals were parameter- [B].Y 2 ! [B].Y 1
ized as normal distributions N (µ, σ), where µ ∼ U(0, 5)
Figure 11: Model learned by RCD for data generated from
and σ ∼ U(0, 1). TheP conditional distribution for a re-
the temporal model of Figure 10.
lational variable X was Y ∈parents(X) avg(Y ) + 0.1 ∗
N (0, 1). This resulted in 300 datasets, each over 300 time
points. 9 RELATED WORK
In order to assess statistical independence, we fitted a stan-
dard linear least-squares regression equation to the out- For the propositional case, there are several approaches
come variable using the treatment and the variables in the for learning the structure of temporal probabilistic mod-
conditioning set as covariates. For relational variables in els. Most of them are not constraint-based methods, but
the conditioning set, we used the average as the aggregation follow the search-and-score paradigm. Friedman et al.
function. Then, we directly used the t-test of the coefficient [1998] present an algorithm to learn the structure of DBNs
of the treatment variable to assess independence (p > 0.05 from complete data using a search-and-score algorithm,
or effect size < 0.01). Figure 9 shows average precision and from incomplete data using structural EM. Lähdesmäki
and recall of TRCD after Phase I and Phase II, when ap- and Shmulevich [2008] use Bayesian methods to learn
plied to the synthetic datasets. While precision after Phase the structure of a DBN from steady state measurements
I is more than 0.75 in most cases, the recall after Phase or from time-series data and steady state measurements.
I is relatively low. That implies that we concluded inde- Robinson and Hartemink [2010] present an MCMC algo-
pendence (and therefore we removed an edge) more often rithm to learn the structure of a DBN that changes over
than we should. This corresponds to Type II errors and can time. A different approach that learns a causal temporal
be attributed to the lack of good conditional independence model from time series data is the difference-based causal-
tests for relational data. ity learner [Voortman et al., 2010]. This framework is
based on dynamic Structural Equation Models.
Finally, to demonstrate the difference in expressiveness be-
tween TRCD and RCD (the only constraint-based algo- A constraint-based method for temporal propositional do-
rithm for relational data), we ran RCD on a “temporally mains is presented by Entner and Hoyer [2010] who ex-
flattened”3 version of the synthetic data. The true model tend the FCI algorithm [Spirtes et al., 2000] to temporal
and the model that TRCD learned are shown in Figure 10. domains. FCI relaxes the causal sufficiency assumption,
The model learned by RCD is shown in Figure 11. TRCD i.e., it allows the presence of latent common causes. Our
correctly learns and orients two of the edges, with the cor- approach is the first approach that uses a constraint-based
rect path specification. Those dependencies cannot even be algorithm for data that is both temporal and relational (al-
expressed in the space of dependencies for RCD. though under the assumption of causal sufficiency).
Another widely used method for inferring causal relation-
3
RCD is ignoring temporal information. An instance is ships from temporal data is Granger causality [Granger,
uniquely identified by instance id and time point. 1969]. The main idea underlying Granger causality is that
TRCD on temporal data
Precision / Recall (with standard error of the mean)
2 entities 3 entities
1.00 1.00
0.75 0.75
0.50 0.50
0.25
0.25
0.00
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of dependencies Number of dependencies
Skeleton precision Skeleton recall Oriented precision Oriented recall
Figure 9: TRCD on synthetic temporal data.
a cause should improve the predictive accuracy of its ef- structure of temporal relational models from data. We
fect, compared to predictions based solely on the effect’s showed that the algorithm is sound and complete, and we
past values. There has been work in extending Granger provided experimental evidence that showcases the correct-
causality for multivariate settings, as well as in combining ness of TRCD. Finally, we showed the improvement that
Granger causality with graphical models [Eichler, 2012]. TRCD achieves compared to RCD when applied to do-
Liu et al. [2010] propose a regularized Hidden Markov mains with a temporal component.
Random Field regression to learn the structure of a tempo-
TRCD makes certain simplifying assumptions. Future
ral causal graph from multivariate time-series data. How-
work could focus on relaxing some of those assumptions,
ever, the methods used for learning the structure of the
specifically, allowing the structure of the causal model to
causal model are not constraint-based.
change over time (change point detection for relational
Another line of work in learning temporal and relational data). Another avenue for future research is relaxing the
models stems from combining first-order logic with prob- causal sufficiency assumption by employing techniques
abilistic frameworks. Logical Hidden Markov Models ex- such as blocking [Rattigan et al., 2011] for temporal re-
tend Hidden Markov Models to handle relational (non-flat lational domains. Finally, an important issue that arises
data) [Kersting et al., 2006]. Kersting and Raiko [2005] when modelling time as a discrete quantity is how to choose
provide an EM-based algorithm to learn the structure of the appropriate granularity of time points. Ribeiro et al.
LHMMs. [2012] provide an analysis on how aggregating at a given
time granularity affects the characterization of the under-
In terms of representation, Manfredotti [2009] introduces
lying temporal process. Continuous time Bayesian net-
relational dynamic Bayesian networks (RDBNs), a first-
works [Nodelman et al., 2002, 2003] provide a way around
order logic-based extension of dynamic Bayesian net-
this by allowing each variable to be modeled at a different
works. RDBNs are similar to the relational model we de-
time granularity.
fine; however, we provide an explicit characterization for
the space of relational paths, and subsequently, the space
of relational dependencies. To be more specific, temporal
relational paths in our framework are restricted to conjunc-
tions of predicates that correspond to possible traversals of
the relational schema. This restriction, together with the Acknowledgements
domain specific hop threshold, allows us to enumerate the
space of potential dependencies to learn.
Funding was provided by the U.S. Army Research Office
(ARO) and Defense Advanced Research Projects Agency
10 CONCLUSIONS AND FUTURE WORK (DARPA) under Contract Number W911NF-11-C-0088.
The content of the information in this document does not
In this paper we presented a formalization of temporal re- necessarily reflect the position or the policy of the Gov-
lational models, and we extended the theory of relational ernment, and no official endorsement should be inferred.
d-separation to the temporal domain. We presented a The U.S. Government is authorized to reproduce and dis-
constraint-based algorithm, TRCD, that leverages the no- tribute reprints for Government purposes notwithstanding
tion of temporal relational d-separation to learn the causal any copyright notation here on.
References Uri Nodelman, Christian R. Shelton, and Daphne Koller.
Continuous time Bayesian networks. In Proceedings of
Nathan Eagle and Alex (Sandy) Pentland. Reality mining:
the Eighteenth Conference on Uncertainty in Artificial
Sensing complex social systems. Personal and Ubiqui-
Intelligence, pages 378–387, 2002.
tous Computing, 10(4):255–268, March 2006.
Uri Nodelman, Christian R. Shelton, and Daphne Koller.
Michael Eichler. Graphical modelling of multivariate time
Learning continuous time Bayesian networks. In Pro-
series. Probability Theory and Related Fields, 153(1-
ceedings of the Nineteenth International Conference on
2):233–268, 2012.
Uncertainty in Artificial Intelligence, pages 451–458,
Doris Entner and Patrik Hoyer. On causal discovery from 2003.
time series data using FCI. In Proceedings of the 5th
Judea Pearl and Thomas Verma. A theory of inferred cau-
European Workshop on Probabilistic Graphical Models
sation. In J. Allen, R. Fikes, and E. Sandewall, edi-
(PGM-2010), pages 121–128, 2010.
tors, Principles of Knowledge Representation and Rea-
Nir Friedman, Kevin Murphy, and Stuart Russell. Learning soning: Proceeding of the Second International Confer-
the structure of dynamic probabilistic networks. In Pro- ence, pages 441–452. Morgan Kaufmann, 1991.
ceedings of the Fourteenth Conference on Uncertainty in
Matthew J.H. Rattigan, Marc Maier, and David Jensen. Re-
Artificial Intelligence, pages 139–147, 1998.
lational blocking for causal discovery. In Proceedings of
Dan Geiger and Judea Pearl. On the logic of causal models. the Twenty-Fifth AAAI Conference on Artificial Intelli-
In Proceedings of the Fourth Annual Conference on Un- gence, pages 145–151, 2011.
certainty in Artificial Intelligence, pages 136–147, 1988.
Bruno F. Ribeiro, Nicola Perra, and Andrea Baronchelli.
Clive W. J. Granger. Investigating causal relations by Quantifying the effect of temporal resolution in time-
econometric models and cross-spectral methods. Econo- varying network. CoRR, abs/1211.7052, 2012.
metrica, 37(3):424–38, July 1969.
Joshua W. Robinson and Alexander J. Hartemink. Learning
Kristian Kersting and Tapani Raiko. “Say EM” for select- non-stationary dynamic Bayesian networks. Journal of
ing probabilistic models for logical sequences. In Pro- Machine Learning Research, 11:3647–3680, December
ceedings of the Twenty-First Conference in Uncertainty 2010.
in Artificial Intelligence, pages 300–307, 2005.
Peter Spirtes, Clark Glymour, and Richard Scheines. Cau-
Kristian Kersting, Luc De Raedt, and Tapani Raiko. Logi- sation, Prediction and Search. MIT Press, Cambridge,
cal hidden Markov models. Journal of Artificial Intelli- MA, 2nd edition, 2000.
gence Research, 25:425–456, 2006.
Thomas Verma and Judea Pearl. Causal networks: Seman-
Harri Lähdesmäki and Ilya Shmulevich. Learning the struc- tics and expressiveness. In Proceedings of the Fourth
ture of dynamic Bayesian networks from time series and Annual Conference on Uncertainty in Artificial Intelli-
steady state measurements. Machine Learning, 71(2- gence, pages 352–359, 1988.
3):185–217, June 2008.
Mark Voortman, Denver Dash, and Marek J. Druzdzel.
Yan Liu, Alexandru Niculescu-Mizil, Aurelie C. Lozano, Learning why things change: The difference-based
and Yong Lu. Learning temporal causal graphs for rela- causality learner. In Proceedings of the Twenty-Sixth
tional time-series analysis. In Proceedings of the Twenty- Conference on Uncertainty in Artificial Intelligence,
Seventh International Conference on Machine Learning, pages 641–650, 2010.
pages 687–694, 2010.
Marc Maier, Katerina Marazopoulou, David Arbour, and
David Jensen. A sound and complete algorithm for
learning causal models from relational data. In Proceed-
ings of the Twenty-Ninth Conference on Uncertainty in
Artificial Intelligence, pages 371–380, 2013.
Marc Maier, Katerina Marazopoulou, and David Jensen.
Reasoning about Independence in Probabilistic Models
of Relational Data. arXiv:1302.4381, 2013.
Cristina E. Manfredotti. Modeling and Inference with Re-
lational Dynamic Bayesian Networks. PhD thesis, Uni-
versity of Milano, 2009.
Kevin P. Murphy. Dynamic Bayesian Networks: Represen-
tation, Inference and Learning. PhD thesis, University
of California, Berkeley, 2002.