Dynamical Kinds and their Discovery


                                                    Benjamin C. Jantzen
                                                      Philosophy Dept.
                                                        Virginia Tech
                                                    Blacksburg, VA 24061


                        Abstract                                population, the learning problem is, roughly, to ascertain the
                                                                set of direct, possibly stochastic causal relations amongst
     We demonstrate the possibility of classifying              variables and, often, a model of the explicit functional form
     causal systems into kinds that share a common              of those causal dependencies. Similarly, there is vast lit-
     structure without first constructing an explicit dy-       erature on system identification that focuses on producing
     namical model or using prior knowledge of the              a model of a given complex dynamical system that is suit-
     system dynamics. The algorithmic ability to de-            able for prediction and control (Norton, 1986; Ljung, 2010).
     termine whether arbitrary systems are governed             While these lines of research differ in terminology, methods,
     by causal relations of the same form offers sig-           and emphasis – system identification tends to focus on deter-
     nificant practical applications in the development         ministic systems, while the GCM formalism is designed for
     and validation of dynamical models. It is also             representing stochastic causal relations – they both provide
     of theoretical interest as an essential stage in the       methods for discovering causal structure. They also share
     scientific inference of laws from empirical data.          an emphasis on the individual system – particular systems
     The algorithm presented is based on the dynam-             of causal relations are grouped together into categories only
     ical symmetry approach to dynamical kinds. A               after the fact, and generally without a consistent aim or
     dynamical symmetry with respect to time is an              motivation.
     intervention on one or more variables of a system          This exclusive focus on individual systems, however, leaves
     that commutes with the time evolution of the sys-          out a significant element of scientific inference. Much sci-
     tem. A dynamical kind is a class of systems shar-          entific work, especially in the early exploration of a new
     ing a set of dynamical symmetries. The algorithm           phenomena, involves grouping distinct physical systems into
     presented classifies deterministic, time-dependent         categories. These categories are then targets for law-like
     causal systems by directly comparing their ex-             generalization. In other words, a primary goal of scientific
     hibited symmetries. Using simulated, noisy data            research is the identification of laws of kinds of system, not
     from a variety of nonlinear systems, we show               just models of one-off systems. So, for example, we seek
     that this algorithm correctly sorts systems into           laws of chemical kinetics, not just models of the particu-
     dynamical kinds. It is robust under significant            lar reaction in some specific beaker on a lab bench. We
     sampling error, is immune to violations of nor-            seek laws of motion, not just a model of how this particular
     mality in sampling error, and fails gracefully with        pendulum moves given a variable arm length. This sort
     increasing dynamical similarity. The algorithm             of inference to kinds is not reflected in the causal learning
     we demonstrate is the first to address this aspect         or system identification algorithms that have been devel-
     of automated scientific discovery.                         oped – both approaches to learning causal structure allow
                                                                for classification only after model creation.
                                                                But what if such classification could be done prior to learn-
1   The problem                                                 ing detailed models of the causal structure of individual
                                                                systems? In that case, a number of inferential tasks would
Recent decades have seen remarkable progress in addressing
                                                                become substantially easier. For example, if one knew in
the problem of learning causal structure from observational
                                                                advance that multiple systems are governed by a common
or experimental data. One especially rich approach deploys
                                                                type of causal structure – whether representable by, say, a
the framework of graphical causal models (GCMs) (Pearl,
                                                                parametric class of differential equations or an equivalence
2000; Spirtes et al., 2000; Spirtes and Zhang, 2016). Given
                                                                class of GCMs – then one could pool data from multiple
a particular set of variables describing a specific system or
systems in order to determine the form of the differential       system or evolves the system and then applies σk . Note that
equation or an equivalence class of GCMs governing sys-          this is equivalent to the condition that dσk (xA (t))/dt =
tems of that sort. Or consider the problem of model valida-      rA σk (xA (t)). Even if these transformations were unknown
tion. For sufficiently complex computer models, validation       to us, there is a simple way to ascertain whether the σk for
can be exceedingly difficult because the dynamical relations     species A are different from the corresponding transforma-
amongst high-level variables in the simulation are unknown       tions, τ , that commute with the time evolution of species
(Mattingly and Warwick, 2009). But if it were possible to        B. To do so, we can read these transformations – at least
ascertain directly – without an explicit model of the high-      over a finite interval – directly off of our experimental data.
level dynamics in either the simulation or the target system     Suppose we make simultaneous measurements of each pop-
– that the dynamics are of the same type, this would provide     ulation at m times, t0 , t1 , . . . , tm . Suppose further that, for
a tractable route to validation. Finally, one could assess       each species, two experiments are conducted. In the first,
whether or not a system with unknown dynamics exhibits           an intervention sets the initial populations to xA (t0 ) and
transitions between different dynamical regimes, such as the     xB (t0 ) for species A and B, respectively, and in the sec-
transition from laminar to turbulent flow in fluid dynamics.     ond we set the initial populations to x̃A (t0 ) and x̃B (t0 ).
In such transitions, one and the same system exhibits a dif-     For each population, the resulting pair of trajectories must
ferent dynamical form or causal structure at different times     be connected by one of the special time-commuting trans-
or in different circumstances.                                   formations characteristic of that system. Empirical curves
                                                                 for σ and τ can thus be constructed by plotting x̃A (ti ) as
For these reasons, a method for assessing the sameness
                                                                 a function of xA (ti ) and x̃B (ti ) as a function of xB (ti ).
or difference of dynamical structure in the absence of any
                                                                 To decide whether the dynamics of the two systems have
explicit dynamical model would be both practically and
                                                                 the same form it is only necessary to determine whether
theoretically valuable. For such a direct classification to be
                                                                 σ(xA ) ≡ x̃A (xA ) = x̃B (xB ) ≡ τ (xB ). If species A really
possible, two things are required. First, a clear and rigorous
                                                                 grows exponentially, then x̃A (xA ) = kxA , and detecting a
definition of dynamical kind, and second, an empirical test
                                                                 difference in dynamics depends only on detecting deviation
for sameness of kind. We suggest that the first component –
                                                                 from a straight line in the sample of x̃B (xB ).
a rigorous definition – has been provided and well-motivated
by the theoretical work of Jantzen (2014). To meet the sec-
ond need, we present here both a test derived from this          2     Theoretical background and related work
notion of dynamical kind and an explicit algorithm for au-
tomating that test that is accurate and robust in the presence   2.1   Theory of dynamical kinds
of significant measurement error in empirical data.
                                                                 The preceding example appealed to a feature of the dynam-
                                                                 ics of all exponential growth systems. That is, dynamical
1.1   Directly discerning dynamical difference: an               systems were grouped into categories on the basis of high-
      example                                                    level features of their dynamics – in this case, the form of
                                                                 the governing differential equation – rather than features of
To get a sense for how such a test might work, consider two      specific trajectories or solutions of that equation. This is
different species of bacteria, A and B, in dilute solutions of   the essence of a dynamical kind – a class of causal systems
nutrient-rich media. One might be interested in knowing          that share some salient higher-order features though they
whether the growth of both species is governed by the same       may differ dramatically in the specific relations amongst
dynamics under these conditions, as suggested by the “First      variables.
Law of Ecology” (Turchin, 2001) . Specifically, one might
                                                                 Of course, there are indefinitely many higher-order features
be interested in whether both systems are well-represented
                                                                 of causal structure that could be used for classification in
by differential equations of the same form, whatever that
                                                                 this way. We have chosen to focus on the variety of higher-
form may be. The question can be approached directly by
                                                                 order properties that were used to sort dynamical kinds in
considering the sorts of properties shared by two systems
                                                                 the bacterial growth example. Specifically, the notion of
governed by equations of the same form. One such property
                                                                 dynamical kind we propose to exploit for purposes of auto-
is the behavior of the system under intervention. More
                                                                 mated discovery is that proposed by Jantzen (2014). The
specifically, for any given dynamics there is a special class
                                                                 motivation is two-fold. First, Jantzen’s proposal picks out
of transformations of the system state that commute with
                                                                 classes of causal system that, if identified in the absence
the evolution of that system state through time.
                                                                 of detailed individual models, would provide the sort of
To be concrete, suppose species A grows exponentially,           practical benefits for data-pooling, validation, and regime-
such that dxA (t)/dt = rA xA (t). In that case, all and only     detection described above. Second, as Jantzen argues in
scaling transformations of the form σk (xA (t)) = kxA (t)        detail, the classes carved out in this way coincide well with
commute with the time-evolution of the system so that it         existing scientific practice. Thus, automating the discov-
doesn’t matter whether one applies σk and then evolves the       ery of dynamical kinds in Jantzen’s sense would go a long
way toward filling a gap in automated scientific inference,        way, it’s clear that most symmetries of interest to physicists
namely the absence of algorithmic methods for determining          are dynamical symmetries with respect to time. Assuming
classes of system appropriate for framing laws.                    that the dynamics is deterministic and well-described by a
                                                                   system of ordinary differential equations in time, we can
In Jantzen’s view, what makes the dynamics of two distinct
                                                                   also restate the condition in terms of flows (Logemann and
systems members of the same kind is a higher-order prop-
                                                                   Ryan, 2014). These are mappings φ : X × R → X where
erty that they share: the structured set of transformations of
                                                                   X is the space of sates of the system at times represented
each variable that commute with incrementation of another
                                                                   by the reals and under the conditions that φ(x, 0) = 0 and
variable. These transformations are called ‘dynamical sym-
                                                                   φ(φ(x, t), s) = φ(x, t + s). Dynamical symmetries in the
metries’ and are defined as follows (Jantzen, 2014, p3633):
                                                                   framework correspond to functions σ such that, for all x
Definition 1 (Dynamical symmetry). Let V be a set of               and t:
variables. Let σ be an intervention on the variables in
Int ⊂ V . The transformation σ is a dynamical symmetry
with respect to some index variable X ∈ V −Int if and only                           φ(σ(x), t) = σ(φ(x, t))                  (1)
if σ has the following property: for all xi and xf , the final
state of the system is the same whether σ is applied when          Whatever formalism is appropriate for expressing dynamical
X = xi and then an intervention on X makes it such that            symmetries in a particular application, it is not necessarily
X = xf , or the intervention on X is applied first, changing       true that all functions satisfying a given formal condition
its value from xi to xf , and then σ is applied.                   are proper dynamical symmetries. The latter are defined
                                                                   in terms of interventions on variables – interactions with
Notice that this definition is substantive rather than mathe-      a system that change its measured quantities in a way that
matical in that it appeals to physical interactions of a certain   is at least physically possible and which does not destroy
sort, namely interventions in the GCM sense. As such, this         the causal connections amongst them. Some formal trans-
definition of dynamical symmetry is not tied to any partic-        formations may represent unrealizable interventions such
ular formalism or application. The notion of commuting             as, perhaps, time reversal. For deterministic dynamics rep-
interventions is certainly at home in the GCM framework,           resented by a system of ODEs in time, every dynamical
but it is also easily translated into the language of differen-    symmetry can be represented by a mapping satisfying the
tial equations as the commutation of system evolution and          condition expressed by Equation 1. But if we take Jantzen’s
the manipulation of initial or boundary conditions. It is the      definition seriously, then it may not be the case that every
latter framework we focus on in the remainder of this paper.       such σ represents a dynamical symmetry. However, since
In particular, we will be concerned with classes of system         it is impossible to give a general account of what these ex-
that are well-described by differential equations. In fact,        ceptions are, we will make the idealizing assumption that,
we’ll further restrict attention to systems with explicit time     for the varieties of system under analysis here, there is a
dependence. In this special case of temporal dynamics, time        one-to-one correspondence between dynamical symmetries
is a uniquely natural “index variable” – in a sense, one can’t     and functions satisfying Equation 1.
help but “intervene” to increment time in any given experi-
ment. Thus, we are primarily interested in interventions on        In Jantzen’s sense, a bare set of dynamical symmetries is
subsets of variables which are dynamical symmetries with           insufficient to characterize a dynamical kind. Further dis-
respect to time (Jantzen, 2014, p3632):                            tinctions can be made if we note that for any given system
                                                                   of causally connected variables, the corresponding set of
Definition 2 (Dynamical symmetry with respect to time).
                                                                   dynamical symmetries exhibits structure under composition.
Let t be the variable representing time, and let V be a set
                                                                   In other words, one dynamical symmetry applied after an-
of additional dynamical variables such that t ∈      / V . Let
                                                                   other is also a dynamical symmetry, and the equivalence
σ be an intervention on the variables in Int ⊂ V . The
                                                                   relations amongst such compositions constitute a symmetry
transformation σ is a dynamical symmetry with respect to
                                                                   structure (Jantzen, 2014, p3634-3635):
time if and only if for all intervals ∆t, the final state of the
system is the same whether σ is applied at some time t0 and        Definition 3 (Nontrivial symmetry structure). The sym-
the system evolved until t0 + ∆t, or the system first allowed      metry structure of a collection of dynamical symmetries,
to evolve from t0 to t0 + ∆t and then σ is applied.                Σ = {σi |i = 1, 2, . . . } is given by the composition function
                                                                   ◦ : Σ × Σ → Σ. A nontrivial symmetry structure is one that
To put this in terms familiar from mathematical physics            contains the identity (i.e., “do nothing”) transformation and,
(Rosen, 1995, 2008), let Λt,s (x) be the propagator for a          for at least one variable Y relative to some index variable
given system, i.e., the function Λt,s : X → X for which            X, is not isomorphic to the group of mappings from Y to
Λt,s (x) is the state of the system at time t given that x ∈ X     itself (i.e., the set of all transformations of Y ).
is the state at time s. A function σ : X → X represents a
dynamical symmetry with respect to time if and only if for         Jantzen identifies these nontrivial symmetry structures with
every s, t > s, and x, Λt,s (σ(x)) = σ(Λt,s (x)). Put this         the dynamical kinds of interest (2014, p3635):
Definition 4 (Dynamical kind). A dynamical kind is a class
of systems of variables that share a set of dynamical symme-                         t         v0    v1           t       v0      v1
                                                                                                                                  
tries that are related by a non-trivial symmetry structure.                   t0              a00   a01     t0           b00     b01
                                                                             t1              a10   a11 
                                                                                                          t1           b10     b11 
                                                                                                          
                                                                                                                                    
                                                                             t2              a20   a21 
                                                                                                          t2           b20     b21 
                                                                                                          
2.2   Related work
                                                                               ..              ..    ..      ..           ..      ..
                                                                                                                                    
                                                                                .               .     .       .            .       .
As suggested above, Jantzen’s theory of dynamical kinds is                                              w
                                                                                                        
closely connected to the axiomatic treatment of causation
that underwrites the increasingly diverse and powerful set              
                                                                                v0       v1         ṽ 0
                                                                                                             
                                                                                                                  v0     v1            ṽ
                                                                                                                                       1
                                                                                                                                         
of causal discovery algorithms. In particular, Jantzen shows       *   a00               a01        b00         a00     a01         b01 +
that a system of variables possesses a nontrivial symmetry            a10               a11        b10       a10     a11         b11 
                                                                                                          ,
                                                                                                                                      
structure S (one that is not isomorphic to the set of automor-        a20               a21        b20       a20     a21         b21 
                                                                        ..                ..         ..          ..      ..          ..
                                                                                                                                      
phisms on the states of S) if and only if it has a non-trivial
causal structure (one for which at least one variable is a di-           .                 .          .           .       .           .
rect cause of another)(2014, p3635-3636). This is the sense
in which dynamical symmetries are high-level properties          Figure 1: Schematic showing how samples from two time-
of the causal structure of a system. There is also precedent     series are restructured to obtain an implicit model of the
for Jantzen’s appeal to symmetries in the machine learning       dynamical symmetry that maps one trajectory into the other.
literature. Schmidt and Lipson (2009) developed an auto-
mated discovery algorithm that learns free-form laws by
seeking invariants of the motion – functions of the dynam-
ical variables whose values remain constant through time.         result, for each system, is a sample of p vector functions,
As Noether (Noether and Tavel, 1971) demonstrated, the            each with n components. In the second phase, these sam-
presence of symmetries is intimately connected with the           ples are transformed in order to serve as samples of the
presence of such invariants. But to the best of our knowl-        symmetry transformations to be compared. These dynam-
edge, there have been no prior attempts to use empirical          ical symmetries correspond to the p − 1 vector functions,
                                                                  ˜j (x~1 (t)) which map the initial curve ~x1 (t) into each of the
                                                                 ~x
measurements of dynamical symmetries to sort systems in
the absence of detailed causal models.                            other sampled curves (so j ∈ 2, . . . , p). Any given symme-
                                                                  try function can thus be decomposed into n functions of n
                                                                 variables: hx̃j,1 (x1,1 , . . . , x1,n ), . . . , x̃j,n (xj,1 , . . . , xj,n )i .
2.3   Learning dynamical kinds                                   To estimate these components for each symmetry, the sam-
                                                                  pled data is rearranged as follows. Using time to match up
Though it is not sufficient, sharing a set of dynamical sym-
                                                                  corresponding samples, the values of hx1,1 (t), . . . , x1,n (t)i
metries is a necessary condition for two systems to belong to
the same dynamical kind. Given that the collection of possi-      are paired separately with each component of x~˜i (t). This
ble dynamical symmetries is uncountably large as is the set       is shown schematically for n = p = 2 in Figure 1. The
of interventions one could perform on the system, verifying       result is a set of n matrices for each of the p − 1 dynamical
that any two systems have all of their dynamical symmetries       symmetries implicitly measured in the experiments.
in common is not possible. However, it is possible to verify     In the final phase, the sets of transformed data representing
with confidence that two systems do not share all of their       specific dynamical symmetries are compared across sys-
symmetries, and thus differ with respect to dynamical kind       tems. The output is a judgment that one or more dynamical
(and ultimately causal structure). The algorithm described       symmetries differ between System 1 and System 2 (the pos-
in the next section uses empirical data from systems that        itive case), or that the set of symmetries is indistinguishable
are presumed to be governed by a deterministic temporal          (the negative case). To make this judgment, 10-fold cross-
dynamics in order to assess whether or not the systems be-       validation is used to estimate the error of two competing
ing compared differ with respect to one or more dynamical        models, sep and joint. For the sep model, data from System
symmetry with respect to time.                                   1 and System 2 are treated separately. Each of the n com-
                                                                 ponents of the p − 1 symmetry transformations for System
3     The algorithm                                              1 are fit with polynomial surfaces independently of the n
                                                                 components of the p − 1 symmetries of System 2. For the
                                                                 joint model, the data from System 1 and System 2 for each
Comparing dynamical systems proceeds in three phases:
                                                                 of the p − 1 symmetries is combined and treated as a single
(i) sampling, (ii) transformation, and (iii) comparison. In
                                                                 sample from the same function.
phase (i), the variables v1 , . . . , vn are measured at times
t1 , . . . , tm for both System 1 and System 2, starting from    For both models, polynomial surfaces are used to provide a
p ≥ 2 distinct initial conditions (the set of initial condi-     global fit of each component function, ~x˜j,i (~x1 (t)) . These
tions is presumed to be identical for the two systems). The      surfaces are fit using singular-value decomposition to find
the least-squares parameter values for a polynomial in n
variables and order d (Golub and Reinsch, 1970). The or-
der of the surface is determined by 10-fold cross-validation.
Specifically, the error for a fit of order 1 is estimated via
10-fold cross-validation. Higher orders continue to be inves-
tigated until the relative reduction in estimated error fails to
exceed a tunable threshold, , at which point the previous
order considered is used to fit the entire dataset. For the
results reported here,  was fixed at 10−4 , though as Figure      Algorithm 1 Dynamical Type Comparison
4.5 (e)-(f) indicates, the performance of the algorithm is         Input: Aσ,v – for each transformation, σ and target vari-
largely insensitive to the value of epsilon over ten orders of         able, vi , measured for System 1, contains an m×(n+1)
magnitude.                                                             matrix with m samples (rows) of v1 , . . . , vn and the
                                                                       value of vi to which σ maps the system state
The estimated error of sep and joint cannot be directly com-           Bσ,v – for each transformation, σ and target variable, v,
pared. If in fact the two systems are of the same dynamical            measured for System 2, contains an m × (n + 1) matrix
type and thus have all of their symmetries in common, then             with m samples (rows) of v1 , . . . , vn and the value of v
the two models are expected to have the same error. How-               to which σ maps the system state
ever, due to random noise in the data and random variation         Output: a boolean value indicating whether the data belong
in how the data is partitioned, cross-validation will provide          to systems of different dynamical types
slightly different estimates. To meaningfully compare the           1: for each σ and v, randomize the rows of Aσ,v , Bσ,v
competing models, it must be determined how much of the             2: P artition1σ,v ← divide the rows of Aσ,v into 10 seg-
difference in error is attributable to noise alone. In order           ments
to do so, the data from System 1 is divided in two, and             3: P artition2σ,v ← divide the rows of Bσ,v into 10 seg-
the same 10-fold cross-validation routine is used to esti-             ments
mate the errors of joint and combined models. In this case,         4: for i = 1 to 10 do
since all of the data comes from the same system, any dif-          5:    T rain1σ,v ← S{P artition1σ,v [j]|j 6= i}
                                                                                        S
ference in estimated error is due to noise in the data. Call        6:    T rain2σ,v ← {P artition2σ,v [j]|j 6= i}
the absolute value of this difference, δ1 . The procedure           7:    T rainJointσ,v ← T rain1σ,v ∪ T rain2σ,v
is repeated with the data from System 2 to find δ2 . Then           8:    M odelσ,v1
                                                                                       ← FitPolynomial(T rain1σ,v )
δ = max(δ1 , δ2 ) is used as a threshold to compare sep and         9:    M odelσ,v ← FitPolynomial(T rain2σ,v )
                                                                                   2
joint. Specifically, if the mean squared error of the joint        10:    M odelσ,v1,2
                                                                                       ← FitPolynomial(T rain1σ,v ∪T rain2σ,v )
model as estimated by cross-validation exceeds the mean
square error estimate for the sep model by more than δ, i.e.       11:    SE 1 ←
                                                                                      S
                                                                                         {for all σ, v, the squared errors of
M SEjoint > M SEsep + δ, then the hypotheses of shared                            1
                                                                          M odelσ,v                            1
symmetries is rejected and the systems are judged to be of                    2
                                                                                      S respect to P artitionσ,v [i]}
                                                                                     with
                                                                   12:    SE ←           {for all σ, v, the squared errors of
different kinds. The overall algorithm for the comparison                         2
                                                                          M odelσ,v                            2
phase is shown in Algorithm 1.                                                1,2
                                                                                     withSrespect to P artitionσ,v [i]}
                                                                   13:    SE        ←       {for all σ, v, the squared errors
                                                                                     1,2
                                                                          of M odelσ,v     with respect to P artition1σ,v [i] ∪
4     Assessing performance                                               P artition2σ,v [i]}
                                                                   14: end for
4.1   General considerations                                       15: M SEsep ← mean(SE 1 ∪ SE 2 )
                                                                   16: M SEjoint ← mean(SE 1,2 )
The probability that the algorithm correctly detects a differ-     17: if M SEjoint > M SEsep + threshold(Aσ,v , Bσ,v )
ence in measured symmetries depends upon the details of                then
the dynamics underlying the two systems, the number of ini-        18:    return TRUE
tial conditions sampled (i.e., the number of transformations       19: else
applied), the time over which the systems were observed,           20:    return FALSE
and the amount of measurement noise present. All of these          21: end if
factors interact with one another, and there is no tractable
way to represent, let alone compute, the performance of
the algorithm in all conceivable circumstances. Instead, we
undertook experiments with simulated data for a couple of
nonlinear systems with two aims in mind: (i) to demonstrate
that the algorithm does in fact correctly identify whether
systems belong to the same dynamical kinds, and (ii) to
show how the algorithm behaves as sampling error and the
       120                     (a)                            120                    (d)                                   180                 (a)                 180                        (d)
       100                                                    100                                                          160                                     160
        80                                                     80                                                          140                                     140
                                                                                                                           120                                     120


                                                                                                                   population


                                                                                                                                                                     population
population


                                                       population
        60                                                     60                                                          100                                     100
        40                                                     40                                                           80                                      80
        20              r = 1. 5, K = 100, α = 1,              20           r = 1. 5, K = 100, α = 1,                       60                                      60
                         β = 1, γ = 1                                        β = 1, γ = 1                                   40                                      40
         0                                                      0                                                           20                                      20
        20                                                     20                                                            0                                       0
             0      2      4         6      8          10           0   2        4         6       8          10             0.0   0.2   0.4         0.6   0.8 1.0   0.0             0.2   0.4         0.6   0.8 1.0
                                                time                                                   time                                                   time                                              time

       120                     (b)                            120                    (e)                                   160             (b)                     180                           (e)
       100                                                    100                                                          140                                     160
                                                                                                                           120                                     140
        80                                                     80                                                                                                  120


                                                                                                                   population


                                                                                                                                                                     population
                                                                                                                           100
population


        60                                             population
                                                               60                                                                                                  100
                                                                                                                            80                                      80
        40                                                     40                                                           60
                        r = 1. 5, K = 100, α = 1,                           r = 5, K = 100, α = 1,                                                                  60
        20                                                     20                                                           40                                      40
         0               β = 0. 6, γ = 1                        0            β = 1, γ = 1                                   20                                      20
        20                                                     20                                                            0                                       0
             0      2      4         6      8          10           0   2        4         6       8          10             0.0   0.2   0.4         0.6   0.8 1.0   0.0             0.2   0.4         0.6   0.8 1.0
                                                time                                                   time                                                   time                                              time

                               (c)                                                   (f)                                                       (c)                                               (f)
       100                                                    100                                                          100                                             100
        80                                                     80                                                           80                                              80


                                                                                                                     Error (%)


                                                                                                                                                                     Error (%)
                                                                                                                            60                                              60
 Error (%)


                                                        Error (%)


        60                                                     60
        40                                                     40                                                           40                                              40
        20                                                     20                                                           20                                              20
         0                                                      0                                                            0                                               0
                 0.0 0.2 0.4 0.6 0.8 1.0                            0   1    2         3       4                             0.0 0.5 1.0 1.5 2.0 2.5 3.0                          1.0 1.5 2.0 2.5 3.0 3.5 4.0
                                                 β                                                      r5                                                  r2 /r1                                             r1

Figure 2: (a) A representative sample of data from the sim-                                                        Figure 3: (a) Representative sample of data from the sim-
ulated logistic growth system with the indicated parameter                                                         ulated two-species competitive Lotka-Volterra model with
values and Gaussian noise with a standard deviation of 3.                                                          noise drawn from a Gaussian distribution with standard de-
The two curves correspond to initial populations of 1 (red                                                         viation of 5. The trajectories marked with pluses correspond
pluses) and 25 (blue dots). (b) Data sampled from a growth                                                         to both competing species with initial populations of 1, and
system belonging to a different dynamical kind. (c) The                                                            the trajectories marked with dots indicate the growth of both
error rate of the algorithm as the value of β is varied in one                                                     species starting from initial populations of 25 each. (b) Data
of the models compared, while fixed at β = 1 for the other.                                                        sampled from a system belonging to a different kind, (c)
(d) and (e) Representative samples of growth systems of the                                                        The error rate of the model as the ratio of growth rates r2 /r1
same dynamical kind. (f) The error rate of the algorithm                                                           is varied in one of the systems compared, while the other
as a function of the value of r in the underlying general-                                                         system is held fixed at r1 = r2 = 2. (d), (e) Representative
ized logistic growth model, with one system held fixed at                                                          samples from a pair of systems of the same dynamical kind
r = 1.5.                                                                                                           but with distinct growth rates. (f) The error rate of the algo-
                                                                                                                   rithm as the value of the growth rates r1 and r2 are varied
                                                                                                                   in a fixed ratio r2 /r1 = 2 for one of the systems compared.
                                                                                                                   The horizontal axis shows the value of r1 . The other system
degree of similarity of the underlying dynamics are varied                                                         is held fixed at r1 = 2, r2 = 4.
in isolation.
                                                                                                                   4.2.1           Biological growth

                                                                                                                   The first class of systems we considered includes a subset
4.2              Synthetic data                                                                                    of a theoretically important family of nonlinear models of
                                                                                                                   biological growth, the so-called “logistic growth models”
                                                                                                                   (Tsoularis and Wallace, 2002). Models in this family are
We tested the algorithm against two classes of simulated
                                                                                                                   described by a differential equation of the form:
system for which we could be certain of the symmetries
of the underlying dynamics and thus of whether or not any                                                                                                   γ
                                                                                                                                     ẋ = rxα 1 − (x/K)β                  (2)
two systems are of the same dynamical type. The use of
simulations also offered control of sampling noise, allowing                                                       The variable x refers to the size of a population, while the
us to test the robustness of the algorithm against a range of                                                      parameters r and K are generally taken to correspond to
noise distributions. We focused on systems of importance                                                           an intrinsic growth rate and carrying capacity, respectively.
to theoretical ecology, a field in which system identification                                                     For each value of α, β and γ, Equation 2 yields a distinct
is notoriously difficult.                                                                                          class of models of growth dynamics. There is significant
controversy over which, if any, of these models describes             governed by different members of the family of logistic
actual biological populations, in part because it is so difficult     growth models. Specifically, we examined the accuracy of
to distinguish solutions of the different models from one             the algorithm for systems with α = γ = 1 and different
another with standard statistical approaches. We focused              values of β. For each of 10 values of β 6= 1, we ran 100
on the set of models for which α = β = γ = 1. In                      trials comparing a system with this value against one in
this case, Equation 2 reduces to the well-known Verhulst              which β = 1. In all cases, the systems belonged to different
logistic equation, ẋ = rx (1 − (x/K)). This equation has             dynamical kinds. Of course, the point of the algorithm is not
closed-form solutions which we used to simulate growing               merely to recognize when dynamical systems are different,
populations. Importantly, it is straightforward to solve for          but to group systems that, despite appearances, belong to the
the set of dynamical symmetries of population with respect            same dynamical kind. So in a second set of experiments, we
to time. For a transformation, σ to be a dynamical symmetry           examined comparisons between two systems governed by
of the Verhulst equation according to Definition 2, it must           the Verhulst equation (α = β = γ = 1) that differed to vary-
satisfy:                                                              ing degrees in their respect growth rates, r. One system was
                                                                      fixed at r = 1.5. We ran 100 trials with the second system
            σ̇(x(t)) = rσ(x(t)) (1 − σ(x(t))/K)                (3)    at each of 10 different values of r. According to Equation
                                                                      4, all of these systems belong to the same dynamical kind.
This results in a one-parameter family of symmetries:                 In all of these experiments, we imposed measurement noise
                                                                      that was normally distributed with a standard deviation of 5.
           σp (x) = Kx/ (1 − e−p )x + e−p K
                                              
                                                      (4)
                                                                      The results of these experiments are displayed in Figure 2.
Note that for different values of K, a different family of            We conducted an analogous battery of tests with simulated
symmetries and thus a distinct dynamical kind is obtained.            Lotka-Volterra systems. These results are shown in Figure
However, two systems with equal K values are in the same              3.
kind no matter how much r differs between them.                       The principal lesson to be drawn is that the algorithm works
                                                                      – almost every time it declares two systems to belong to
4.2.2    Species competition                                          different kinds, they in fact do. Of course, as the error
The second class of systems we simulated can be viewed                rate curves demonstrate, the more similar the trajectories
as a generalized logistic growth model that aims to cap-              produced by the dynamics underlying two systems are –
ture the interaction of two or more species competing for             such as when β = 1 in one system and β ≈ 1 in the other
resources. Specifically, we looked at the two-species com-            – the more likely the algorithm is to miss the difference.
petitive Lotka-Volterra model (Pastor, 2011):                         On the other hand, it’s very seldom fooled into declaring
                                                                      distinct kinds when two systems – no matter how divergent
             ẋ1 = r1 x1 (1 − (x1 + α12 x2 )/K1 )                     their apparent behavior – really do belong to the same kind.
                                                               (5)
             ẋ2 = r2 x2 (1 − (x2 + α21 x1 )/K2 )
                                                                      4.4   Noise and normality
In this case, it’s much more complicated to find closed form
expressions for symmetries, in part because the symme-                We characterized the robustness of the comparison algo-
tries are vector functions, σ (x) = hσ1 (x1 , x2 ), σ2 (x1 , x2 )i.   rithm under increasing noise in the single-variable case by
However, it is straightforward to show that the symmetry              testing it against pairs of samples from two kinds of Verhulst
condition implies that the symmetries of the system are func-         logarithmic growth system: System 1 for which K = 100
tions only of the ratio of the growth rates, r2 /r1 . We thus         and System 2 for which K = 90 (these are in different dy-
tested whether the algorithm correctly classifies systems             namical kinds since the family of symmetries is a function
with the same growth rate ratio in the same dynamical kind            of K). For each of the four possible pairings of System 1
despite widely different values of r1 and r2 . Conversely,            with System 2, we conducted 25 experiments (for a total
we examined the efficiency of the algorithm at discriminat-           of 50 experiments in which the systems belonged to the
ing systems with distinct growth rate ratios despite similar          same dynamical kind, and 50 in which they belonged in
values for all parameters.                                            different kinds). In each experiment, Gaussian noise was
                                                                      added to measurements of the state of each simulated system.
4.3     Classification                                                This was repeated for 10 different values of the standard
                                                                      deviation, σ, of the distribution of added noise. A positive
To verify that the algorithm does in fact correctly deter-            example is one in which the two systems actually differ, a
mine whether systems belong to the same dynamical kind                negative is when systems of the same type are paired to-
and to explore its limitations in doing so, we performed              gether. Only 8 out of the 100 trials returned a false positive;
experiments with both logistic growth and Lotka-Volterra              nearly all errors (by design) were false negatives. The ac-
simulations. For growth models, we compared systems gov-              curacy of the classifier is plotted in Figure 4 as a function
erned by the Verhulst equation (α = β = γ = 1) to systems             of the standard deviation of the noise. The algorithm is
> 80% accurate until the standard deviation of the noise          fit for each sampled symmetry. To examine the effect of
reaches more than 12% of the overall range of the dynamical       the value of  on the classifier error rate, we tested pairs
variable measured.                                                of samples from two kinds of Verhulst logarithmic growth
                                                                  system: System 1 for which K = 100 and System 2 for
We conducted analogous experiments in the multi-variable
                                                                  which K = 90 (these are in different dynamical kinds).
case by testing pairs of samples from two kinds of Lotka-
                                                                  For each of the four possible pairings of System 1 with
Volterra system, specifically one for which r1 = 2, r2 = 4
                                                                  System 2, we conducted 25 experiments (for a total of 50
and one for which r1 = r2 = 3. Again, we ran 100 trials
                                                                  experiments in which the systems belonged to the same
of which 50 involved systems of the same dynamical kind
                                                                  dynamical kind, and 50 in which they belonged in different
and 50 involved systems of different kinds. We saw com-
                                                                  kinds). These experiments were repeated for each of 10
parable or better performance with respect to the preceding
                                                                  different values of , spanning 10 orders of magnitude. As is
experiment (see Figure 4 (c)). There is, however, a subtle
                                                                  clear from Figure 4.5 (e), there is no significant dependence
property of the algorithm’s behavior in this case that bears
                                                                  of the overall error rate on the value chosen. We conducted
mentioning. At very low values of noise, it performs no
                                                                  analogous experiments in the multi-variable case by testing
better than at high values of noise. The reason for this is
                                                                  pairs of samples from two kinds of Lotka-Volterra system,
that for two or more variables, there is no precise overlap
                                                                  specifically one for which r1 = 2, r2 = 4 and one for
between the initial, untransformed trajectories of two sys-
                                                                  which r1 = r2 = 3. Again, we ran 100 trials of which
tems that aren’t exactly alike in parameter values. So when
                                                                  50 involved systems of the same dynamical kind and 50
the noise is low enough, the algorithm can always fit a joint
                                                                  involved systems of different kinds. Again, there was no
model essentially by tying together two separate models of
                                                                  significant dependence of the overall error rate upon the
functions with disjoint support. However, a little bit of noise
                                                                  value of , as indicated in Figure 4.5 (f).
blurs the trajectories enough to produce overlap, and the
statistics of the joint versus separate fits become dominated
by the shapes of the curves, rather than their exact location                 100                    (a)                   100                       (b)
in phase space. Fortunately, the conservative nature of the                    80                                           80
                                                                  Error (%)


algorithm allows this deficiency to be corrected by injecting                  60                                           60
noise into a sample and watching to see if the algorithm                       40                                           40
changes its mind from “same kind” to “different”.                              20                                           20
                                                                                0                                            0
Finally, to get a sense for the fragility of the algorithm                          0        5   10 15 20 25 30                  0        5       10 15 20 25 30
                                                                                                                   σ                                                  α
under violations of normality in the sample noise, we re-
peated the same experiments, except that a Gaussian noise                     100                    (c)                   100                       (d)
source was replaced with the skew normal distribution (Az-                     80                                           80
                               √          (x−µ)2
                                                                  Error (%)


zalini, 1985). If φ(x) = 1/ 2ω 2 πe− 2ω2 is the stan-                          60                                           60
dard normal probability density function    with mean µ and                    40                                           40
                                                                               20                                           20
                                  Rx
standard deviation ω, Φ(x) = −∞ φ(x0 )dx0 is it’s cor-
responding cumulative distribution function, and T (h, α)                       0                                            0
                                                                                    0        5   10 15 20 25 30                  0        5       10 15 20 25 30
                                                                                                                   σ                                                  α
is Owen’s T -function (Owen, 1956), then the cumula-
tive distribution function of the skew normal is given by                     100                    (e)                   100                          (f)
Ψ(x) = Φ( (x−µ)            (x−µ)
               ω ) − 2T ( ω , α). We used the method                           80                                           80
                                                                  Error (%)


of transformation to sample from this skew normal distri-                      60                                           60
bution. When the shape parameter α = 0, the skew normal                        40                                           40
distribution reduces to an ordinary Gaussian with mean µ                       20                                           20
and standard deviation, ω. We tested the algorithm in exper-                    0                                            0
                                                                                        10       8   6     4   2       0             10       8     6         4   2       0
iments in which µ = 0 and α was varied from 0 to 30 while                                                      log(²)                                             log(²)
keeping the standard deviation of the distribution fixed at 15
by varying ω appropriately. The noise was thus increasingly       Figure 4: (a) Error rate as a function of standard deviation
right-skewed. Over this range, however, the skew had no           of normally distributed noise for logistic growth models.
impact on the accuracy of the classifier, as can be seen in       (b) Error rate as a function of the α-parameter of the skew
Figure 4 (b) and (d).                                             normal distribution for logistic growth systems. (c) Error
                                                                  rate versus standard deviation of normally distributed noise
4.5   Dependence on tunable parameters                            for two-species Lotka-Volterra systems. (d) Error rate versus
                                                                  α for Lotka-Volterra systems. (e) Error rate as a function
As indicated in Section 3 we also examined the sensitivity        of log() for logistic growth systems. (f) Error rate as a
of the algorithm to the tunable parameter, , used in a cross-    function of log() for two Lotka-Volterra systems.
validation routine to choose the order of the polynomial
5   Discussion                                                    function of y0 , then Equation 6 is an instance of the sort
                                                                  of (possibly nonlinear) model considered by Hoyer et al.
The experiments reported here demonstrate that it is possi-       (2009).
ble to recognize dynamical kinds without any knowledge
                                                                  According to this model, we can express the joint distribu-
of the underlying dynamics, and to do so in the face of
                                                                  tion over x and y as:
significant noise. By directly assessing whether systems
share a common set of dynamical symmetries, it is possible
to pool systems for model identification or to validate dy-
                                                                     p(x, y) = px (x)pη (y|x) = px (x)pη (y − f (x; y0 ))      (7)
namical models without explicitly fitting models to the data.
Perhaps most importantly, we have demonstrated that an
important sort of scientific inference widely neglected in the    Let’s consider x as the index variable and y the target. Then
literature on automated discovery is amenable to algorithmic      according to the extended definition, a transformation σ(y)
treatment.                                                        is a dynamical symmetry just if the joint distribution, p(x, y)
                                                                  is the same whether we (i) first apply σ when the expected
But the work reported here only begins to exploit the frame-      value E[x] = x0 and then intervene to bring the expected
work of dynamical kinds. The algorithm described above            value of x to E[x] = x0 + δ (changing the probability
was designed with a particular class of system in mind,           density over x to p̃x (x)); or (ii) bring the expected value of
namely those for which the deterministic causal relations         x to E[x] = x0 + δ and then apply σ to y. Given the model
are well-described by a differential equation in time. In other   expressed by Equation 6, this condition is satisfied only if:
words, it was assumed that systems of unknown dynamical
type are governed by a deterministic, time-dependent dy-
namics, and that all random variation observed is noise in the                  p̃x (x)pη (y − f (x; σ(y0 ))) =
measurement process. But the notion of a dynamical kind                                                                        (8)
                                                                                        p̃x (x)pη (y − σ(f (x; y0 )))
as reflected in Definition 2 does not limit one to considering
temporal dynamics. And with only modest modification,             If we assume that p̃x (x) > 0, this entails that
one can accommodate the general case of stochastic dy-
namics, that is, causal relations that are inherently noisy,                         f (x; σ(y0 )) = σ(f (x; y0 ))             (9)
regardless of the measurement process. The requisite modi-
fication is reflected by the following definition:                This is exactly the same condition that would have obtained
Definition 5 (Dynamical symmetry). Let V be a set of              without the additive noise in the model. In general, determin-
variables. Let σ be an intervention on the variables in           istic systems governed by first-order differential equations
Int ⊂ V . The transformation σ is a dynamical symmetry            can be cast in the form of Equation 6 without the additive
with respect to some index variable X ∈ V − Int if and            noise (and it’s not difficult to generalize that form for higher-
only if σ has the following property: for all xi and xf , the     order dynamics). By adding the noise term, we can consider
final probability distribution over V is the same whether σ       stochastic versions of the dynamical systems considered
is applied when E[X] = xi and then an intervention on X           above. For example, we can introduce a stochastic logistic
makes it such that E[X] = xf , or the intervention on X is        growth model this way:
applied first, changing its expected value from xi to xf , and
then σ is applied.                                                                        (K − x0 )y0 x
                                                                              y :=                             +η         (10)
                                                                                   (K − y0 )x0 + (y0 − x0 )x
To see how this extended definition can be applied in the
GCM framework, consider the class of two-variable models          The variable y represents the population at t + δ which is
in which x → y, x is distributed according to px (x), and         dependent upon x, the population at t. The variable y0 is the
the value of y is determined as a function f of the value of      population at t + δ that results from a population of x0 at t.
x and an additive noise variable:                                 Though it is perhaps not obvious from the form, note that
                                                                  Equation 10 reduces to the deterministic model when the
                                                                  noise term is set to 0. The symmetry condition expressed
                     y := f (x; y0 ) + η                   (6)    by Equation 9 entails in this case that
                                                                                                           
The noise variable, η is distributed according to pη (η) with                         (K − x0 )y0 x
                                                                           σ                                  =
a mean value of 0. The function f may be any function                          (K − y0 )x0 + (y0 − x0 )x
from values of x to values of y (not necessarily linear). The                              (K − x0 )σ(y0 )x
parameter y0 is the expected value of y when the expected                         (K − σ(y0 ))x0 + (σ(y0 ) − x0 )x
value of x is x0 . It’s a way of capturing the ‘memory’ and
dependence on initial conditions reflected in a differential      which is in turn satisfied by
equation without unfolding such dependence as an infinite                                             Ky
chain of discrete time slices. Note that when f is a constant                    σp (y) =
                                                                                             (1 − e−p )y + e−p K
This is exactly the same set of symmetries that characterize      Mattingly, J. and W. Warwick (2009, August). Projectible
the deterministic dynamical kind (see Equation 4). In other        Predicates in Analogue and Simulated Systems. Syn-
words, the extended definition of dynamical symmetry picks         these 169(3), 465–482. ArticleType: research-article
out an expanded class of systems that includes as a proper         / Full publication date: Aug., 2009 / Copyright 2009
subset the systems governed by a deterministic Verhulst            Springer.
equation with a particular value of the carrying capacity, K.     Noether, E. and M. A. Tavel (1971). Invariant Variation
But it also includes systems with stochastic causal relations       Problems. Transport Theory and Statistical Physics 1(3),
as described by Equation 10.                                       186–207.
Despite this overlap in conditions for kind membership,           Norton, J. P. (1986). An introduction to identification. Lon-
detecting sameness or difference of dynamical kind in sys-          don ; New York: Academic Press. OCLC: 12237750.
tems exhibiting stochastic causation requires an approach
                                                                  Owen, D. B. (1956). Tables for Computing Bivariate Nor-
different from the algorithm described above. That’s be-
                                                                   mal Probabilities. The Annals of Mathematical Statis-
cause individual trajectories are random walks biased by the
                                                                   tics 27(4), 1075–1090.
function f rather than noisy instantiations of deterministic
trajectories. We leave the development of methods appropri-       Pastor, J. (2011). Mathematical ecology of populations and
ate for stochastic causation to future work. But we would           ecosystems. John Wiley & Sons.
like to reiterate the generality of the theory that has allowed   Pearl, J. (2000). Causality : models, reasoning, and infer-
us to frame the problem of kind discovery in this extended          ence. Cambridge, U.K.; New York: Cambridge Univer-
context. We have already demonstrated an algorithm for the          sity Press.
causal systems of the sort that are the focus of most work
                                                                  Rosen, J. (1995). Symmetry in science : an introduction to
in system identification. The point we wish to stress is that
                                                                    the general theory. New York: Springer-Verlag.
the theoretical framework of dynamical kinds extends to
the relations of stochastic causation at the focus of work in     Rosen, J. (2008). Symmetry rules how science and nature
causal discovery. Exploiting the framework in this context          are founded on symmetry. Berlin: Springer.
in order to infer scientifically salient kinds, validate mod-     Schmidt, M. and H. Lipson (2009, April). Distilling
els, and pool data for model construction will require only         Free-Form Natural Laws from Experimental Data. Sci-
modest algorithmic innovation.                                      ence 324(5923), 81–85.
                                                                  Spirtes, P., C. N. Glymour, and R. Scheines (2000). Cau-
Acknowledgements
                                                                    sation, prediction, and search (2nd ed ed.). Adaptive
This material is based upon work supported by the National          computation and machine learning. Cambridge, Mass:
Science Foundation under Grant No. 1454190.                         MIT Press.
                                                                  Spirtes, P. and K. Zhang (2016, February). Causal discov-
References                                                          ery and inference: concepts and recent methodological
                                                                    advances. Applied Informatics 3(1), 1–28.
Azzalini, A. (1985). A Class of Distributions Which In-
                                                                  Tsoularis, A. and J. Wallace (2002, July). Analysis of lo-
  cludes the Normal Ones. Scandinavian Journal of Statis-
                                                                    gistic growth models. Mathematical Biosciences 179(1),
  tics 12(2), 171–178.
                                                                    21–55.
Golub, P. G. H. and D. C. Reinsch (1970, April). Sin-
                                                                  Turchin, P. (2001, July). Does population ecology have
  gular value decomposition and least squares solutions.
                                                                    general laws? Oikos 94(1), 17–26.
 Numerische Mathematik 14(5), 403–420.
Hoyer, P. O., D. Janzing, J. M. Mooij, J. Peters, and
  B. Schlkopf (2009). Nonlinear causal discovery with
  additive noise models. In D. Koller, D. Schuurmans,
 Y. Bengio, and L. Bottou (Eds.), Advances in Neural In-
  formation Processing Systems 21, pp. 689–696. Curran
 Associates, Inc.
Jantzen, B. C. (2014, December). Projection, symmetry,
  and natural kinds. Synthese 192(11), 3617–3646.
Ljung, L. (2010, April). Perspectives on system identifica-
  tion. Annual Reviews in Control 34(1), 1–12.
Logemann, H. and E. P. Ryan (2014). Ordinary Differential
  Equations: Analysis, Qualitative Theory and Control.
  London: Springer-Verlag.