=Paper=
{{Paper
|id=Vol-2964/article_169
|storemode=property
|title=Learning Dynamical Systems across Environments
|pdfUrl=https://ceur-ws.org/Vol-2964/article_169.pdf
|volume=Vol-2964
|authors=Yuan Yin,Ibrahim Ayed,Emmanuel de Bézenac,Patrick Gallinari
|dblpUrl=https://dblp.org/rec/conf/aaaiss/YinABG21
}}
==Learning Dynamical Systems across Environments==
Learning Dynamical Systems across Environments
Yuan Yin1 , Ibrahim Ayed1,2 , Emmanuel de Bézenac1 , Patrick Gallinari1,3
1
Sorbonne Université, CNRS, LIP6, Paris, France
2
Theresis Lab, Thales
3
Criteo AI Lab, Paris, France
{yuan.yin, ibrahim.ayed, emmanuel.de-bezenac, patrick.gallinari}@lip6.fr
Abstract changes can be caused by different factors: For example, in
climatic modeling, there are external forces such as the Cori-
Learning the behavior of natural phenomena automatically olis force that varies in different spatial locations (Madec
from the data has gained much traction these last years. How-
ever, in most real world scenarios, the environment in which
et al. 2019), or in cardiac computational model parameters
the data samples are acquired is varying and may not be the need to be personalized for each patient (Neic et al. 2017).
same for each data sample. This is due to different circum- The classical learning paradigm in this context is to treat
stances e.g. acquisition in different spatial locations, or sim- all the data as independent and identically distributed, thus
ply experimental settings which slightly differ. This severely disregarding the discrepancies between the environments.
hinders the training process, and makes the standard learn- As this assumption is not valid, it leads to a biased solu-
ing framework inapplicable. In this work, we propose a novel tion and results in an average model that performs poorly.
framework for modeling physical systems in this context, Conversely, one may also choose to avoid making this as-
where we are able to leverage the data across different envi- sumption by splitting the data from different environments
ronments in order to learn the underlying dynamical systems, and learning one dynamical system per environment, sep-
ensuring generalization without compromising the model’s
expressiveness and predictive performance. We instantiate
arately. However, this ignores the similarities between en-
our framework on two different families of dynamical sys- vironments and would severely affect generalization perfor-
tems, proving that our approach yields superior results over mance, specifically in settings where per-environment data
the classical learning approach as well as against competitive is limited.
baselines. Finally, we also show that we are also able to ac- In this work, our goal is to take into account the differ-
celerate and improve the learning for environments that have ence between environments and make use of the similar-
never been seen before. ities across them. Thus, we propose the LEarning Across
Dynamical Systems (LEADS) framework, a novel learning
methodology where the dynamics are decomposed into two
Introduction components, one shared across all environments, and an-
Often, natural phenomena may be difficult to understand other that takes into account the dynamics that cannot be
and comprehend due to the complex and nonlinear interac- expressed by the shared component and only those. This al-
tions between composing elements, making it cumbersome lows us to leverage the data from similar environments auto-
to derive a mathematical model describing it. In this con- matically, without compromising the expressiveness of the
text, a data-driven approach arises as a powerful alternative model. We demonstrate the effectiveness of our framework
to classical modeling methods, as an unknown model can on two standard examples of dynamics given by differen-
be learned automatically from the data. Recently, much ef- tial equations: the Lotka-Volterra predator-prey model, ex-
fort has been focused in this direction (Giannakis and Majda pressed as an ODE, and the Gray-Scott reaction-diffusion
2012; Mangan et al. 2017), with a particular emphasis on equations, expressed as PDEs. Finally, we also show that our
using neural networks (Raissi, Perdikaris, and Karniadakis method accelerates and improves learning for similar unseen
2019; Chen et al. 2018; Ayed et al. 2019) for treating cases environments.
where the underlying processes are largely unknown. De-
spite promising results, these methods usually postulate an Approach
idealized setting where the data is abundant and the environ-
ment in which it is acquired is always the same. However, Problem Setting
in practice, this is never the case as obtaining real world We consider the problem of learning unknown physical pro-
data samples may be expensive. Perhaps more importantly, cesses with data acquired from different environments. For
the environment in which they are acquired may vary. These each environment e ∈ E, we assume that the data is gener-
ated from an unknown governing differential equation:
Copyright © 2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY dXt
4.0) = fe (Xt ) (1)
dt
defined over a finite time interval [0, T ] where the state X is The general idea here is that as g is the same for each
either vector-valued, i.e. we have Xt ∈ Rd (Lotka-Volterra environment it can be learned using all data points, across
equations in the section Experiments) or is a d-dimensional all environments. However, this decomposition yields a po-
vector field over a bounded spatial domain Ω ∈ Rk , i.e. for tentially infinite number of solutions, and in particular the
t ∈ [0, T ] and x ∈ Ω, Xt (x) ∈ Rd . As stated above, modi- trivial solution obtained by setting g to be the null function:
fications in the environment have an impact on dynamics of in this case, data across environments cannot be leveraged.
the system and thus the evolution terms fe are expected to be In order to avoid the aforementioned trivial solution, we
different. Nevertheless, we do assume they yield some form would like the shared function g to explain the dynamics
of similarity between environments: as we will see in the fol- as much as possible, and in turn make the environment de-
lowing, this is not a necessary condition for our framework pendent function he be as small as possible. The following
to be applicable, but this is what will allow us to leverage the constrained optimization problem embeds this general idea:
data from the other environments. X 2
min khe k subject to
As in Arjovsky et al. (2020), we choose to not discard the g,he ∈F
e
information from where the data was collected. We construct
our training set with training sample (e, {X e,i }i=1,...,Ne ) ∈ dXte,i
∀(e, Xte,i ) ∈ D, = (g + he )(Xte,i ) (3)
D. Each sample is thus composed of the environment identi- dt
fier e as well as a set of trajectories where each X e,i , denot- Let us consider the limit case where the dynamics are the
ing here the i-th trajectory in the environment e, is a function same across environments, i.e. ∀e ∈ E, fe = f : this objec-
verifying Equation 1. tive will then yield as solution the couple (g = f, h = 0),
meaning that the common information, which is all there is,
Related Work will entirely be captured by g as expected. This will benefit
its generalization performance as all the data will be used,
To make the prediction performance invariant across en- even those from different environments.
vironments, IRM (Arjovsky et al. 2020) aims at finding a We will now instantiate our method, providing a practical
classifier that retains the correlations independent of differ- implementation to solve the previous objective. In practice,
ent environments by excluding other spurious environment- we do not have access to the data trajectories at every instant
related ones. However, in the context of dynamical systems, t but only to a finite number of snapshots {Xk∆t i
}0≤k≤T/∆t
modeling bias in each environment is as important as mod- at a temporal resolution ∆t. We consider the Lagrangian for-
eling the invariant information, as both of them are indis- mulation of the proposed objective as our training loss. In-
pensable for prediction. This makes IRM incompatible with stead of comparing the evolution terms as in Equation 3, we
our setting. Spieckermann et al. (2015); Bird and Williams directly compare the trajectories induced by these instead12 :
(2019) use RNNs conditioned on an environment code to !
Ne XK
perform biased learning in different environments. Nonethe- X 1 X 2
2 e,i e,i
less, the similarity between environments are not explicitly L(g, h, λ) = khe k + Xk∆t − X̃k∆t
λ i=1 k=1
exploited as common invariant dynamical information. e∈E
(4)
In terms of robustness at test time, our formulation with e,i e,i R k∆t i
common term is related to Multi-Task Learning (MTL) and where X̃k∆t = X0 + 0 (g + he )(X̃s ) ds, which are the
Distributionally Robust optimization (DRO). Baxter (2000) trajectory states starting from X0e,i solved by a DE solver
suggests that jointly learning related tasks in MLT can po- with g + he up to t = k∆t. Note that λ is treated as divisor
tentially result in better generalization than models learned under khe k rather than a multiplier of the constraints. This
individually from each task. DRO approaches such as Bi- is equivalent to optimize the original Lagrangian but more
etti et al. (2019); Staib and Jegelka (2019) suggest that, in friendly with the gradient-descent-based methods when λ is
general loss minimization, imposing certain norm penalty very large. With an adequate algorithm in practice optimiz-
on neural networks (or other models) can encourage better ing g, h and λ, we should arrive at the optimum g and he
generalization. when λ → +∞. However, solving such optimization prob-
lem is difficult as a varying λ changes constantly the loss
The Proposed Framework: LEADS surface, which makes the learning difficult in the context
of dynamical system. We therefore treat the λ as a hyper-
As the dynamical systems in equation (1) are unknown, we parameter for each experiment, which should not affect the
will learn them from the data by parametrizing the evolution non-nullity of g even though in this case the constraints in
terms fe with neural networks as in Ayed et al. (2019); Chen Equation 3 will not be perfectly satisfied at optimum.
et al. (2018). The problem now lies in how these terms will It is important to note that LEADS is actually independent
be instantiated. We consider decomposing the dynamics in of the choice of the function space F. We choose here neu-
two components one g ∈ F shared across environments, ral networks for its expressiveness, in order to validate our
and another environment dependent component he ∈ F, framework. One can apply LEADS to any feasible function
such that if F is large enough, their should exist a couple space expressed by other data-driven methods.
(g, he ) ∈ F 2 such that by their sum, we recover the dynam-
1
ics for environment e, i.e. Note that both are equivalent when ∆t tends to 0.
2
Directly comparing the (approximate) evolution terms is pos-
∀e ∈ E, fe = g + he (2) sible using finite differences, but led to worse results.
L-V (#E = 4) L-V (#E = 10) G-S (#E = 3)
Method
MSE train MSE test MSE train MSE test MSE train MSE test
Env. Indep. 4.79e-1 7.00±1.71 e-1 4.57e-1 5.08±0.56 e-1 1.55e-2 1.43±0.15 e-2
Env. Dep. Sum 6.87e-6 1.26±1.21 e-2 7.32e-6 1.22±1.68 e-2 8.48e-5 6.43±3.42 e-3
LEADS no min. 4.89e-6 3.33±3.14 e-3 3.28e-6 3.07±2.58 e-3 7.65e-5 5.53±3.43 e-3
LEADS 8.15e-6 2.36±2.16 e-3 7.63e-6 1.77±1.58 e-3 8.60e-5 3.38±3.31 e-3
Table 1: Comparison between LEADS and baselines for Lotka-Volterra (in 4 and 10 envs.) and Gray-Scott equations (in 3 envs.).
(a) (b)
Figure 1: Comparison between test trajectories (blue) and ground truth (red), shown in phase space. Blue trajectories are
predicted by (a) Env. Dep. Sum and (b) LEADS for Lotka-Volterra in 4 environments (env. 1 to 4 from left to right).
Experiments simplistic equation (Pearson 1993). The governing PDE is:
We conduct our experiments for two complex nonlinear dy- ∂u
namical systems. The first one is an ODE-driven biologi- = Du ∆u − uv 2 + F (1 − u)
∂t
cal dynamical system, and the second one is a PDE-driven ∂v
reaction-diffusion model in which we find many complex = Dv ∆v + uv 2 − (F + k)v
∂t
behaviors.
where Xte = (uet , vte ) is state in a given spatial domain Ω,
with periodic boundary conditions. Du , Dv denotes respec-
Lotka-Volterra Equation We consider this classical tively the diffusion coefficient for u and v, which are con-
model (Lotka 1926), frequently used for describing the dy- stant (Pearson 1993). F and k together define the type of cor-
namics of interaction between a pair of predator and prey in responding patterns and behaviors. This means that the dif-
an ecosystem. The dynamics follow the equations: fusion and reaction terms are respectively non-environment
and environment component.
dx dy We therefore choose parameters θe = (Fe , ke ) for each
= αx − βxy, = δxy − γy
dt dt environment e to simulate data. Same as the Lotka-Volterra
where x, y are the quantity of the predator and the prey, Equation, the initial conditions are shared across environ-
α, β, γ, δ define how two species interact. In fact, by a ments and we simulate one trajectory per environment for
proper rescaling one can absorb β and δ into x and y. We training and 32 trajectories for test. The step size is ∆t = 20
therefore leave β, δ constant by setting β = δ = 1 across and the horizon is T = K∆t = 200. The experiments are
all environments and let α, γ depend on the environments. conducted in 3 environments.
The nonlinear interaction between two species are therefore
non-environment component and the linear terms are linked Training Details Within the experiments for each equa-
to environments. tion, functions g, h are NNs with the same architecture. We
We thus define the parameter θe = (αe , γe ) for each envi- use 4-layer MLPs for Lotka-Volterra and 4-layer ConvNets
ronment e. Note that choosing θe determines consequently for Gray-Scott. We apply Swish as the default activation
the second fixed point of the system (γe , αe ), around which function (Ramachandran, Zoph, and Le 2017). These net-
the trajectories orbit. The system state is Xt = (xt , yt ). works are integrated in time using the differentiable solver
The initial conditions are fixed across the environments, i.e. implemented by Chen et al. (2018). The basic backpropa-
∀e, X0e,i = X0i . Starting from the same initial condition gation through the internals of the solver is used instead.
X0i = (xi0 , y0i ), we simulate only 1 trajectory per environ- We apply an exponential Scheduled Sampling (Lamb et al.
ment for training and 32 for test. Note that the test set is 2016) with exponent at 0.99 to stabilize the training. We
much larger than the training one. The step size is ∆t = 0.5 use across all experiments Adam optimizer (Kingma and
and the dataset horizon is T = K∆t = 10. The experiments Ba 2015) with the same learning rate of 1 × 10−3 and
are conducted in 4 and 10 environments. (β1 , β2 ) = (0.9, 0.999). For the operator norm acting on
2 2
e,i e,i
he , we opt for maxi,k he (Xk∆t ) / Xk∆t , where the
e,i
Gray-Scott Equation This reaction-diffusion model is fa- X correspond to training sample trajectories. In order for
mous for its Turing patterns and complex behaviors w.r.t its the estimation of the norm on the test data to not deviate
(a) (b) (c) (d) (e)
Figure 2: Comparison of trajectories from (a) Env. Dep. Sum and (b) LEADS with (c) the ground truth for Grey-Scott equation.
Each row represents an environment. We show the state of channel u at t = 0, . . . , 5∆T . They are accompanied by the maps
of prediction error at the rightmost timestep by (d) Env. Dep. Sum and (e) LEADS. The larger the error, the brighter the pixel at
the corresponding coordinates.
too much from its norm of the training data, we also penal- MSE test at iteration
ize the sum of spectral norms of the weight at each layer Adaptation
PL he 2 50 250 500 10000
l=1 kWl k , an upper bound on the associated Lipshitz
constant, as suggested in Bietti et al. (2019). No adapt. — 0.36 —
Env. Dep. Sum
0.23 5.02e-2 0.25 3.05e-3
from scratch
Baselines We introduce following baselines to compare Env. Dep. Single
1.65 18.3 8.87e-2 4.13e-3
with the proposed formulation: from scratch
LEADS boosted
• Env. Indep.: the sum of two environment-independent 0.73 2.06e-3 1.84e-3 1.11e-3
Env. Dep. Single
neural networks g + h, learned with the standard ERM
learning principle, as in Ayed et al. (2019)3 , Table 2: Comparison of different adaptation strategies in 2
• Env. Dep. Sum: the sum of two environment-dependent new environments of Lotka-Volterra at different iterations.
NNs ge + he .
• LEADS no min.: our proposal without norm penalty, Learning in Unknown Environments
equivalent to LEADS with λ = +∞. We demonstrate how the learned invariant dynamics can
We show the results in Table 1. For Lotka-Volterra sys- boost the fitting in new similar environments. We suppose
tems, we confirm at first that the entire dataset cannot be now that we have an invariant function ĝ learned with
fit with a single pair of NNs (Env. Indep.). Comparing with LEADS from L-V (#E = 4). We then generate another
other baselines, our method LEADS reduces nearly 4/5 of Lotka-Volterra dataset in new environments Enew , still 1 tra-
the test MSE by Env. Dep. Sum and 1/3 of the test MSE jectory per environment in training set and 32 in test.
by LEADS no min. when there are #E = 4 environments. Let us consider the following adaptation strategies:
Figure 1 shows the samples of predicted trajectories in test, • No adapt.: a sanity check to ensure that the new dynamics
LEADS almost overlaps the ground true trajectory, while cannot be predicted by ĝ without further adaptation.
Env. Dep. Sum underperforms in most environments. When • Env. Dep. Sum from scratch: the sum of two environment
the number of environments is increased to #E = 10, the dependent NNs, trained from scratch with
error cut is over 85% w.r.t Env. Dep. Sum and over 40% w.r.t
LEADS no min. • Env. Dep. Single from scratch: an environment dependent
We observe the same improving tendency for Gray-Scott NN, trained from scratch, no boosting by ĝ.
systems. The error by LEADS is around 1/2 of Env. Dep. • LEADS boosted Env. Dep. Single: train environment de-
Sum test MSE and 60% of LEADS no min. test MSE. In pendent NN he boosted by learned ĝ.
Figure 2(a)-(c), the states obtained with our method is quali- Table 2 contains the adaptation results at training itera-
tatively closer to the ground truth. With the help of the error tions from 50 to 10000. With No adapt., we firstly show that
maps in Figure 2(d) and (e), we see that at the rightmost end- ĝ alone is not able to predict in any of these new environ-
time frames, the errors are systematically reduced across all ments, even if they are closely related to the original ones.
environments. This shows that LEADS accumulates less er- At the iteration 50, we observe that three last adaptations
rors through the integration, which suggests that LEADS al- perform poorly as expected since they are at early stage of
leviates the overfitting on the support. training. As soon as iteration 250, LEADS boosted Env. Dep.
Single surpasses already the best performance of the training
3 from scratch methods (Env. Dep. Sum and Env. Dep. Single
We have opted for the sum as it allows for a proper comparison
with our method. from scratch) at iteration 10000. This clearly shows that the
learned shared dynamics improves and accelerates the learn- Lamb, A.; Goyal, A.; Zhang, Y.; Zhang, S.; Courville, A.;
ing in new environments. and Bengio, Y. 2016. Professor Forcing: A New Algorithm
for Training Recurrent Networks. arXiv:1610.09038 [cs,
Conclusion stat] ArXiv: 1610.09038.
We introduce a data-driven framework LEADS to learn dy- Lotka, A. J. 1926. ELEMENTS OF PHYSICAL BIOL-
namics from the data that is collected from a set of simi- OGY. Science Progress in the Twentieth Century (1919-
lar yet different dynamical systems. Demonstrated with two 1933) 21(82): 341–343. ISSN 20594941. URL http://www.
complex families of systems, our framework can signifi- jstor.org/stable/43430362.
cantly improve the test performance in every environment, Madec, G.; Bourdallé-Badie, R.; Chanut, J.; Clementi, E.;
especially when the number of available trajectories is lim- Coward, A.; Ethé, C.; Iovino, D.; Lea, D.; Lévy, C.; Lo-
ited. We finally show that the extracted dynamics by LEADS vato, T.; Martin, N.; Masson, S.; Mocavero, S.; Rousset,
can boost the learning in similar new environments, which C.; Storkey, D.; Vancoppenolle, M.; Müeller, S.; Nurser, G.;
leads us towards a more flexible framework for prediction Bell, M.; and Samson, G. 2019. NEMO ocean engine. Add
and generalization in new environments. SI3 and TOP reference manuals.
Mangan, N. M.; Kutz, J. N.; Brunton, S. L.; and Proctor, J. L.
Acknowledgements 2017. Model selection for dynamical systems via sparse re-
This work was partially funded by Locust ANR-15-CE23- gression and information criteria. Proceedings of the Royal
0027 and Chaires de recherche et d’enseignement en intelli- Society A: Mathematical, Physical and Engineering Sci-
gence artificielle (Chaires IA), DL4Clim project (PG). ences 473(2204): 20170009. doi:10.1098/rspa.2017.0009.
Neic, A.; Campos, F. O.; Prassl, A. J.; Niederer, S. A.;
References Bishop, M. J.; Vigmond, E. J.; and Plank, G. 2017. Efficient
Arjovsky, M.; Bottou, L.; Gulrajani, I.; and Lopez-Paz, D. computation of electrograms and ECGs in human whole
2020. Invariant Risk Minimization. arXiv:1907.02893 [cs, heart simulations using a reaction-eikonal model. Journal of
stat] ArXiv: 1907.02893. Computational Physics 346: 191 – 211. ISSN 0021-9991.
Ayed, I.; de Bézenac, E.; Pajot, A.; Brajard, J.; and Gallinari, Pearson, J. E. 1993. Complex Patterns in a Simple Sys-
P. 2019. Learning Dynamical Systems from Partial Obser- tem. Science 261(5118): 189–192. ISSN 0036-8075. doi:
vations. CoRR abs/1902.11136. 10.1126/science.261.5118.189.
Baxter, J. 2000. A Model of Inductive Bias Learning. J. Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019.
Artif. Int. Res. 12(1): 149–198. ISSN 1076-9757. Physics-informed neural networks: A deep learning frame-
Bietti, A.; Mialon, G.; Chen, D.; and Mairal, J. 2019. A work for solving forward and inverse problems involving
Kernel Perspective for Regularizing Deep Neural Networks. nonlinear partial differential equations. Journal of Compu-
In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings tational Physics 378: 686–707.
of the 36th International Conference on Machine Learning, Ramachandran, P.; Zoph, B.; and Le, Q. V. 2017. Searching
volume 97 of Proceedings of Machine Learning Research, for Activation Functions. CoRR abs/1710.05941.
664–674. Long Beach, California, USA: PMLR. Spieckermann, S.; Düll, S.; Udluft, S.; Hentschel, A.; and
Bird, A.; and Williams, C. K. I. 2019. Customizing Se- Runkler, T. 2015. Exploiting similarity in system identifica-
quence Generation with Multi-Task Dynamical Systems. tion tasks with recurrent neural networks. Neurocomputing
CoRR abs/1910.05026. 169: 343 – 349. ISSN 0925-2312. Learning for Visual Se-
mantic Understanding in Big Data ESANN 2014 Industrial
Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; and Duve-
Data Processing and Analysis.
naud, D. K. 2018. Neural Ordinary Differential Equations.
In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Staib, M.; and Jegelka, S. 2019. Distributionally robust opti-
Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neu- mization and generalization in kernel methods. In Advances
ral Information Processing Systems, volume 31, 6571–6583. in Neural Information Processing Systems, 9134–9144.
Curran Associates, Inc.
Giannakis, D.; and Majda, A. J. 2012. Nonlinear Lapla-
cian spectral analysis for time series with intermittency
and low-frequency variability. Proceedings of the National
Academy of Sciences 109(7): 2222–2227. ISSN 0027-8424.
doi:10.1073/pnas.1118984109. URL https://www.pnas.org/
content/109/7/2222.
Kingma, D. P.; and Ba, J. 2015. Adam: A Method for
Stochastic Optimization. In Bengio, Y.; and LeCun, Y., eds.,
3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Confer-
ence Track Proceedings.