Learning Dynamical Systems across Environments

Learning Dynamical Systems across Environments YuanYin yuan.yin@lip6.fr Sorbonne Université CNRS LIP6

Paris France

IbrahimAyed ibrahim.ayed@lip6.fr Sorbonne Université CNRS LIP6

Paris France

Theresis Lab Thales EmmanuelDe Bézenac emmanuel.de-bezenac@lip6.fr Sorbonne Université CNRS LIP6

Paris France

PatrickGallinari patrick.gallinari@lip6.fr Sorbonne Université CNRS LIP6

Paris France

Criteo AI Lab

Paris France

Learning Dynamical Systems across Environments D1E8C7DF2D622D5126F4711B5DA000E6 GROBID - A machine learning software for extracting information from scholarly documents

Learning the behavior of natural phenomena automatically from the data has gained much traction these last years. However, in most real world scenarios, the environment in which the data samples are acquired is varying and may not be the same for each data sample. This is due to different circumstances e.g. acquisition in different spatial locations, or simply experimental settings which slightly differ. This severely hinders the training process, and makes the standard learning framework inapplicable. In this work, we propose a novel framework for modeling physical systems in this context, where we are able to leverage the data across different environments in order to learn the underlying dynamical systems, ensuring generalization without compromising the model's expressiveness and predictive performance. We instantiate our framework on two different families of dynamical systems, proving that our approach yields superior results over the classical learning approach as well as against competitive baselines. Finally, we also show that we are also able to accelerate and improve the learning for environments that have never been seen before.

Introduction

Often, natural phenomena may be difficult to understand and comprehend due to the complex and nonlinear interactions between composing elements, making it cumbersome to derive a mathematical model describing it. In this context, a data-driven approach arises as a powerful alternative to classical modeling methods, as an unknown model can be learned automatically from the data. Recently, much effort has been focused in this direction (Giannakis and Majda 2012;Mangan et al. 2017), with a particular emphasis on using neural networks (Raissi, Perdikaris, and Karniadakis 2019;Chen et al. 2018;Ayed et al. 2019) for treating cases where the underlying processes are largely unknown. Despite promising results, these methods usually postulate an idealized setting where the data is abundant and the environment in which it is acquired is always the same. However, in practice, this is never the case as obtaining real world data samples may be expensive. Perhaps more importantly, the environment in which they are acquired may vary. These changes can be caused by different factors: For example, in climatic modeling, there are external forces such as the Coriolis force that varies in different spatial locations (Madec et al. 2019), or in cardiac computational model parameters need to be personalized for each patient (Neic et al. 2017).

The classical learning paradigm in this context is to treat all the data as independent and identically distributed, thus disregarding the discrepancies between the environments. As this assumption is not valid, it leads to a biased solution and results in an average model that performs poorly. Conversely, one may also choose to avoid making this assumption by splitting the data from different environments and learning one dynamical system per environment, separately. However, this ignores the similarities between environments and would severely affect generalization performance, specifically in settings where per-environment data is limited.

In this work, our goal is to take into account the difference between environments and make use of the similarities across them. Thus, we propose the LEarning Across Dynamical Systems (LEADS) framework, a novel learning methodology where the dynamics are decomposed into two components, one shared across all environments, and another that takes into account the dynamics that cannot be expressed by the shared component and only those. This allows us to leverage the data from similar environments automatically, without compromising the expressiveness of the model. We demonstrate the effectiveness of our framework on two standard examples of dynamics given by differential equations: the Lotka-Volterra predator-prey model, expressed as an ODE, and the Gray-Scott reaction-diffusion equations, expressed as PDEs. Finally, we also show that our method accelerates and improves learning for similar unseen environments.

Approach Problem Setting

We consider the problem of learning unknown physical processes with data acquired from different environments. For each environment e ∈ E, we assume that the data is generated from an unknown governing differential equation:

dX t dt = f e (X t )(1)

defined over a finite time interval [0, T ] where the state X is either vector-valued, i.e. we have X t ∈ R d (Lotka-Volterra equations in the section Experiments) or is a d-dimensional vector field over a bounded spatial domain Ω ∈ R k , i.e. for t ∈ [0, T ] and x ∈ Ω, X t (x) ∈ R d . As stated above, modifications in the environment have an impact on dynamics of the system and thus the evolution terms f e are expected to be different. Nevertheless, we do assume they yield some form of similarity between environments: as we will see in the following, this is not a necessary condition for our framework to be applicable, but this is what will allow us to leverage the data from the other environments.

As in Arjovsky et al. (2020), we choose to not discard the information from where the data was collected. We construct our training set with training sample (e, {X e,i } i=1,...,Ne ) ∈ D. Each sample is thus composed of the environment identifier e as well as a set of trajectories where each X e,i , denoting here the i-th trajectory in the environment e, is a function verifying Equation 1.

Related Work

To make the prediction performance invariant across environments, IRM (Arjovsky et al. 2020) aims at finding a classifier that retains the correlations independent of different environments by excluding other spurious environmentrelated ones. However, in the context of dynamical systems, modeling bias in each environment is as important as modeling the invariant information, as both of them are indispensable for prediction. This makes IRM incompatible with our setting. Spieckermann et al. (2015); Bird and Williams (2019) use RNNs conditioned on an environment code to perform biased learning in different environments. Nonetheless, the similarity between environments are not explicitly exploited as common invariant dynamical information.

In terms of robustness at test time, our formulation with common term is related to Multi-Task Learning (MTL) and Distributionally Robust optimization (DRO). Baxter (2000) suggests that jointly learning related tasks in MLT can potentially result in better generalization than models learned individually from each task. DRO approaches such as Bietti et al. (2019); Staib and Jegelka (2019) suggest that, in general loss minimization, imposing certain norm penalty on neural networks (or other models) can encourage better generalization.

The Proposed Framework: LEADS

As the dynamical systems in equation ( 1) are unknown, we will learn them from the data by parametrizing the evolution terms f e with neural networks as in Ayed et al. (2019); Chen et al. (2018). The problem now lies in how these terms will be instantiated. We consider decomposing the dynamics in two components one g ∈ F shared across environments, and another environment dependent component h e ∈ F, such that if F is large enough, their should exist a couple (g, h e ) ∈ F2 such that by their sum, we recover the dynamics for environment e, i.e.

∀e ∈ E, f e = g + h e

(2)

The general idea here is that as g is the same for each environment it can be learned using all data points, across all environments. However, this decomposition yields a potentially infinite number of solutions, and in particular the trivial solution obtained by setting g to be the null function: in this case, data across environments cannot be leveraged.

In order to avoid the aforementioned trivial solution, we would like the shared function g to explain the dynamics as much as possible, and in turn make the environment dependent function h e be as small as possible. The following constrained optimization problem embeds this general idea:

min g,he∈F e h e 2 subject to ∀(e, X e,i t ) ∈ D, dX e,i t dt = (g + h e )(X e,i t )(3)

Let us consider the limit case where the dynamics are the same across environments, i.e. ∀e ∈ E, f e = f : this objective will then yield as solution the couple (g = f, h = 0), meaning that the common information, which is all there is, will entirely be captured by g as expected. This will benefit its generalization performance as all the data will be used, even those from different environments.

We will now instantiate our method, providing a practical implementation to solve the previous objective. In practice, we do not have access to the data trajectories at every instant t but only to a finite number of snapshots {X i k∆t } 0≤k≤ T /∆t at a temporal resolution ∆t. We consider the Lagrangian formulation of the proposed objective as our training loss. Instead of comparing the evolution terms as in Equation 3, we directly compare the trajectories induced by these instead 12 :

L(g, h, λ) = e∈E 1 λ h e 2 + Ne i=1 K k=1 X e,i k∆t − Xe,i k∆t 2

(4) where Xe,i k∆t = X e,i 0 + k∆t 0 (g + h e )( Xi s ) ds, which are the trajectory states starting from X e,i 0 solved by a DE solver with g + h e up to t = k∆t. Note that λ is treated as divisor under h e rather than a multiplier of the constraints. This is equivalent to optimize the original Lagrangian but more friendly with the gradient-descent-based methods when λ is very large. With an adequate algorithm in practice optimizing g, h and λ, we should arrive at the optimum g and h e when λ → +∞. However, solving such optimization problem is difficult as a varying λ changes constantly the loss surface, which makes the learning difficult in the context of dynamical system. We therefore treat the λ as a hyperparameter for each experiment, which should not affect the non-nullity of g even though in this case the constraints in Equation 3 will not be perfectly satisfied at optimum.

It is important to note that LEADS is actually independent of the choice of the function space F. We choose here neural networks for its expressiveness, in order to validate our framework. One can apply LEADS to any feasible function space expressed by other data-driven methods.

MethodL-V (#E = 4) L-V (#E = 10) G-S (#E =

Experiments

We conduct our experiments for two complex nonlinear dynamical systems. The first one is an ODE-driven biological dynamical system, and the second one is a PDE-driven reaction-diffusion model in which we find many complex behaviors.

Lotka-Volterra Equation

We consider this classical model (Lotka 1926), frequently used for describing the dynamics of interaction between a pair of predator and prey in an ecosystem. The dynamics follow the equations:

dx dt = αx − βxy, dy dt = δxy − γy

where x, y are the quantity of the predator and the prey, α, β, γ, δ define how two species interact. In fact, by a proper rescaling one can absorb β and δ into x and y. We therefore leave β, δ constant by setting β = δ = 1 across all environments and let α, γ depend on the environments. The nonlinear interaction between two species are therefore non-environment component and the linear terms are linked to environments. We thus define the parameter θ e = (α e , γ e ) for each environment e. Note that choosing θ e determines consequently the second fixed point of the system (γ e , α e ), around which the trajectories orbit. The system state is X t = (x t , y t ). The initial conditions are fixed across the environments, i.e. ∀e, X e,i 0 = X i 0 . Starting from the same initial condition X i 0 = (x i 0 , y i 0 ), we simulate only 1 trajectory per environment for training and 32 for test. Note that the test set is much larger than the training one. The step size is ∆t = 0.5 and the dataset horizon is T = K∆t = 10. The experiments are conducted in 4 and 10 environments.

Gray-Scott Equation

This reaction-diffusion model is famous for its Turing patterns and complex behaviors w.r.t its simplistic equation (Pearson 1993). The governing PDE is:

∂u ∂t = D u ∆u − uv 2 + F (1 − u) ∂v ∂t = D v ∆v + uv 2 − (F + k)v

where X e t = (u e t , v e t ) is state in a given spatial domain Ω, with periodic boundary conditions. D u , D v denotes respectively the diffusion coefficient for u and v, which are constant (Pearson 1993). F and k together define the type of corresponding patterns and behaviors. This means that the diffusion and reaction terms are respectively non-environment and environment component.

We therefore choose parameters θ e = (F e , k e ) for each environment e to simulate data. Same as the Lotka-Volterra Equation, the initial conditions are shared across environments and we simulate one trajectory per environment for training and 32 trajectories for test. The step size is ∆t = 20 and the horizon is T = K∆t = 200. The experiments are conducted in 3 environments.

Training Details Within the experiments for each equation, functions g, h are NNs with the same architecture. We use 4-layer MLPs for Lotka-Volterra and 4-layer ConvNets for Gray-Scott. We apply Swish as the default activation function (Ramachandran, Zoph, and Le 2017). These networks are integrated in time using the differentiable solver implemented by Chen et al. (2018). The basic backpropagation through the internals of the solver is used instead. We apply an exponential Scheduled Sampling (Lamb et al. 2016) with exponent at 0.99 to stabilize the training. We use across all experiments Adam optimizer (Kingma and Ba 2015) with the same learning rate of 1 × 10 −3 and (β 1 , β 2 ) = (0.9, 0.999). For the operator norm acting on h e , we opt for max i,k h e (X e,i k∆t )

/ X e,i k∆t 2

, where the X e,i correspond to training sample trajectories. In order for the estimation of the norm on the test data to not deviate , an upper bound on the associated Lipshitz constant, as suggested in Bietti et al. (2019).

Baselines We introduce following baselines to compare with the proposed formulation:

• Env. Indep.: the sum of two environment-independent neural networks g + h, learned with the standard ERM learning principle, as in Ayed et al. (2019) 3 ,

• Env. Dep. Sum: the sum of two environment-dependent NNs g e + h e .

• LEADS no min.: our proposal without norm penalty, equivalent to LEADS with λ = +∞.

We show the results in Table 1. For Lotka-Volterra systems, we confirm at first that the entire dataset cannot be fit with a single pair of NNs (Env. Indep.). Comparing with other baselines, our method LEADS reduces nearly 4/5 of the test MSE by Env. Dep. Sum and 1/3 of the test MSE by LEADS no min. when there are #E = 4 environments. Figure 1 shows the samples of predicted trajectories in test, LEADS almost overlaps the ground true trajectory, while Env. Dep. Sum underperforms in most environments. When the number of environments is increased to #E = 10, the error cut is over 85% w.r.t Env. Dep. Sum and over 40% w.r.t LEADS no min.

We observe the same improving tendency for Gray-Scott systems. The error by LEADS is around 1/2 of Env. Dep. Sum test MSE and 60% of LEADS no min. test MSE. In Figure 2(a)-(c), the states obtained with our method is qualitatively closer to the ground truth. With the help of the error maps in Figure 2(d) and (e), we see that at the rightmost endtime frames, the errors are systematically reduced across all environments. This shows that LEADS accumulates less errors through the integration, which suggests that LEADS alleviates the overfitting on the support.

Learning in Unknown Environments

We demonstrate how the learned invariant dynamics can boost the fitting in new similar environments. We suppose now that we have an invariant function ĝ learned with LEADS from L-V (#E = 4). We then generate another Lotka-Volterra dataset in new environments E new , still 1 trajectory per environment in training set and 32 in test.

Let us consider the following adaptation strategies:

• No adapt.: a sanity check to ensure that the new dynamics cannot be predicted by ĝ without further adaptation.

• Env. Dep. Sum from scratch: the sum of two environment dependent NNs, trained from scratch with

• Env. Dep. Single from scratch: an environment dependent NN, trained from scratch, no boosting by ĝ.

• LEADS boosted Env. Dep. Single: train environment dependent NN h e boosted by learned ĝ.

Table 2 contains the adaptation results at training iterations from 50 to 10000. With No adapt., we firstly show that ĝ alone is not able to predict in any of these new environments, even if they are closely related to the original ones. At the iteration 50, we observe that three last adaptations perform poorly as expected since they are at early stage of training. As soon as iteration 250, LEADS boosted Env. Dep. Single surpasses already the best performance of the training from scratch methods (Env. Dep. Sum and Env. Dep. Single from scratch) at iteration 10000. This clearly shows that the learned shared dynamics improves and accelerates the learning in new environments.

Conclusion

We introduce a data-driven framework LEADS to learn dynamics from the data that is collected from a set of similar yet different dynamical systems. Demonstrated with two complex families of systems, our framework can significantly improve the test performance in every environment, especially when the number of available trajectories is limited. We finally show that the extracted dynamics by LEADS can boost the learning in similar new environments, which leads us towards a more flexible framework for prediction and generalization in new environments.

Figure 1 :1Figure 1: Comparison between test trajectories (blue) and ground truth (red), shown in phase space. Blue trajectories are predicted by (a) Env. Dep. Sum and (b) LEADS for Lotka-Volterra in 4 environments (env. 1 to 4 from left to right).

Figure 2 :2Figure 2: Comparison of trajectories from (a) Env. Dep. Sum and (b) LEADS with (c) the ground truth for Grey-Scott equation. Each row represents an environment. We show the state of channel u at t = 0, . . . , 5∆T . They are accompanied by the maps of prediction error at the rightmost timestep by (d) Env. Dep. Sum and (e) LEADS. The larger the error, the brighter the pixel at the corresponding coordinates.

Table 1 :1Comparison between LEADS and baselines for Lotka-Volterra (in 4 and 10 envs.) and Gray-Scott equations (in 3 envs.).3)

Table 2 :2Comparison of different adaptation strategies in 2 new environments of Lotka-Volterra at different iterations.AdaptationMSE test at iteration5025050010000No adapt.-0.36 -Env. Dep. Sum from scratch0.23 5.02e-20.253.05e-3Env. Dep. Single from scratch1.6518.38.87e-2 4.13e-3LEADS boosted Env. Dep. Single0.73 2.06e-3 1.84e-3 1.11e-3

Note that both are equivalent when ∆t tends to 0. Directly comparing the (approximate) evolution terms is possible using finite differences, but led to worse results. We have opted for the sum as it allows for a proper comparison with our method.

Acknowledgements

This work was partially funded by Locust ANR-15-CE23-0027 and Chaires de recherche et d'enseignement en intelligence artificielle (Chaires IA), DL4Clim project (PG).

MArjovsky LBottou IGulrajani DLopez-Paz arXiv:1907.02893 ArXiv: 1907.02893 Invariant Risk Minimization 2020 cs, stat Learning Dynamical Systems from Partial Observations IAyed EDe Bézenac APajot JBrajard PGallinari CoRR abs/1902.11136 2019 A Model of Inductive Bias Learning JBaxter J. Artif. Int. Res 1076-9757 12 1 2000 A Kernel Perspective for Regularizing Deep Neural Networks ABietti GMialon DChen JMairal Proceedings of the 36th International Conference on Machine Learning KChaudhuri RSalakhutdinov the 36th International Conference on Machine Learning

Long Beach, California, USA

PMLR 2019 97 Proceedings of Machine Learning Research Customizing Sequence Generation with Multi-Task Dynamical Systems ABird CK IWilliams CoRR abs/1910.05026 2019 Neural Ordinary Differential Equations RT QChen YRubanova JBettencourt DKDuvenaud Advances in Neural Information Processing Systems SBengio HWallach HLarochelle KGrauman NCesa-Bianchi RGarnett 2018 31 <author> <persName><forename type="first">Inc</forename><surname>Curran Associates</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b7"> <analytic> <title level="a" type="main">Nonlinear Laplacian spectral analysis for time series with intermittency and low-frequency variability DGiannakis AJMajda 10.1073/pnas.1118984109 Proceedings of the National Academy of Sciences 0027-8424 109 7 2012 Adam: A Method for Stochastic Optimization DPKingma JBa arXiv:1610.09038 ArXiv: 1610.09038 3rd International Conference on Learning Representations, ICLR 2015 YBengio YLecun

San Diego, CA, USA

2015. May 7-9, 2015. 2016 Professor Forcing: A New Algorithm for Training Recurrent Networks. cs, stat ELEMENTS OF PHYSICAL BIOL-OGY AJLotka Science Progress in the Twentieth Century 20594941 21 82 1926. 1919. 1933 GMadec RBourdallé-Badie JChanut EClementi ACoward CEthé DIovino DLea CLévy TLovato NMartin SMasson SMocavero CRousset DStorkey MVancoppenolle SMüeller GNurser MBell GSamson NEMO ocean engine. Add SI3 and TOP reference manuals 2019 Efficient computation of electrograms and ECGs in human whole heart simulations using a reaction-eikonal model NMMangan JNKutz SLBrunton JLProctor FOCampos AJPrassl SANiederer MJBishop EJVigmond GPlank 10.1098/rspa.2017.0009 Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 0021-9991 473 2017. 2204. 2017 Journal of Computational Physics Complex Patterns in a Simple System JEPearson 10.1126/science.261.5118.189 Science 0036-8075 261 1993. 5118 Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations MRaissi PPerdikaris GEKarniadakis Journal of Computational Physics 378 2019 Searching for Activation Functions PRamachandran BZoph QVLe CoRR abs/1710.05941 2017 Exploiting similarity in system identification tasks with recurrent neural networks SSpieckermann SDüll SUdluft AHentschel TRunkler ESANN 2014 Industrial Data Processing and Analysis 2015 169 Learning for Visual Semantic Understanding in Big Data Distributionally robust optimization and generalization in kernel methods MStaib SJegelka Advances in Neural Information Processing Systems 2019