=Paper= {{Paper |id=Vol-2964/article_169 |storemode=property |title=Learning Dynamical Systems across Environments |pdfUrl=https://ceur-ws.org/Vol-2964/article_169.pdf |volume=Vol-2964 |authors=Yuan Yin,Ibrahim Ayed,Emmanuel de Bézenac,Patrick Gallinari |dblpUrl=https://dblp.org/rec/conf/aaaiss/YinABG21 }} ==Learning Dynamical Systems across Environments== https://ceur-ws.org/Vol-2964/article_169.pdf
                           Learning Dynamical Systems across Environments
                   Yuan Yin1 , Ibrahim Ayed1,2 , Emmanuel de Bézenac1 , Patrick Gallinari1,3
                                           1
                                           Sorbonne Université, CNRS, LIP6, Paris, France
                                                      2
                                                        Theresis Lab, Thales
                                                  3
                                                    Criteo AI Lab, Paris, France
                             {yuan.yin, ibrahim.ayed, emmanuel.de-bezenac, patrick.gallinari}@lip6.fr



                            Abstract                                  changes can be caused by different factors: For example, in
                                                                      climatic modeling, there are external forces such as the Cori-
  Learning the behavior of natural phenomena automatically            olis force that varies in different spatial locations (Madec
  from the data has gained much traction these last years. How-
  ever, in most real world scenarios, the environment in which
                                                                      et al. 2019), or in cardiac computational model parameters
  the data samples are acquired is varying and may not be the         need to be personalized for each patient (Neic et al. 2017).
  same for each data sample. This is due to different circum-            The classical learning paradigm in this context is to treat
  stances e.g. acquisition in different spatial locations, or sim-    all the data as independent and identically distributed, thus
  ply experimental settings which slightly differ. This severely      disregarding the discrepancies between the environments.
  hinders the training process, and makes the standard learn-         As this assumption is not valid, it leads to a biased solu-
  ing framework inapplicable. In this work, we propose a novel        tion and results in an average model that performs poorly.
  framework for modeling physical systems in this context,            Conversely, one may also choose to avoid making this as-
  where we are able to leverage the data across different envi-       sumption by splitting the data from different environments
  ronments in order to learn the underlying dynamical systems,        and learning one dynamical system per environment, sep-
  ensuring generalization without compromising the model’s
  expressiveness and predictive performance. We instantiate
                                                                      arately. However, this ignores the similarities between en-
  our framework on two different families of dynamical sys-           vironments and would severely affect generalization perfor-
  tems, proving that our approach yields superior results over        mance, specifically in settings where per-environment data
  the classical learning approach as well as against competitive      is limited.
  baselines. Finally, we also show that we are also able to ac-          In this work, our goal is to take into account the differ-
  celerate and improve the learning for environments that have        ence between environments and make use of the similar-
  never been seen before.                                             ities across them. Thus, we propose the LEarning Across
                                                                      Dynamical Systems (LEADS) framework, a novel learning
                                                                      methodology where the dynamics are decomposed into two
                        Introduction                                  components, one shared across all environments, and an-
Often, natural phenomena may be difficult to understand               other that takes into account the dynamics that cannot be
and comprehend due to the complex and nonlinear interac-              expressed by the shared component and only those. This al-
tions between composing elements, making it cumbersome                lows us to leverage the data from similar environments auto-
to derive a mathematical model describing it. In this con-            matically, without compromising the expressiveness of the
text, a data-driven approach arises as a powerful alternative         model. We demonstrate the effectiveness of our framework
to classical modeling methods, as an unknown model can                on two standard examples of dynamics given by differen-
be learned automatically from the data. Recently, much ef-            tial equations: the Lotka-Volterra predator-prey model, ex-
fort has been focused in this direction (Giannakis and Majda          pressed as an ODE, and the Gray-Scott reaction-diffusion
2012; Mangan et al. 2017), with a particular emphasis on              equations, expressed as PDEs. Finally, we also show that our
using neural networks (Raissi, Perdikaris, and Karniadakis            method accelerates and improves learning for similar unseen
2019; Chen et al. 2018; Ayed et al. 2019) for treating cases          environments.
where the underlying processes are largely unknown. De-
spite promising results, these methods usually postulate an                                   Approach
idealized setting where the data is abundant and the environ-
ment in which it is acquired is always the same. However,             Problem Setting
in practice, this is never the case as obtaining real world           We consider the problem of learning unknown physical pro-
data samples may be expensive. Perhaps more importantly,              cesses with data acquired from different environments. For
the environment in which they are acquired may vary. These            each environment e ∈ E, we assume that the data is gener-
                                                                      ated from an unknown governing differential equation:
Copyright © 2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY                                dXt
4.0)                                                                                             = fe (Xt )                     (1)
                                                                                              dt
defined over a finite time interval [0, T ] where the state X is       The general idea here is that as g is the same for each
either vector-valued, i.e. we have Xt ∈ Rd (Lotka-Volterra          environment it can be learned using all data points, across
equations in the section Experiments) or is a d-dimensional         all environments. However, this decomposition yields a po-
vector field over a bounded spatial domain Ω ∈ Rk , i.e. for        tentially infinite number of solutions, and in particular the
t ∈ [0, T ] and x ∈ Ω, Xt (x) ∈ Rd . As stated above, modi-         trivial solution obtained by setting g to be the null function:
fications in the environment have an impact on dynamics of          in this case, data across environments cannot be leveraged.
the system and thus the evolution terms fe are expected to be          In order to avoid the aforementioned trivial solution, we
different. Nevertheless, we do assume they yield some form          would like the shared function g to explain the dynamics
of similarity between environments: as we will see in the fol-      as much as possible, and in turn make the environment de-
lowing, this is not a necessary condition for our framework         pendent function he be as small as possible. The following
to be applicable, but this is what will allow us to leverage the    constrained optimization problem embeds this general idea:
data from the other environments.                                                           X         2
                                                                                     min        khe k subject to
   As in Arjovsky et al. (2020), we choose to not discard the                      g,he ∈F
                                                                                              e
information from where the data was collected. We construct
our training set with training sample (e, {X e,i }i=1,...,Ne ) ∈                                 dXte,i
                                                                             ∀(e, Xte,i ) ∈ D,           = (g + he )(Xte,i )     (3)
D. Each sample is thus composed of the environment identi-                                        dt
fier e as well as a set of trajectories where each X e,i , denot-   Let us consider the limit case where the dynamics are the
ing here the i-th trajectory in the environment e, is a function    same across environments, i.e. ∀e ∈ E, fe = f : this objec-
verifying Equation 1.                                               tive will then yield as solution the couple (g = f, h = 0),
                                                                    meaning that the common information, which is all there is,
Related Work                                                        will entirely be captured by g as expected. This will benefit
                                                                    its generalization performance as all the data will be used,
To make the prediction performance invariant across en-             even those from different environments.
vironments, IRM (Arjovsky et al. 2020) aims at finding a               We will now instantiate our method, providing a practical
classifier that retains the correlations independent of differ-     implementation to solve the previous objective. In practice,
ent environments by excluding other spurious environment-           we do not have access to the data trajectories at every instant
related ones. However, in the context of dynamical systems,         t but only to a finite number of snapshots {Xk∆t  i
                                                                                                                         }0≤k≤T/∆t
modeling bias in each environment is as important as mod-           at a temporal resolution ∆t. We consider the Lagrangian for-
eling the invariant information, as both of them are indis-         mulation of the proposed objective as our training loss. In-
pensable for prediction. This makes IRM incompatible with           stead of comparing the evolution terms as in Equation 3, we
our setting. Spieckermann et al. (2015); Bird and Williams          directly compare the trajectories induced by these instead12 :
(2019) use RNNs conditioned on an environment code to                                                                              !
                                                                                                        Ne XK
perform biased learning in different environments. Nonethe-                        X 1                 X                         2
                                                                                                  2               e,i        e,i
less, the similarity between environments are not explicitly        L(g, h, λ) =             khe k +            Xk∆t − X̃k∆t
                                                                                           λ           i=1 k=1
exploited as common invariant dynamical information.                               e∈E
                                                                                                                                 (4)
   In terms of robustness at test time, our formulation with                   e,i      e,i   R k∆t            i
common term is related to Multi-Task Learning (MTL) and             where X̃k∆t = X0 + 0 (g + he )(X̃s ) ds, which are the
Distributionally Robust optimization (DRO). Baxter (2000)           trajectory states starting from X0e,i solved by a DE solver
suggests that jointly learning related tasks in MLT can po-         with g + he up to t = k∆t. Note that λ is treated as divisor
tentially result in better generalization than models learned       under khe k rather than a multiplier of the constraints. This
individually from each task. DRO approaches such as Bi-             is equivalent to optimize the original Lagrangian but more
etti et al. (2019); Staib and Jegelka (2019) suggest that, in       friendly with the gradient-descent-based methods when λ is
general loss minimization, imposing certain norm penalty            very large. With an adequate algorithm in practice optimiz-
on neural networks (or other models) can encourage better           ing g, h and λ, we should arrive at the optimum g and he
generalization.                                                     when λ → +∞. However, solving such optimization prob-
                                                                    lem is difficult as a varying λ changes constantly the loss
The Proposed Framework: LEADS                                       surface, which makes the learning difficult in the context
                                                                    of dynamical system. We therefore treat the λ as a hyper-
As the dynamical systems in equation (1) are unknown, we            parameter for each experiment, which should not affect the
will learn them from the data by parametrizing the evolution        non-nullity of g even though in this case the constraints in
terms fe with neural networks as in Ayed et al. (2019); Chen        Equation 3 will not be perfectly satisfied at optimum.
et al. (2018). The problem now lies in how these terms will            It is important to note that LEADS is actually independent
be instantiated. We consider decomposing the dynamics in            of the choice of the function space F. We choose here neu-
two components one g ∈ F shared across environments,                ral networks for its expressiveness, in order to validate our
and another environment dependent component he ∈ F,                 framework. One can apply LEADS to any feasible function
such that if F is large enough, their should exist a couple         space expressed by other data-driven methods.
(g, he ) ∈ F 2 such that by their sum, we recover the dynam-
                                                                       1
ics for environment e, i.e.                                              Note that both are equivalent when ∆t tends to 0.
                                                                       2
                                                                         Directly comparing the (approximate) evolution terms is pos-
                    ∀e ∈ E, fe = g + he                      (2)    sible using finite differences, but led to worse results.
                                   L-V (#E = 4)                      L-V (#E = 10)                    G-S (#E = 3)
         Method
                            MSE train        MSE test       MSE train              MSE test      MSE train        MSE test
         Env. Indep.          4.79e-1     7.00±1.71 e-1          4.57e-1         5.08±0.56 e-1    1.55e-2    1.43±0.15 e-2
         Env. Dep. Sum        6.87e-6     1.26±1.21 e-2          7.32e-6         1.22±1.68 e-2    8.48e-5    6.43±3.42 e-3
         LEADS no min.        4.89e-6     3.33±3.14 e-3          3.28e-6         3.07±2.58 e-3    7.65e-5    5.53±3.43 e-3
         LEADS                8.15e-6     2.36±2.16 e-3          7.63e-6         1.77±1.58 e-3    8.60e-5    3.38±3.31 e-3

Table 1: Comparison between LEADS and baselines for Lotka-Volterra (in 4 and 10 envs.) and Gray-Scott equations (in 3 envs.).




                             (a)                                                                     (b)

Figure 1: Comparison between test trajectories (blue) and ground truth (red), shown in phase space. Blue trajectories are
predicted by (a) Env. Dep. Sum and (b) LEADS for Lotka-Volterra in 4 environments (env. 1 to 4 from left to right).


                      Experiments                                      simplistic equation (Pearson 1993). The governing PDE is:
We conduct our experiments for two complex nonlinear dy-                                ∂u
namical systems. The first one is an ODE-driven biologi-                                   = Du ∆u − uv 2 + F (1 − u)
                                                                                        ∂t
cal dynamical system, and the second one is a PDE-driven                                ∂v
reaction-diffusion model in which we find many complex                                     = Dv ∆v + uv 2 − (F + k)v
                                                                                        ∂t
behaviors.
                                                                       where Xte = (uet , vte ) is state in a given spatial domain Ω,
                                                                       with periodic boundary conditions. Du , Dv denotes respec-
Lotka-Volterra Equation We consider this classical                     tively the diffusion coefficient for u and v, which are con-
model (Lotka 1926), frequently used for describing the dy-             stant (Pearson 1993). F and k together define the type of cor-
namics of interaction between a pair of predator and prey in           responding patterns and behaviors. This means that the dif-
an ecosystem. The dynamics follow the equations:                       fusion and reaction terms are respectively non-environment
                                                                       and environment component.
             dx             dy                                            We therefore choose parameters θe = (Fe , ke ) for each
                = αx − βxy,    = δxy − γy
             dt             dt                                         environment e to simulate data. Same as the Lotka-Volterra
where x, y are the quantity of the predator and the prey,              Equation, the initial conditions are shared across environ-
α, β, γ, δ define how two species interact. In fact, by a              ments and we simulate one trajectory per environment for
proper rescaling one can absorb β and δ into x and y. We               training and 32 trajectories for test. The step size is ∆t = 20
therefore leave β, δ constant by setting β = δ = 1 across              and the horizon is T = K∆t = 200. The experiments are
all environments and let α, γ depend on the environments.              conducted in 3 environments.
The nonlinear interaction between two species are therefore
non-environment component and the linear terms are linked              Training Details Within the experiments for each equa-
to environments.                                                       tion, functions g, h are NNs with the same architecture. We
   We thus define the parameter θe = (αe , γe ) for each envi-         use 4-layer MLPs for Lotka-Volterra and 4-layer ConvNets
ronment e. Note that choosing θe determines consequently               for Gray-Scott. We apply Swish as the default activation
the second fixed point of the system (γe , αe ), around which          function (Ramachandran, Zoph, and Le 2017). These net-
the trajectories orbit. The system state is Xt = (xt , yt ).           works are integrated in time using the differentiable solver
The initial conditions are fixed across the environments, i.e.         implemented by Chen et al. (2018). The basic backpropa-
∀e, X0e,i = X0i . Starting from the same initial condition             gation through the internals of the solver is used instead.
X0i = (xi0 , y0i ), we simulate only 1 trajectory per environ-         We apply an exponential Scheduled Sampling (Lamb et al.
ment for training and 32 for test. Note that the test set is           2016) with exponent at 0.99 to stabilize the training. We
much larger than the training one. The step size is ∆t = 0.5           use across all experiments Adam optimizer (Kingma and
and the dataset horizon is T = K∆t = 10. The experiments               Ba 2015) with the same learning rate of 1 × 10−3 and
are conducted in 4 and 10 environments.                                (β1 , β2 ) = (0.9, 0.999). For the operator norm acting on
                                                                                                             2            2
                                                                                                   e,i              e,i
                                                                       he , we opt for maxi,k he (Xk∆t )         / Xk∆t       , where the
                                                                           e,i
Gray-Scott Equation This reaction-diffusion model is fa-               X correspond to training sample trajectories. In order for
mous for its Turing patterns and complex behaviors w.r.t its           the estimation of the norm on the test data to not deviate
                  (a)                                       (b)                                (c)                      (d)       (e)

Figure 2: Comparison of trajectories from (a) Env. Dep. Sum and (b) LEADS with (c) the ground truth for Grey-Scott equation.
Each row represents an environment. We show the state of channel u at t = 0, . . . , 5∆T . They are accompanied by the maps
of prediction error at the rightmost timestep by (d) Env. Dep. Sum and (e) LEADS. The larger the error, the brighter the pixel at
the corresponding coordinates.


too much from its norm of the training data, we also penal-                                          MSE test at iteration
ize the sum of spectral norms of the weight at each layer              Adaptation
PL         he 2                                                                              50       250        500          10000
   l=1 kWl k , an upper bound on the associated Lipshitz
constant, as suggested in Bietti et al. (2019).                        No adapt.                          — 0.36 —
                                                                       Env. Dep. Sum
                                                                                            0.23     5.02e-2     0.25         3.05e-3
                                                                       from scratch
Baselines We introduce following baselines to compare                  Env. Dep. Single
                                                                                            1.65      18.3     8.87e-2        4.13e-3
with the proposed formulation:                                         from scratch
                                                                       LEADS boosted
• Env. Indep.: the sum of two environment-independent                                       0.73     2.06e-3   1.84e-3        1.11e-3
                                                                       Env. Dep. Single
  neural networks g + h, learned with the standard ERM
  learning principle, as in Ayed et al. (2019)3 ,                     Table 2: Comparison of different adaptation strategies in 2
• Env. Dep. Sum: the sum of two environment-dependent                 new environments of Lotka-Volterra at different iterations.
  NNs ge + he .
• LEADS no min.: our proposal without norm penalty,                   Learning in Unknown Environments
  equivalent to LEADS with λ = +∞.                                    We demonstrate how the learned invariant dynamics can
    We show the results in Table 1. For Lotka-Volterra sys-           boost the fitting in new similar environments. We suppose
tems, we confirm at first that the entire dataset cannot be           now that we have an invariant function ĝ learned with
fit with a single pair of NNs (Env. Indep.). Comparing with           LEADS from L-V (#E = 4). We then generate another
other baselines, our method LEADS reduces nearly 4/5 of               Lotka-Volterra dataset in new environments Enew , still 1 tra-
the test MSE by Env. Dep. Sum and 1/3 of the test MSE                 jectory per environment in training set and 32 in test.
by LEADS no min. when there are #E = 4 environments.                     Let us consider the following adaptation strategies:
Figure 1 shows the samples of predicted trajectories in test,         • No adapt.: a sanity check to ensure that the new dynamics
LEADS almost overlaps the ground true trajectory, while                 cannot be predicted by ĝ without further adaptation.
Env. Dep. Sum underperforms in most environments. When                • Env. Dep. Sum from scratch: the sum of two environment
the number of environments is increased to #E = 10, the                 dependent NNs, trained from scratch with
error cut is over 85% w.r.t Env. Dep. Sum and over 40% w.r.t
LEADS no min.                                                         • Env. Dep. Single from scratch: an environment dependent
    We observe the same improving tendency for Gray-Scott               NN, trained from scratch, no boosting by ĝ.
systems. The error by LEADS is around 1/2 of Env. Dep.                • LEADS boosted Env. Dep. Single: train environment de-
Sum test MSE and 60% of LEADS no min. test MSE. In                      pendent NN he boosted by learned ĝ.
Figure 2(a)-(c), the states obtained with our method is quali-           Table 2 contains the adaptation results at training itera-
tatively closer to the ground truth. With the help of the error       tions from 50 to 10000. With No adapt., we firstly show that
maps in Figure 2(d) and (e), we see that at the rightmost end-        ĝ alone is not able to predict in any of these new environ-
time frames, the errors are systematically reduced across all         ments, even if they are closely related to the original ones.
environments. This shows that LEADS accumulates less er-              At the iteration 50, we observe that three last adaptations
rors through the integration, which suggests that LEADS al-           perform poorly as expected since they are at early stage of
leviates the overfitting on the support.                              training. As soon as iteration 250, LEADS boosted Env. Dep.
                                                                      Single surpasses already the best performance of the training
   3                                                                  from scratch methods (Env. Dep. Sum and Env. Dep. Single
     We have opted for the sum as it allows for a proper comparison
with our method.                                                      from scratch) at iteration 10000. This clearly shows that the
learned shared dynamics improves and accelerates the learn-         Lamb, A.; Goyal, A.; Zhang, Y.; Zhang, S.; Courville, A.;
ing in new environments.                                            and Bengio, Y. 2016. Professor Forcing: A New Algorithm
                                                                    for Training Recurrent Networks. arXiv:1610.09038 [cs,
                        Conclusion                                  stat] ArXiv: 1610.09038.
We introduce a data-driven framework LEADS to learn dy-             Lotka, A. J. 1926. ELEMENTS OF PHYSICAL BIOL-
namics from the data that is collected from a set of simi-          OGY. Science Progress in the Twentieth Century (1919-
lar yet different dynamical systems. Demonstrated with two          1933) 21(82): 341–343. ISSN 20594941. URL http://www.
complex families of systems, our framework can signifi-             jstor.org/stable/43430362.
cantly improve the test performance in every environment,           Madec, G.; Bourdallé-Badie, R.; Chanut, J.; Clementi, E.;
especially when the number of available trajectories is lim-        Coward, A.; Ethé, C.; Iovino, D.; Lea, D.; Lévy, C.; Lo-
ited. We finally show that the extracted dynamics by LEADS          vato, T.; Martin, N.; Masson, S.; Mocavero, S.; Rousset,
can boost the learning in similar new environments, which           C.; Storkey, D.; Vancoppenolle, M.; Müeller, S.; Nurser, G.;
leads us towards a more flexible framework for prediction           Bell, M.; and Samson, G. 2019. NEMO ocean engine. Add
and generalization in new environments.                             SI3 and TOP reference manuals.
                                                                    Mangan, N. M.; Kutz, J. N.; Brunton, S. L.; and Proctor, J. L.
                   Acknowledgements                                 2017. Model selection for dynamical systems via sparse re-
This work was partially funded by Locust ANR-15-CE23-               gression and information criteria. Proceedings of the Royal
0027 and Chaires de recherche et d’enseignement en intelli-         Society A: Mathematical, Physical and Engineering Sci-
gence artificielle (Chaires IA), DL4Clim project (PG).              ences 473(2204): 20170009. doi:10.1098/rspa.2017.0009.
                                                                    Neic, A.; Campos, F. O.; Prassl, A. J.; Niederer, S. A.;
                        References                                  Bishop, M. J.; Vigmond, E. J.; and Plank, G. 2017. Efficient
Arjovsky, M.; Bottou, L.; Gulrajani, I.; and Lopez-Paz, D.          computation of electrograms and ECGs in human whole
2020. Invariant Risk Minimization. arXiv:1907.02893 [cs,            heart simulations using a reaction-eikonal model. Journal of
stat] ArXiv: 1907.02893.                                            Computational Physics 346: 191 – 211. ISSN 0021-9991.
Ayed, I.; de Bézenac, E.; Pajot, A.; Brajard, J.; and Gallinari,   Pearson, J. E. 1993. Complex Patterns in a Simple Sys-
P. 2019. Learning Dynamical Systems from Partial Obser-             tem. Science 261(5118): 189–192. ISSN 0036-8075. doi:
vations. CoRR abs/1902.11136.                                       10.1126/science.261.5118.189.
Baxter, J. 2000. A Model of Inductive Bias Learning. J.             Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019.
Artif. Int. Res. 12(1): 149–198. ISSN 1076-9757.                    Physics-informed neural networks: A deep learning frame-
Bietti, A.; Mialon, G.; Chen, D.; and Mairal, J. 2019. A            work for solving forward and inverse problems involving
Kernel Perspective for Regularizing Deep Neural Networks.           nonlinear partial differential equations. Journal of Compu-
In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings          tational Physics 378: 686–707.
of the 36th International Conference on Machine Learning,           Ramachandran, P.; Zoph, B.; and Le, Q. V. 2017. Searching
volume 97 of Proceedings of Machine Learning Research,              for Activation Functions. CoRR abs/1710.05941.
664–674. Long Beach, California, USA: PMLR.                         Spieckermann, S.; Düll, S.; Udluft, S.; Hentschel, A.; and
Bird, A.; and Williams, C. K. I. 2019. Customizing Se-              Runkler, T. 2015. Exploiting similarity in system identifica-
quence Generation with Multi-Task Dynamical Systems.                tion tasks with recurrent neural networks. Neurocomputing
CoRR abs/1910.05026.                                                169: 343 – 349. ISSN 0925-2312. Learning for Visual Se-
                                                                    mantic Understanding in Big Data ESANN 2014 Industrial
Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; and Duve-
                                                                    Data Processing and Analysis.
naud, D. K. 2018. Neural Ordinary Differential Equations.
In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.;            Staib, M.; and Jegelka, S. 2019. Distributionally robust opti-
Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neu-           mization and generalization in kernel methods. In Advances
ral Information Processing Systems, volume 31, 6571–6583.           in Neural Information Processing Systems, 9134–9144.
Curran Associates, Inc.
Giannakis, D.; and Majda, A. J. 2012. Nonlinear Lapla-
cian spectral analysis for time series with intermittency
and low-frequency variability. Proceedings of the National
Academy of Sciences 109(7): 2222–2227. ISSN 0027-8424.
doi:10.1073/pnas.1118984109. URL https://www.pnas.org/
content/109/7/2222.
Kingma, D. P.; and Ba, J. 2015. Adam: A Method for
Stochastic Optimization. In Bengio, Y.; and LeCun, Y., eds.,
3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Confer-
ence Track Proceedings.