Variational Autoencoders for Learning Nonlinear Dynamics of Physical Systems
                                            Ryan Lopez,3 Paul J. Atzberger 1,2,+ *
                          1
                            Department of Mathematics, University of California Santa Barbara (UCSB).
                   2
                       Department of Mechanical Engineering, University of California Santa Barbara (UCSB).
                             3
                               Department of Physics, University of California Santa Barbara (UCSB).
                                                      +
                                                        atzberg@gmail.com
                                                       http://atzberger.org/


                              Abstract                               classes [51, 67], sparse symbolic dictionary methods that are
                                                                     linear-in-parameters such as SINDy [9, 64, 67], and dynamic
  We develop data-driven methods for incorporating physical
                                                                     Bayesian networks (DBNs), such as Hidden Markov Chains
  information for priors to learn parsimonious representations
  of nonlinear systems arising from parameterized PDEs and           (HMMs) and Hidden-Physics Models [58, 54, 62, 5, 43, 26].
  mechanics. Our approach is based on Variational Autoen-               A central challenge in learning non-linear dynamics is to
  coders (VAEs) for learning nonlinear state space models from       obtain representations not only capable of reproducing sim-
  observations. We develop ways to incorporate geometric and         ilar outputs as observed directly in the training dataset but to
  topological priors through general manifold latent space rep-      infer structures that can provide stable more long-term ex-
  resentations. We investigate the performance of our methods        trapolation capabilities over multiple future steps and input
  for learning low dimensional representations for the nonlin-       states. In this work, we develop learning methods aiming to
  ear Burgers equation and constrained mechanical systems.           obtain robust non-linear models by providing ways to in-
                                                                     corporate more structure and information about the underly-
                         Introduction                                ing system related to smoothness, periodicity, topology, and
                                                                     other constraints. We focus particularly on developing Prob-
The general problem of learning dynamical models from a              abilistic Autoencoders (PAE) that incorporate noise-based
time series of observations has a long history spanning many         regularization and priors to learn lower dimensional repre-
fields [51, 67, 15, 35] including in dynamical systems [8, 67,       sentations from observations. This provides the basis of non-
68, 47, 50, 52, 32, 19, 23], control [9, 51, 60, 63], statistics     linear state space models for prediction. We develop meth-
[1, 48, 26], and machine learning [15, 35, 46, 58, 3, 73]. Re-       ods for incorporating into such representations geometric
ferred to as system identification in control and engineering,       and topological information about the system. This facili-
many approaches have been developed starting with linear             tates capturing qualitative features of the dynamics to en-
dynamical systems (LDS). These includes the Kalman Fil-              hance robustness and to aid in interpretability of results. We
ter and extensions [39, 22, 28, 70, 71], Principle Orthogo-          demonstrate and perform investigations of our methods to
nal Decomposition (POD) [12, 49], and more recently Dy-              obtain models for reductions of parameterized PDEs and for
namic Mode Decomposition (DMD) [63, 45, 69] and Koop-                constrained mechanical systems.
man Operator approaches [50, 20, 42]. These successful and
widely-used approaches rely on assumptions on the model
structure, most commonly, that a time-invariant LDS pro-                    Learning Nonlinear Dynamics with
vides a good local approximation or that noise is Gaussian.                  Variational Autoencoders (VAEs)
   There also has been research on more general nonlinear            We develop data-driven approaches based on a Variational
system identification [1, 65, 15, 35, 66, 47, 48, 51]. Non-          Autoencoder (VAE) framework [40]. We learn from obser-
linear systems pose many open challenges and fewer uni-              vation data a set of lower dimensional representations that
fied approaches given the rich behaviors of nonlinear dy-            are used to make predictions for the dynamics. In prac-
namics. For classes of systems and specific application do-          tice, data can include experimental measurements, large-
mains, methods have been developed which make differ-                scale computational simulations, or solutions of complicated
ent levels of assumptions about the underlying structure of          dynamical systems for which we seek reduced models. Re-
the dynamics. Methods for learning nonlinear dynamics in-            ductions aid in gaining insights for a class of inputs or phys-
clude the NARAX and NOE approaches with function ap-                 ical regimes into the underlying mechanisms generating the
proximators based on neural networks and other models                observed behaviors. Reduced descriptions are also helpful in
   * Work supported by grants DOE Grant ASCR PHILMS DE-              many optimization problems in design and in development
SC0019246 and NSF Grant DMS-1616353.                                 of controllers [51].
Copyright © 2021for this paper by its authors. Use permitted under      Standard autoencoders can result in encodings that yield
Creative Commons License Attribution 4.0 International (CC BY        unstructured scattered disconnected coding points for sys-
4.0).                                                                tem features z. VAEs provide probabilistic encoders and de-
coders where noise provides regularizations that promote
more connected encodings, smoother dependence on inputs,
and more disentangled feature components [40]. As we shall
discuss, we also introduce other regularizations into our
methods to help aid in interpretation of the learned latent
representations.


                                                                    Figure 2: Variational Autoencoder (VAE). VAEs [40] are
                                                                    used to learn representations of the nonlinear dynamics.
                                                                    Deep Neural Networks (DNNs) are trained (i) to serve as
                                                                    feature extractors to represent functions u(x, t) and their
                                                                    evolution in a low dimensional latent space as z(t) (encoder
                                                                    ∼ qθe ), and (ii) to serve as approximators that can con-
                                                                    struct predictions u(x, t+τ ) using features z(t+τ ) (decoder
                                                                    ∼ pθd ).


Figure 1: Learning Nonlinear Dynamics. Data-driven                  way to estimate the log likelihood that the encoder-decoder
methods are developed for learning robust models to predict         reproduce the observed data sample pairs (X(i) , x(i) ) using
from u(x, t) the non-linear evolution to u(x, t+τ ) for PDEs        the codes z0 and z. Here, we include a latent-space map-
and other dynamical systems. Probabilistic Autoencoders             ping z0 = fθ` (z) parameterized by θ` , which we can use
(PAEs) are utilized to learn representations z of u(x, t) in        to characterize the evolution of the system or further pro-
low dimensional latent spaces with prescribed geometric             cessing of features. The X(i) is the input and x(i) is the
and topological properties. The model makes predictions us-         output prediction. For the case of dynamical systems, we
ing learnable maps that (i) encode an input u(x, t) ∈ U
                                                                    take X(i) ∼ ui (t) a sample of the initial state function ui (t)
as z(t) in latent space (top), (ii) evolve the representation
z(t) → z(t + τ ) (top-right), (iii) decode the representation       and the output x(i) ∼ ui (t + τ ) the predicted state function
z(t + τ ) to predict û(x, t + τ ) (bottom-right).                  ui (t + τ ). We discuss the specific distributions used in more
                                                                    detail below.
   We learn VAE predictors using a Maximum Likelihood                  The LKL term involves the Kullback-Leibler Divergence
Estimation (MLE) approach for the Log Likelihood (LL)               [44, 18] acting similar to a Bayesian prior on latent space
LLL = log(pθ (X, x)). For dynamics of u(s), let X = u(t)            to regularize the encoder conditional probability distribu-
and x = u(t+τ ). We base pθ on the autoencoder framework            tion so that for each sample this distribution is similar to
in Figure 1 and 2. We use variational inference to approxi-         pθd . We take pθd = η(0, σ02 ) a multi-variate Gaussian with
mate the LL by the Evidence Lower Bound (ELBO) [7] to               independent components. This serves (i) to disentangle the
train a model with parameters θ using encoders and decoders         features from each other to promote independence, (ii) pro-
based on minimizing the loss function                               vide a reference scale and localization for the encodings z,
                                                                    and (iii) promote parsimonious codes utilizing smaller di-
        θ∗   =   arg min −LB (θe , θd , θ` ; X(i) , x(i) ),         mensions than d when possible.
                      θe ,θd
                                                                       The LRR term gives a regularization that promotes retain-
       LB    = LRE + LKL + LRR ,                              (1)
                              h                  i                  ing information in z so the encoder-decoder pair can recon-
     LRE     = Eqθe (z|X(i) ) log pθd (x(i) |z0 )                   struct functions. As we shall discuss, this also promotes or-
                                                                  ganization of the latent space for consistency over multi-step
     LKL     = −βDKL qθe (z|X(i) ) k p̃θd (z)                       predictions and aids in model interpretability.
                                h                   i                  We use for the specific encoder probability distributions
     LRR     = γEqθe (z0 |x(i) ) log pθd (x(i) |z0 ) .              conditional Gaussians z ∼ qθe (z|x(i) ) = a(X(i) , x(i) ) +
The qθe denotes the encoding probability distribution and           η(0, σe2 ) where η is a Gaussian with variance σe2 , (i.e.
                                                                        i               i
pθd the decoding probability distribution. The loss ` = −LB         EX [z] = a, VarX [z] = σe2 ). One can think of the learned
provides a regularized form of MLE.                                 mean function a in the VAE as corresponding to a typi-
  The terms LRE and LKL arise from the ELBO variational             cal encoder a(X(i) , x(i) ; θe ) = a(X(i) ; θe ) = z(i) and the
bound LLL ≥ LRE +LKL when β = 1, [7]. This provides a               variance function σe2 = σe2 (θe ) as providing control of a
noise source to further regularize the encoding. Among other                             Related Work
properties, this promotes connectedness of the ensemble of
latent space codes. For the VAE decoder distribution, we           Many variants of autoencoders have been developed for
take x ∼ pθd (x|z(i) ) = b(z(i) ) + η(0, σd2 ). The learned        making predictions of sequential data, including those based
mean function b(z(i) ; θe ) corresponds to a typical decoder       on Recurrent Neural Networks (RNNs) with LSTMs and
and the variance function σe2 = σe2 (θd ) controls the source      GRUs [34, 29, 16]. While RNNs provide a rich approxima-
of regularizing noise.                                             tion class for sequential data, they pose for dynamical sys-
   The terms to be learned in the VAE framework                    tems challenges for interpretability and for training to obtain
are (a, σe , fθ` , b, σd ) which are parameterized by θ =          predictions stable over many steps with robustness against
(θe , θd , θ` ). In practice, it is useful to treat variances      noise in the training dataset. Autoencoders have also been
σ(·) initially as hyper-parameters. We learn predictors for        combined with symbolic dictionary learning for latent dy-
the dynamics by training over samples of evolution pairs           namics in [11] providing some advantages for interpretabil-
{(uin , uin+1 )}m                                                  ity and robustness, but require specification in advance of
                  i=1 , where i denotes the sample index and
                                                                   a sufficiently expressive dictionary. Neural networks incor-
uin = ui (tn ) with tn = t0 + nτ for a time-scale τ .              porating physical information have also been developed that
   To make predictions, the learned models use the follow-         impose stability conditions during training [53, 46, 24]. The
ing stages: (i) extract from u(t) the features z(t), (ii) evolve   work of [17] investigates combining RNNs with VAEs to ob-
z(t) → z(t + τ ), (iii) predict using z(t + τ ) the û(t + τ ),    tain more robust models for sequential data and considered
summarized in Figure 1. By composition of the latent evo-          tasks related to processing speech and handwriting.
lution map the model makes multi-step predictions of the
dynamics.                                                             In our work we learn dynamical models making use of
                                                                   VAEs to obtain probabilistic encoders and decoders between
     Learning with Manifold Latent Spaces                          euclidean and non-euclidean latent spaces to provide ad-
                                                                   ditional regularizations to help promote parsimoniousness,
     Roles of Non-Euclidean Geometry and                           disentanglement of features, robustness, and interpretabil-
                   Topology                                        ity. Prior VAE methods used for dynamical systems in-
For many systems, parsimonious representations can be              clude [31, 55, 27, 13, 55, 59]. These works use primar-
obtained by working with non-euclidean manifold latent             ily euclidean latent spaces and consider applications includ-
spaces, such as a torus for doubly periodic systems or even        ing human motion capture and ODE systems. Approaches
non-orientable manifolds, such as a klein bottle as arises in      for incorporating topological information into latent vari-
imaging and perception studies [10]. For this purpose, we          able representations include the early works by Kohonen
learn encoders E over a family of mappings to a prescribed         on Self-Organizing Maps (SOMs) [41] and Bishop on Gen-
manifold M of the form                                             erative Topographical Maps (GTMs) based on density net-
                                                                   works providing a generative approach [6]. More recently,
      z = Eφ (x) = Λ(Ẽφ (x)) = Λ(w), w = Ẽφ (x).
                                                                   VAE methods using non-euclidean latent spaces include
We take the map Ẽφ (x) : x → w, where we represent                [37, 38, 25, 14, 21, 2]. These incorporate the role of geom-
a smooth closed manifold M of dimension m in R2m , as              etry by augmenting the prior distribution p̃θd (z) on latent
supported by the Whitney Embedding Theorem [72]. The Λ             space to bias toward a manifold. In the recent work [57], an
maps (projects) points w ∈ R2m to the manifold represen-           explicit projection procedure is introduced, but in the special
tation z ∈ M ⊂ R2m . In practice, we accomplish this two           case of a few manifolds having an analytic projection map.
ways: (i) we provide an analytic mapping Λ to M, (ii) we              In our work we develop further methods for more gen-
provide a high resolution point-cloud representation of the        eral latent space representations, including non-orientable
target manifold along with local gradients and use for Λ a         manifolds, and applications to parameterized PDEs and con-
quantized mapping to the nearest point on M. We provide            strained mechanical systems. We introduce more general
more details in Appendix A.                                        methods for non-euclidean latent spaces in terms of point-
   This allows us to learn VAEs with latent spaces for z           cloud representations of the manifold along with local gra-
with general specified topologies and controllable geomet-         dient information that can be utilized within general back-
ric structures. The topologies of sphere, torus, klein bottle      propogation frameworks, see Appendix A. This also allows
are intrinsically different than Rn . This allows for new types    for the case of manifolds that are non-orientable and hav-
of priors such as uniform on compact manifolds or distribu-        ing complex shapes. Our methods provide flexible ways to
tions with more symmetry. As we shall discuss, additional          design and control both the topology and the geometry of
latent space structure also helps in learning more robust rep-     the latent space by merging or subtracting shapes or stretch-
resentations less sensitive to noise since we can unburden         ing and contracting regions. We also consider additional
the encoder and decoder from having to learn the embedding         types of regularizations for learning dynamical models fa-
geometry and avoid the potential for them making erroneous         cilitating multi-step predictions and more interpretable state
use of extra latent space dimensions. We also have statistical     space models. In our work, we also consider reduced models
gains since the decoder now only needs to learn a mapping          for non-linear PDEs, such as Burgers Equations, and learn-
from the manifold M for reconstructions of x. These more           ing representations for more general constrained mechanical
parsimonious representations also aid identifiability and in-      systems. We also investigate the role of non-linearities mak-
terpretability of models.                                          ing comparisons with other data-driven models.
                           Results                                 Deep Neural Networks (DNNs) with layer sizes (in)-400-
Burgers’ Equation of Fluid Mechanics: Learning                     400-(out), ReLU activations, and γ = 0.5, β = 1, and initial
Nonlinear PDE Dynamics                                             standard deviations σd = σe = 4 × 10−3 . We show results
                                                                   of our VAE model predictions in Figure 3 and Table 1.
We consider the nonlinear viscous Burgers’ equation
                  ut = −uux + νuxx ,                        (2)
where ν is the viscosity [4, 36]. We consider periodic bound-
ary conditions on Ω = [0, 1]. Burgers equation is motivated
as a mechanistic model for the fluid mechanics of advective
transport and shocks, and serves as a widely used benchmark
for analysis and computational methods.
   The nonlinear Cole-Hopf Transform CH can be used to
relate Burgers equation to the linear Diffusion equation φt =
νφxx [36]. This provides a representation of the solution u
                                         Z x              
                                       1          0      0
    φ(x, t) = CH[u] = exp −                   u(x , t)dx
                                      2ν 0
                                      ∂
    u(x, t) = CH−1 [φ] = −2ν             ln φ(x, t).         (3)
                                     ∂x
This can be represented by the Fourier expansion
                 ∞
                 X
     φ(x, t) =          φ̂k (0) exp(−4π 2 k 2 νt) · exp(i2πkx).
                 k=−∞
                                                                   Figure 3: Burgers’ Equation: Prediction of Dynamics. We
The φ̂k (0)          =      Fk [φ(x, 0)] and φ(x, t)        =      consider responses for U1 = {u | u(x, t; α) = α sin(2πx) +
   −1                     2 2
F [{φ̂k (0) exp(−4π k νt)}] with F the Fourier                     (1−α) cos3 (2πx)}. Predictions are made for the evolution u
transform. This provides an analytic representa-                   over the time-scale τ satisfying equation 2 with initial con-
tion of the solution of the viscous Burgers equation               ditions in U1 . We find our nonlinear VAE methods are able
u(x, t) = CH−1 [φ(x, t)] where φ̂(0) = F[CH[u(x, 0)]]. In          to learn with 2 latent dimensions the dynamics with errors
general, for nonlinear PDEs with initial conditions within a       < 1%. Methods such as DMD [63, 69] with 3 modes which
class of functions U, we aim to learn models that provide          are only able to use a single linear space to approximate the
predictions u(t + τ ) = Sτ u(t) approximating the evolution        initial conditions and prediction encounter challenges in ap-
operator Sτ over time-scale τ . For the Burgers equation,          proximating the nonlinear evolution. We find our linear VAE
the CH provides an analytic way to obtain a reduced order          method with 2 modes provides some improvements, by al-
model by truncating the Fourier expansion to |k| ≤ nf /2.          lowing for using different linear spaces for representing the
This provides for the Burgers equation a benchmark                 input and output functions, but at the cost of additional com-
model against which to compare our learned models. For             putations. Results are summarized in Table 1.
general PDEs comparable analytic representations are not
usually available, motivating development of data-driven              We show the importance of the non-linear approximation
approaches.                                                        properties of our VAE methods in capturing system behav-
   We develop VAE methods for learning reduced order               iors by making comparisons with Dynamic Mode Decompo-
models for the responses of nonlinear Burgers Equation             sition (DMD) [63, 69], Principle Orthogonal Decomposition
when the initial conditions are from a collection of func-         (POD) [12], and a linear variant of our VAE approach. Re-
tions U. We learn VAE models that extract from u(x, t) la-         cent CNN-AEs have also studied related advantages of non-
tent variables z(t) to predict u(x, t + τ ). Given the non-        linear approximations [46]. Some distinctions in our work is
uniqueness of representations and to promote interpretabil-        the use of VAEs to further regularize AEs and using topo-
ity of the model, we introduce the inductive bias that the         logical latent spaces to facilitate further capturing of struc-
evolution dynamics in latent space for z is linear of the          ture. The DMD and POD are widely used and successful ap-
form ż = −λ0 z, giving exponential decay rate λ0 . For dis-       proaches that aim to find an optimal linear space on which
crete times, we take zn+1 = fθ` (zn ) = exp(−λ0 τ ) · zn ,         to project the dynamics and learn a linear evolution law for
where θ` = (λ0 ). We still consider general nonlinear map-         system behaviors. DMD and POD have been successful in
pings for the encoders and decoders which are represented          obtaining models for many applications, including steady-
by deep neural networks. We train the model on the pairs           state fluid mechanics and transport problems [69, 63]. How-
(u(x, t), u(x, t + τ )) by drawing m samples of ui (x, ti ) ∈      ever, given their inherent linear approximations they can en-
Sti U which generates the evolved state under Burgers equa-        counter well-known challenges related to translational and
tion ui (x, ti +τ ) over time-scale τ . We perform VAE studies     rotational invariances, as arise in advective phenomena and
with parameters ν = 2 × 10−2 , τ = 2.5 × 10−1 with VAE             other settings [8]. Our comparison studies can be found in
                                                                  Method              Dim     0.25s     0.50s        0.75s      1.00s
                                                                  VAE Nonlinear        2     4.44e-3   5.54e-3      6.30e-3    7.26e-3
                                                                  VAE Linear           2     9.79e-2   1.21e-1      1.17e-1    1.23e-1
                                                                  DMD                  3     2.21e-1   1.79e-1      1.56e-1    1.49e-1
                                                                  POD                  3     3.24e-1   4.28e-1      4.87e-1    5.41e-1
                                                                  Cole-Hopf-2          2     5.18e-1   4.17e-1      3.40e-1    1.33e-1
                                                                  Cole-Hopf-4          4     5.78e-1   6.33e-2      9.14e-3    1.58e-3
                                                                  Cole-Hopf-6          6     1.48e-1   2.55e-3      9.25e-5    7.47e-6

                                                                   γ        0.00s       0.25s       0.50s          0.75s        1.00s
                                                                  0.00    1.600e-01   6.906e-03   1.715e-01      3.566e-01    5.551e-01
                                                                  0.50    1.383e-02   1.209e-02   1.013e-02      9.756e-03    1.070e-02
                                                                  2.00    1.337e-02   1.303e-02   9.202e-03      8.878e-03    1.118e-02

                                                                   β        0.00s       0.25s       0.50s          0.75s        1.00s
                                                                  0.00    1.292e-02   1.173e-02   1.073e-02      1.062e-02    1.114e-02
                                                                  0.50    1.190e-02   1.126e-02   1.072e-02      1.153e-02    1.274e-02
                                                                  1.00    1.289e-02   1.193e-02   7.903e-03      7.883e-03    9.705e-03
                                                                  4.00    1.836e-02   1.677e-02   8.987e-03      8.395e-03    8.894e-03


                                                                 Table 1: Burgers’ Equation: Prediction Accuracy. The
Figure 4: Burgers’ Equation: Latent Space Represen-              reconstruction L1 -relative errors in predicting u(x, t) for
tations and Extrapolation Predictions. We show the la-           our VAE methods, Dynamic Model Decomposition (DMD),
tent space representation z of the dynamics for the in-          and Principle Orthogonal Decomposition (POD), and reduc-
put functions u(·, t; α) ∈ U1 . VAE organizes for u the          tion by Cole-Hopf (CH), over multiple-steps and number
learned representations z(α, t) in parameter α (blue-green)      of latent dimensions (Dim) (top). Results when varying the
into circular arcs that are concentric in the time parameter     strength of the reconstruction regularization γ and prior β
t, (yellow-orange) (left). The reconstruction regularization     (bottom).
with γ aligns subsequent time-steps of the dynamics in latent
space facilitating multi-step predictions. The learned VAE       tions are taken to be the two locations x1 , x2 ∈ R2 giving
model exhibits a level of extrapolation to predict dynamics      x = (x1 , x2 ) ∈ R4 . When the segments are rigidly con-
even for some inputs u 6∈ U1 beyond the training dataset         strained these configurations lie on a manifold (torus). We
(right).                                                         can also allow the segments to extend and consider more ex-
                                                                 otic constraints such as the two points x1 , x2 must be on
                                                                 a klein bottle in R4 . Related situations arise in other ar-
Table 1.                                                         eas of imaging and mechanics, such as in pose estimation
    We also considered how our VAE methods performed             and in studies of visual perception [56, 10, 61]. For the
when adjusting the parameters β for the strength of the prior    arm mechanics, we can use this prior knowledge to con-
p̃ as in β-VAEs [33] and γ for the strength of the reconstruc-   struct a torus latent space represented by the product space
tion regularization. The reconstruction regularization has a     of two circles S 1 × S 1 . To obtain a learnable class of mani-
significant influence on how the VAE organizes representa-       fold encoders, we use the family of maps Eθ = Λ(Ẽθ (x)),
tions in latent space and the accuracy of predictions of the
                                                                 with Ẽθ (x) into R4 and Λ(w) = Λ(w1 , w2 , w3 , w4 ) =
dynamics, especially over multiple steps, see Figure 4 and
                                                                 (z1 , z2 , z3 , z4 ) = z, where (z1 , z2 ) = (w1 , w2 )/k(w1 , w2 )k,
Table 1. The regularization serves to align representations
                                                                 (z3 , z4 ) = (w3 , w4 )/k(w3 , w4 )k, see VAE Section and Ap-
consistently in latent space facilitating multi-step composi-
                                                                 pendix A. For the case of klein bottle constraints, we use
tions. We also found our VAE learned representions capable
                                                                 our point-cloud representation of the non-orientable mani-
of some level of extrapolation beyond the training dataset.
                                                                 fold with the parameterized embedding in R4
When varying β, we found that larger values improved the
multiple step accuracy whereas small values improved the          z1 = (a + b cos(u2 )) cos(u
                                                                                           1 ) z2 = (a + b cos(u2 ))  sin(u
                                                                                                                         1)
single step accuracy, see Table 1.                                z3 = b sin(u2 ) cos u21       z4 = b sin(u2 ) sin u21 ,
                                                                 with u1 , u2 ∈ [0, 2π]. The Λ(w) is taken to be the map to the
Constrained Mechanics: Learning with                             nearest point of the manifold M, which we compute numer-
Non-Euclidean Latent Spaces                                      ically along with the needed gradients for backpropogation
To learn more parsimonous and robust representations of          as discussed in Appendix A.
physical systems, we develop methods for latent spaces hav-         Our VAE methods are trained with encoder and decoder
ing geometries and topologies more general than euclidean        DNN’s having layers of sizes (in)-100-500-100-(out) with
space. This is helpful in capturing inherent structure such      Leaky-ReLU activations with s = 1e-6 with results reported
as periodicities or other symmetries. We consider physical       in Figure 5 and Table 2. We find learning representations is
systems with constrained mechanics, such as the arm mech-        improved by use of the manifold latent spaces, in these tri-
anism for reaching for objects in figure 5. The observa-         als even showing a slight edge over R4 . When the wrong
                                                                   Torus            epoch
                                                                   method              1000         2000         3000         final
                                                                   VAE 2-Manifold   6.6087e-02   6.6564e-02   6.6465e-02   6.6015e-02
                                                                   VAE R2           1.6540e-01   1.2931e-01   9.9903e-02   8.0648e-02
                                                                   VAE R4           8.0006e-02   7.6302e-02   7.5875e-02   7.5626e-02
                                                                   VAE R10          8.3411e-02   8.4569e-02   8.4673e-02   8.4143e-02
                                                                   with noise σ        0.01         0.05          0.1          0.5
                                                                   VAE 2-Manifold   6.7099e-02   8.0608e-02   1.1198e-01   4.1988e-01
                                                                   VAE R2           8.5879e-02   9.7220e-02   1.2867e-01   4.5063e-01
                                                                   VAE R4           7.6347e-02   9.0536e-02   1.2649e-01   4.9187e-01
                                                                   VAE R10          8.4780e-02   1.0094e-01   1.3946e-01   5.2050e-01
                                                                   Klein Bottle     epoch
                                                                   method              1000         2000         3000         final
                                                                   VAE 2-Manifold   5.7734e-02   5.7559e-02   5.7469e-02   5.7435e-02
                                                                   VAE R2           1.1802e-01   9.0728e-02   8.0578e-02   7.1026e-02
                                                                   VAE R4           6.9057e-02   6.5593e-02   6.4047e-02   6.3771e-02
                                                                   VAE R10          6.8899e-02   6.9802e-02   7.0953e-02   6.8871e-02
                                                                   with noise σ        0.01         0.05          0.1          0.5
                                                                   VAE 2-Manifold   5.9816e-02   6.9934e-02   9.6493e-02   4.0121e-01
                                                                   VAE R2           1.0120e-01   1.0932e-01   1.3154e-01   4.8837e-01
                                                                   VAE R4           6.3885e-02   7.6096e-02   1.0354e-01   4.5769e-01
                                                                   VAE R10          7.4587e-02   8.8233e-02   1.2082e-01   4.8182e-01


                                                                  Table 2: Manifold Latent Variable Model: VAE Recon-
                                                                  struction Errors The L2 -relative errors of reconstruction
                                                                  for our VAE methods. The final is the lowest value during
Figure 5: VAE Representations of Motions using Mani-              training. The manifold latent spaces show improved learn-
fold Latent Spaces. We learn from observations represen-          ing. When an incompatible topology is used, such as R2 , this
tations for constrained mechanical systems using general          can result in deterioration in learned representations. With
non-euclidean manifolds latent spaces M. The arm mech-            noise in the input X̃ = X + ση(0, 1) and reconstructing
anism has configurations x = (x1 , x2 ) ∈ R4 . For rigid          the target X, the manifold latent spaces also show improve-
segments, the motions are constrained to be on a manifold         ments for learning.
(torus) M ⊂ R4 . For extendable segments, we can also
consider more exotic constraints, such as requiring x1 , x2
to be on a klein bottle in R4 (top). Results of our VAE meth-     restrictions is more likely to use a common latent represen-
ods for learned representations for motions under these con-      tation. For Rd with d > 2, the extraneous dimensions in the
straints are shown. VAE learns the segment length constraint      latent space can result in overfitting of the encoder to the
and two nearly decoupled coordinates for the torus dataset        noise. We see as d becomes larger the reconstruction accu-
that mimic the roles of angles. VAE learns for the klein bot-     racy decreases, see Table 2. These results demonstrate how
tle dataset two segment motions to generate configurations        geometric priors can aid learning in constrained mechanical
(middle and bottom).                                              systems.

                                                                                          Conclusions
                               2
topology is used, such as in R , we find in both cases a sig-     We developed VAE’s for learning robustly nonlinear dynam-
nificant deterioration in the reconstruction accuracy, see Ta-    ics of physical systems by introducing methods for latent
ble 2. This arises since the encoder must be continuous and       representations utilizing general geometric and topological
hedge against the noise regularizations. This results in an in-   structures. We demonstrated our methods for learning the
curred penalty for a subset of configurations. The encoder        non-linear dynamics of PDEs and constrained mechanical
exhibits non-injectivity and a rapid doubling back over the       systems. We expect our methods can also be used in other
space to accommodate the decoder by lining up nearby con-         physics-related tasks and problems to leverage prior geo-
figurations in the topology of the input space manifold to        metric and topological knowledge for improving learning
handle noise perturbations in z from the probabilistic na-        for nonlinear systems.
ture of the encoding. We also studied robustness when train-
ing with noise for X̃ = X + ση(0, 1) and measuring ac-                               Acknowledgments
curacy for reconstruction relative to target X. As the noise      Authors research supported by grants DOE Grant ASCR PHILMS
increases, we see that the manifold latent spaces improve         DE-SC0019246 and NSF Grant DMS-1616353. Also to R.N.L.
reconstruction accuracy acting as a filter through restrict-      support by a donor to UCSB CCS SURF program. Authors also
ing the representation. The probabilistic decoder will tend       acknowledge UCSB Center for Scientific Computing NSF MR-
to learn to estimate the mean over samples of a common            SEC (DMR1121053) and UCSB MRL NSF CNS-1725797. P.J.A.
underlying configuration and with the manifold latent space       would also like to acknowledge a hardware grant from Nvidia.
                       References                                    governing equations. Proceedings of the National
 [1] Archer, E.; Park, I. M.; Buesing, L.; Cunning-                  Academy of Sciences 116(45): 22445–22451. ISSN
     ham, J.; and Paninski, L. 2015. Black box vari-                 0027-8424. doi:10.1073/pnas.1906995116. URL
     ational inference for state space models.     arXiv             https://www.pnas.org/content/116/45/22445.
     preprint arXiv:1511.07367 URL https://arxiv.org/abs/       [12] Chatterjee, A. 2000. An introduction to the proper or-
     1511.07367.                                                     thogonal decomposition. Current Science 78(7): 808–
 [2] Arvanitidis, G.; Hansen, L. K.; and Hauberg, S. 2018.           817. ISSN 00113891. URL http://www.jstor.org/
     Latent Space Oddity: on the Curvature of Deep Gener-            stable/24103957.
     ative Models. In International Conference on Learning      [13] Chen, N.; Karl, M.; and Van Der Smagt, P. 2016. Dy-
     Representations. URL https://openreview.net/forum?              namic movement primitives in latent space of time-
     id=SJzRZ-WCZ.                                                   dependent variational autoencoders. In 2016 IEEE-
 [3] Azencot, O.; Yin, W.; and Bertozzi, A. 2019. Con-               RAS 16th International Conference on Humanoid
     sistent dynamic mode decomposition. SIAM Jour-                  Robots (Humanoids), 629–636. IEEE. URL https:
     nal on Applied Dynamical Systems 18(3): 1565–                   //ieeexplore.ieee.org/document/7803340.
     1585.     URL https://www.math.ucla.edu/∼bertozzi/
                                                                [14] Chen, N.; Klushyn, A.; Ferroni, F.; Bayer, J.; and Van
     papers/CDMD SIADS.pdf.
                                                                     Der Smagt, P. 2020. Learning Flat Latent Manifolds
 [4] Bateman, H. 1915. Some Recent Researches on the                 with VAEs. In III, H. D.; and Singh, A., eds., Pro-
     Motion of Fluids. Monthly Weather Review 43(4):                 ceedings of the 37th International Conference on Ma-
     163. doi:10.1175/1520-0493(1915)43h163:SRROTMi                  chine Learning, volume 119 of Proceedings of Ma-
     2.0.CO;2.                                                       chine Learning Research, 1587–1596. Virtual: PMLR.
 [5] Baum, L. E.; and Petrie, T. 1966. Statistical Infer-            URL http://proceedings.mlr.press/v119/chen20i.html.
     ence for Probabilistic Functions of Finite State Markov    [15] Chiuso, A.; and Pillonetto, G. 2019.          Sys-
     Chains. Ann. Math. Statist. 37(6): 1554–1563. doi:              tem Identification: A Machine Learning Perspec-
     10.1214/aoms/1177699147. URL https://doi.org/10.                tive.  Annual Review of Control, Robotics, and
     1214/aoms/1177699147.                                           Autonomous Systems 2(1): 281–304. doi:10.1146/
 [6] Bishop, C. M.; Svensén, M.; and Williams, C.                   annurev-control-053018-023744. URL https://doi.org/
     K. I. 1996.      GTM: A Principled Alternative to               10.1146/annurev-control-053018-023744.
     the Self-Organizing Map. In Mozer, M.; Jordan,             [16] Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bah-
     M. I.; and Petsche, T., eds., Advances in Neu-                  danau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y.
     ral Information Processing Systems 9, NIPS, Den-                2014. Learning Phrase Representations using RNN
     ver, CO, USA, December 2-5, 1996, 354–360. MIT                  Encoder–Decoder for Statistical Machine Translation.
     Press. URL http://papers.nips.cc/paper/1207-gtm-a-              In Proceedings of the 2014 Conference on Empirical
     principled-alternative-to-the-self-organizing-map.              Methods in Natural Language Processing (EMNLP),
 [7] Blei, D. M.; Kucukelbir, A.; and McAuliffe, J. D. 2017.         1724–1734. Doha, Qatar: Association for Computa-
     Variational Inference: A Review for Statisticians. Jour-        tional Linguistics. doi:10.3115/v1/D14-1179. URL
     nal of the American Statistical Association 112(518):           https://www.aclweb.org/anthology/D14-1179.
     859–877. doi:10.1080/01621459.2017.1285773. URL
                                                                [17] Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville,
     https://doi.org/10.1080/01621459.2017.1285773.
                                                                     A. C.; and Bengio, Y. 2015. A Recurrent Latent Vari-
 [8] Brunton, S. L.; and Kutz, J. N. 2019. Reduced Or-               able Model for Sequential Data. Advances in neural
     der Models (ROMs), 375–402. Cambridge University                information processing systems abs/1506.02216. URL
     Press. doi:10.1017/9781108380690.012.                           http://arxiv.org/abs/1506.02216.
 [9] Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2016.      [18] Cover, T. M.; and Thomas, J. A. 2006. Elements of In-
     Discovering governing equations from data by sparse             formation Theory (Wiley Series in Telecommunications
     identification of nonlinear dynamical systems. Pro-             and Signal Processing). USA: Wiley-Interscience.
     ceedings of the National Academy of Sciences 113(15):           ISBN 0471241954.
     3932–3937. ISSN 0027-8424. doi:10.1073/pnas.
     1517384113. URL https://www.pnas.org/content/113/          [19] Crutchfield, J.; and McNamara, B. S. 1987. Equations
     15/3932.                                                        of Motion from a Data Series. Complex Syst. 1.
[10] Carlsson, G.; Ishkhanov, T.; de Silva, V.; and Zomoro-     [20] Das, S.; and Giannakis, D. 2019. Delay-Coordinate
     dian, A. 2008. On the Local Behavior of Spaces of               Maps and the Spectra of Koopman Operators 175:
     Natural Images. International Journal of Computer               1107–1145. ISSN 0022-4715. doi:10.1007/s10955-
     Vision 76(1): 1–12. ISSN 1573-1405. URL https:                  019-02272-w.
     //doi.org/10.1007/s11263-007-0056-x.                       [21] Davidson, T. R.; Falorsi, L.; Cao, N. D.; Kipf, T.;
[11] Champion, K.; Lusch, B.; Kutz, J. N.; and Brunton,              and Tomczak, J. M. 2018. Hyperspherical Variational
     S. L. 2019. Data-driven discovery of coordinates and            Auto-Encoders URL https://arxiv.org/abs/1804.00891.
[22] Del Moral, P. 1997. Nonlinear filtering: Interacting       [32] Hesthaven, J. S.; Rozza, G.; and Stamm, B. 2016. Re-
     particle resolution. Comptes Rendus de l’Académie des          duced Basis Methods 27–43. ISSN 2191-8198. doi:
     Sciences - Series I - Mathematics 325(6): 653 – 658.            10.1007/978-3-319-22470-1 3.
     ISSN 0764-4442. doi:https://doi.org/10.1016/S0764-         [33] Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot,
     4442(97)84778-7.       URL http://www.sciencedirect.            X.; Botvinick, M. M.; Mohamed, S.; and Lerchner, A.
     com/science/article/pii/S0764444297847787.                      2017. beta-VAE: Learning Basic Visual Concepts with
[23] DeVore, R. A. 2017. Model Reduction and Approx-                 a Constrained Variational Framework. In ICLR. URL
     imation: Theory and Algorithms, chapter Chapter 3:              https://openreview.net/forum?id=Sy2fzU9gl.
     The Theoretical Foundation of Reduced Basis Meth-          [34] Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-
     ods, 137–168. SIAM. doi:10.1137/1.9781611974829.                Term Memory. Neural Comput. 9(8): 1735–1780.
     ch3. URL https://epubs.siam.org/doi/abs/10.1137/1.              ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735.
     9781611974829.ch3.                                              URL https://doi.org/10.1162/neco.1997.9.8.1735.
[24] Erichson, N. B.; Muehlebach, M.; and Mahoney,              [35] Hong, X.; Mitchell, R.; Chen, S.; Harris, C.; Li, K.;
     M. W. 2019. Physics-informed autoencoders for                   and Irwin, G. 2008. Model selection approaches for
     Lyapunov-stable fluid flow prediction. arXiv preprint           non-linear system identification: a review. Interna-
     arXiv:1905.10866 .                                              tional Journal of Systems Science 39(10): 925–946.
                                                                     doi:10.1080/00207720802083018. URL https://doi.
[25] Falorsi, L.; Haan, P. D.; Davidson, T.; Cao, N. D.;             org/10.1080/00207720802083018.
     Weiler, M.; Forré, P.; and Cohen, T. 2018. Explo-
     rations in Homeomorphic Variational Auto-Encoding.         [36] Hopf, E. 1950. The partial differential equation ut +
     ArXiv abs/1807.04689. URL https://arxiv.org/pdf/                uux = µxx . Comm. Pure Appl. Math. 3, 201-230
     1807.04689.pdf.                                                 URL https://onlinelibrary.wiley.com/doi/abs/10.1002/
                                                                     cpa.3160030302.
[26] Ghahramani, Z.; and Roweis, S. T. 1998. Learn-
     ing Nonlinear Dynamical Systems Using an EM                [37] Jensen, K. T.; Kao, T.-C.; Tripodi, M.; and Hennequin,
     Algorithm. In Kearns, M. J.; Solla, S. A.; and                  G. 2020. Manifold GPLVMs for discovering non-
     Cohn, D. A., eds., Advances in Neural Informa-                  Euclidean latent structure in neural data URL https:
     tion Processing Systems 11, [NIPS Conference,                   //arxiv.org/abs/2006.07429.
     Denver, Colorado, USA, November 30 - Decem-                [38] Kalatzis, D.; Eklund, D.; Arvanitidis, G.; and Hauberg,
     ber 5, 1998], 431–437. The MIT Press.            URL            S. 2020. Variational Autoencoders with Rieman-
     http://papers.nips.cc/paper/1594-learning-nonlinear-            nian Brownian Motion Priors.            arXiv e-prints
     dynamical-systems-using-an-em-algorithm.                        arXiv:2002.05227. URL https://arxiv.org/abs/2002.
                                                                     05227.
[27] Girin, L.; Leglaive, S.; Bie, X.; Diard, J.; Hueber, T.;
     and Alameda-Pineda, X. 2020. Dynamical Variational         [39] Kalman, R. E. 1960. A New Approach to Linear Fil-
     Autoencoders: A Comprehensive Review .                          tering and Prediction Problems. Journal of Basic Engi-
                                                                     neering 82(1): 35–45. ISSN 0021-9223. doi:10.1115/
[28] Godsill, S. 2019. Particle Filtering: the First 25 Years        1.3662552. URL https://doi.org/10.1115/1.3662552.
     and beyond. In Proc. Speech and Signal Processing
     (ICASSP) ICASSP 2019 - 2019 IEEE Int. Conf. Acous-         [40] Kingma, D. P.; and Welling, M. 2014. Auto-Encoding
     tics, 7760–7764.                                                Variational Bayes. In 2nd International Conference
                                                                     on Learning Representations, ICLR 2014, Banff, AB,
[29] Goodfellow, I.; Bengio, Y.; and Courville, A. 2016.             Canada, April 14-16, 2014, Conference Track Pro-
     Deep Learning. The MIT Press. ISBN 0262035618.                  ceedings. URL http://arxiv.org/abs/1312.6114.
     URL https://www.deeplearningbook.org/.                     [41] Kohonen, T. 1982. Self-organized formation of topo-
[30] Gross, B.; Trask, N.; Kuberry, P.; and Atzberger, P.            logically correct feature maps. Biological cybernetics
     2020. Meshfree methods on manifolds for hydrody-                43(1): 59–69. URL https://link.springer.com/article/
     namic flows on curved surfaces: A Generalized Mov-              10.1007/BF00337288.
     ing Least-Squares (GMLS) approach. Journal of              [42] Korda, M.; Putinar, M.; and Mezić, I. 2020. Data-
     Computational Physics 409: 109340. ISSN 0021-                   driven spectral analysis of the Koopman operator.
     9991. doi:https://doi.org/10.1016/j.jcp.2020.109340.            Applied and Computational Harmonic Analy-
     URL http://www.sciencedirect.com/science/article/pii/           sis 48(2): 599 – 629.        ISSN 1063-5203.       doi:
     S0021999120301145.                                              https://doi.org/10.1016/j.acha.2018.08.002.       URL
[31] Hernández, C. X.; Wayment-Steele, H. K.; Sultan,               http://www.sciencedirect.com/science/article/pii/
     M. M.; Husic, B. E.; and Pande, V. S. 2018. Varia-              S1063520318300988.
     tional encoding of complex dynamics. Physical Re-          [43] Krishnan, R. G.; Shalit, U.; and Sontag, D. A.
     view E 97(6). ISSN 2470-0053. doi:10.1103/physreve.             2017. Structured Inference Networks for Nonlin-
     97.062412. URL http://dx.doi.org/10.1103/PhysRevE.              ear State Space Models.       In Singh, S. P.; and
     97.062412.                                                      Markovitch, S., eds., Proceedings of the Thirty-First
     AAAI Conference on Artificial Intelligence, February      [54] Pawar, S.; Ahmed, S. E.; San, O.; and Rasheed, A.
     4-9, 2017, San Francisco, California, USA, 2101–               2020. Data-driven recovery of hidden physics in re-
     2109. AAAI Press. URL http://aaai.org/ocs/index.php/           duced order modeling of fluid flows 32: 036602. ISSN
     AAAI/AAAI17/paper/view/14215.                                  1070-6631. doi:10.1063/5.0002051.
[44] Kullback, S.; and Leibler, R. A. 1951. On Informa-        [55] Pearce, M. 2020. The Gaussian Process Prior VAE for
     tion and Sufficiency. Ann. Math. Statist. 22(1): 79–86.        Interpretable Latent Dynamics from Pixels. volume
     doi:10.1214/aoms/1177729694. URL https://doi.org/              118 of Proceedings of Machine Learning Research,
     10.1214/aoms/1177729694.                                       1–12. PMLR. URL http://proceedings.mlr.press/v118/
[45] Kutz, J. N.; Brunton, S. L.; Brunton, B. W.;                   pearce20a.html.
     and Proctor, J. L. 2016.     Dynamic Mode De-             [56] Perea, J. A.; and Carlsson, G. 2014. A Klein-Bottle-
     composition.   Philadelphia, PA: Society for In-               Based Dictionary for Texture Representation. In-
     dustrial and Applied Mathematics. doi:10.1137/1.               ternational Journal of Computer Vision 107(1): 75–
     9781611974508. URL https://epubs.siam.org/doi/abs/             97. ISSN 1573-1405. URL https://doi.org/10.1007/
     10.1137/1.9781611974508.                                       s11263-013-0676-2.
[46] Lee, K.; and Carlberg, K. T. 2020. Model reduc-           [57] Perez Rey, L. A.; Menkovski, V.; and Portegies, J.
     tion of dynamical systems on nonlinear manifolds               2020. Diffusion Variational Autoencoders. In Bessiere,
     using deep convolutional autoencoders. Journal of              C., ed., Proceedings of the Twenty-Ninth International
     Computational Physics 404: 108973. ISSN 0021-                  Joint Conference on Artificial Intelligence, IJCAI-20,
     9991. doi:https://doi.org/10.1016/j.jcp.2019.108973.           2704–2710. International Joint Conferences on Arti-
     URL http://www.sciencedirect.com/science/article/pii/          ficial Intelligence Organization. doi:10.24963/ijcai.
     S0021999119306783.                                             2020/375. URL https://arxiv.org/pdf/1901.08991.pdf.
[47] Lusch, B.; Kutz, J. N.; and Brunton, S. L. 2018. Deep     [58] Raissi, M.; and Karniadakis, G. E. 2018. Hidden
     learning for universal linear embeddings of nonlinear          physics models: Machine learning of nonlinear par-
     dynamics. Nature Communications 9(1): 4950. ISSN               tial differential equations. Journal of Computational
     2041-1723. URL https://doi.org/10.1038/s41467-018-             Physics 357: 125 – 141. ISSN 0021-9991. URL
     07210-0.                                                       https://arxiv.org/abs/1708.00588.
[48] Mania, H.; Jordan, M. I.; and Recht, B. 2020. Ac-
                                                               [59] Roeder, G.; Grant, P. K.; Phillips, A.; Dalchau, N.; and
     tive learning for nonlinear system identification with
                                                                    Meeds, E. 2019. Efficient Amortised Bayesian Infer-
     guarantees. arXiv preprint arXiv:2006.10277 URL
                                                                    ence for Hierarchical and Nonlinear Dynamical Sys-
     https://arxiv.org/pdf/2006.10277.pdf.
                                                                    tems URL https://arxiv.org/abs/1905.12090.
[49] Mendez, M. A.; Balabane, M.; and Buchlin, J. M.
     2018. Multi-scale proper orthogonal decomposition         [60] Samuel H. Rudy, J. Nathan Kutz, S. L. B. 2018. Deep
     (mPOD) doi:10.1063/1.5043720.                                  learning of dynamics and signal-noise decomposition
                                                                    with time-stepping constraints. arXiv:1808:02578
[50] Mezić, I. 2013. Analysis of Fluid Flows via Spec-             URL https://doi.org/10.1016/j.jcp.2019.06.056.
     tral Properties of the Koopman Operator. Annual Re-
     view of Fluid Mechanics 45(1): 357–378. doi:10.1146/      [61] Sarafianos, N.; Boteanu, B.; Ionescu, B.; and Kaka-
     annurev-fluid-011212-140652. URL https://doi.org/              diaris, I. A. 2016. 3D Human pose estimation: A
     10.1146/annurev-fluid-011212-140652.                           review of the literature and analysis of covariates.
                                                                    Computer Vision and Image Understanding 152: 1 –
[51] Nelles, O. 2013. Nonlinear system identification:              20. ISSN 1077-3142. doi:https://doi.org/10.1016/
     from classical approaches to neural networks and               j.cviu.2016.09.002. URL http://www.sciencedirect.
     fuzzy models. Springer Science & Business Me-                  com/science/article/pii/S1077314216301369.
     dia. URL https://play.google.com/books/reader?id=
     tyjrCAAAQBAJ&hl=en&pg=GBS.PR3.                            [62] Saul, L. K. 2020. A tractable latent variable model
                                                                    for nonlinear dimensionality reduction. Proceed-
[52] Ohlberger, M.; and Rave, S. 2016. Reduced Ba-                  ings of the National Academy of Sciences 117(27):
     sis Methods: Success, Limitations and Future Chal-             15403–15408. ISSN 0027-8424. doi:10.1073/pnas.
     lenges. Proceedings of the Conference Algoritmy                1916012117. URL https://www.pnas.org/content/117/
     1–12. URL http://www.iam.fmph.uniba.sk/amuc/ojs/               27/15403.
     index.php/algoritmy/article/view/389.
                                                               [63] Schmid, P. J. 2010. Dynamic mode decomposition of
[53] Parish, E. J.; and Carlberg, K. T. 2020. Time-series
                                                                    numerical and experimental data. Journal of Fluid Me-
     machine-learning error models for approximate solu-
                                                                    chanics 656: 5–28. doi:10.1017/S0022112010001217.
     tions to parameterized dynamical systems. Computer
                                                                    URL https://doi.org/10.1017/S0022112010001217.
     Methods in Applied Mechanics and Engineering 365:
     112990. ISSN 0045-7825. doi:https://doi.org/10.1016/      [64] Schmidt, M.; and Lipson, H. 2009. Distilling Free-
     j.cma.2020.112990. URL http://www.sciencedirect.               Form Natural Laws from Experimental Data 324: 81–
     com/science/article/pii/S0045782520301742.                     85. ISSN 0036-8075. doi:10.1126/science.1165893.
[65] Schoukens, J.; and Ljung, L. 2019. Nonlinear Sys-
     tem Identification: A User-Oriented Road Map. IEEE
     Control Systems Magazine 39(6): 28–99. doi:10.1109/
     MCS.2019.2938121.
[66] Schön, T. B.; Wills, A.; and Ninness, B. 2011.
     System identification of nonlinear state-space mod-
     els.    Automatica 47(1): 39 – 49.      ISSN 0005-
     1098. doi:https://doi.org/10.1016/j.automatica.2010.
     10.013. URL http://www.sciencedirect.com/science/
     article/pii/S0005109810004279.
[67] Sjöberg, J.; Zhang, Q.; Ljung, L.; Benveniste,
     A.; Delyon, B.; Glorennec, P.-Y.; Hjalmarsson,
     H.; and Juditsky, A. 1995.         Nonlinear black-
     box modeling in system identification: a unified
     overview.      Automatica 31(12): 1691 – 1724.
     ISSN 0005-1098. doi:https://doi.org/10.1016/0005-
     1098(95)00120-8.       URL http://www.sciencedirect.
     com/science/article/pii/0005109895001208. Trends in
     System Identification.
[68] Talmon, R.; Mallat, S.; Zaveri, H.; and Coifman, R. R.
     2015. Manifold Learning for Latent Variable Inference
     in Dynamical Systems. IEEE Transactions on Sig-
     nal Processing 63(15): 3843–3856. doi:10.1109/TSP.
     2015.2432731.
[69] Tu, J. H.; Rowley, C. W.; Luchtenburg, D. M.; Brun-
     ton, S. L.; and Kutz, J. N. 2014. On dynamic mode
     decomposition: Theory and applications. Journal of
     Computational Dynamics URL http://aimsciences.org/
     /article/id/1dfebc20-876d-4da7-8034-7cd3c7ae1161.
[70] Van Der Merwe, R.; Doucet, A.; De Freitas, N.;
     and Wan, E. 2000. The Unscented Particle Filter.
     In Proceedings of the 13th International Conference
     on Neural Information Processing Systems, NIPS’00,
     563–569. Cambridge, MA, USA: MIT Press.
[71] Wan, E. A.; and Van Der Merwe, R. 2000. The un-
     scented Kalman filter for nonlinear estimation. In Pro-
     ceedings of the IEEE 2000 Adaptive Systems for Signal
     Processing, Communications, and Control Symposium
     (Cat. No.00EX373), 153–158. doi:10.1109/ASSPCC.
     2000.882463.
[72] Whitney, H. 1944. The Self-Intersections of a Smooth
     n-Manifold in 2n-Space. Annals of Mathematics 45(2):
     220–246. ISSN 0003486X. URL http://www.jstor.org/
     stable/1969265.
[73] Yang, Y.; and Perdikaris, P. 2018.     Physics-
     informed deep generative models. arXiv preprint
     arXiv:1812.03511 .
Appendix A: Backpropogation of Encoders for                       where Φk (u, w) = 21 kw − σ k (u)k22 . The w is the input and
   Non-Euclidean Latent Spaces given by                           u∗ , k ∗ is the solution sought. For smooth parameterizations,
                                                                  the optimal solution satisfies
            General Manifolds
We develop methods for using backpropogation to learn en-                            G = ∇z Φk∗ (u∗ , w) = 0.
coder maps from Rd to general manifolds M. We perform             During learning we need gradients ∇w Λ(w) = ∇w z when
learning using the family of manifold encoder maps of the         w is varied characterizing variations of points on the mani-
form Eθ = Λ(Ẽθ (x)). This allows for use of latent spaces        fold z = Λ(w). We derive these expressions by considering
having general topologies and geometries. We represent the        variations w = w(γ) for a scalar parameter γ. We can ob-
manifold as an embedding M ⊂ R2m and computationally              tain the needed gradients by determining the variations of
use point-cloud representations along with local gradient in-     u∗ = u∗ (γ). We can express these gradients using the Im-
formation, see Figure 6. To allow for Eθ to be learnable, we      plicit Function Theorem as
develop approaches for incorporating our maps into general                       d                         du∗          dw
backpropogation frameworks.                                            0=          G(u∗ (γ), w(γ)) = ∇u G       + ∇w G     .
                                                                                dγ                         dγ           dγ
                                                                  This implies
                                                                                       du∗           −1       dw
                                                                                           = − [∇u G] ∇w G       .
                                                                                       dγ                     dγ
                                                                  As long as we can evaluate at u these local gradients ∇u G,
                                                                  ∇w G, dw/dγ, we only need to determine computationally
                                                                  the solution u∗ . For the backpropogation framework, we use
                                                                  these to assemble the needed gradients for our manifold en-
                                                                  coder maps Eθ = Λ(Ẽθ (x)) as follows.
                                                                     We first find numerically the closest point in the manifold
                                                                                                                       ∗
                                                                  z ∗ ∈ M and represent it as z ∗ = σ(u∗ ) = σ k (u∗ ) for
                                                                  some chart k ∗ . In this chart, the gradients can be expressed
                                                                  as
                                                                             G = ∇u Φ(u, w) = −(w − σ(u))T ∇u σ(u).
                                                                  We take here a column vector convention with ∇u σ(u) =
                                                                  [σu1 | . . . |σuk ]. We next compute
                                                                    ∇u G = ∇uu Φ = ∇u σ T ∇u σ − (w − σ(u))T ∇uu σ(u)
Figure 6: Learnable Mappings to Manifold Surfaces We              and
develop methods based on point cloud representations em-                      ∇w G = ∇w,u Φ = −I∇u σ(u).
bedded in Rn for learning latent manifold representations         For implementation it is useful to express this in more detail
having general geometries and topologies.                         component-wise as
                                                                                      X
   For a manifold M of dimension m, we can represent it                     [G]i = −       (wk − σk (u))∂ui σk (u),
by an embedding within R2m , as supported by the Whitney                                   k
Embedding Theorem [72]. We let z = Λ(w) be a mapping              with
w ∈ R2m to points on the manifold z ∈ M. This allows for                                             X
                                                                     [∇u G]i,j   =    [∇uu Φ]i,j =        ∂uj σk (u)∂ui σk (u)
learning within the family of manifold encoders w = Ẽθ (x)                                           k
any function from Rd to R2m . This facilitates use of deep                            X
neural networks and other function classes. In practice, we                      −         (wk − σk (u))∂u2i ,uj σk (u)
shall take z = Λ(w) to map to the nearest location on the                              k
manifold. We can express this as the optimization problem           [∇w G]i,j    = [∇w,u Φ]i,j
                                                                                     X
                               1                                                 = −     ∂wj wk ∂ui σk (u) = −∂ui σj (u).
                 z ∗ = arg min   kw − zk22 .
                           z∈M 2                                                           k

We can always express a smooth manifold using local coor-         The final gradient is given by
dinate charts σ k (u), for example, by using a local Monge-        dΛ(w)   dz ∗        du∗               −1    dw
Gauge quadratic fit to the point cloud [30]. We can express              =      = ∇u σ     = −∇u σ [∇u G] ∇w G    .
                                                                    dγ     dγ          dγ                      dγ
z ∗ = σ k (u∗ ) for some chart k ∗ . In terms of the coordinate
charts {Uk } and local parameterizations {σ k (u)} we can ex-       In summary, once we determine the point z ∗ = Λ(w)
press this as                                                     we need only evaluate the above expressions to obtain the
                                                                  needed gradient for learning via backpropogation
                                1
           u∗ , k ∗ = arg min     kw − σ k (u)k22 ,                       ∇θ Eθ (x) = ∇w Λ(w)∇θ Ẽθ (x), w = Ẽθ (x).
                         k,u∈Uk 2
The ∇w Λ is determined by dΛ(w)/dγ using γ =
w1 , . . . wn . In practice, the Ẽθ (x) is represented by a deep
neural network from Rd to R2m . In this way, we can learn
general encoder mappings Eθ (x) from x ∈ Rd to general
manifolds M.