=Paper=
{{Paper
|id=Vol-2587/article_5
|storemode=property
|title=Enforcing Constraints for Time Series Prediction in Supervised, Unsupervised and Reinforcement
                        Learning
|pdfUrl=https://ceur-ws.org/Vol-2587/article_5.pdf
|volume=Vol-2587
|authors=Panos Stinis
|dblpUrl=https://dblp.org/rec/conf/aaaiss/Stinis20
}}
==Enforcing Constraints for Time Series Prediction in Supervised, Unsupervised and Reinforcement
                        Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2587/article_5.pdf</pdf>
<pre>
  Enforcing constraints for time series prediction in supervised, unsupervised and
                               reinforcement learning

                                                              Panos Stinis
                                       Advanced Computing, Mathematics and Data Division
                                     Pacific Northwest National Laboratory, Richland WA 99354


                            Abstract                                  to enforce constraints coming from a dynamical system dur-
                                                                      ing the training of a neural network to represent the flow
  We assume that we are given a time series of data from a
  dynamical system and our task is to learn the flow map of           map of the system. Thus, prior domain knowledge is incor-
  the dynamical system. We present a collection of results on         porated in the neural network training. On the other hand, as
  how to enforce constraints coming from the dynamical sys-           we will show, the accurate representation of the dynamical
  tem in order to accelerate the training of deep neural networks     system flow map through a neural network is equivalent to
  to represent the flow map of the system as well as increase         constructing a temporal integrator for the dynamical system
  their predictive ability. In particular, we provide ways to en-     modified to account for unresolved temporal scales. Thus,
  force constraints during training for all three major modes of      machine learning can enhance scientific computing.
  learning, namely supervised, unsupervised and reinforcement            We assume that we are given data in the form of a time
  learning. In general, the dynamic constraints need to include
  terms which are analogous to memory terms in model reduc-
                                                                      series of the states of a dynamical system (a training trajec-
  tion formalisms. Such memory terms act as a restoring force         tory). Our task is to train a neural network to learn the flow
  which corrects the errors committed by the learned flow map         map of the dynamical system. This means to optimize the
  during prediction.                                                  parameters of the neural network so that when it is presented
  For supervised learning, the constraints are added to the ob-       with the state of the system at one instant, it will predict ac-
  jective function. For the case of unsupervised learning, in         curately the state of the system at another instant which is
  particular generative adversarial networks, the constraints are     a fixed time interval apart. If we want to use the data alone
  introduced by augmenting the input of the discriminator. Fi-        to train a neural network to represent the flow map, then it
  nally, for the case of reinforcement learning and in particular     is easy to construct simple examples where the trained flow
  actor-critic methods, the constraints are added to the reward       map has rather poor predictive ability (Stinis et al. 2019).
  function. In addition, for the reinforcement learning case, we      The reason is that the given data train the flow map to learn
  present a novel approach based on homotopy of the action-
                                                                      how to respond accurately as long as the state of the system
  value function in order to stabilize and accelerate training.
  We use numerical results for the Lorenz system to illustrate        is on the trajectory. However, at every timestep, when we
  the various constructions.                                          invoke the flow map to predict the estimate of the state at
                                                                      the next timestep, we commit an error. After some steps, the
                                                                      predicted trajectory veers into parts of phase space where
                        Introduction                                  the neural network has not trained. When this happens, the
Scientific machine learning, which combines the strengths             neural network’s predictive ability degrades rapidly.
of scientific computing with those of machine learning, is               One way to aid the neural network in its training task is to
becoming a rather active area of research. Several related            provide data that account for this inevitable error. In (Stinis
priority research directions were stated in the recently pub-         et al. 2019), we advanced the idea of using a noisy version of
lished report (Baker et al. 2019). In particular, two prior-          the training data i.e. a noisy version of the training trajectory.
ity research directions are: (i) how to leverage scientific do-       In particular, we attach a noise cloud around each point on
main knowledge in machine learning (e.g. physical prin-               the training trajectory. During training, the neural network
ciples, symmetries, constraints); and (ii) how can machine            learns how to take as input points from the noise cloud, and
learning enhance scientific computing (e.g reduced-order or           map them back to the noiseless trajectory at the next time
sub-grid physics models, parameter optimization in multi-             instant. This is an implicit way of encoding a restoring force
scale simulations).                                                   in the parameters of the neural network. We have found that
   Our aim in the current work is to present a collection of          this modification can improve the predictive ability of the
results that contribute to both of the aforementioned prior-          trained neural network but up to a point.
ity research directions. On the one hand, we provide ways                We want to aid the neural network further by enforcing
Copyright c 2020, for this paper by its authors. Use permit-          constraints that we know the state of the system satisfies. In
ted under Creative Commons License Attribution 4.0 International      particular, we assume that we have knowledge of the dif-
(CCBY 4.0).                                                           ferential equations that govern the evolution of the system
(our constructions work also if we assume algebraic con-           1999), the constraints are added to the reward function. In
straints see e.g. (Stinis et al. 2019)). Enforcing the differ-     addition, for the reinforcement learning case, we have de-
ential equations directly at the continuum level can be ef-        veloped a novel approach based on homotopy of the action-
fected for supervised and and reinforcement learning but it        value function in order to stabilize and accelerate training.
is more involved for unsupervised learning. Here we have              In recent years, there has been considerable interest in
opted to enforce constraints in discrete time. We want to in-      the development of methods that utilize data and physical
corporate the discretized dynamics into the training process       constraints in order to train predictors for dynamical sys-
of the neural network. The purpose of such an attempt can be       tems and differential equations e.g. see (Berry, Giannakis,
explained in two ways: (i) we want to aid the neural network       and Harlim 2015; Raissi, Perdikaris, and Karniadakis 2018;
so that it does not have to discover the dynamics (physics)        Chen et al. 2018; Han, Jentzen, and E 2018; Sirignano
from scratch; and (ii) we want the constraints to act as reg-      and Spiliopoulos 2018; Felsberger and Koutsourelakis 2018;
ularizers for the optimization problem which determines the        Wan et al. 2018; Ma et al. 2018) and references therein.
parameters of the neural network.                                  Our approach is different, it introduces the novel concept
   Closer inspection of the concept of noisy data and of en-       of training on purpose with modified (noisy) data in order to
forcing the discretized constraints reveals that they can be       incorporate (implicitly or explicitly) a restoring force in the
combined. However, this needs to be done with care. Re-            dynamics learned by the neural network flow map. We have
call that when we use noisy data we train the neural net-          also provided the connection between the incorporation of
work to map a point from the noise cloud back to the noise-        such restoring forces and the concept of memory in model
less point at the next time instant. Thus, we cannot enforce       reduction.
the discretized constraints as they are because the dynam-            Due to space limitations, we cannot expand on the details
ics have been modified. In particular, the use of noisy data       of how to enforce constraints for the 3 major modes of learn-
requires that the discretized constraints be modified to ac-       ing (please see Sections 1 and 2 in (Stinis 2019) for a de-
count explicitly for the restoring force. We have called the       tailed discussion of all the constructions). Instead we focus
modification of the discretized constraints the explicit error-    on the presentation of numerical results for the Lorenz sys-
correction.                                                        tem to showcase the performance of the proposed approach.
   The meaning of the restoring force is analogous to that of      Also, we note that we have not included results which show
memory terms in model reduction formalisms (Chorin and             how enforcing constraints, implicitly or explicitly, is better
Stinis 2006). Note that the memory here is not because we          than not enforcing constraints at all (please see (Stinis et al.
are only resolving part of the system’s variables (see e.g.        2019) and (Stinis 2019) for such results).
(Ma, Wang, and E 2018; Harlim et al. 2019)) but due to
the use of a finite timestep. The timescales that are smaller                         Numerical results
than the timestep used are not resolved explicitly. However,       The Lorenz system is given by
their effect on the resolved timescales cannot be ignored. In
fact, it is what causes the inevitable error at each applica-                        dx1
tion of the flow map. The restoring force that we include                                = σ(x2 − x1 )                         (1)
                                                                                      dt
in the modified constraints is there to remedy this error i.e.                       dx2
to account for the unresolved timescales albeit in a simpli-                             = ρx1 − x2 − x1 x3                    (2)
                                                                                      dt
fied manner. This is precisely the role played by memory                             dx3
terms in model reduction formalisms. In the current work                                 = x1 x2 − βx3                         (3)
we have restricted attention to linear error-correction terms.                        dt
The linear terms come with coefficients whose magnitude            where σ, ρ and β are positive. We have chosen for the nu-
is optimized as part of the training. In this respect, optimiz-    merical experiments the commonly used values σ = 10,
ing the error-correction term coefficients becomes akin to         ρ = 28 and β = 8/3. For these values of the parameters
temporal renormalization. This means that the coefficients         the Lorenz system is chaotic and possesses an attractor for
depend on the temporal scale at which we probe the system          almost all initial points. We have chosen the initial condition
(Goldenfeld 1992; Barenblatt 2003). Finally, we note that          x1 (0) = 0, x2 (0) = 1 and x3 (0) = 0.
the error-correction term can be more complex than linear.            We have used as training data the trajectory that starts
In fact, it can be modeled by a separate neural network. It        from the specified initial condition and is computed by the
can also involve not just the previous state but also states       Euler scheme with timestep δt = 10−4 . In particular, we
further back in time. Results for such more elaborate error-       have used data from a trajectory for t ∈ [0, 3]. For all three
correction terms will be presented elsewhere.                      modes of learning, we have trained the neural network to
   We have implemented constraint enforcing in all three           represent the flow map with timestep ∆t = 1.5 × 10−2 i.e.
major modes of learning. For supervised learning, the con-         150 times larger than the timestep used to produce the train-
straints are added to the objective function. For the case of      ing data. After we trained the neural network that represents
unsupervised learning, in particular generative adversarial        the flow map, we used it to predict the solution for t ∈ [0, 9].
networks (GANs) (Goodfellow et al. 2014), the constraints          Thus, the trained flow map’s task is to predict (through iter-
are introduced by augmenting the input of the discrimina-          ative application) the whole training trajectory for t ∈ [0, 3]
tor (Stinis et al. 2019). Finally, for the case of reinforcement   starting from the given initial condition and then keep pro-
learning and in particular actor-critic methods (Sutton et al.     ducing predictions for t ∈ (3, 9].
    This is a severe test of the learned flow map’s predictive      (zj1 , zj2 , zj3 ) from the noise cloud makes zero the residuals
abilities for four reasons. First, due to the chaotic nature of
the Lorenz system there is no guarantee that the flow map            j1 = F1 (zj ) − zj1 − ∆t[σ(zj2 − zj1 )] + ∆ta1 zj1    (4)
can correct its errors so that it can follow closely the training    j2 = F2 (zj ) − zj2 − ∆t[ρzj1 − zj2 − zj1 zj3 ] + ∆ta2 zj2
trajectory even for the interval [0, 3] used for training. Sec-                                                             (5)
ond, by extending the interval of prediction beyond the one          j3 = F3 (zj ) − zj3 − ∆t[zj1 zj2 − βzj3 ] + ∆ta3 zj3 , (6)
used for training we want to check whether the neural net-
work has actually learned the map of the Lorenz system and          where a1 , a2 and a3 are parameters to be optimized dur-
not just overfitting the training data. Third, we have chosen       ing training along with the parameters of the neural network
an initial condition that is far away from the attractor but our    flow map. The first three terms on the RHS of (4)-(6) are the
integration interval is long enough so that the system does         forward Euler scheme, while the third is the diagonal linear
reach the attractor and then evolves on it. In other words, we      error-correcting term. More elaborate error-correcting terms
want the neural network to learn both the evolution of the          will appear elsewhere (see also (Stinis 2019)).
transient and the evolution on the attractor. Fourth, we have
chosen to train the neural network to represent the flow map        Supervised learning
corresponding to a much larger timestep than the one used to        The loss function used for enforcing constraints in super-
produce the training trajectory in order to check the ability       vised learning was
of the error-correcting term to account for a significant range                       m 3                                 
of unresolved timescales (relative to the training trajectory).                    1 XX
                                                                          Loss =             [(Fl (zj ) − xdata
                                                                                                           jl   ) 2
                                                                                                                    +  2
                                                                                                                        jl ,
                                                                                                                          ]  (7)
    We performed experiments with different values for the                         m j=1
                                                                                             l=1
various parameters that enter in our constructions. We
present here indicative results for the case of N = 2 × 104         where jl are the residuals given by (4)-(6). The uncon-
samples (N/3 for training, N/3 for validation and N/3 for           strained loss function is given by (7) without the residuals.
testing). We have chosen Ncloud = 100 for the cloud of                 We used a deep neural network for the representation of
points around each input. Thus, the timestep ∆t = 1.5 ×             the flow map with 10 hidden layers of width 20. We note
10−2 . This is because there are 20000/100 = 200 time               that because the solution of the Lorenz system acquires val-
instants in the interval [0, 3] at a distance ∆t = 3/200 =          ues outside of the region of the activation function we have
1.5 × 10−2 apart.                                                   removed the activation function from the last layer of the
    The noise cloud for the neural network at a point t was         generator (alternatively we could have used batch normal-
constructed using the point xi (t) for i = 1, 2, 3, on the train-   ization and kept the activation function). Fig. 1 compares
ing trajectory and adding random disturbances so that it be-        the evolution of the prediction for x1 (t) of the neural net-
comes the collection xil (t)(1 − Rrange + 2Rrange × ξil )           work flow map starting at t = 0 and computed with a
where l = 1, . . . , Ncloud . The random variables ξil ∼            timestep ∆t = 1.5 × 10−2 to the ground truth (training
U [0, 1] and Rrange = 2 × 10−2 . As we have explained               trajectory) computed with the forward Euler scheme with
before, we want to train the neural network to map the in-          timestep δt = 10−4 . We show plots only for x1 (t) since the
put from the noise cloud at a time t to the noiseless point         results are similar for the x2 (t) and x3 (t).
xi (t + ∆t) (for i = 1, 2, 3,) on the training trajectory at time      We make two observations. First, the prediction of the
t + ∆t.                                                             neural network flow map is able to follow with adequate ac-
    We have to also motivate the value of Rrange for the            curacy the ground truth not only during the interval [0, 3] that
range of the noise cloud. Recall that the training trajectory       was used for training, but also during the interval (3, 9]. Sec-
was computed with the Euler scheme which is a first-order           ond, the explicit enforcing of constraints i.e. the enforcing of
scheme. For the interval ∆t = 1.5 × 10−2 we expect the              the constraints (4)-(6) (see results in Fig. 1(b)) is better than
error committed by the flow map to be of similar magnitude          the implicit enforcing of constraints.
and thus we should accommodate this error by considering
a cloud of points within this range. We found that taking           Unsupervised learning
Rrange slightly larger and equal to 2 × 10−2 helps the accu-        For the case of unsupervised learning we have chosen
racy of the training.                                               GANs. To enforce constraints we consider a two-player min-
    We denote by (F1 (zj ), F2 (zj ), F3 (zj )) the neural net-     max game with the modified value function V const (D, G) :
work flow map prediction at tj + ∆t for the input vector
zj = (zj1 , zj2 , zj3 ) from the noise cloud at time tj . Also,      min max V const (D, G) = Ex∼pdata (x) [log D(x, D (x))]
                                                                      G     D
xdata
  j    = (x1 (tj +∆t), x2 (tj +∆t), x3 (tj +∆t)) is the point                   +Ez∼pz (z) [log(1 − D(G(z), G (z)))],           (8)
on the training trajectory computed by the Euler scheme
with δt = 10−4 . For the mini-batch size we have chosen             where D (x) ∼ N (0, (2δt)2 ) is the constraint residual for
m = 1000 for the supervised and unsupervised cases and              the true sample (see explanation in Section 2.4 in (Stinis
m = 33 for the reinforcement learning case.                         et al. 2019)). Also, G (z) is the constraint residual for the
    We also need to specify the constraints that we want            generator-created sample (see (4)-(6) above). The uncon-
to enforce. Using the notation introduced above, we want            strained value function is given by (8) without the residuals.
to train the neural network flow map so that its out-               Note that in our setup, the generator input distribution pz (z)
put (F1 (zj ), F2 (zj ), F3 (zj )) for an input data point zj =     will be from the noise cloud around the training trajectory.
                  20                                                                      20

                  15                                                                      15

                  10                                                                      10

                   5                                                                       5

            x1


                                                                                    x1
                   0                                                                       0

                 −5                                                                      −5

                 −10                                                                     −10

                 −15                                                                     −15
                       0   1   2   3    4          5   6   7   8   9                           0   1   2   3    4          5   6   7   8   9
                                            Time                                                                    Time


                                       (a)                                                                     (a)

                  20                                                                      20

                  15                                                                      15

                  10                                                                      10

                   5                                                                       5
            x1


                                                                                    x1
                   0                                                                       0

                 −5                                                                      −5

                 −10                                                                     −10

                 −15                                                                     −15
                       0   1   2   3    4          5   6   7   8   9                           0   1   2   3    4          5   6   7   8   9
                                            Time                                                                    Time


                                       (b)                                                                     (b)

Figure 1: Supervised learning. Comparison of ground truth              Figure 2: Unsupervised learning (GAN). Comparison of
for x1 (t) computed with the Euler scheme with timestep                ground truth for x1 (t) computed with the Euler scheme
δt = 10−4 (blue dots) and the neural network flow map pre-             with timestep δt = 10−4 (blue dots) and the neural net-
diction with timestep ∆t = 1.5 × 10−2 (red crosses). Note              work flow map (GAN generator) prediction with timestep
that the timestep of the neural network flow map is 150 times          ∆t = 1.5 × 10−2 (red crosses). (a) noisy data without en-
larger than the timestep used to produce the training data. (a)        forced constraints during training; (b) noisy data with en-
noisy data without enforced constraints during training; (b)           forced constraints during training (see text for details).
noisy data with enforced constraints during training (see text
for details).
                                                                       has predictive accuracy.
                                                                          We also note that training with noiseless data is even more
On the other hand, the true data distribution pdata is the dis-        brittle. For the very few experiments where we avoided in-
tribution of values of the (noiseless) training trajectory.            stability the predicted solution from the trained GAN gener-
   We have used for the GAN generator a deep neural net-               ator was not accurate at all.
work with 9 hidden layers of width 20 and for the discrim-
inator a neural network with 2 hidden layers of width 20.              Reinforcement learning
The numbers of hidden layers both for the generator and                The last case we examine is that of reinforcement learning
the discriminator were chosen as the smallest that allowed             (see (Stinis 2019) for notation and details about the con-
the GAN training to reach its game-theoretic optimum with-             structions). In particular, we use a deterministic policy actor-
out at the same time requiring large scale computations. Fig.          critic method (Lillicrap et al. 2015). In our application we
2 compares the evolution of the prediction of the neural               have identified the neural network flow map with the action
network flow map starting at t = 0 and computed with a                 policy. For the representation of the deterministic action pol-
timestep ∆t = 1.5 × 10−2 to the ground truth (training                 icy, we used a deep neural network with 10 hidden layers of
trajectory) computed with the forward Euler scheme with                width 20. For the representation of the action-value func-
timestep δt = 10−4 .                                                   tion we used a deep neural network with 15 hidden layers of
   Fig. 2(a) shows results for the implicit enforcing of con-          width 20. The task of learning an accurate representation of
straints. We see that this is not enough to produce a neural           the action-value function is more difficult than that of find-
network flow map with long-term predictive accuracy. Fig.              ing the action policy. This justifies the need for a stronger
2(b) shows the significant improvement in the predictive ac-           network to represent the action-value function.
curacy when we enforce the constraints explicitly. The re-                The training of actor-critic methods in their original form
sults for this specific example are not as good as in the case         suffers from stability issues. Researchers have developed
of supervised learning presented earlier. We note that train-          various modifications and tricks to stabilize training (see
ing a GAN with or without constraints is a delicate numeri-            the review in (Pfau and Vinyals 2016)). The one that en-
cal task as explained in more detail in (Stinis et al. 2019).          abled us to stabilize results in the first place is that of target
One needs to find the right balance between the expres-                networks (Mnih et al. 2015; Lillicrap et al. 2015). The tar-
sive strengths of the generator and the discriminator (game-           get network concept uses different networks to represent the
theoretic optimum) to avoid instabilities but also train the           action-value function and the action policy that appear in
neural network flow map i.e. the GAN generator, so that it             the expression for the target in the Bellman equation. How-
ever, the predictive accuracy of the trained neural network                          20

flow map i.e. the action policy, was extremely poor unless                           15


we also used our homotopy approach for the action-value                              10


function. This was true for both cases of enforcing or not                            5


                                                                               x1
constraints explicitly during training. With this in mind we                          0


present results with and without the homotopy approach for                          −5

                                                                                    −10
the action-value function to highlight the accuracy improve-
                                                                                    −15
ment afforded by the use of homotopy.                                                     0   1   2   3    4
                                                                                                               Time
                                                                                                                      5   6   7   8   9


   After each iteration of the optimizer for the action-value
                                                                                                          (a)
function, the homotopy approach uses the quantity
                                                                                     25

 δ × Q(st , µ(st )) + (1 − δ) × [rt + γQ(st+1 , µ(st+1 ))] (9)                       20

                                                                                     15
in the optimization for the action policy. Here, Q(st , µ(st ))                      10

is the action-value function, µ(st ) is the action policy and rt


                                                                               x1
                                                                                      5

the reward function, γ ∈ [0, 1] is the discount factor which                          0


expresses the degree of faith in future actions, and δ is the                       −5


homotopy parameter (see Section 2.3 in (Stinis 2019)). We                           −10


initialized the homotopy parameter δ at 0, and increased its                        −15
                                                                                          0   1   2   3    4          5   6   7   8   9
                                                                                                               Time
value (until it reached 1) every 2000 training iterations.
   We have set the discount factor to γ = 1, which is a diffi-                                            (b)
cult case. It corresponds to the case of a deterministic envi-
ronment which means that the same actions always produce           Figure 3: Reinforcement learning (Actor-critic). Compar-
the same rewards. This is the situation in our numerical ex-       ison of ground truth for x1 (t) computed with the Euler
periments where we are given a training trajectory that does       scheme with timestep δt = 10−4 (blue dots), the neural net-
not change. We have conducted more experiments for other           work flow map prediction with timestep ∆t = 1.5 × 10−2
values of γ but a detailed presentation of those results will      with homotopy for the action-value function during training
await a future publication.                                        (red crosses) and the neural network flow map prediction
   The reward function (with constraints) for an input point       with timestep ∆t = 1.5 × 10−2 without homotopy for the
zj from the noise cloud at time tj                                 action-value function during training (green triangles). (a)
                        3                                        noisy data without enforced constraints during training; (b)
             data
                       X
                                          data 2     2             noisy data with enforced constraints during training (see text
     r(zj , xj ) = −        (µl (zj ) − xjl ) + jl         (10)
                                                                   for details).
                       l=1

where xdata
          j    is the noiseless point from the training trajec-
tory at time tj + ∆t. Also, µl (zj ) is the action at zj i.e.      training of a neural network to represent the flow map of the
the prediction of the neural network flow map and jl is the       system. We have provided ways that the constraints can be
constraint residual for the prediction (see (4)-(6) above).        enforced in all three major modes of learning, namely su-
   Fig. 3 presents results of the prediction performance of        pervised, unsupervised and reinforcement learning. In line
the neural network flow map when it was trained with and           with the law of scientific computing that one should build in
without the use of homotopy for the action value function.         an algorithm as much prior information is known as pos-
In Fig. 3(a) we have results for the implicit enforcing of con-    sible, we observe a striking improvement in performance
straints while in Fig. 3(b) for the explicit enforcing of con-     when known constraints are enforced during training. There
straints. We make two observations. First, both for implicit       is an added benefit of training with noisy data and how these
and explicit enforcing of the constraints, the use of homo-        correspond to the incorporation of a restoring force in the
topy leads to accurate results for long times. Especially for      dynamics of the system (see (Stinis et al. 2019) and (Stinis
the case of explicit enforcing which gave us some of the best      2019) for more details). This restoring force is analogous to
results from all the numerical experiments we conducted for        memory terms appearing in model reduction formalisms. In
the different modes of learning. Second, if we do not use          our framework, the reduction is in a temporal sense i.e. it al-
homotopy, the predictions are extremely poor both for im-          lows us to construct a flow map that remains accurate though
plicit and explicit forcing. Indeed, the green curve in Fig.       it is defined for large timesteps.
3(a) representing the prediction of x1 (t) for the case of im-         The model reduction connection opens an interesting av-
plicit constraint enforcing without homotopy is as inaccu-         enue of research that makes contact with complex sys-
rate as it looks. It starts at 0 and within a few steps drops to   tems appearing in real-world problems. The use of larger
a negative value and does not change much after that. The          timesteps for the neural network flow map than the ground
predictions for x2 (t) and x3 (t) are equally inaccurate.          truth without sacrificing too much accuracy is important. We
                                                                   can imagine an online setting where observations come at
             Discussion and future work                            sparsely placed time instants and are used to update the pa-
We have presented a collection of results about the enforc-        rameters of the neural network flow map. The use of sparse
ing of known constraints for a dynamical system during the         observations could be dictated by necessity e.g. if it is hard
to obtain frequent measurements or efficiency e.g. the local     Goldenfeld, N. 1992. Lectures on Phase Transitions and the
processing of data in field-deployed sensors can be costly.      Renormalization Group. Perseus Books.
Thus, if the trained flow map is capable of accurate estimates   Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;
using larger timesteps then its successful updated training      Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
using only sparse observations becomes more probable.            2014. Generative adversarial nets. Advances in neural in-
   The current approach approximates the flow map using a        formation processing systems 2672–2680.
feed-forward neural network. It will be interesting to com-      Han, J.; Jentzen, A.; and E, W. 2018. Solving high-
pare its performance with other approaches, most notably         dimensional partial differential equations using deep learn-
Recurrent Neural Networks which have been used to model          ing. Proceedings of the National Academy of Sciences
time series data (see e.g. the review (Bianchi et al. 2017)).    115(34):8505–8510.
   The constructions presented in the current work depend
                                                                 Harlim, J.; Jiang, S. W.; Liang, S.; and Yang, H. 2019. Ma-
on a large number of details that can potentially affect their
                                                                 chine learning for prediction with missing dynamics. arXiv
performance. A thorough study of the relative merits of en-
                                                                 preprint arXiv:1910.05861.
forcing constraints for the different modes of learning needs
to be undertaken and will be presented in a future publica-      Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;
tion. We do believe though that the framework provides a         Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous
promising research direction at the nexus of scientific com-     control with deep reinforcement learning. arXiv preprint
puting and machine learning.                                     arXiv:1509.02971.
                                                                 Ma, H.; Leng, S.; Aihara, K.; Lin, W.; and Chen, L. 2018.
                  Acknowledgements                               Randomly distributed embedding making short-term high-
                                                                 dimensional data predictable. Proceedings of the National
The author would like to thank Court Corley, Tobias              Academy of Sciences 115(43):E9994–E10002.
Hagge, Nathan Hodas, George Karniadakis, Kevin Lin, Paris
                                                                 Ma, C.; Wang, J.; and E, W. 2018. Model reduction with
Perdikaris, Maziar Raissi, Alexandre Tartakovsky, Ramakr-
                                                                 memory and the machine learning of dynamical systems.
ishna Tipireddy, Xiu Yang and Enoch Yeung for helpful
                                                                 arXiv preprint arXiv:1808.04258v1.
discussions and comments. The work presented here was
partially supported by the PNNL-funded “Deep Learning            Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness,
for Scientific Discovery Agile Investment” and the DOE-          J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland,
ASCR-funded ”Collaboratory on Mathematics and Physics-           A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;
Informed Learning Machines for Multiscale and Multi-             Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,
physics Problems (PhILMs)”. Pacific Northwest National           S.; and Hassabis, D. 2015. Human-level control through
Laboratory is operated by Battelle Memorial Institute for        deep reinforcement learning. Nature 518(7540):529–533.
DOE under Contract DE-AC05-76RL01830.                            Pfau, D., and Vinyals, O. 2016. Connecting generative ad-
                                                                 versarial networks and actor-critic methods. arXiv preprint
                       References                                arXiv:1610.01945.
                                                                 Raissi, M.; Perdikaris, P.; and Karniadakis, G. 2018. Nu-
Baker, N.; Alexander, F.; Bremer, T.; Hagberg, A.;
                                                                 merical Gaussian processes for time-dependent and non-
Kevrekidis, Y.; Najm, H.; Parashar, M.; Patra, A.; Sethian,
                                                                 linear partial differential equations. SIAM J. Sci. Comput.
J.; Wild, S.; and Willcox, K. 2019. Workshop report on
                                                                 40:A172–A198.
basic research needs for scientific machine learning: Core
technologies for artificial intelligence.                        Sirignano, J., and Spiliopoulos, K. 2018. DGM: A deep
                                                                 learning algorithm for solving partial differential equations.
Barenblatt, G. I. 2003. Scaling. Cambridge University Press.     Journal of Computational Physics 375:1339 – 1364.
Berry, T.; Giannakis, D.; and Harlim, J. 2015. Nonpara-          Stinis, P.; Hagge, T.; Tartakovsky, A. M.; and Young, E.
metric forecasting of low-dimensional dynamical systems.         2019. Enforcing constraints for interpolation and extrapo-
Phys. Rev. E 91:032915.                                          lation in generative adversarial networks. Journal of Com-
Bianchi, F. M.; Maiorino, E.; Kampffmeyer, M. C.; Rizzi,         putational Physics 397.
A.; and Jenssen, R. 2017. An overview and comparative            Stinis, P. 2019. Enforcing constraints for time series predic-
analysis of recurrent neural networks for short term load        tion in supervised, unsupervised and reinforcement learning.
forecasting. arXiv preprint arXiv:1705.04378.                    arXiv preprint arXiv:1905.07501.
Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; and Duve-         Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y.
naud, D. 2018. Neural ordinary differential equations. arXiv     1999. Policy gradient methods for reinforcement learning
preprint arXiv:1806.07366v3.                                     with function approximation. In Proceedings of the 12th
Chorin, A. J., and Stinis, P. 2006. Problem reduction, renor-    International Conference on Neural Information Processing
malization and memory. Communications in Applied Math-           Systems, NIPS’99, 1057–1063. Cambridge, MA, USA: MIT
ematics and Computational Science 1:1–27.                        Press.
Felsberger, L., and Koutsourelakis, P. 2018. Physics-            Wan, Z.; Vlachas, P.; Koumoutsakos, P.; and Sapsis, T. 2018.
constrained, data-driven discovery of coarse-grained dy-         Data-assisted reduced-order modeling of extreme events in
namics. arXiv preprint arXiv:1802.03824v1.                       complex dynamical systems. PLoS ONE 13:e0197704.

</pre>