Towards Modeling Physically-Consistent, Chaotic
                          Spatiotemporal Dynamics with Echo State Networks
                     Matthew Ziemann∗† , Alisha Sharma∗‡ , Kaiyan Shi∗ , and Yiling Qiao∗
                                           ∗
                                            University of Maryland, College Park, MD 20742
                                           †
                                            Army Research Laboratory, Adelphi, MD 20783
                                         ‡
                                           Naval Research Laboratory, Washington, DC 20375
                                            {mrziema2,ajsharma,kshi12,yilingq}@umd.edu

                            Abstract
  This study explores how echo state networks (ESNs) can be
  used in time-series forecasting of chaotic physics. We compare
  the performance of a basic ESN with two physics-informed
  variants, tested on the canonical Lorenz attractor. We then
  apply the ESN to a large-scale atmospheric model and a
  larger real-world weather dataset to test its ability to scale to
  large spatiotemporal systems. We find that a traditional ESN
  when properly tuned can outperform our equivalent physics-
  informed methods. We also find that the ESN is capable of
  accurately predicting the global evolution of the atmospheric
  primitive equations over short time frames (∼67 hrs), but
  struggles to accurately predict real-world data.


                        Introduction
                                                                       Figure 1: Velocity potential computed using the Climate
Many useful scientific simulations, such as large-scale cli-
                                                                       Forecast System (CFS) (Saha et al. 2010, 2014).
mate models (Fig. 1), contain computationally intractable
subroutines that limit the pace and accessibility of research
in critical areas. Neural networks are gaining popularity as
a way of sidestepping these bottlenecks due to their reduced
computational complexity, improved scalability, and low post-          fixed, making them extremely cheap to train. Unfortunately,
training cost (Frank, Drikakis, and Charissis 2020). However,          while ESNs excel at modeling low dimensional dynamics,
they also have some significant disadvantages that prevent             they scale poorly to high-dimensional input (Aggarwal 2018),
widespread adoption. Notably, they require massive amounts             which poses a challenge for problems with many degrees of
of data and computational resources to train. Furthermore,             freedom such as those with spatial parameters. Furthermore,
without explicit knowledge of physics, their predictions can           while they can accurately model complex chaotic systems,
be unreliable and break important physical laws, which can             they are purely data-driven: they have no qualms about break-
be disastrous in scientific simulations. This is particularly true     ing the laws of physics.
of the difficult regimes that engineers often want to bypass,
such as chaotic, stiff, or multiscale systems.                            Researchers have suggested various methods to overcome
   Echo state networks (ESNs), a simplification of recurrent           these challenges. Studies suggest that informing neural net-
neural networks, address several of these challenges. ESNs             works with physics-based knowledge can yield reduced re-
are one of the most effective data-driven architectures for            quirements for data and training time, as well as signifi-
time-series forecasting, particularly for chaotic dynamical            cantly improving overall performance. Two examples of these
systems (Aggarwal 2018; Jaeger 2001). While they share the             methods include the ”Combined Hybrid/Parallel Prediction”
temporal invariance of standard recurrent neural networks,             (CHyPP) method (Wikner et al. 2020), and physics-informed
all parameters aside from the final fully-connected layer are          loss constraints (Raissi, Perdikaris, and Karniadakis 2019;
                                                                       Doan, Polifke, and Magri 2019). We present a comparative
The first two authors contributed equally to this work.                study of these techniques to better understand their respective
Distribution A: Approved for public release; distribution unlimited.
Copyright c 2021 for this paper by its authors. Use permitted under    benefits and limitations on the small-scale chaotic systems
Creative Commons License Attribution 4.0 International (CC BY          and extended these methods to explore their feasibility in
4.0).                                                                  large-scale dynamical systems.
                   Echo State Networks                                          If we consider the ESN to be a surrogate Runge-Kutta
Recurrent neural networks (RNNs) are a natural model for dy-                 solver, we can verify its prediction by applying the Runge-
namical systems, but they can be difficult to train in practice.             Kutta equations in reverse at each stage. If the ESN is a
Echo state networks (ESNs) are a promising simplification                    good surrogate, this “round trip” will end back at the initial
of RNNs: they freeze most of the parameters, swapping a                      state. Training minimizes the distance between the true and
highly non-convex training process with a simple convex                      calculated initial states, providing a strong error signal and
regression, but still retain much of the sequential structural               bringing the ESN predictions closer to the solver.
and expressive power of general RNNs (Jaeger 2001).                             This physical consistency term has a side effect of making
                                                                             training harder, turning a convex problem highly nonconvex.
                             Previous step context

                                                                             Hybrid Prediction Model
                                                          w

                                                          w
                                                                             As many systems of interest have known reduced or approxi-
 𝑢%                                                                  𝑢2%&'   mate formulations, an alternate strategy for improving phys-
               ⠇                                          ⠇
                                                          w
                                                                             ical consistency is based on the idea that correcting similar
                                                                             (but incorrect) dynamics is easier than learning a system from
         Input Scaling 𝑊!"      Reservoir 𝑊            Trainable
                                                     Readout Layer
                                                                             scratch. By concatenating steps of a reduced-order or flawed
                   Skip connection from input            𝑊#$%                model of the physical system to the reservoir output, we can
                                                                             reformulate the learning problem as correcting the flawed
Figure 2: Diagram of a basic echo state network (ESN). The                   dynamics instead of predicting them from scratch (Wikner
readout layer is trained by ridge regression to map the non-                 et al. 2020). As in the basic ESN, training is convex.
linear reservoir projections to the target output state.

   Conceptually, ESNs work similarly to other weighted av-                                     Preliminary Results
erage methods: they use the architecture of a general RNN
to (a) generate high-dimensional nonlinear projections of the                Three echo state network (ESN) configurations were tested:
inputs and (b) learn an optimal weighted average of these                    a basic ESN (BaseESN), an ESN trained with a physical
projections to predict the next state (Figure 2). The input scal-            consistency loss (PhyESN), and a hybrid numerical/data-
ing and reservoir layers are initialized according to the “echo              driven approach where BaseESN was used to refine a reduced-
state property” (Jaeger 2001) and then are frozen. Learning                  order estimate (HyESN). Models were implemented in Julia
the readout layer is a convex minimization problem between                   using the SciML ecosystem (Martinuzzi 2020).
the reservoir outputs and the output state; the global minimum                  We first trained our three models to predict the spatiotem-
can be found cheaply through closed-form ridge regression                    poral evolution of a simple, well-studied chaotic attractor: the
or by using a convex solver (Jaeger 2001).                                   Lorenz attractor. We then implemented the best performing
                                                                             ESN with two large-scale dynamical systems of increasing
      Physics Informed Echo State Networks                                   complexity to evaluate its ability to scale to more difficult
We focus on two mechanisms for embedding physics into                        problem sets.
the basic ESN architecture described above: a physical con-                     The models were evaluated by two metrics. First, their
sistency loss function and a hybrid data-driven/knowledge-                   relative accuracy was measured by the time-averaged root
based predictor. These approaches are described in the fol-                  mean squared error (RMSE) over a fixed prediction time.
lowing sections.                                                             Next, the dynamic stability of a model was measured by
                                                                             the time horizon, or time for which the normalized error of
Physical Consistency Loss                                                    the model predictions stays under a specified error tolerance
A direct way to constrain the network is by explicitly embed-                max (Wikner et al. 2020; Doan, Polifke, and Magri 2019).
ding the system’s governing dynamics into the optimization
of the readout layer. Through training, the ESN should learn                 Lorenz Attractor
to emulate the system’s governing dynamics.                                  The first set of experiments focused on the chaotic Lorenz
   Raissi, Perdikaris, and Karniadakis (2019) introduced an                  attractor, a canonical problem in chaotic dynamics modeling.
interesting discrete mechanism for this using Runge-Kutta                    The Lorenz attractor is the chaotic regime of a dynamical sys-
solvers, a family of iterative algorithms for numerically solv-              tem derived from an atmospheric surface convection model
ing systems of ordinary differential equations. Given an initial             (Lorenz 1963). The governing equations are given below:
value and set of governing equations, Runge-Kutta solvers
                                                                                                                    
estimate the next timestep by taking a weighted average of                                            σ(u2 − u1 )
                                                                                              ∂u 
several intermediate predictions called stages. This formula-                                    = u1 (ρ − u3 ) − u2                    (1)
tion fits naturally with the discrete nature of ESNs.                                         ∂t
                                                                                                     u1 u2 − βu3
   System behavior is controlled by the parameters σ, ρ,          the network using closed-form ridge regression and predict-
and β, and the most commonly chosen parameters to study           ing 1000 new points (approximately 20 Lyapunov times) each
chaotic dynamics are σ = 10, ρ = 28 and β = 8/3.                  took ≈ 50 ms (Table 1). We also measured training time with
   Numerical results are summarized in Table 1. Results are       a BFGS solver, an iterative nonlinear minimizer, to provide a
calculated for the best models. Each hyperparameter configu-      performance baseline for PhyESN; this approach was nearly
ration was tested over several thousand random initializations    4 orders of magnitude slower (≈ 3 s).
for BaseESN and HyESN; in contrast, PhyESN was only
                                                                  PhyESN The second experiment (PhyESN) took the same
tested over 5 initializations per configuration due to the high
                                                                  baseline ESN architecture and added a physical consistency
training cost. Results are reported in Lyapunov times, the
                                                                  loss. This physical consistency loss function was signifi-
characteristic timescale of a chaotic system, as in (Wikner
                                                                  cantly more difficult to optimize than the ridge regression
et al. 2020; Doan, Polifke, and Magri 2019).
                                                                  in BaseESN: not only is it non-linear, it is non-convex. To
BaseESN In the first experiment, we modeled the Lorenz            account for this, the readout layer was trained using BFGS.
system with a basic ESN (BaseESN). The basic ESN architec-           Qualitatively, the PhyESN results looked similar to
ture includes an input layer initialized with a random uniform    BaseESN. However, despite the physics embedded in the
distribution with values in [−σ, σ], a random sparse reservoir    loss function, the best PhyESN models had slightly worse ac-
with average connectivity d and spectral radius α, and a T1       curacy and stability than the BaseESN. Furthermore, training
nonlinear transformation of the reservoir output (Chattopad-      took significantly more time: BFGS struggled to converge,
hyay, Hassanzadeh, and Subramanian 2020). The readout             taking 5-7 orders-of-magnitude longer (depending on the run)
layer was trained using ridge regression with regularization      than the closed-form BaseESN training, and the final solution
weight β. A skip connection was included from the input to        was a local (not global) minimum.
reservoir output, and the reservoir leaked previous inputs to
the readout layer with leak coefficient a.                        HyESN The trained HyESN predictions diverged from the
                                                                  true system after approximately 9 Lyapunov times (Figure
                                                                  4), which is worse than the results predicted by BaseESN.
                                                                  However, HyESN frequently exhibits an interesting recov-
                                                                  ery behavior: in Figure 4, the dynamics appear to recover
                                                                  between 11 and 17 Lyapunov times. This can also be seen
                                                                  in the numerical results: while BaseESN has a substantially
                                                                  longer time horizon, the error over 20 Lyapunov times is 12%
                                                                  lower in HyESN.


                                                                     Figure 4: Hybrid Model (HyESN) Lorenz Predictions

                                                                     Training time for HyESN was consistently slower than
                                                                  BaseESN, though they were similar; while they are both
                                                                  trained with closed-form ridge regression, the HyESN read-
                                                                  out layer was larger due to the flawed dynamics input.
     Figure 3: BaseESN Lorenz Predictions and Error
                                                                  Weather Forecasting
   The BaseESN predictions diverged from the true system          We took a two-part approach to evaluate the behavior of
after approximately 15 Lyapunov times, though the predic-         ESNs on large-scale systems. First, we trained a BaseESN
tions continued to follow a similar chaotic pattern (Figure 3).   (our best-performing ESN) on data generated by a simple
BaseESN was also extremely fast to train and predict: training    atmospheric model and evaluate its prediction error. Then
                            Table 1: Best results, Lorenz Models. The best results are in bold font.

                                   Net   RMSE (20 λmax t)        Time Horizon (λmax t)        Train Time(s)
                 BaseESN (literature)           1.69                     5.10            1.36e-3 (closed-form)
                                                                                         1.08e-3 (closed-form)
                     BaseESN (tuned)            1.09                    15.13
                                                                                              2.98 (BFGS)
                               HyESN           8.91e-1                   8.97              5.59 (closed-form)
                              PhyESN             1.8                     7.2                 3.0e+2 (BFGS)


Figure 5: Plot of BaseESN normalized error over time for the         Figure 6: Plot of BaseESN normalized error over time for the
primitive equations system.                                          CFSR data.


we trained a BaseESN on real-world data from the National            icantly. When results finally diverge, the ESN continues to
Oceanic and Atmospheric Association (NOAA)’s Climate                 predict realistic values, though they do not reflect the actual
Forecast System (CFS) (Saha et al. 2010, 2014) and evaluated         evolution of the system.
its prediction error.                                                   We then trained the BaseESN on real-world data from
   We begin with the simple atmospheric model. The gov-              the CFS Reanalysis (CFSR). We utilized hourly data from
erning system of nonlinear, partial differential equations for       January 2005 to December 2006, on a 73x144 spatial grid.
atmospheric dynamics—known as the primitive equations—               Due to memory constraints, we restricted the data to u, v, P,
was numerically solved to generate data for pressure (P ),           & T rather than utilizing the full 73 parameters of CFSR, and
temperature (T ), and the latitudinal & longitudinal compo-          again used three altitude levels for u, v, & T . This increased
nents of wind velocity (u & v) (Ehrendorfer 2011). This              the number of input parameters to 105,120, approximately 5
system is commonly called the dynamical core, as it is the           times more than required for the dynamical core system.
core of most numerical weather models. These four param-                The BaseESN did not perform as well on the real-world
eters were solved on the surface of Earth with a grid of 64          data as it did on the dynamical core system. As seen in Figure
longitudinal points and 32 latitudinal points. We solved u,          6, the predictions for the velocity components diverged from
v, and T at three altitude levels (surface, mid-, and high-).        acceptable error levels within four timesteps, likely because
The pressure was only solved at the surface, bringing the            the velocities evolve more quickly. Pressure and tempera-
total number of ESN input parameters to 20,480, a significant        ture both evolve much more slowly over time, and so the
increase from the 3 input parameters of the Lorenz system. It        BaseESN was able to remain within acceptable error levels
was trained over 300 days of data with 20 minute timesteps.          for much longer. Despite the quick divergence from real-
   With some tuning, the ESN performed quite well in predict-        world values, the BaseESN continued to make realistic (if
ing the evolution of states. With our best-performing model,         incorrect) predictions until around 300 timesteps, where it di-
the ESN predicted over 200 timesteps (>67 hrs) while re-             verged quite suddenly from realistic values. It’s worth noting
maining below the max normalized error cutoff threshold              that we were unable to perform significant hyperparameter
(0.4), seen in Figure 5. Notably, the ESN predictions grad-          tuning on the CFSR system as the computational cost was
ually accumulate error over time and do not diverge signif-          quite high, requiring multiple days to train.
                         Discussion                                  inputs and decode outputs. Faster training and prediction
We found that a highly tuned BaseESN generally out-                  with ESNs would enable a greater degree of hyperparameter
performed the physics-informed techniques (PhyESN and                tuning for large systems which may yield stronger results
HyESN) on the Lorenz system when measured on predic-                 for real-world systems. This would also enable the use of
tion accuracy and dynamic stability. This result underscores         more input data which may lead to further improvements for
the importance of hyperparameter tuning in deep learning             real-world systems.
approaches.
   The primary advantage to BaseESN is the low cost to find
                                                                                               References
                                                                     Aggarwal, C. C. 2018. Neural Networks and Deep Learning: A
the global minimum in low-dimensionality systems. Hyper-
                                                                     Textbook. Cham: Springer International Publishing. ISBN 978-3-
parameter selection and reservoir initialization significantly       319-94462-3 978-3-319-94463-0. doi:10.1007/978-3-319-94463-0.
impacted model performance. The stark difference between
                                                                     Chattopadhyay, A.; Hassanzadeh, P.; and Subramanian, D. 2020.
the BaseESN (baseline) and BaseESN (tuned) results in Ta-            Data-Driven Prediction of a Multi-Scale Lorenz 96 Chaotic System
ble 1 illustrates this well: the tuned BaseESN is stable for         Using Deep Learning Methods: Reservoir Computing, ANN, and
approximately 3× longer than the baseline BaseESN model,             RNN-LSTM. Nonlinear Processes in Geophysics 27(3): 373–389.
which uses hyperparameters found commonly in the litera-             doi:10/ghnj74.
ture (Wikner et al. 2020; Doan, Polifke, and Magri 2019).            Doan, N. A. K.; Polifke, W.; and Magri, L. 2019.            A
The BaseESN’s cheap training and prediction cost allow for           Physics-Aware Machine to Predict Extreme Events in Turbulence.
                                                                     arXiv:1912.10994 .
extensive parameter searches and repeated trials, even when
on modest hardware.                                                  Ehrendorfer, M. 2011. Spectral numerical weather prediction mod-
                                                                     els. Society for Industrial and Applied Mathematics. ISBN 978-
   PhyESN does not share this benefit. Conceptually, physics
                                                                     1611971989.
constraints were intended to improve the physical consistency
                                                                     Frank, M.; Drikakis, D.; and Charissis, V. 2020. Machine-Learning
(and thus reliability) of the model; however, the practical
                                                                     Methods for Computational Science and Engineering. Computation
concerns overshadowed any benefit. PhyESN trains by mini-            8: 15.
mizing the residual during a round-trip ODE solver step. In
                                                                     Jaeger, H. 2001. The “Echo State” Approach to Analysing and
other words, it takes an easy problem and turns it into a very       Training Recurrent Neural Networks - with an Erratum Note. Ger-
hard (nonconvex) one. There is no longer a clear closed-form         man National Research Center for Information Technology GMD
solution, and optimizers are more likely to get stuck in bad         Technical Report 148(34): 13.
local minima. Qualitatively, the loss landscape seemed rough         Lorenz, E. N. 1963. Deterministic nonperiodic flow. Journal of the
and difficult to traverse: training times for the network varied     atmospheric sciences 20(2): 130–141.
wildly (though all were many orders of magnitude slower              Martinuzzi, F. 2020. SciML/ReservoirComputing.Jl. SciML Open
than ridge regression), and first-order gradient descent tech-       Source Scientific Machine Learning.
niques failed to converge. This technique was a poor match           Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019. Physics-
for this experiment, but it may perform better in different          Informed Neural Networks: A Deep Learning Framework for Solv-
network architectures or for systems with different dynamics.        ing Forward and Inverse Problems Involving Nonlinear Partial Dif-
                                                                     ferential Equations. Journal of Computational Physics 378: 686–
   Conceptually, HyESN keeps many of the benefits of                 707. ISSN 0021-9991. doi:10/gfzbvx.
BaseESN, including the convex formulation and cheap train-
                                                                     Saha, S.; Moorthi, S.; Pan, H.-L.; Wu, X.; Wang, J.; Nadiga, S.;
ing cost; however, the HyESN’s hybrid approach did not               Tripp, P.; Kistler, R.; Woollen, J.; Behringer, D.; et al. 2010. The
improve the model’s time horizon. This may be because the            NCEP climate forecast system reanalysis. Bulletin of the American
Lorenz system is relatively simple and thus does not benefit         Meteorological Society 91(8): 1015–1058.
from the flawed model’s “hints”. One interesting feature of          Saha, S.; Moorthi, S.; Wu, X.; Wang, J.; Nadiga, S.; Tripp, P.;
HyESN was its recovery ability. As noted previously, HyESN           Behringer, D.; Hou, Y.-T.; Chuang, H.-y.; Iredell, M.; et al. 2014.
predictions frequently appeared to recover across reservoir          The NCEP climate forecast system version 2. Journal of climate
initializations, which can be seen in the low long-term predic-      27(6): 2185–2208.
tion error reported in Table 1. If this behavior persists in other   Wikner, A.; Pathak, J.; Hunt, B.; Girvan, M.; Arcomano, T.; Szun-
systems, this could be a benefit in long-running simulations.        yogh, I.; Pomerance, A.; and Ott, E. 2020. Combining Machine
                                                                     Learning with Knowledge-Based Modeling for Scalable Forecast-
   The results from our experiments with the primitive equa-         ing and Subgrid-Scale Closure of Large, Complex, Spatiotemporal
tions were promising and hint toward the ESN’s ability to            Systems. Chaos: An Interdisciplinary Journal of Nonlinear Science
handle high-dimensional inputs better than we anticipated.           30(5): 053111. ISSN 1054-1500. doi:10/ggxrjq.
However, as expected, the computational cost of training
ESNs and generating predictions does not scale well with
large numbers of inputs, which hurts its feasibility in practice.
The use of ESNs for large-scale systems would benefit from
research to implement feature reduction techniques—for ex-
ample, the use of a convolutional autoencoder to encode