Towards Modeling Physically-Consistent, Chaotic Spatiotemporal Dynamics with Echo State Networks Matthew Ziemann∗† , Alisha Sharma∗‡ , Kaiyan Shi∗ , and Yiling Qiao∗ ∗ University of Maryland, College Park, MD 20742 † Army Research Laboratory, Adelphi, MD 20783 ‡ Naval Research Laboratory, Washington, DC 20375 {mrziema2,ajsharma,kshi12,yilingq}@umd.edu Abstract This study explores how echo state networks (ESNs) can be used in time-series forecasting of chaotic physics. We compare the performance of a basic ESN with two physics-informed variants, tested on the canonical Lorenz attractor. We then apply the ESN to a large-scale atmospheric model and a larger real-world weather dataset to test its ability to scale to large spatiotemporal systems. We find that a traditional ESN when properly tuned can outperform our equivalent physics- informed methods. We also find that the ESN is capable of accurately predicting the global evolution of the atmospheric primitive equations over short time frames (∼67 hrs), but struggles to accurately predict real-world data. Introduction Figure 1: Velocity potential computed using the Climate Many useful scientific simulations, such as large-scale cli- Forecast System (CFS) (Saha et al. 2010, 2014). mate models (Fig. 1), contain computationally intractable subroutines that limit the pace and accessibility of research in critical areas. Neural networks are gaining popularity as a way of sidestepping these bottlenecks due to their reduced computational complexity, improved scalability, and low post- fixed, making them extremely cheap to train. Unfortunately, training cost (Frank, Drikakis, and Charissis 2020). However, while ESNs excel at modeling low dimensional dynamics, they also have some significant disadvantages that prevent they scale poorly to high-dimensional input (Aggarwal 2018), widespread adoption. Notably, they require massive amounts which poses a challenge for problems with many degrees of of data and computational resources to train. Furthermore, freedom such as those with spatial parameters. Furthermore, without explicit knowledge of physics, their predictions can while they can accurately model complex chaotic systems, be unreliable and break important physical laws, which can they are purely data-driven: they have no qualms about break- be disastrous in scientific simulations. This is particularly true ing the laws of physics. of the difficult regimes that engineers often want to bypass, such as chaotic, stiff, or multiscale systems. Researchers have suggested various methods to overcome Echo state networks (ESNs), a simplification of recurrent these challenges. Studies suggest that informing neural net- neural networks, address several of these challenges. ESNs works with physics-based knowledge can yield reduced re- are one of the most effective data-driven architectures for quirements for data and training time, as well as signifi- time-series forecasting, particularly for chaotic dynamical cantly improving overall performance. Two examples of these systems (Aggarwal 2018; Jaeger 2001). While they share the methods include the ”Combined Hybrid/Parallel Prediction” temporal invariance of standard recurrent neural networks, (CHyPP) method (Wikner et al. 2020), and physics-informed all parameters aside from the final fully-connected layer are loss constraints (Raissi, Perdikaris, and Karniadakis 2019; Doan, Polifke, and Magri 2019). We present a comparative The first two authors contributed equally to this work. study of these techniques to better understand their respective Distribution A: Approved for public release; distribution unlimited. Copyright c 2021 for this paper by its authors. Use permitted under benefits and limitations on the small-scale chaotic systems Creative Commons License Attribution 4.0 International (CC BY and extended these methods to explore their feasibility in 4.0). large-scale dynamical systems. Echo State Networks If we consider the ESN to be a surrogate Runge-Kutta Recurrent neural networks (RNNs) are a natural model for dy- solver, we can verify its prediction by applying the Runge- namical systems, but they can be difficult to train in practice. Kutta equations in reverse at each stage. If the ESN is a Echo state networks (ESNs) are a promising simplification good surrogate, this “round trip” will end back at the initial of RNNs: they freeze most of the parameters, swapping a state. Training minimizes the distance between the true and highly non-convex training process with a simple convex calculated initial states, providing a strong error signal and regression, but still retain much of the sequential structural bringing the ESN predictions closer to the solver. and expressive power of general RNNs (Jaeger 2001). This physical consistency term has a side effect of making training harder, turning a convex problem highly nonconvex. Previous step context Hybrid Prediction Model w w As many systems of interest have known reduced or approxi- 𝑢% 𝑢2%&' mate formulations, an alternate strategy for improving phys- ⠇ ⠇ w ical consistency is based on the idea that correcting similar (but incorrect) dynamics is easier than learning a system from Input Scaling 𝑊!" Reservoir 𝑊 Trainable Readout Layer scratch. By concatenating steps of a reduced-order or flawed Skip connection from input 𝑊#$% model of the physical system to the reservoir output, we can reformulate the learning problem as correcting the flawed Figure 2: Diagram of a basic echo state network (ESN). The dynamics instead of predicting them from scratch (Wikner readout layer is trained by ridge regression to map the non- et al. 2020). As in the basic ESN, training is convex. linear reservoir projections to the target output state. Conceptually, ESNs work similarly to other weighted av- Preliminary Results erage methods: they use the architecture of a general RNN to (a) generate high-dimensional nonlinear projections of the Three echo state network (ESN) configurations were tested: inputs and (b) learn an optimal weighted average of these a basic ESN (BaseESN), an ESN trained with a physical projections to predict the next state (Figure 2). The input scal- consistency loss (PhyESN), and a hybrid numerical/data- ing and reservoir layers are initialized according to the “echo driven approach where BaseESN was used to refine a reduced- state property” (Jaeger 2001) and then are frozen. Learning order estimate (HyESN). Models were implemented in Julia the readout layer is a convex minimization problem between using the SciML ecosystem (Martinuzzi 2020). the reservoir outputs and the output state; the global minimum We first trained our three models to predict the spatiotem- can be found cheaply through closed-form ridge regression poral evolution of a simple, well-studied chaotic attractor: the or by using a convex solver (Jaeger 2001). Lorenz attractor. We then implemented the best performing ESN with two large-scale dynamical systems of increasing Physics Informed Echo State Networks complexity to evaluate its ability to scale to more difficult We focus on two mechanisms for embedding physics into problem sets. the basic ESN architecture described above: a physical con- The models were evaluated by two metrics. First, their sistency loss function and a hybrid data-driven/knowledge- relative accuracy was measured by the time-averaged root based predictor. These approaches are described in the fol- mean squared error (RMSE) over a fixed prediction time. lowing sections. Next, the dynamic stability of a model was measured by the time horizon, or time for which the normalized error of Physical Consistency Loss the model predictions stays under a specified error tolerance A direct way to constrain the network is by explicitly embed- max (Wikner et al. 2020; Doan, Polifke, and Magri 2019). ding the system’s governing dynamics into the optimization of the readout layer. Through training, the ESN should learn Lorenz Attractor to emulate the system’s governing dynamics. The first set of experiments focused on the chaotic Lorenz Raissi, Perdikaris, and Karniadakis (2019) introduced an attractor, a canonical problem in chaotic dynamics modeling. interesting discrete mechanism for this using Runge-Kutta The Lorenz attractor is the chaotic regime of a dynamical sys- solvers, a family of iterative algorithms for numerically solv- tem derived from an atmospheric surface convection model ing systems of ordinary differential equations. Given an initial (Lorenz 1963). The governing equations are given below: value and set of governing equations, Runge-Kutta solvers   estimate the next timestep by taking a weighted average of σ(u2 − u1 ) ∂u  several intermediate predictions called stages. This formula- = u1 (ρ − u3 ) − u2  (1) tion fits naturally with the discrete nature of ESNs. ∂t u1 u2 − βu3 System behavior is controlled by the parameters σ, ρ, the network using closed-form ridge regression and predict- and β, and the most commonly chosen parameters to study ing 1000 new points (approximately 20 Lyapunov times) each chaotic dynamics are σ = 10, ρ = 28 and β = 8/3. took ≈ 50 ms (Table 1). We also measured training time with Numerical results are summarized in Table 1. Results are a BFGS solver, an iterative nonlinear minimizer, to provide a calculated for the best models. Each hyperparameter configu- performance baseline for PhyESN; this approach was nearly ration was tested over several thousand random initializations 4 orders of magnitude slower (≈ 3 s). for BaseESN and HyESN; in contrast, PhyESN was only PhyESN The second experiment (PhyESN) took the same tested over 5 initializations per configuration due to the high baseline ESN architecture and added a physical consistency training cost. Results are reported in Lyapunov times, the loss. This physical consistency loss function was signifi- characteristic timescale of a chaotic system, as in (Wikner cantly more difficult to optimize than the ridge regression et al. 2020; Doan, Polifke, and Magri 2019). in BaseESN: not only is it non-linear, it is non-convex. To BaseESN In the first experiment, we modeled the Lorenz account for this, the readout layer was trained using BFGS. system with a basic ESN (BaseESN). The basic ESN architec- Qualitatively, the PhyESN results looked similar to ture includes an input layer initialized with a random uniform BaseESN. However, despite the physics embedded in the distribution with values in [−σ, σ], a random sparse reservoir loss function, the best PhyESN models had slightly worse ac- with average connectivity d and spectral radius α, and a T1 curacy and stability than the BaseESN. Furthermore, training nonlinear transformation of the reservoir output (Chattopad- took significantly more time: BFGS struggled to converge, hyay, Hassanzadeh, and Subramanian 2020). The readout taking 5-7 orders-of-magnitude longer (depending on the run) layer was trained using ridge regression with regularization than the closed-form BaseESN training, and the final solution weight β. A skip connection was included from the input to was a local (not global) minimum. reservoir output, and the reservoir leaked previous inputs to the readout layer with leak coefficient a. HyESN The trained HyESN predictions diverged from the true system after approximately 9 Lyapunov times (Figure 4), which is worse than the results predicted by BaseESN. However, HyESN frequently exhibits an interesting recov- ery behavior: in Figure 4, the dynamics appear to recover between 11 and 17 Lyapunov times. This can also be seen in the numerical results: while BaseESN has a substantially longer time horizon, the error over 20 Lyapunov times is 12% lower in HyESN. Figure 4: Hybrid Model (HyESN) Lorenz Predictions Training time for HyESN was consistently slower than BaseESN, though they were similar; while they are both trained with closed-form ridge regression, the HyESN read- out layer was larger due to the flawed dynamics input. Figure 3: BaseESN Lorenz Predictions and Error Weather Forecasting The BaseESN predictions diverged from the true system We took a two-part approach to evaluate the behavior of after approximately 15 Lyapunov times, though the predic- ESNs on large-scale systems. First, we trained a BaseESN tions continued to follow a similar chaotic pattern (Figure 3). (our best-performing ESN) on data generated by a simple BaseESN was also extremely fast to train and predict: training atmospheric model and evaluate its prediction error. Then Table 1: Best results, Lorenz Models. The best results are in bold font. Net RMSE (20 λmax t) Time Horizon (λmax t) Train Time(s) BaseESN (literature) 1.69 5.10 1.36e-3 (closed-form) 1.08e-3 (closed-form) BaseESN (tuned) 1.09 15.13 2.98 (BFGS) HyESN 8.91e-1 8.97 5.59 (closed-form) PhyESN 1.8 7.2 3.0e+2 (BFGS) Figure 5: Plot of BaseESN normalized error over time for the Figure 6: Plot of BaseESN normalized error over time for the primitive equations system. CFSR data. we trained a BaseESN on real-world data from the National icantly. When results finally diverge, the ESN continues to Oceanic and Atmospheric Association (NOAA)’s Climate predict realistic values, though they do not reflect the actual Forecast System (CFS) (Saha et al. 2010, 2014) and evaluated evolution of the system. its prediction error. We then trained the BaseESN on real-world data from We begin with the simple atmospheric model. The gov- the CFS Reanalysis (CFSR). We utilized hourly data from erning system of nonlinear, partial differential equations for January 2005 to December 2006, on a 73x144 spatial grid. atmospheric dynamics—known as the primitive equations— Due to memory constraints, we restricted the data to u, v, P, was numerically solved to generate data for pressure (P ), & T rather than utilizing the full 73 parameters of CFSR, and temperature (T ), and the latitudinal & longitudinal compo- again used three altitude levels for u, v, & T . This increased nents of wind velocity (u & v) (Ehrendorfer 2011). This the number of input parameters to 105,120, approximately 5 system is commonly called the dynamical core, as it is the times more than required for the dynamical core system. core of most numerical weather models. These four param- The BaseESN did not perform as well on the real-world eters were solved on the surface of Earth with a grid of 64 data as it did on the dynamical core system. As seen in Figure longitudinal points and 32 latitudinal points. We solved u, 6, the predictions for the velocity components diverged from v, and T at three altitude levels (surface, mid-, and high-). acceptable error levels within four timesteps, likely because The pressure was only solved at the surface, bringing the the velocities evolve more quickly. Pressure and tempera- total number of ESN input parameters to 20,480, a significant ture both evolve much more slowly over time, and so the increase from the 3 input parameters of the Lorenz system. It BaseESN was able to remain within acceptable error levels was trained over 300 days of data with 20 minute timesteps. for much longer. Despite the quick divergence from real- With some tuning, the ESN performed quite well in predict- world values, the BaseESN continued to make realistic (if ing the evolution of states. With our best-performing model, incorrect) predictions until around 300 timesteps, where it di- the ESN predicted over 200 timesteps (>67 hrs) while re- verged quite suddenly from realistic values. It’s worth noting maining below the max normalized error cutoff threshold that we were unable to perform significant hyperparameter (0.4), seen in Figure 5. Notably, the ESN predictions grad- tuning on the CFSR system as the computational cost was ually accumulate error over time and do not diverge signif- quite high, requiring multiple days to train. Discussion inputs and decode outputs. Faster training and prediction We found that a highly tuned BaseESN generally out- with ESNs would enable a greater degree of hyperparameter performed the physics-informed techniques (PhyESN and tuning for large systems which may yield stronger results HyESN) on the Lorenz system when measured on predic- for real-world systems. This would also enable the use of tion accuracy and dynamic stability. This result underscores more input data which may lead to further improvements for the importance of hyperparameter tuning in deep learning real-world systems. approaches. The primary advantage to BaseESN is the low cost to find References Aggarwal, C. C. 2018. Neural Networks and Deep Learning: A the global minimum in low-dimensionality systems. Hyper- Textbook. Cham: Springer International Publishing. ISBN 978-3- parameter selection and reservoir initialization significantly 319-94462-3 978-3-319-94463-0. doi:10.1007/978-3-319-94463-0. impacted model performance. The stark difference between Chattopadhyay, A.; Hassanzadeh, P.; and Subramanian, D. 2020. the BaseESN (baseline) and BaseESN (tuned) results in Ta- Data-Driven Prediction of a Multi-Scale Lorenz 96 Chaotic System ble 1 illustrates this well: the tuned BaseESN is stable for Using Deep Learning Methods: Reservoir Computing, ANN, and approximately 3× longer than the baseline BaseESN model, RNN-LSTM. Nonlinear Processes in Geophysics 27(3): 373–389. which uses hyperparameters found commonly in the litera- doi:10/ghnj74. ture (Wikner et al. 2020; Doan, Polifke, and Magri 2019). Doan, N. A. K.; Polifke, W.; and Magri, L. 2019. A The BaseESN’s cheap training and prediction cost allow for Physics-Aware Machine to Predict Extreme Events in Turbulence. arXiv:1912.10994 . extensive parameter searches and repeated trials, even when on modest hardware. Ehrendorfer, M. 2011. Spectral numerical weather prediction mod- els. Society for Industrial and Applied Mathematics. ISBN 978- PhyESN does not share this benefit. Conceptually, physics 1611971989. constraints were intended to improve the physical consistency Frank, M.; Drikakis, D.; and Charissis, V. 2020. Machine-Learning (and thus reliability) of the model; however, the practical Methods for Computational Science and Engineering. Computation concerns overshadowed any benefit. PhyESN trains by mini- 8: 15. mizing the residual during a round-trip ODE solver step. In Jaeger, H. 2001. The “Echo State” Approach to Analysing and other words, it takes an easy problem and turns it into a very Training Recurrent Neural Networks - with an Erratum Note. Ger- hard (nonconvex) one. There is no longer a clear closed-form man National Research Center for Information Technology GMD solution, and optimizers are more likely to get stuck in bad Technical Report 148(34): 13. local minima. Qualitatively, the loss landscape seemed rough Lorenz, E. N. 1963. Deterministic nonperiodic flow. Journal of the and difficult to traverse: training times for the network varied atmospheric sciences 20(2): 130–141. wildly (though all were many orders of magnitude slower Martinuzzi, F. 2020. SciML/ReservoirComputing.Jl. SciML Open than ridge regression), and first-order gradient descent tech- Source Scientific Machine Learning. niques failed to converge. This technique was a poor match Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019. Physics- for this experiment, but it may perform better in different Informed Neural Networks: A Deep Learning Framework for Solv- network architectures or for systems with different dynamics. ing Forward and Inverse Problems Involving Nonlinear Partial Dif- ferential Equations. Journal of Computational Physics 378: 686– Conceptually, HyESN keeps many of the benefits of 707. ISSN 0021-9991. doi:10/gfzbvx. BaseESN, including the convex formulation and cheap train- Saha, S.; Moorthi, S.; Pan, H.-L.; Wu, X.; Wang, J.; Nadiga, S.; ing cost; however, the HyESN’s hybrid approach did not Tripp, P.; Kistler, R.; Woollen, J.; Behringer, D.; et al. 2010. The improve the model’s time horizon. This may be because the NCEP climate forecast system reanalysis. Bulletin of the American Lorenz system is relatively simple and thus does not benefit Meteorological Society 91(8): 1015–1058. from the flawed model’s “hints”. One interesting feature of Saha, S.; Moorthi, S.; Wu, X.; Wang, J.; Nadiga, S.; Tripp, P.; HyESN was its recovery ability. As noted previously, HyESN Behringer, D.; Hou, Y.-T.; Chuang, H.-y.; Iredell, M.; et al. 2014. predictions frequently appeared to recover across reservoir The NCEP climate forecast system version 2. Journal of climate initializations, which can be seen in the low long-term predic- 27(6): 2185–2208. tion error reported in Table 1. If this behavior persists in other Wikner, A.; Pathak, J.; Hunt, B.; Girvan, M.; Arcomano, T.; Szun- systems, this could be a benefit in long-running simulations. yogh, I.; Pomerance, A.; and Ott, E. 2020. Combining Machine Learning with Knowledge-Based Modeling for Scalable Forecast- The results from our experiments with the primitive equa- ing and Subgrid-Scale Closure of Large, Complex, Spatiotemporal tions were promising and hint toward the ESN’s ability to Systems. Chaos: An Interdisciplinary Journal of Nonlinear Science handle high-dimensional inputs better than we anticipated. 30(5): 053111. ISSN 1054-1500. doi:10/ggxrjq. However, as expected, the computational cost of training ESNs and generating predictions does not scale well with large numbers of inputs, which hurts its feasibility in practice. The use of ESNs for large-scale systems would benefit from research to implement feature reduction techniques—for ex- ample, the use of a convolutional autoencoder to encode