Accelerating Simulation of Stiff Nonlinear Systems using Continuous-Time Echo
                                  State Networks
                           Ranjan Anantharaman1 , Yingbo Ma2 , Shashi Gowda1 ,
                   Chris Laughman3 , Viral B. Shah2 , Alan Edelman1 , Chris Rackauckas1,2
                                                 1
                                                 Massachusetts Institute of Technology
                                                           2
                                                             Julia Computing
                                              3
                                                Mitsubishi Electric Research Laboratories

                            Abstract                                     A popular class of traditional surrogatization techniques is
                                                                      projection based model order reduction, such as the proper
  Modern design, control, and optimization often require mul-
  tiple expensive simulations of highly nonlinear stiff models.       orthogonal decomposition (POD) (Benner, Gugercin, and
  These costs can be amortized by training a cheap surrogate          Willcox 2015). This method computes “snapshots” of the
  of the full model, which can then be used repeatedly. Here          trajectory and uses the singular value decomposition of the
  we present a general data-driven method, the continuous-            linearization in order to construct a basis of a subspace of the
  time echo state network (CTESN), for generating surrogates          snapshot space, and the model is remade with a change of
  of nonlinear ordinary differential equations with dynamics at       basis. However, if the system is very nonlinear, the computa-
  widely separated timescales. We empirically demonstrate the         tional complexity of this linearization-based reduced model
  ability to accelerate a physically motivated scalable model of      can be almost as high as the original model. One way to
  a heating system by 98x while maintaining relative error of         overcome this difficulty is through empirical interpolation
  within 0.2 %. We showcase the ability for this surrogate to ac-
                                                                      methods (Nguyen et al. 2014). Other methods to produce
  curately handle highly stiff systems which have been shown
  to cause training failures with common surrogate methods            surrogates generally utilize the structural information known
  such as Physics-Informed Neural Networks (PINNs), Long              about highly regular systems like partial differential equa-
  Short Term Memory (LSTM) networks, and discrete echo                tion discretizations (Frangos et al. 2010).
  state networks (ESN). We show that our model captures fast             Many of these methods require a scientist to actively make
  transients as well as slow dynamics, while demonstrating that       choices about the approximations being performed to the
  fixed time step machine learning techniques are unable to ad-       system. In contrast, the data-driven approaches like Physics-
  equately capture the multi-rate behavior. Together this pro-        Informed Neural Networks (PINNs)(Raissi, Perdikaris, and
  vides compelling evidence for the ability of CTESN surro-
  gates to predict and accelerate highly stiff dynamical systems
                                                                      Karniadakis 2019) and Long Short Term Memory (LSTM)
  which are unable to be directly handled by previous scientific      networks (Chattopadhyay, Hassanzadeh, and Subramanian
  machine learning techniques.                                        2020) have gained popularity due to their apparent appli-
                                                                      cability to “all” ordinary and partial differential equations
                                                                      in a single automated form. However, numerical stiffness
                        Introduction                                  (Söderlind, Jay, and Calvo 2015) and multiscale dynamics
Stiff nonlinear systems of ordinary differential equations            represent an additional challenge. Highly stiff differential
are widely prevalent throughout science and engineering               equations can lead to gradient pathologies that make com-
(Wanner and Hairer 1996; Shampine and Gear 1979) and                  mon surrogate techniques like PINNs hard to train (Wang,
are characterized by dynamics with widely separated time              Teng, and Perdikaris 2020).
scales. These systems require highly stable numerical meth-              A classic way to create surrogates for stiff systems is
ods to use non-vanishing step-sizes reliably (Gear 1971),             to simply eliminate the stiffness. The computational singu-
and also tend to be computationally expensive to solve. Even          lar perturbation (CSP) method (Hadjinicolaou and Gous-
with state-of-the-art simulation techniques, design, control,         sis 1998) has been shown to decompose chemical reaction
and optimisation of these systems remains intractable in              equations into fast and slow modes. The fast modes are
many realistic engineering applications (Benner, Gugercin,            then eliminated, resulting in a non-stiff system. Another op-
and Willcox 2015). To address these challenges, researchers           tion is to perform problem-specific variable transformations
have focused on techniques to obtain an approximation to a            (Qian et al. 2020; Kramer and Willcox 2019) to a form
system (called a “surrogate”) whose forward simulation time           more suited to model order reduction by traditional meth-
is relatively inexpensive while maintaining reasonable accu-          ods. These transformations are often problem specific and
racy (Willard et al. 2020; Ratnaswamy et al. 2019; Zhang              require a scientist’s intervention at the equation level. Re-
et al. 2020; Kim et al. 2020; van de Plassche et al. 2020).           cent studies on PINNs have demonstrated that such variable
Copyright © 2021for this paper by its authors. Use permitted under    elimination may be required to handle highly stiff equations
Creative Commons License Attribution 4.0 International (CC BY         because the stiffness leads to ill-conditioned optimization
4.0)                                                                  problems. For example, on the classic Robertson’s equations
         A                                                             B


                                                                      D
         C


                                E


Figure 1: Prediction of each surrogate on the Robertson’s equations Shown in each figure is the result of the data-driven
algorithm’s prediction at p = [0.04, 3 × 107 , 1 × 104 ], a parameter set not in the training data. Ground truth, obtained by solving
the ODE using the Rosenbrock23 solver with absolute tolerance of 10−6 , is in blue. The PINN was trained using a 3-layer
multi-layer perceptron with the ADAM optimizer for 300,000 epochs with minibatching, and its prediction is in red. Both the
ESN and CTESN were trained with a reservoir size of 3000 on a parameter space of [0.036, 0.044] × [2.7 × 107 , 3.3 × 107 ] ×
[9 × 103 , 1.1 × 104 ], from which 1000 sets of parameters were sampled using Sobol sampling. The predictions of the CTSEN
are generated by the radial basis function prediction of Wout (p) and are shown in green. Predictions from the ESN are in purple.
The LSTM predictions, in gold, are generated by a network with 3 hidden LSTM layers and an output dense layer, after training
for 2000 epochs. (A) A timeseries plot of the y1 (t) predictions. (B) The absolute error of the surrogate predictions on y1 (t).
(C) A timeseries plot of the y2 (t) predictions. (D) The absolute error of the surrogate predictions on y2 (t). (E) The result of
y1 (t) + y2 (t) + y3 (t) over time. By construction this quantity’s theoretical value is 1 over the timeseries.
(ROBER) (Robertson 1976) and Pollution model (POLLU)              calculated by using a least squares fit of against the model’s
(Verwer 1994) stiff test problems it was shown that direct        time series, which then fully describes the prediction pro-
training of PINNs failed, requiring the authors to perform        cess.
a quasi-steady state (QSS) assumption in order for accurate          This process of using a direct linear solve, such as a
prediction to occur (Ji et al. 2020). However, many chem-         QR-factorization, to calculate Wout means that no gradient-
ical reaction systems require transient activations to prop-      based optimization is used in the training process. For this
erly arrive at the overarching dynamics, making the QSS as-       reason ESNs have traditionally been used as a stand-in
sumption only applicable to a subset of dynamical systems         for recurrent neural networks which overcome the vanish-
(Henry and Martin 2016; Flach and Schnell 2006; Eilert-           ing gradient problem (Jaeger et al. 2007; Lukoševičius and
sen and Schnell 2020; Turanyi, Tomlin, and Pilling 1993;          Jaeger 2009; Mattheakis and Protopapas 2019; Vlachas et al.
Schuster and Schuster 1989; Thomas, Straube, and Grima            2020; Chattopadhyay et al. 2019; Grezes 2014; Evanusa,
2011). Thus while demonstrating promising results on diffi-       Aloimonos, and Fermüller 2020; Butcher et al. 2013). How-
cult equations, training on the QSS-approximated equations        ever, ESNs have also been applied to learning chaotic sys-
requires specific chemical reaction networks and requires         tems (Chattopadhyay et al. 2019; Doan, Polifke, and Magri
the scientist to make approximation choices that are diffi-       2019), nonlinear systems identification (Jaeger 2003), bio-
cult to automate, which reduces the general applicability that    signal processing (Kudithipudi et al. 2016), and robot con-
PINNs were meant to give.                                         trol (Polydoros, Nalpantidis, and Krüger 2015). These are all
   The purpose of this work is to introduce a general data-       cases where the derivative calculations are unstable or, as in
driven method, the CTESN, that is generally applicable            the case of chaotic equations, are not well-defined for long
and able to accurately capture highly nonlinear heteroge-         time spans.
neous stiff time-dependent systems without requiring the             This ability to handle problems with gradient patholo-
user to train on non-stiff approximations. It is able to ac-      gies gives the intuitive justification for exploring reservoir
curately train and predict on highly ill-conditioned models.      computing techniques on handling stiff equations. However,
We demonstrate these results (Figure 1) on the Roberston’s        stiff systems generally have behavior spanning multiple
equations, which PINNs, LSTM networks and discrete-               timescales which are difficult to represent with uniformly-
time machine learning techniques fail to handle. Our re-          spaced prediction intervals. For example, in the ROBER
sults showcase the ability to transform difficult stiff equa-     problem we will showcase, an important transient occurs
tions into non-stiff reservoir equations which are then in-       for less than a 10 seconds of the 10,000 second simulation.
tegrated in place of the original system. Given the O(n3 )        However this feature is important to capture the catalysis
scaling behavior of general stiff ODE solvers due to internal     that kick-starts the long-term changes. Many more samples
LU-factorizations, the resulting approximation by a surro-        from t ∈ [0, 10] will be required than from t ∈ [10, 105 ]
gate with linear scaling with number of outputs, we observe       in order to accurately capture the dynamics of the system.
increasing accelerations as the system gets larger. With this     These behaviors are the reason why all of the major software
scaling difference we demonstrate the ability to accelerate       for handling stiff equations, such as CVODE (Hindmarsh
a large stiff system by 98x while achieving < 0.2% error          et al. 2005), LSODA (Hindmarsh and Petzold 2005), and
(Figure 5).                                                       Radau (Hairer and Wanner 1999) are adaptive. In fact, this
                                                                  behavior is so crucial to the stable handling of stiff systems
     Continuous-Time Echo State Networks                          that robust implicit solves tie the stepping behavior to the
                                                                  implicit handling of the system with complex procedures for
Echo State Networks (ESNs) are a reservoir computing
                                                                  reducing time steps when Newton convergence rates are re-
framework which projects signals from higher dimensional
                                                                  duced (Wanner and Hairer 1996; Hosea and Shampine 1996;
spaces defined by the dynamics of a fixed non-linear system
                                                                  Hairer and Wanner 1999). For these reasons, we will demon-
called a “reservoir” (Ozturk, Xu, and Prıncipe 2007). The
                                                                  strate that the classic fixed time step reservoir computing
ESN’s mathematical formulation is as follows. For a NR -
                                                                  methods from machine learning are unable to handle these
dimensional reservoir, the reservoir equation is given by:
                                                                  highly stiff equations.
                 rn+1 = f (Arn + Wf b xn ),                (1)       To solve these issues, we introduce a new variant of
                                                                  ESNs, which we call continuous-time echo state networks
where f is a chosen activation function (like tanh or sig-        (CTESNs), which allows for an underlying adaptive time
moid), A is the NR × NR reservoir weight matrix, and Wf b         process while avoiding gradient pathologies in training. Let
is the NR × N feedback matrix where N is the size of our          N be the dimension of our model, and let P be a Cartesian
original model. In order to arrive at a prediction of our orig-   space of parameters under which the model is expected to
inal model, we take a projection of the reservoir:                operate. The CTESN of with reservoir dimension NR is de-
                                                                  fined as
                      x̂n = g(Wout rn ),                   (2)
                                                                                  r0 = f (Ar + Whyb x(p∗ , t)),             (3)
where g is the output activation function (generally the iden-
                                                                                x(t) = g(Wout r(t)),                        (4)
tity or sigmoid) and Wout is the N × NR projection matrix.
In the training process of an ESN, the matrices A and Wf b        where A is a fixed sparse random matrix of dimension NR ×
are randomly chosen constants, meaning the Wout matrix is         NR and Whyb is a fixed random dense matrix of dimen-
the only variable which needs to be approximated. Wout is         sions NR × N . The term Whyb x(p∗ , t) represents a “hybrid”
term that incorporates physics information into the reservoir       ing function are, in practice, much cheaper than the cost of
(Pathak et al. 2018), namely a solution at some point in the        simulating the model multiple times.
parameter space of the dynamical system. Given these fixed             Prediction comprises of two steps: predicting the least
values, the readout matrix Wout is the only trained portion         squares matrix, and simulating the reservoir time series (or,
of this network and is obtained through a least squares fit of      in this case, just using the pre-computed continuous solution
the reservoir ODE solution against the original timeseries.         since the reservoir is fixed for every set of parameters). The
We note that in this study we choose f = tanh and g = id            final prediction is just the matrix multiplication of two.
for all of our examples.                                               A strong advantage of the training is that it requires no
   To obtain a surrogate that predicts the dynamics at new          manual investigation of the stiff model on the part of the
physical parameters, the reservoir projection Wout is fit           researcher and can be called as an off-the-shelf routine. It
against many solutions at parameters {p1 , . . . , pn }, where      allows the researcher to make a trade-off, computing a few
n is the number of training data points sampled from the            parallelized runs of the full stiff model in order to generate a
parameter space. Using these fits, an interpolating function        surrogate, which can then be plugged in and used repeatedly
Wout (p) between the matrices can be trained. A prediction          for design and optimization.
x̂(t) for at physical parameters p̂ is thus given by:                  We implemented the training routines and the follow-
                     x̂(t) = Wout (p̂)r(t).                  (5)    ing models in the Julia programming language (Bezanson
                                                                    et al. 2017) to take advantage of its first class support for
   A strong advantage of our method is its ease of imple-           differential equations solvers (Rackauckas and Nie 2017)
mentation and ease of training. Global L2 fitting via stabi-        and scientific machine learning packages. For the exam-
lized methods like SVD are robust to ill-conditioning, alle-        ples in this paper, we have sampled the high-dimensional
viating many of the issues encountered when attempting to           spaces using a Sobol low-discrepancy sequence (Sobol’
build neural surrogates of such equations. Also note that in        et al. 2011) and interpolated the Wout matrices using a radial
this particular case, the readout matrix is fit against the same    basis function provided by the Julia package Surrogates.jl
reservoir time series. This means that prediction does not          (https://github.com/SciML/Surrogates.jl).
need to simulate the reservoir, providing an extra accelera-
tion.                                                                                      Case Studies
   Another advantage is the ability to use time stepping in-
formation from the solver during training. As noted before,         In this section we describe two representative examples. We
not only are step sizes chosen adaptively based on minimiz-         demonstrate that the CTESN can handle highly stiff be-
ing a local error estimate to a specified tolerance (Shampine       haviour through the ROBER example. We then talk about
and Gear 1979), but they also adapt to concentrate around           the performance of the surrogate on a scalable, physically-
the most stiff and numerically difficult time points of the         inspired heating system.
model by incorporating the Newton convergence into the re-
jection framework. These timestamps thus provide heuristic          Robertson Equations and High Stiffness
snapshots of the most important points for training the least       We first consider Robertson’s chemical reactions involving
squares fit, whereas snapshots from uniform time steps may          three reaction species A, B and C:
skip over many crucial aspects of the dynamics.
                                                                                                   0.04
                          Training                                                           A −−→ B
In this section we describe the automated training proce-                                          3×107
                                                                                       B + B −−−−→ C + B
dure used to generate CTESN surrogates. An input param-
                                                                                                   104
eter space P is first chosen. This could be a design space                             B + C −−→ A + C
for the model or a range of operating conditions. Now n sets
of parameters {p1 , . . . , pn } are sampled from this space us-    which lead to the ordinary differential equations:
ing a sampling sequence that covers as much of the space
as possible. The full model is now simulated at each sample                    y˙1 = −0.04y1 + 104 y2 · y3                      (6)
in parallel since each run is independent, generating time
series for each run. The choice of points in time used to gen-                 y˙2 = 0.04y1 − 104 y2 · y3 − 3 · 107 y22         (7)
erate the time series at each p comes from the numerical                       y˙3 = 3 · 107 y22                                (8)
ODE solve at that p. The reservoir ODE is then constructed
using a candidate solution at any one of the n parameters              where y1 , y2 , and y3 are the concentrations of A, B and C
x(p∗ , t), p∗ ∈ {p1 , . . . , pn } and is then simulated, gener-    respectively. This system has widely separated reaction rates
ating the reservoir time series. Since the reservoir ODE is         (0.04, 104 , 3 · 107 ), and is well known to be very stiff (Gob-
non-stiff, this simulation is cheap compared to the cost of         bert 1996; Robertson and Williams 1975; Robertson 1976).
the full model. Least squares projections can now be calcu-         It is commonly used as an example to evaluate integrators
lated from each solution to the reservoir in parallel. Once         of stiff ODEs (Hosea and Shampine 1996). Finding an ac-
all the least squares matrices are obtained, an interpolating       curate surrogate for this system is difficult because it needs
function is trained to predict the least squares projection ma-     to capture both the stable slow reacting system and the fast
trix. Both the least squares fitting and training the interpolat-   transients. Additionally, the surrogate needs to be consistent
with this system’s implicit physical constraints, such as the                           PINN training on the Robertson's Equations
                                                                             3000
conservation of matter (y1 + y2 + y3 = 1) and positivity of
the variables (yi > 0), in order to provide a stable solution.
                                                                             2500
    A surrogate was trained by sampling 1000 sets of param-
eters from the Cartesian parameter space [0.036, 0.044] ×
[2.7 × 107 , 3.3 × 107 ] × [9 × 103 , 1.1 × 104 ] using Sobol                2000


                                                                      Loss
sampling so as to evenly cover the whole space. We train
on the time series of the three states themselves as outputs.                1500
A least squares projection Wout was fit for each set of pa-
rameters, and then a radial basis function was used to inter-                1000
polate between the matrices. The prediction workflow is as
follows: given a set of parameters whose time series is de-                         0           1×105                      2×105     3×105
sired, the radial basis function predicts the projection matrix.                                        Number of epochs
The pre-simulated reservoir is then sampled at the desired
time points, and a matrix multiplication with the predicted        Figure 2: Training a PINN on the Robertson’s Equations:
Wout gives us the desired prediction. Figure 1 shows a com-        PINN was trained for 300,000 epochs using the ADAM op-
parison between the CTESN, ESN, PINN and LSTM meth-                timizer with a learning rate of 10−3 , by which time the loss
ods. The PINN data is reproduced from (Ji et al. 2020) and         seems to saturate. The hyperparameters of the PINN can be
the ESN was trained using 101 time points uniformly sam-           found in the Case Studies section.
pled from the time span, while CTESN used 61 adaptively
sampled time points informed by the ODE solver (Rosen-
brock23 (Shampine 1982)).                                          highly ill-conditioned Hessian in the training process due
    The CTESN method is able to accurately capture both            to the stiffness of the equation, it is very unlikely for local
the slow and fast transients and respect the conservation          optimization to find a parameters which make an accurate
of mass. The ESN is able to accurately predict at the time         prediction. We additionally note stiff systems of this form
points it was trained on, but many features are missed. The        may be hard to capture by neural networks directly as neu-
uniform stepping completely misses the fast transient rise         ral networks show a bias towards low frequency functions
at t = 10−4 because the uniform intervals do not sample            (Rahaman et al. 2019).
points from that time scale. Additionally, the first sampled
time point at t = 100 is far into the concentration drop of
                                                                   Stiffly-Aware Surrogates of HVAC Systems
y1 which leads to an inaccurate prediction before the system
settles into near steady state behavior. As stated earlier, the    Our second test problem is a scalable benchmark used in
CTESN uses information from a stiff ODE solver to choose           the engineering community (Casella 2015). It is a simpli-
the right points along the time span to accurately capture         fied, lumped-parameter model of a heating system with a
multi-scale behaviour with less training data than the ESN.        central heater supplying heat to several rooms through a dis-
In order to compare the discrete ESN to the continuous re-         tribution network. Each room has an on-off controller with
sult, a cubic spline was fit to its 101 evenly spaced prediction   hysteresis which provides very fast localized action (Ranade
points.                                                            and Casella 2014). The resulting system of equations is thus
    The PINN was trained by sampling 2500 logarithmically          very stiff and unable to be solved by standard explicit time
spaced points across the time span. The network used was           stepping methods.
a 3-layer multi-layer perceptron with 128 nodes per hidden            The size of the heating system is scaled by a parameter
layer and the Gaussian Error Linear Unit activation func-          N which refers to the number of users/rooms. Each room
tion (Hendrycks and Gimpel 2016). The layers were ini-             is governed by two equations corresponding to its temper-
tialed using Xavier initialization (Glorot and Bengio 2010),       ature and the state of its on-off controller. The tempera-
and trained with the ADAM optimizer (Kingma and Ba                 ture of fluid supplying heat to each room is governed by
2019) at a learning rate of 10−3 for 300,000 epochs with           one equation. This produces a system with 2N + 1 cou-
mini batch size of 128. Figure 2 shows the convergence plot        pled non-linear equations. This “scalability” lets us test how
as the PINN trains on the ROBER equations. The LSTM                our CTESN surrogate scales. To train the surrogate, we de-
network used a similar architecture to the PINN, but with          fine a parameter space P under which we expect it to op-
LSTM hidden layers instead of fully connected layers. It           erate. First, we assume set point temperature of each room
used 2500 logarithmically spaced points and was trained for        to be between 17◦ C and 23◦ C. Each room is warmed by
2000 epochs until convergence.                                     a heat conducting fluid, whose set point is between 65◦ C
    Figure 1 highlights how the trained PINN fails to cap-         and 75◦ C. Thus the parameter space over which we expect
ture both the fast and the slow transients and do not re-          our surrogate to work is the rectangular space denoted by
spect mass conservation. Our collaborators investigated why        [17◦ C, 23◦ C] × [65◦ C, 75◦ C].
PINNs fail to solve these equations in (Ji et al. 2020). The          We used a reservoir size of 3000 and sampled 100 sets
reason for the difficulty can be attributed to recently iden-      of parameters from this space using Sobol sampling, and fit
tified results in gradient pathologies in the training arising     least squares projection matrices Wout between each solu-
from stiffness (Wang, Teng, and Perdikaris 2020). With a           tion and the reservoir. For a system with N rooms, we train
                                                                         10                              100                400


                                                                 Figure 5: Scaling performance of surrogate on heating
                                                                 system. We compare the time taken to simulate the full stiff
                                                                 model to the trained surrogate with 10, 20, 30 , 40, 50, 60,
                                                                 70, 80, 90, 100, 200 and 400 rooms. We observe a speedup
                                                                 of up to 98x. The surrogate was trained by sampling 100 sets
                                                                 of parameters from our input space, with a reservoir size of
Figure 3: Validating the surrogate of the scalable heat-         3000.
ing system with 10 rooms. When tested with parameters
it has not seen in training, our surrogate is able to repro-
duce the behaviour of the system to within 0.01% error.          on N + 1 outputs, namely the temperature of each room,
The surrogate is trained on 100 points sampled from the          and the temperature of the heat conducting fluid. Figure 3
[17◦ C, 23◦ C]×[65◦ C, 75◦ C] where the ranges represent set     demonstrates that the training technique is accurately able
point temperature of each room and set point of the fluid        to find matrices Wout which capture the stiff system within
supplying heat to the rooms respectively. The test parame-       0.01% error on a test parameters. We then fit an interpo-
ters that validated here are [21◦ C, 71◦ C]. More details on     lating radial basis function Wout (p). Figure 4 demonstrates
training can be found in the Case Studies section.               that the interpolated Wout (p) is able to adequately capture
                                                                 the dynamics throughout the trained parameter space. Lastly,
                                                                 Figure 5 demonstrates the O(N ) cost of the surrogate eval-
                                                                 uation, which in comparison to the O(N 3 ) cost of a general
                                                                 implicit ODE solver (due to the LU-factorizations) leads to
                                                                 an increasing gap in the solver performance as N increases.
                                                                 At the high end of our test, the surrogate accelerates a 801
                                                                 dimensional stiff ODE system by approximately 98x.

                                                                              Conclusion & Future Work
                                                                 We present CTESNs, a data-driven method for generating
                                                                 surrogates of nonlinear ordinary differential equations with
                                                                 dynamics at widely separated timescales. Our method main-
                                                                 tains accuracy for different parameters in a chosen parame-
                                                                 ter space, and shows favourable scaling with system size on
                                                                 a physics-inspired scalable model. This method can be ap-
                                                                 plied to any ordinary differential equation without requiring
                                                                 the scientist to simplify the model before surrogate applica-
                                                                 tion, greatly improving productivity.
                                                                    In future work, we plan to extend the formulation to take
                                                                 in forcing functions.This entails that the reservoir needs to
                                                                 be simulated every single time a prediction is made, adding
Figure 4: Reliability of surrogate through parameter             to running time, but we do note that numerically simulating
space. We sampled our grid at over 500 grid points and plot-     the reservoir is quite fast in practice as it is non-stiff, and
ted a heatmap of test error through our parameter space. We      thus techniques which regenerate reservoirs on demand will
find our surrogate performs reliably even at the border of our   likely not incur a major run time performance cost.
space with error within 0.1%                                        Our method utilizes the continuous nature of differential
                                                                 equation solutions. Hybrid dynamical systems, such as those
                                                                 with event handling (Ellison 1981), can introduce discon-
                                                                 tinuities into the system which will require extensions to
our method. Further extensions to the method will handle          Evanusa, M.; Aloimonos, Y.; and Fermüller, C. 2020.
both derivative discontinuities and events present in Filip-      Deep Reservoir Computing with Learned Hidden Reservoir
pov dynamical systems (Filippov 2013).Further opportuni-          Weights using Direct Feedback Alignment. arXiv preprint
ties could explore utilizing more structure within equations      arXiv:2010.06209 .
for building a more robust CTESN or decrease the necessary        Filippov, A. F. 2013. Differential equations with discontin-
size of the reservoir.                                            uous righthand sides: control systems, volume 18. Springer
   To train both the example problems in this paper, we re-       Science & Business Media.
quired no knowledge of the physics. This presents an oppor-
tunity to train surrogates of black-box systems.                  Flach, E. H.; and Schnell, S. 2006. Use and abuse of the
                                                                  quasi-steady-state approximation. IEE Proceedings-Systems
                   Acknowledgement                                Biology 153(4): 187–191.
The information, data, or work presented herein was funded        Frangos, M.; Marzouk, Y.; Willcox, K.; and van Bloe-
in part by the Advanced Research Projects Agency-Energy           men Waanders, B. 2010. Surrogate and reduced-order mod-
(ARPA-E), U.S. Department of Energy, under Award Num-             eling: a comparison of approaches for large-scale statistical
bers DE-AR0001222 and DE-AR0001211, and NSF grant                 inverse problems [Chapter 7] .
OAC-1835443. The views and opinions of authors expressed          Gear, C. W. 1971. Numerical initial value problems in ordi-
herein do not necessarily state or reflect those of the United    nary differential equations. nivp .
States Government or any agency thereof. The authors thank
Francesco Martinuzzi for reviewing drafts of this paper.          Glorot, X.; and Bengio, Y. 2010. Understanding the diffi-
                                                                  culty of training deep feedforward neural networks. In Pro-
                        References                                ceedings of the thirteenth international conference on artifi-
                                                                  cial intelligence and statistics, 249–256.
Benner, P.; Gugercin, S.; and Willcox, K. 2015. A survey
of projection-based model reduction methods for parametric        Gobbert, M. K. 1996. Robertson’s example for stiff differ-
dynamical systems. SIAM review 57(4): 483–531.                    ential equations. Arizona State University, Technical report
                                                                  .
Bezanson, J.; Edelman, A.; Karpinski, S.; and Shah, V. B.
2017. Julia: A fresh approach to numerical computing.             Grezes, F. 2014. Reservoir Computing. Ph.D. thesis, PhD
SIAM review 59(1): 65–98.                                         thesis, The City University of New York.
Butcher, J. B.; Verstraeten, D.; Schrauwen, B.; Day, C. R.;       Hadjinicolaou, M.; and Goussis, D. A. 1998. Asymptotic
and Haycock, P. W. 2013. Reservoir computing and extreme          solution of stiff PDEs with the CSP method: the reaction
learning machines for non-linear time-series data analysis.       diffusion equation. SIAM Journal on Scientific Computing
Neural networks 38: 76–89.                                        20(3): 781–810.
Casella, F. 2015. Simulation of large-scale models in Mod-        Hairer, E.; and Wanner, G. 1999. Stiff differential equations
elica: State of the art and future perspectives. In Proceedings   solved by Radau methods. Journal of Computational and
of the 11th International Modelica Conference, 459–468.           Applied Mathematics 111(1-2): 93–111.
Chattopadhyay, A.; Hassanzadeh, P.; and Subramanian, D.           Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear
2020. Data-driven predictions of a multiscale Lorenz 96           units (gelus). arXiv preprint arXiv:1606.08415 .
chaotic system using machine-learning methods: reservoir          Henry, A.; and Martin, O. C. 2016. Short relaxation times
computing, artificial neural network, and long short-term         but long transient times in both simple and complex reaction
memory network. Nonlinear Processes in Geophysics 27(3):          networks. Journal of the Royal Society Interface 13(120):
373–389.                                                          20160388.
Chattopadhyay, A.; Hassanzadeh, P.; Subramanian, D.; and          Hindmarsh, A.; and Petzold, L. 2005. LSODA, Ordinary
Palem, K. 2019. Data-driven prediction of a multi-scale           Differential Equation Solver for Stiff or Non-stiff System .
Lorenz 96 chaotic system using a hierarchy of deep learn-
ing methods: Reservoir computing, ANN, and RNN-LSTM               Hindmarsh, A. C.; Brown, P. N.; Grant, K. E.; Lee, S. L.;
.                                                                 Serban, R.; Shumaker, D. E.; and Woodward, C. S. 2005.
                                                                  SUNDIALS: Suite of nonlinear and differential/algebraic
Doan, N. A. K.; Polifke, W.; and Magri, L. 2019. Physics-
                                                                  equation solvers. ACM Transactions on Mathematical Soft-
informed echo state networks for chaotic systems forecast-
                                                                  ware (TOMS) 31(3): 363–396.
ing. In International Conference on Computational Science,
192–198. Springer.                                                Hosea, M.; and Shampine, L. 1996. Analysis and imple-
Eilertsen, J.; and Schnell, S. 2020. The Quasi-State-State        mentation of TR-BDF2. Applied Numerical Mathematics
Approximations revisited: Timescales, small parameters,           20(1-2): 21–37.
singularities, and normal forms in enzyme kinetics. Mathe-        Jaeger, H. 2003. Adaptive nonlinear system identification
matical Biosciences 108339.                                       with echo state networks. In Advances in neural information
Ellison, D. 1981. Efficient automatic integration of ordinary     processing systems, 609–616.
differential equations with discontinuities. Mathematics and      Jaeger, H.; Lukoševičius, M.; Popovici, D.; and Siewert, U.
Computers in Simulation 23(1): 12–20.                             2007. Optimization and applications of echo state networks
with leaky-integrator neurons. Neural networks 20(3): 335–       Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019.
352.                                                             Physics-informed neural networks: A deep learning frame-
Ji, W.; Qiu, W.; Shi, Z.; Pan, S.; and Deng, S. 2020. Stiff-     work for solving forward and inverse problems involving
PINN: Physics-Informed Neural Network for Stiff Chemical         nonlinear partial differential equations. Journal of Compu-
Kinetics. arXiv preprint arXiv:2011.04520 .                      tational Physics 378: 686–707.
Kim, Y.; Choi, Y.; Widemann, D.; and Zohdi, T. 2020. A fast      Ranade, A.; and Casella, F. 2014. Multi-rate integration
and accurate physics-informed neural network reduced or-         algorithms: a path towards efficient simulation of object-
der model with shallow masked autoencoder. arXiv preprint        oriented models of very large systems. In Proceedings of
arXiv:2009.11990 .                                               the 6th International Workshop on Equation-Based Object-
                                                                 Oriented Modeling Languages and Tools, 79–82.
Kingma, D. P.; and Ba, J. A. 2019. A method for stochastic
                                                                 Ratnaswamy, V.; Safta, C.; Sargsyan, K.; and Ricciuto,
optimization. arXiv 2014. arXiv preprint arXiv:1412.6980
                                                                 D. M. 2019. Physics-informed Recurrent Neural Network
434.
                                                                 Surrogates for E3SM Land Model. AGUFM 2019: GC43D–
Kramer, B.; and Willcox, K. E. 2019. Nonlinear model order       1365.
reduction via lifting transformations and proper orthogonal      Robertson, H. 1976. Numerical integration of systems of
decomposition. AIAA Journal 57(6): 2297–2307.                    stiff ordinary differential equations with special structure.
Kudithipudi, D.; Saleh, Q.; Merkel, C.; Thesing, J.; and         IMA Journal of Applied Mathematics 18(2): 249–263.
Wysocki, B. 2016. Design and analysis of a neuromemris-          Robertson, H.; and Williams, J. 1975. Some properties of
tive reservoir computing architecture for biosignal process-     algorithms for stiff differential equations. IMA Journal of
ing. Frontiers in neuroscience 9: 502.                           Applied Mathematics 16(1): 23–34.
Lukoševičius, M.; and Jaeger, H. 2009. Reservoir computing     Schuster, S.; and Schuster, R. 1989. A generalization
approaches to recurrent neural network training. Computer        of Wegscheider’s condition. Implications for properties of
Science Review 3(3): 127–149.                                    steady states and for quasi-steady-state approximation. Jour-
Mattheakis, M.; and Protopapas, P. 2019. Recurrent Neu-          nal of Mathematical Chemistry 3(1): 25–42.
ral Networks: Exploding, Vanishing Gradients & Reservoir         Shampine, L. F. 1982. Implementation of Rosenbrock meth-
Computing .                                                      ods. ACM Transactions on Mathematical Software (TOMS)
Nguyen, V.; Buffoni, M.; Willcox, K.; and Khoo, B. 2014.         8(2): 93–113.
Model reduction for reacting flow applications. Interna-         Shampine, L. F.; and Gear, C. W. 1979. A user’s view of
tional Journal of Computational Fluid Dynamics 28(3-4):          solving stiff ordinary differential equations. SIAM review
91–105.                                                          21(1): 1–17.
Ozturk, M. C.; Xu, D.; and Prıncipe, J. C. 2007. Analy-          Sobol’, I. M.; Asotsky, D.; Kreinin, A.; and Kucherenko,
sis and design of echo state networks. Neural computation        S. 2011. Construction and comparison of high-dimensional
19(1): 111–138.                                                  Sobol’generators. Wilmott 2011(56): 64–79.
Pathak, J.; Wikner, A.; Fussell, R.; Chandra, S.; Hunt, B. R.;   Söderlind, G.; Jay, L.; and Calvo, M. 2015. Stiffness 1952–
Girvan, M.; and Ott, E. 2018. Hybrid forecasting of chaotic      2012: Sixty years in search of a definition. BIT Numerical
processes: Using machine learning in conjunction with a          Mathematics 55(2): 531–558.
knowledge-based model. Chaos: An Interdisciplinary Jour-         Thomas, P.; Straube, A. V.; and Grima, R. 2011. Commu-
nal of Nonlinear Science 28(4): 041101.                          nication: limitations of the stochastic quasi-steady-state ap-
Polydoros, A.; Nalpantidis, L.; and Krüger, V. 2015. Ad-        proximation in open biochemical reaction networks.
vantages and limitations of reservoir computing on model         Turanyi, T.; Tomlin, A.; and Pilling, M. 1993. On the er-
learning for robot control. In IROS Workshop on Machine          ror of the quasi-steady-state approximation. The Journal of
Learning in Planning and Control of Robot Motion, Ham-           Physical Chemistry 97(1): 163–172.
burg, Germany.
                                                                 van de Plassche, K.; Citrin, J.; Bourdelle, C.; Camenen, Y.;
Qian, E.; Kramer, B.; Peherstorfer, B.; and Willcox, K. 2020.    Casson, F.; Felici, F.; Horn, P.; Ho, A.; Van Mulders, S.;
Lift & Learn: Physics-informed machine learning for large-       Koechl, F.; et al. 2020. Fast surrogate modelling of turbu-
scale nonlinear dynamical systems. Physica D: Nonlinear          lent transport in fusion plasmas with physics-informed neu-
Phenomena 406: 132401.                                           ral networks .
Rackauckas, C.; and Nie, Q. 2017. Differentialequations. jl–     Verwer, J. G. 1994. Gauss–Seidel iteration for stiff ODEs
a performant and feature-rich ecosystem for solving differ-      from chemical kinetics. SIAM Journal on Scientific Com-
ential equations in julia. Journal of Open Research Software     puting 15(5): 1243–1250.
5(1).                                                            Vlachas, P. R.; Pathak, J.; Hunt, B. R.; Sapsis, T. P.; Girvan,
Rahaman, N.; Baratin, A.; Arpit, D.; Draxler, F.; Lin, M.;       M.; Ott, E.; and Koumoutsakos, P. 2020. Backpropagation
Hamprecht, F.; Bengio, Y.; and Courville, A. 2019. On the        algorithms and reservoir computing in recurrent neural net-
spectral bias of neural networks. In International Confer-       works for the forecasting of complex spatiotemporal dynam-
ence on Machine Learning, 5301–5310. PMLR.                       ics. Neural Networks .
Wang, S.; Teng, Y.; and Perdikaris, P. 2020. Understand-
ing and mitigating gradient pathologies in physics-informed
neural networks. arXiv preprint arXiv:2001.04536 .
Wanner, G.; and Hairer, E. 1996. Solving ordinary differen-
tial equations II. Springer Berlin Heidelberg.
Willard, J.; Jia, X.; Xu, S.; Steinbach, M.; and Kumar, V.
2020. Integrating physics-based modeling with machine
learning: A survey. arXiv preprint arXiv:2003.04919 .
Zhang, R.; Zen, R.; Xing, J.; Arsa, D. M. S.; Saha, A.; and
Bressan, S. 2020. Hydrological Process Surrogate Mod-
elling and Simulation with Neural Networks. In Pacific-
Asia Conference on Knowledge Discovery and Data Mining,
449–461. Springer.