Accelerating Simulation of Stiff Nonlinear Systems using Continuous-Time Echo State Networks Ranjan Anantharaman1 , Yingbo Ma2 , Shashi Gowda1 , Chris Laughman3 , Viral B. Shah2 , Alan Edelman1 , Chris Rackauckas1,2 1 Massachusetts Institute of Technology 2 Julia Computing 3 Mitsubishi Electric Research Laboratories Abstract A popular class of traditional surrogatization techniques is projection based model order reduction, such as the proper Modern design, control, and optimization often require mul- tiple expensive simulations of highly nonlinear stiff models. orthogonal decomposition (POD) (Benner, Gugercin, and These costs can be amortized by training a cheap surrogate Willcox 2015). This method computes “snapshots” of the of the full model, which can then be used repeatedly. Here trajectory and uses the singular value decomposition of the we present a general data-driven method, the continuous- linearization in order to construct a basis of a subspace of the time echo state network (CTESN), for generating surrogates snapshot space, and the model is remade with a change of of nonlinear ordinary differential equations with dynamics at basis. However, if the system is very nonlinear, the computa- widely separated timescales. We empirically demonstrate the tional complexity of this linearization-based reduced model ability to accelerate a physically motivated scalable model of can be almost as high as the original model. One way to a heating system by 98x while maintaining relative error of overcome this difficulty is through empirical interpolation within 0.2 %. We showcase the ability for this surrogate to ac- methods (Nguyen et al. 2014). Other methods to produce curately handle highly stiff systems which have been shown to cause training failures with common surrogate methods surrogates generally utilize the structural information known such as Physics-Informed Neural Networks (PINNs), Long about highly regular systems like partial differential equa- Short Term Memory (LSTM) networks, and discrete echo tion discretizations (Frangos et al. 2010). state networks (ESN). We show that our model captures fast Many of these methods require a scientist to actively make transients as well as slow dynamics, while demonstrating that choices about the approximations being performed to the fixed time step machine learning techniques are unable to ad- system. In contrast, the data-driven approaches like Physics- equately capture the multi-rate behavior. Together this pro- Informed Neural Networks (PINNs)(Raissi, Perdikaris, and vides compelling evidence for the ability of CTESN surro- gates to predict and accelerate highly stiff dynamical systems Karniadakis 2019) and Long Short Term Memory (LSTM) which are unable to be directly handled by previous scientific networks (Chattopadhyay, Hassanzadeh, and Subramanian machine learning techniques. 2020) have gained popularity due to their apparent appli- cability to “all” ordinary and partial differential equations in a single automated form. However, numerical stiffness Introduction (Söderlind, Jay, and Calvo 2015) and multiscale dynamics Stiff nonlinear systems of ordinary differential equations represent an additional challenge. Highly stiff differential are widely prevalent throughout science and engineering equations can lead to gradient pathologies that make com- (Wanner and Hairer 1996; Shampine and Gear 1979) and mon surrogate techniques like PINNs hard to train (Wang, are characterized by dynamics with widely separated time Teng, and Perdikaris 2020). scales. These systems require highly stable numerical meth- A classic way to create surrogates for stiff systems is ods to use non-vanishing step-sizes reliably (Gear 1971), to simply eliminate the stiffness. The computational singu- and also tend to be computationally expensive to solve. Even lar perturbation (CSP) method (Hadjinicolaou and Gous- with state-of-the-art simulation techniques, design, control, sis 1998) has been shown to decompose chemical reaction and optimisation of these systems remains intractable in equations into fast and slow modes. The fast modes are many realistic engineering applications (Benner, Gugercin, then eliminated, resulting in a non-stiff system. Another op- and Willcox 2015). To address these challenges, researchers tion is to perform problem-specific variable transformations have focused on techniques to obtain an approximation to a (Qian et al. 2020; Kramer and Willcox 2019) to a form system (called a “surrogate”) whose forward simulation time more suited to model order reduction by traditional meth- is relatively inexpensive while maintaining reasonable accu- ods. These transformations are often problem specific and racy (Willard et al. 2020; Ratnaswamy et al. 2019; Zhang require a scientist’s intervention at the equation level. Re- et al. 2020; Kim et al. 2020; van de Plassche et al. 2020). cent studies on PINNs have demonstrated that such variable Copyright © 2021for this paper by its authors. Use permitted under elimination may be required to handle highly stiff equations Creative Commons License Attribution 4.0 International (CC BY because the stiffness leads to ill-conditioned optimization 4.0) problems. For example, on the classic Robertson’s equations A B D C E Figure 1: Prediction of each surrogate on the Robertson’s equations Shown in each figure is the result of the data-driven algorithm’s prediction at p = [0.04, 3 × 107 , 1 × 104 ], a parameter set not in the training data. Ground truth, obtained by solving the ODE using the Rosenbrock23 solver with absolute tolerance of 10−6 , is in blue. The PINN was trained using a 3-layer multi-layer perceptron with the ADAM optimizer for 300,000 epochs with minibatching, and its prediction is in red. Both the ESN and CTESN were trained with a reservoir size of 3000 on a parameter space of [0.036, 0.044] × [2.7 × 107 , 3.3 × 107 ] × [9 × 103 , 1.1 × 104 ], from which 1000 sets of parameters were sampled using Sobol sampling. The predictions of the CTSEN are generated by the radial basis function prediction of Wout (p) and are shown in green. Predictions from the ESN are in purple. The LSTM predictions, in gold, are generated by a network with 3 hidden LSTM layers and an output dense layer, after training for 2000 epochs. (A) A timeseries plot of the y1 (t) predictions. (B) The absolute error of the surrogate predictions on y1 (t). (C) A timeseries plot of the y2 (t) predictions. (D) The absolute error of the surrogate predictions on y2 (t). (E) The result of y1 (t) + y2 (t) + y3 (t) over time. By construction this quantity’s theoretical value is 1 over the timeseries. (ROBER) (Robertson 1976) and Pollution model (POLLU) calculated by using a least squares fit of against the model’s (Verwer 1994) stiff test problems it was shown that direct time series, which then fully describes the prediction pro- training of PINNs failed, requiring the authors to perform cess. a quasi-steady state (QSS) assumption in order for accurate This process of using a direct linear solve, such as a prediction to occur (Ji et al. 2020). However, many chem- QR-factorization, to calculate Wout means that no gradient- ical reaction systems require transient activations to prop- based optimization is used in the training process. For this erly arrive at the overarching dynamics, making the QSS as- reason ESNs have traditionally been used as a stand-in sumption only applicable to a subset of dynamical systems for recurrent neural networks which overcome the vanish- (Henry and Martin 2016; Flach and Schnell 2006; Eilert- ing gradient problem (Jaeger et al. 2007; Lukoševičius and sen and Schnell 2020; Turanyi, Tomlin, and Pilling 1993; Jaeger 2009; Mattheakis and Protopapas 2019; Vlachas et al. Schuster and Schuster 1989; Thomas, Straube, and Grima 2020; Chattopadhyay et al. 2019; Grezes 2014; Evanusa, 2011). Thus while demonstrating promising results on diffi- Aloimonos, and Fermüller 2020; Butcher et al. 2013). How- cult equations, training on the QSS-approximated equations ever, ESNs have also been applied to learning chaotic sys- requires specific chemical reaction networks and requires tems (Chattopadhyay et al. 2019; Doan, Polifke, and Magri the scientist to make approximation choices that are diffi- 2019), nonlinear systems identification (Jaeger 2003), bio- cult to automate, which reduces the general applicability that signal processing (Kudithipudi et al. 2016), and robot con- PINNs were meant to give. trol (Polydoros, Nalpantidis, and Krüger 2015). These are all The purpose of this work is to introduce a general data- cases where the derivative calculations are unstable or, as in driven method, the CTESN, that is generally applicable the case of chaotic equations, are not well-defined for long and able to accurately capture highly nonlinear heteroge- time spans. neous stiff time-dependent systems without requiring the This ability to handle problems with gradient patholo- user to train on non-stiff approximations. It is able to ac- gies gives the intuitive justification for exploring reservoir curately train and predict on highly ill-conditioned models. computing techniques on handling stiff equations. However, We demonstrate these results (Figure 1) on the Roberston’s stiff systems generally have behavior spanning multiple equations, which PINNs, LSTM networks and discrete- timescales which are difficult to represent with uniformly- time machine learning techniques fail to handle. Our re- spaced prediction intervals. For example, in the ROBER sults showcase the ability to transform difficult stiff equa- problem we will showcase, an important transient occurs tions into non-stiff reservoir equations which are then in- for less than a 10 seconds of the 10,000 second simulation. tegrated in place of the original system. Given the O(n3 ) However this feature is important to capture the catalysis scaling behavior of general stiff ODE solvers due to internal that kick-starts the long-term changes. Many more samples LU-factorizations, the resulting approximation by a surro- from t ∈ [0, 10] will be required than from t ∈ [10, 105 ] gate with linear scaling with number of outputs, we observe in order to accurately capture the dynamics of the system. increasing accelerations as the system gets larger. With this These behaviors are the reason why all of the major software scaling difference we demonstrate the ability to accelerate for handling stiff equations, such as CVODE (Hindmarsh a large stiff system by 98x while achieving < 0.2% error et al. 2005), LSODA (Hindmarsh and Petzold 2005), and (Figure 5). Radau (Hairer and Wanner 1999) are adaptive. In fact, this behavior is so crucial to the stable handling of stiff systems Continuous-Time Echo State Networks that robust implicit solves tie the stepping behavior to the implicit handling of the system with complex procedures for Echo State Networks (ESNs) are a reservoir computing reducing time steps when Newton convergence rates are re- framework which projects signals from higher dimensional duced (Wanner and Hairer 1996; Hosea and Shampine 1996; spaces defined by the dynamics of a fixed non-linear system Hairer and Wanner 1999). For these reasons, we will demon- called a “reservoir” (Ozturk, Xu, and Prıncipe 2007). The strate that the classic fixed time step reservoir computing ESN’s mathematical formulation is as follows. For a NR - methods from machine learning are unable to handle these dimensional reservoir, the reservoir equation is given by: highly stiff equations. rn+1 = f (Arn + Wf b xn ), (1) To solve these issues, we introduce a new variant of ESNs, which we call continuous-time echo state networks where f is a chosen activation function (like tanh or sig- (CTESNs), which allows for an underlying adaptive time moid), A is the NR × NR reservoir weight matrix, and Wf b process while avoiding gradient pathologies in training. Let is the NR × N feedback matrix where N is the size of our N be the dimension of our model, and let P be a Cartesian original model. In order to arrive at a prediction of our orig- space of parameters under which the model is expected to inal model, we take a projection of the reservoir: operate. The CTESN of with reservoir dimension NR is de- fined as x̂n = g(Wout rn ), (2) r0 = f (Ar + Whyb x(p∗ , t)), (3) where g is the output activation function (generally the iden- x(t) = g(Wout r(t)), (4) tity or sigmoid) and Wout is the N × NR projection matrix. In the training process of an ESN, the matrices A and Wf b where A is a fixed sparse random matrix of dimension NR × are randomly chosen constants, meaning the Wout matrix is NR and Whyb is a fixed random dense matrix of dimen- the only variable which needs to be approximated. Wout is sions NR × N . The term Whyb x(p∗ , t) represents a “hybrid” term that incorporates physics information into the reservoir ing function are, in practice, much cheaper than the cost of (Pathak et al. 2018), namely a solution at some point in the simulating the model multiple times. parameter space of the dynamical system. Given these fixed Prediction comprises of two steps: predicting the least values, the readout matrix Wout is the only trained portion squares matrix, and simulating the reservoir time series (or, of this network and is obtained through a least squares fit of in this case, just using the pre-computed continuous solution the reservoir ODE solution against the original timeseries. since the reservoir is fixed for every set of parameters). The We note that in this study we choose f = tanh and g = id final prediction is just the matrix multiplication of two. for all of our examples. A strong advantage of the training is that it requires no To obtain a surrogate that predicts the dynamics at new manual investigation of the stiff model on the part of the physical parameters, the reservoir projection Wout is fit researcher and can be called as an off-the-shelf routine. It against many solutions at parameters {p1 , . . . , pn }, where allows the researcher to make a trade-off, computing a few n is the number of training data points sampled from the parallelized runs of the full stiff model in order to generate a parameter space. Using these fits, an interpolating function surrogate, which can then be plugged in and used repeatedly Wout (p) between the matrices can be trained. A prediction for design and optimization. x̂(t) for at physical parameters p̂ is thus given by: We implemented the training routines and the follow- x̂(t) = Wout (p̂)r(t). (5) ing models in the Julia programming language (Bezanson et al. 2017) to take advantage of its first class support for A strong advantage of our method is its ease of imple- differential equations solvers (Rackauckas and Nie 2017) mentation and ease of training. Global L2 fitting via stabi- and scientific machine learning packages. For the exam- lized methods like SVD are robust to ill-conditioning, alle- ples in this paper, we have sampled the high-dimensional viating many of the issues encountered when attempting to spaces using a Sobol low-discrepancy sequence (Sobol’ build neural surrogates of such equations. Also note that in et al. 2011) and interpolated the Wout matrices using a radial this particular case, the readout matrix is fit against the same basis function provided by the Julia package Surrogates.jl reservoir time series. This means that prediction does not (https://github.com/SciML/Surrogates.jl). need to simulate the reservoir, providing an extra accelera- tion. Case Studies Another advantage is the ability to use time stepping in- formation from the solver during training. As noted before, In this section we describe two representative examples. We not only are step sizes chosen adaptively based on minimiz- demonstrate that the CTESN can handle highly stiff be- ing a local error estimate to a specified tolerance (Shampine haviour through the ROBER example. We then talk about and Gear 1979), but they also adapt to concentrate around the performance of the surrogate on a scalable, physically- the most stiff and numerically difficult time points of the inspired heating system. model by incorporating the Newton convergence into the re- jection framework. These timestamps thus provide heuristic Robertson Equations and High Stiffness snapshots of the most important points for training the least We first consider Robertson’s chemical reactions involving squares fit, whereas snapshots from uniform time steps may three reaction species A, B and C: skip over many crucial aspects of the dynamics. 0.04 Training A −−→ B In this section we describe the automated training proce- 3×107 B + B −−−−→ C + B dure used to generate CTESN surrogates. An input param- 104 eter space P is first chosen. This could be a design space B + C −−→ A + C for the model or a range of operating conditions. Now n sets of parameters {p1 , . . . , pn } are sampled from this space us- which lead to the ordinary differential equations: ing a sampling sequence that covers as much of the space as possible. The full model is now simulated at each sample y˙1 = −0.04y1 + 104 y2 · y3 (6) in parallel since each run is independent, generating time series for each run. The choice of points in time used to gen- y˙2 = 0.04y1 − 104 y2 · y3 − 3 · 107 y22 (7) erate the time series at each p comes from the numerical y˙3 = 3 · 107 y22 (8) ODE solve at that p. The reservoir ODE is then constructed using a candidate solution at any one of the n parameters where y1 , y2 , and y3 are the concentrations of A, B and C x(p∗ , t), p∗ ∈ {p1 , . . . , pn } and is then simulated, gener- respectively. This system has widely separated reaction rates ating the reservoir time series. Since the reservoir ODE is (0.04, 104 , 3 · 107 ), and is well known to be very stiff (Gob- non-stiff, this simulation is cheap compared to the cost of bert 1996; Robertson and Williams 1975; Robertson 1976). the full model. Least squares projections can now be calcu- It is commonly used as an example to evaluate integrators lated from each solution to the reservoir in parallel. Once of stiff ODEs (Hosea and Shampine 1996). Finding an ac- all the least squares matrices are obtained, an interpolating curate surrogate for this system is difficult because it needs function is trained to predict the least squares projection ma- to capture both the stable slow reacting system and the fast trix. Both the least squares fitting and training the interpolat- transients. Additionally, the surrogate needs to be consistent with this system’s implicit physical constraints, such as the PINN training on the Robertson's Equations 3000 conservation of matter (y1 + y2 + y3 = 1) and positivity of the variables (yi > 0), in order to provide a stable solution. 2500 A surrogate was trained by sampling 1000 sets of param- eters from the Cartesian parameter space [0.036, 0.044] × [2.7 × 107 , 3.3 × 107 ] × [9 × 103 , 1.1 × 104 ] using Sobol 2000 Loss sampling so as to evenly cover the whole space. We train on the time series of the three states themselves as outputs. 1500 A least squares projection Wout was fit for each set of pa- rameters, and then a radial basis function was used to inter- 1000 polate between the matrices. The prediction workflow is as follows: given a set of parameters whose time series is de- 0 1×105 2×105 3×105 sired, the radial basis function predicts the projection matrix. Number of epochs The pre-simulated reservoir is then sampled at the desired time points, and a matrix multiplication with the predicted Figure 2: Training a PINN on the Robertson’s Equations: Wout gives us the desired prediction. Figure 1 shows a com- PINN was trained for 300,000 epochs using the ADAM op- parison between the CTESN, ESN, PINN and LSTM meth- timizer with a learning rate of 10−3 , by which time the loss ods. The PINN data is reproduced from (Ji et al. 2020) and seems to saturate. The hyperparameters of the PINN can be the ESN was trained using 101 time points uniformly sam- found in the Case Studies section. pled from the time span, while CTESN used 61 adaptively sampled time points informed by the ODE solver (Rosen- brock23 (Shampine 1982)). highly ill-conditioned Hessian in the training process due The CTESN method is able to accurately capture both to the stiffness of the equation, it is very unlikely for local the slow and fast transients and respect the conservation optimization to find a parameters which make an accurate of mass. The ESN is able to accurately predict at the time prediction. We additionally note stiff systems of this form points it was trained on, but many features are missed. The may be hard to capture by neural networks directly as neu- uniform stepping completely misses the fast transient rise ral networks show a bias towards low frequency functions at t = 10−4 because the uniform intervals do not sample (Rahaman et al. 2019). points from that time scale. Additionally, the first sampled time point at t = 100 is far into the concentration drop of Stiffly-Aware Surrogates of HVAC Systems y1 which leads to an inaccurate prediction before the system settles into near steady state behavior. As stated earlier, the Our second test problem is a scalable benchmark used in CTESN uses information from a stiff ODE solver to choose the engineering community (Casella 2015). It is a simpli- the right points along the time span to accurately capture fied, lumped-parameter model of a heating system with a multi-scale behaviour with less training data than the ESN. central heater supplying heat to several rooms through a dis- In order to compare the discrete ESN to the continuous re- tribution network. Each room has an on-off controller with sult, a cubic spline was fit to its 101 evenly spaced prediction hysteresis which provides very fast localized action (Ranade points. and Casella 2014). The resulting system of equations is thus The PINN was trained by sampling 2500 logarithmically very stiff and unable to be solved by standard explicit time spaced points across the time span. The network used was stepping methods. a 3-layer multi-layer perceptron with 128 nodes per hidden The size of the heating system is scaled by a parameter layer and the Gaussian Error Linear Unit activation func- N which refers to the number of users/rooms. Each room tion (Hendrycks and Gimpel 2016). The layers were ini- is governed by two equations corresponding to its temper- tialed using Xavier initialization (Glorot and Bengio 2010), ature and the state of its on-off controller. The tempera- and trained with the ADAM optimizer (Kingma and Ba ture of fluid supplying heat to each room is governed by 2019) at a learning rate of 10−3 for 300,000 epochs with one equation. This produces a system with 2N + 1 cou- mini batch size of 128. Figure 2 shows the convergence plot pled non-linear equations. This “scalability” lets us test how as the PINN trains on the ROBER equations. The LSTM our CTESN surrogate scales. To train the surrogate, we de- network used a similar architecture to the PINN, but with fine a parameter space P under which we expect it to op- LSTM hidden layers instead of fully connected layers. It erate. First, we assume set point temperature of each room used 2500 logarithmically spaced points and was trained for to be between 17◦ C and 23◦ C. Each room is warmed by 2000 epochs until convergence. a heat conducting fluid, whose set point is between 65◦ C Figure 1 highlights how the trained PINN fails to cap- and 75◦ C. Thus the parameter space over which we expect ture both the fast and the slow transients and do not re- our surrogate to work is the rectangular space denoted by spect mass conservation. Our collaborators investigated why [17◦ C, 23◦ C] × [65◦ C, 75◦ C]. PINNs fail to solve these equations in (Ji et al. 2020). The We used a reservoir size of 3000 and sampled 100 sets reason for the difficulty can be attributed to recently iden- of parameters from this space using Sobol sampling, and fit tified results in gradient pathologies in the training arising least squares projection matrices Wout between each solu- from stiffness (Wang, Teng, and Perdikaris 2020). With a tion and the reservoir. For a system with N rooms, we train 10 100 400 Figure 5: Scaling performance of surrogate on heating system. We compare the time taken to simulate the full stiff model to the trained surrogate with 10, 20, 30 , 40, 50, 60, 70, 80, 90, 100, 200 and 400 rooms. We observe a speedup of up to 98x. The surrogate was trained by sampling 100 sets of parameters from our input space, with a reservoir size of Figure 3: Validating the surrogate of the scalable heat- 3000. ing system with 10 rooms. When tested with parameters it has not seen in training, our surrogate is able to repro- duce the behaviour of the system to within 0.01% error. on N + 1 outputs, namely the temperature of each room, The surrogate is trained on 100 points sampled from the and the temperature of the heat conducting fluid. Figure 3 [17◦ C, 23◦ C]×[65◦ C, 75◦ C] where the ranges represent set demonstrates that the training technique is accurately able point temperature of each room and set point of the fluid to find matrices Wout which capture the stiff system within supplying heat to the rooms respectively. The test parame- 0.01% error on a test parameters. We then fit an interpo- ters that validated here are [21◦ C, 71◦ C]. More details on lating radial basis function Wout (p). Figure 4 demonstrates training can be found in the Case Studies section. that the interpolated Wout (p) is able to adequately capture the dynamics throughout the trained parameter space. Lastly, Figure 5 demonstrates the O(N ) cost of the surrogate eval- uation, which in comparison to the O(N 3 ) cost of a general implicit ODE solver (due to the LU-factorizations) leads to an increasing gap in the solver performance as N increases. At the high end of our test, the surrogate accelerates a 801 dimensional stiff ODE system by approximately 98x. Conclusion & Future Work We present CTESNs, a data-driven method for generating surrogates of nonlinear ordinary differential equations with dynamics at widely separated timescales. Our method main- tains accuracy for different parameters in a chosen parame- ter space, and shows favourable scaling with system size on a physics-inspired scalable model. This method can be ap- plied to any ordinary differential equation without requiring the scientist to simplify the model before surrogate applica- tion, greatly improving productivity. In future work, we plan to extend the formulation to take in forcing functions.This entails that the reservoir needs to be simulated every single time a prediction is made, adding Figure 4: Reliability of surrogate through parameter to running time, but we do note that numerically simulating space. We sampled our grid at over 500 grid points and plot- the reservoir is quite fast in practice as it is non-stiff, and ted a heatmap of test error through our parameter space. We thus techniques which regenerate reservoirs on demand will find our surrogate performs reliably even at the border of our likely not incur a major run time performance cost. space with error within 0.1% Our method utilizes the continuous nature of differential equation solutions. Hybrid dynamical systems, such as those with event handling (Ellison 1981), can introduce discon- tinuities into the system which will require extensions to our method. Further extensions to the method will handle Evanusa, M.; Aloimonos, Y.; and Fermüller, C. 2020. both derivative discontinuities and events present in Filip- Deep Reservoir Computing with Learned Hidden Reservoir pov dynamical systems (Filippov 2013).Further opportuni- Weights using Direct Feedback Alignment. arXiv preprint ties could explore utilizing more structure within equations arXiv:2010.06209 . for building a more robust CTESN or decrease the necessary Filippov, A. F. 2013. Differential equations with discontin- size of the reservoir. uous righthand sides: control systems, volume 18. Springer To train both the example problems in this paper, we re- Science & Business Media. quired no knowledge of the physics. This presents an oppor- tunity to train surrogates of black-box systems. Flach, E. H.; and Schnell, S. 2006. Use and abuse of the quasi-steady-state approximation. IEE Proceedings-Systems Acknowledgement Biology 153(4): 187–191. The information, data, or work presented herein was funded Frangos, M.; Marzouk, Y.; Willcox, K.; and van Bloe- in part by the Advanced Research Projects Agency-Energy men Waanders, B. 2010. Surrogate and reduced-order mod- (ARPA-E), U.S. Department of Energy, under Award Num- eling: a comparison of approaches for large-scale statistical bers DE-AR0001222 and DE-AR0001211, and NSF grant inverse problems [Chapter 7] . OAC-1835443. The views and opinions of authors expressed Gear, C. W. 1971. Numerical initial value problems in ordi- herein do not necessarily state or reflect those of the United nary differential equations. nivp . States Government or any agency thereof. The authors thank Francesco Martinuzzi for reviewing drafts of this paper. Glorot, X.; and Bengio, Y. 2010. Understanding the diffi- culty of training deep feedforward neural networks. In Pro- References ceedings of the thirteenth international conference on artifi- cial intelligence and statistics, 249–256. Benner, P.; Gugercin, S.; and Willcox, K. 2015. A survey of projection-based model reduction methods for parametric Gobbert, M. K. 1996. Robertson’s example for stiff differ- dynamical systems. SIAM review 57(4): 483–531. ential equations. Arizona State University, Technical report . Bezanson, J.; Edelman, A.; Karpinski, S.; and Shah, V. B. 2017. Julia: A fresh approach to numerical computing. Grezes, F. 2014. Reservoir Computing. Ph.D. thesis, PhD SIAM review 59(1): 65–98. thesis, The City University of New York. Butcher, J. B.; Verstraeten, D.; Schrauwen, B.; Day, C. R.; Hadjinicolaou, M.; and Goussis, D. A. 1998. Asymptotic and Haycock, P. W. 2013. Reservoir computing and extreme solution of stiff PDEs with the CSP method: the reaction learning machines for non-linear time-series data analysis. diffusion equation. SIAM Journal on Scientific Computing Neural networks 38: 76–89. 20(3): 781–810. Casella, F. 2015. Simulation of large-scale models in Mod- Hairer, E.; and Wanner, G. 1999. Stiff differential equations elica: State of the art and future perspectives. In Proceedings solved by Radau methods. Journal of Computational and of the 11th International Modelica Conference, 459–468. Applied Mathematics 111(1-2): 93–111. Chattopadhyay, A.; Hassanzadeh, P.; and Subramanian, D. Hendrycks, D.; and Gimpel, K. 2016. Gaussian error linear 2020. Data-driven predictions of a multiscale Lorenz 96 units (gelus). arXiv preprint arXiv:1606.08415 . chaotic system using machine-learning methods: reservoir Henry, A.; and Martin, O. C. 2016. Short relaxation times computing, artificial neural network, and long short-term but long transient times in both simple and complex reaction memory network. Nonlinear Processes in Geophysics 27(3): networks. Journal of the Royal Society Interface 13(120): 373–389. 20160388. Chattopadhyay, A.; Hassanzadeh, P.; Subramanian, D.; and Hindmarsh, A.; and Petzold, L. 2005. LSODA, Ordinary Palem, K. 2019. Data-driven prediction of a multi-scale Differential Equation Solver for Stiff or Non-stiff System . Lorenz 96 chaotic system using a hierarchy of deep learn- ing methods: Reservoir computing, ANN, and RNN-LSTM Hindmarsh, A. C.; Brown, P. N.; Grant, K. E.; Lee, S. L.; . Serban, R.; Shumaker, D. E.; and Woodward, C. S. 2005. SUNDIALS: Suite of nonlinear and differential/algebraic Doan, N. A. K.; Polifke, W.; and Magri, L. 2019. Physics- equation solvers. ACM Transactions on Mathematical Soft- informed echo state networks for chaotic systems forecast- ware (TOMS) 31(3): 363–396. ing. In International Conference on Computational Science, 192–198. Springer. Hosea, M.; and Shampine, L. 1996. Analysis and imple- Eilertsen, J.; and Schnell, S. 2020. The Quasi-State-State mentation of TR-BDF2. Applied Numerical Mathematics Approximations revisited: Timescales, small parameters, 20(1-2): 21–37. singularities, and normal forms in enzyme kinetics. Mathe- Jaeger, H. 2003. Adaptive nonlinear system identification matical Biosciences 108339. with echo state networks. In Advances in neural information Ellison, D. 1981. Efficient automatic integration of ordinary processing systems, 609–616. differential equations with discontinuities. Mathematics and Jaeger, H.; Lukoševičius, M.; Popovici, D.; and Siewert, U. Computers in Simulation 23(1): 12–20. 2007. Optimization and applications of echo state networks with leaky-integrator neurons. Neural networks 20(3): 335– Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019. 352. Physics-informed neural networks: A deep learning frame- Ji, W.; Qiu, W.; Shi, Z.; Pan, S.; and Deng, S. 2020. Stiff- work for solving forward and inverse problems involving PINN: Physics-Informed Neural Network for Stiff Chemical nonlinear partial differential equations. Journal of Compu- Kinetics. arXiv preprint arXiv:2011.04520 . tational Physics 378: 686–707. Kim, Y.; Choi, Y.; Widemann, D.; and Zohdi, T. 2020. A fast Ranade, A.; and Casella, F. 2014. Multi-rate integration and accurate physics-informed neural network reduced or- algorithms: a path towards efficient simulation of object- der model with shallow masked autoencoder. arXiv preprint oriented models of very large systems. In Proceedings of arXiv:2009.11990 . the 6th International Workshop on Equation-Based Object- Oriented Modeling Languages and Tools, 79–82. Kingma, D. P.; and Ba, J. A. 2019. A method for stochastic Ratnaswamy, V.; Safta, C.; Sargsyan, K.; and Ricciuto, optimization. arXiv 2014. arXiv preprint arXiv:1412.6980 D. M. 2019. Physics-informed Recurrent Neural Network 434. Surrogates for E3SM Land Model. AGUFM 2019: GC43D– Kramer, B.; and Willcox, K. E. 2019. Nonlinear model order 1365. reduction via lifting transformations and proper orthogonal Robertson, H. 1976. Numerical integration of systems of decomposition. AIAA Journal 57(6): 2297–2307. stiff ordinary differential equations with special structure. Kudithipudi, D.; Saleh, Q.; Merkel, C.; Thesing, J.; and IMA Journal of Applied Mathematics 18(2): 249–263. Wysocki, B. 2016. Design and analysis of a neuromemris- Robertson, H.; and Williams, J. 1975. Some properties of tive reservoir computing architecture for biosignal process- algorithms for stiff differential equations. IMA Journal of ing. Frontiers in neuroscience 9: 502. Applied Mathematics 16(1): 23–34. Lukoševičius, M.; and Jaeger, H. 2009. Reservoir computing Schuster, S.; and Schuster, R. 1989. A generalization approaches to recurrent neural network training. Computer of Wegscheider’s condition. Implications for properties of Science Review 3(3): 127–149. steady states and for quasi-steady-state approximation. Jour- Mattheakis, M.; and Protopapas, P. 2019. Recurrent Neu- nal of Mathematical Chemistry 3(1): 25–42. ral Networks: Exploding, Vanishing Gradients & Reservoir Shampine, L. F. 1982. Implementation of Rosenbrock meth- Computing . ods. ACM Transactions on Mathematical Software (TOMS) Nguyen, V.; Buffoni, M.; Willcox, K.; and Khoo, B. 2014. 8(2): 93–113. Model reduction for reacting flow applications. Interna- Shampine, L. F.; and Gear, C. W. 1979. A user’s view of tional Journal of Computational Fluid Dynamics 28(3-4): solving stiff ordinary differential equations. SIAM review 91–105. 21(1): 1–17. Ozturk, M. C.; Xu, D.; and Prıncipe, J. C. 2007. Analy- Sobol’, I. M.; Asotsky, D.; Kreinin, A.; and Kucherenko, sis and design of echo state networks. Neural computation S. 2011. Construction and comparison of high-dimensional 19(1): 111–138. Sobol’generators. Wilmott 2011(56): 64–79. Pathak, J.; Wikner, A.; Fussell, R.; Chandra, S.; Hunt, B. R.; Söderlind, G.; Jay, L.; and Calvo, M. 2015. Stiffness 1952– Girvan, M.; and Ott, E. 2018. Hybrid forecasting of chaotic 2012: Sixty years in search of a definition. BIT Numerical processes: Using machine learning in conjunction with a Mathematics 55(2): 531–558. knowledge-based model. Chaos: An Interdisciplinary Jour- Thomas, P.; Straube, A. V.; and Grima, R. 2011. Commu- nal of Nonlinear Science 28(4): 041101. nication: limitations of the stochastic quasi-steady-state ap- Polydoros, A.; Nalpantidis, L.; and Krüger, V. 2015. Ad- proximation in open biochemical reaction networks. vantages and limitations of reservoir computing on model Turanyi, T.; Tomlin, A.; and Pilling, M. 1993. On the er- learning for robot control. In IROS Workshop on Machine ror of the quasi-steady-state approximation. The Journal of Learning in Planning and Control of Robot Motion, Ham- Physical Chemistry 97(1): 163–172. burg, Germany. van de Plassche, K.; Citrin, J.; Bourdelle, C.; Camenen, Y.; Qian, E.; Kramer, B.; Peherstorfer, B.; and Willcox, K. 2020. Casson, F.; Felici, F.; Horn, P.; Ho, A.; Van Mulders, S.; Lift & Learn: Physics-informed machine learning for large- Koechl, F.; et al. 2020. Fast surrogate modelling of turbu- scale nonlinear dynamical systems. Physica D: Nonlinear lent transport in fusion plasmas with physics-informed neu- Phenomena 406: 132401. ral networks . Rackauckas, C.; and Nie, Q. 2017. Differentialequations. jl– Verwer, J. G. 1994. Gauss–Seidel iteration for stiff ODEs a performant and feature-rich ecosystem for solving differ- from chemical kinetics. SIAM Journal on Scientific Com- ential equations in julia. Journal of Open Research Software puting 15(5): 1243–1250. 5(1). Vlachas, P. R.; Pathak, J.; Hunt, B. R.; Sapsis, T. P.; Girvan, Rahaman, N.; Baratin, A.; Arpit, D.; Draxler, F.; Lin, M.; M.; Ott, E.; and Koumoutsakos, P. 2020. Backpropagation Hamprecht, F.; Bengio, Y.; and Courville, A. 2019. On the algorithms and reservoir computing in recurrent neural net- spectral bias of neural networks. In International Confer- works for the forecasting of complex spatiotemporal dynam- ence on Machine Learning, 5301–5310. PMLR. ics. Neural Networks . Wang, S.; Teng, Y.; and Perdikaris, P. 2020. Understand- ing and mitigating gradient pathologies in physics-informed neural networks. arXiv preprint arXiv:2001.04536 . Wanner, G.; and Hairer, E. 1996. Solving ordinary differen- tial equations II. Springer Berlin Heidelberg. Willard, J.; Jia, X.; Xu, S.; Steinbach, M.; and Kumar, V. 2020. Integrating physics-based modeling with machine learning: A survey. arXiv preprint arXiv:2003.04919 . Zhang, R.; Zen, R.; Xing, J.; Arsa, D. M. S.; Saha, A.; and Bressan, S. 2020. Hydrological Process Surrogate Mod- elling and Simulation with Neural Networks. In Pacific- Asia Conference on Knowledge Discovery and Data Mining, 449–461. Springer.