Sparsely constrained neural networks for model discovery of PDEs Gert-Jan Both,1 Gijs Vermariën, 2 Remy Kusters 1 1 Université de Paris, INSERM U1284, Center for Research and Interdisciplinarity (CRI), F-75006 Paris, France 2 Leiden Observatory, Leiden University, Leiden, The Netherlands gert-jan.both@cri-paris.org, vermarien@strw.leidenuniv.nl, remy.kusters@cri-paris.org Abstract sets, but suffers from convergence issues and, to date, does not leverage advanced sparse regression techniques. Sparse regression on a library of candidate features has de- In this paper we present a modular approach to combine veloped as the prime method to discover the partial differ- ential equation underlying a spatio-temporal data-set. These deep-learning based models with state-of-the-art sparse re- features consist of higher order derivatives, limiting model gression techniques. Our framework consists of a neural discovery to densely sampled data-sets with low noise. Neural network to model the data, from which we construct the network-based approaches circumvent this limit by construct- function library. Key to our approach is that we dynamically ing a surrogate model of the data, but have to date ignored apply a mask to select the active terms in the function library advances in sparse regression algorithms. In this paper we throughout training and constrain the network to solutions of present a modular framework that dynamically determines the equation given by these active terms. To determine this the sparsity pattern of a deep-learning based surrogate using mask, we can use any non-differentiable sparsity-promoting any sparse regression technique. Using our new approach, algorithm (see figure 1). This allows us to use a constrained we introduce a new constraint on the neural network and neural network to model the data and construct an accurate show how a different network architecture and sparsity es- timator improve model discovery accuracy and convergence function library, while an advanced sparsity promoting algo- on several benchmark examples. Our framework is available rithm is used to dynamically discover the equation based on at https://github.com/PhIMaL/DeePyMoD output from the network. We present three experiments to show how varying these components improves the performance of model discovery. Introduction (I) We replace the gradient-based optimisation of the con- Model discovery aims at finding interpretive models in the straint by one based on ordinary least squares, leading to form of PDEs from large spatio-temporal data-sets. Most much faster convergence. (II) We show that using PDE-find algorithms apply sparse regression on a predefined set of can- to find the active components outperforms a threshold-based didate terms, as initially proposed by Brunton et al. for ODEs Lasso approach in highly noisy data-set. (III) We demon- with SINDY (Brunton, Proctor, and Kutz 2016) and by Rudy strate that using a SIREN (Sitzmann et al. 2020) instead of a et al. for PDEs with PDE-find (Rudy et al. 2017). By writ- standard feed forward-neural network allows us to discover ing the unknown differential equation as ∂t u = f (u, ux , ...) equations from highly complex data-sets. and assuming the right-hand side is a linear combination of predefined terms, i.e. f (u, ux , ...) = au + bux + ... = Θξ, Related Work model discovery reduces to finding a sparse coefficient vec- Sparse regression Sparse regression as a means to dis- tor ξ. Calculating the time derivative ut and the function cover differential equations was pioneered by SINDY (Brun- library Θ is notoriously hard for noisy and sparse data since ton, Proctor, and Kutz 2016) and PDE-find (Rudy et al. it involves calculating higher order derivatives. The error 2017). They have since been expanded to automated hyper- in these terms is typically high due to the use of numerical parameter tuning (Champion et al. 2019a; Maddu et al. differentiation techniques such as finite difference or spline 2019); a Bayesian approach for model discovery using interpolation, limiting classical model discovery to low-noise Sparse Bayesian Learning (Yuan et al. 2019), model dis- and densely sampled data-sets. Deep learning-based methods covery for parametric differential equations(Rudy, Kutz, and circumvent this issue by constructing a surrogate from the Brunton 2019) and evolutionary approach to PDE discovery data and calculating the feature library Θ as well as the time (Maslyaev, Hvatov, and Kalyuzhnaya 2019). derivative ut from this digital twin using automatic differen- tiation. This approach significantly improves the accuracy of Deep learning-based model discovery With the advent of the time derivative and the library in noisy and sparse data Physics Informed neural networks (Raissi, Perdikaris, and Copyright c 2021, for this paper by its authors. Use permitted Karniadakis 2017a,b), a neural network has become one of under Creative Commons License Attribution 4.0 International (CC the prime approaches to create a surrogate of the data and BY 4.0) then perform sparse regression on the networks prediction Data Funct. approx. Library Sparsity Determine mask Underlying PDE Constraint Sparsity mask mse reg Figure 1: Schematic overview of our framework. (I) A function approximator constructs a surrogate of the data, (II) from which a Library of possible terms and the time derivative is constructed using automatic differentiation. (III) A sparsity estimator selects the active terms in the library using sparse regression and (IV) the function approximator is constrained to solutions allowed by the active terms by the constraint. (Schaeffer 2017; Berg and Nyström 2019). Alternatively, the constraint from the sparsity selection process itself. We Neural ODEs are introduced to discover unknown governing first calculate a sparsity mask g and constrain the network equation (Rackauckas et al. 2020) from physical data-sets. only by the active terms in the mask: instead of constraining Different optimisation strategy based on the method of alter- the neural network with ξ, we constrain it with ξ◦ g, replacing nating direction is considered in (Chen, Liu, and Sun 2020), eq. 1 with and graph based approaches have been developed recently (Seo and Liu 2019; Sanchez-Gonzalez et al. 2018). (Grey- danus, Dzamba, and Yosinski 2019) and (Cranmer et al. 2020) N N 1 X 2 1 X 2 directly encode symmetries in neural networks using respec- L= (ui − ûi ) + (∂t ûi − Θi (ξ · g)) . (2) N i=1 N i=1 tively the Hamiltonian and Lagrangian framework. Finally, auto-encoders have been used to model PDEs and discover latent variables(Lu, Kim, and Soljačić 2019; Iten et al. 2020), Training using eq. 2 requires two steps: first, we calculate g but do not lead to an explicit equation and require large using a sparse estimator. Next, we minimise it with respect to amounts of data. the network parameters using the masked coefficient vector. The sparsity mask g need not be calculated differentiably, so Deep-learning based model discovery with that any classical, non-differentiable sparse estimator can be used. This approach has several additional advantages: i) It sparse regression provides an unbiased estimate of the coefficient vector since Deep learning-based model discovery uses a neural network we do not apply l1 or l2 regularisation on ξ, ii) the sparsity to construct a surrogate model û of the data u. A library of pattern is determined from the full library Θ, rather than only candidate terms Θ is constructed using automatic differentia- from the remaining active terms, allowing dynamic addition tion from û and the neural network is constrained to solutions and removal of active terms throughout training, and iii) we allowed by this library (Both et al. 2019). The loss function can use cross validation in the sparse estimator to find the of the network thus consists of two contributions, (i) a mean optimal hyper parameters for model selection. Finally, we square error to learn the mapping (~x, t) → û and (ii) a term note that the sparsity mask g mirrors the role of attention in to constrain the network, transformers (Bahdanau, Cho, and Bengio 2016). Using this change, we construct a general framework for N N deep learning based model discovery using four modules (see 1 X 2 1 X 2 figure 1). (I) A function approximator constructs a surro- L= (ui − ûi ) + (∂t ûi − Θi ξ) . (1) N i=1 N i=1 gate model of the data, (II) from which a Library of possible terms and the time derivative is constructed using automatic The sparse coefficient vector ξ is learned concurrently with differentiation. (III) A sparsity estimator constructs a spar- the network parameters and plays two roles: 1) determining sity mask to select the active terms in the library using some the active (i.e. non-zero) components of the underlying PDE sparse regression algorithm and (IV) a constraint constrains and 2) constraining the network according to these active the function approximator to solutions allowed by the active terms. We propose to separate these two tasks by decoupling terms obtained from the sparsity estimator. Training We typically calculate the sparsity mask g using Losses A an external, non-differentiable estimator. In this case, updat- ing the mask at the right time is crucial: before the function approximator has reasonably approximated the data, updat- ing the mask would adversely affect training, as it is likely to select the wrong terms. Vice versa, updating the mask too late risks using a function library from an overfitted network. We implement a procedure in the spirit of ”early stopping” Coefficients to decide when to update: the data-set gets split into a train B and test-set and we update the mask once the mean squared error on the test-set reaches a minimum or changes less than a preset value δ. We typically set δ = 10−6 to ensure the network has learned a good representation of the data. After the first update, we periodically update the mask using the sparsity estimator. In figure 2 we demonstrate this training procedure on a Burgers equation with 1500 samples Mask C with 2% white noise. It shows the losses on the train- and testset in panel A, the coefficients of the constraint in panel B and the sparsity mask in C. In practice we observe that large data-sets with little noise typically discover the correct PDE after only a single sparsity update, but that noisy data- sets require several updates, removing only a few terms at a time. Final convergence is reached when the l1 norm of the coefficient vector remains constant. Package We provide our framework as a python based package at https://github.com/PhIMaL/DeePyMoD, with the Figure 2: A) MSE of the test-set and the total loss of the train- documentation and examples available at https://phimal. set as function of the number of epochs. The vertical line github.io/DeePyMoD/. Mirroring our approach, each model indicates the first time the sparsity mask is applied. B) The consists of four modules: a function approximator, library, twelve coefficients as function of the number of epochs. The constraint and sparsity estimator module. Each module can be two terms uxx and uux need to be recovered. C) Dynamic customised or replaced without affecting the other modules, sparsity mask during training. Yellow components are active, allowing for quick experimentation. Our framework is built blue components are inactive. on Pytorch (Paszke et al. 2019) and any Pytorch model (i.e. Recurrent Neural Networks) can be used as function approxi- mator. The sparse estimator module follows the Scikit-learn Panel A) shows that the least-squares approach reaches a API (Pedregosa et al.; Buitinck et al. 2013), i.e., all the build- consistently lower loss. More strikingly, we show in panel in Scikit-learn estimators, such as those in PySindy(de Silva B) that the mean absolute error in the coefficients is three et al. 2020) or SK-time (Löning et al.), can be used. orders of magnitude lower. We explain the difference as a consequence of the random initialisation of ξ: the network Experiments is initially constrained by incorrect coefficients, prolonging Constraint The sparse coefficient vector ξ in eq. 1 is typ- convergence. The random initialisation also causes the larger ically found by optimising it concurrently with the neural spread in results compared to the least squares method. The network parameters θ. Considering a network with parameter least squares method does not suffer from sensitivity to the configuration θ∗ , the problem of finding ξ can be rewrit- initialisation and consistently converges. 2 ten as arg minξ |ut (θ∗ ) − Θ(θ∗ )ξ| . This can be analytically solved by least squares under mild assumptions; we calcu- Sparsity estimator Implementing the sparsity estimator late ξ by solving this problem every iteration, rather than separately from the neural network allows us to use any optimizing it using gradient descent. In figure 3 we compare sparsity promoting algorithm. Here we show that a classi- the two constraining strategies on a Burgers data-set1 , by cal method for PDE model discovery, PDE-find (Rudy et al. training for 5000 epochs without updating the sparsity mask2 . 2017), can be used together with neural networks to per- 1 We solve ut = ux x + νuux with a delta-peak initial condition form model discovery in highly sparse and noisy data-sets. for ν = 0.1 for x = [−3, 4], t = [0.5, 5], randomly sample 2000 We compare it with the thresholded Lasso3 in figure 4 ap- points and add 10% white noise. proach (Both et al. 2019) on a Burgers data-set 4 with vary- 2 All experiments use a network with a tanh activation function ing amounts of noise. The PDE-find estimator discovers the of 5 layers with 30 neurons per layer. The network is optimized 3 using the ADAM optimiser with a learning rate of 2 · 10−3 and We use a pre-set threshold of 0.1. 4 β = (0.99, 0.999). See footnote 2, only with 1000 points randomly sampled. A Train loss 1.0 Threshold Fraction correct Grad. desc. 0.8 PDE-find 4 Lst. sq. Log cost 0.6 0.4 6 0.2 0.0 8 Epoch 0 20 40 60 80 100 Noise level (%) B 0 Coefficient error Figure 4: Fraction of correct discovered Burgers equations 1 Log MAE (averaged over 10 runs) as function of the noise level for the thresholded lasso and PDE-find sparsity estimator. 2 Grad. desc. 3 Lst. sq. Discussion and future work 0 1 2 3 4 5 In this paper we introduced a framework for model discovery, Epoch 1e3 combining classical sparsity estimation with deep learning based surrogates. Building on this, we showed that replacing the function approximator, constraint or dynamically apply- Figure 3: A) Loss and B) mean absolute error of the coeffi- ing the sparsity estimator during training can extend model cients obtained with the gradient descent and the least squares discovery to more complex datasets, speed up convergence constraint as a function of the number of epochs. Results have or make it more robust to noise. Each of the four components been averaged over twenty runs and shaded area denotes the is decoupled from the rest and can be independently changed, standard deviation. making our approach a solid base for future research. Cur- rently, the function approximator simply learns the solution using a feed forward neural network. We suspect that adding correct equation in the majority of cases, even with up to more structure, for example by using recurrent, convolutional 60% − 80% noise, whereas the thresholded lasso mostly fails or graph neural networks, will improve the performance of at 40%. We emphasise that the modular approach we propose model discovery. It might also be beneficial to regularise the here allows to combine classical and deep learning-based constraint, for example by implementing lasso or ridge re- techniques. More advanced sparsity estimators such as SR3 gression. Updating the sparsity mask in a non-differentiable (Champion et al. 2019b) can easily be included in this frame- manner works because the neural network is able to learn a work. fairly accurate surrogate without imposing sparsity on the constraint. If the network is unable to learn an accurate repre- Function approximator We show in figure 5 that a tanh- sentation, our approach breaks down. Updating the mask in a based NN fails to converge on a data-set of the Kuramoto- differentiable manner would not suffer from this drawback, Shivashinksy (KS) equation5 (panel A and B). Consequently, and we intend to pursue this in future works. the coefficient vectors are incorrect (Panel D). As our frame- work is agnostic to the underlying function approximator, Acknowledgments we instead use a SIREN 6 , which is able to learn very sharp This work received support from the CRI Research Fellow- features in the underlying dynamics. In panel B we show that ship to attributed to Remy Kusters. We thank the Betten- a SIREN is able to learn the complex dynamics of the KS court Schueller Foundation long term partnership and NVidia equation and in panel C that it discovers the correct equation7 . for supplying the GPU under the Academic Grant program. This example shows that the choice of function approxima- We would also like to thank the authors and contributors of tor can be a decisive factor in the success of neural network Numpy ((Harris et al. 2020)), Scipy ((Virtanen et al. 2020)), based model discovery. Using our framework we can also Scikit-learn ((Pedregosa et al.)), Matplotlib ((Hunter 2007)), explore using RNNs, Neural ODEs (Rackauckas et al. 2020) Ipython ((Perez and Granger 2007)), and Pytorch ((Paszke or Graph Neural Networks (Seo and Liu 2019). et al. 2019)) for making our work possible through their open- 5 source software. The authors declare no competing interest. We solve ∂t u + uux + uxx + uxxxx = 0 between x = [0, 100], t = [0, 44], randomly sample 25000 points and add 5% white noise. References 6 Both networks use 8 layers with 50 neurons. We train the Bahdanau, D.; Cho, K.; and Bengio, Y. 2016. Neural Ma- SIREN using ADAM with a learning rate of 2.5 · 10−4 and chine Translation by Jointly Learning to Align and Trans- β = (0.999, 0.999) late. arXiv:1409.0473 [cs, stat] URL http://arxiv.org/abs/ 7 In bold; uux : green, uxx : blue and uxxxx : orange 1409.0473. ArXiv: 1409.0473. A Kuramoto-Shivashinksy B MSE C SIREN 10 1 0 5 10 2 t D 10 Tanh 10 3 0 SIREN Tanh u 10 4 10 x 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Epoch 1e4 Epoch 1e4 Figure 5: A) Solution of the KS equation. Lower panel shows the cross section at the last time point: t = 44. B) MSE as function of the number of epochs for both the tanh-based and SIREN NN. Coefficients as function of number of epochs for C) the SIREN. and D) the tanh-based NN. The bold curves in panel C and D are the terms in the KS equation components; green: uux :, blue: uxx and orange: uxxxx . Only SIREN is able to discover the correct equation. Berg, J.; and Nyström, K. 2019. Data-driven discovery laws from scarce data. arXiv:2005.03448 [physics, stat] URL of PDEs in complex datasets. Journal of Computational http://arxiv.org/abs/2005.03448. ArXiv: 2005.03448. Physics 384: 239–252. ISSN 00219991. doi:10.1016/j.jcp. Cranmer, M.; Greydanus, S.; Hoyer, S.; Battaglia, P.; Spergel, 2019.01.036. URL http://arxiv.org/abs/1808.10788. ArXiv: D.; and Ho, S. 2020. Lagrangian Neural Networks. 1808.10788. arXiv:2003.04630 [physics, stat] URL http://arxiv.org/abs/ Both, G.-J.; Choudhury, S.; Sens, P.; and Kusters, R. 2019. 2003.04630. ArXiv: 2003.04630. DeepMoD: Deep learning for Model Discovery in noisy data. de Silva, B. M.; Champion, K.; Quade, M.; Loiseau, J.-C.; arXiv:1904.09406 [physics, q-bio, stat] URL http://arxiv.org/ Kutz, J. N.; and Brunton, S. L. 2020. PySINDy: A Python abs/1904.09406. ArXiv: 1904.09406. package for the Sparse Identification of Nonlinear Dynamics Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2016. Dis- from Data. arXiv:2004.08424 [physics] URL http://arxiv.org/ covering governing equations from data by sparse identifi- abs/2004.08424. ArXiv: 2004.08424. cation of nonlinear dynamical systems. Proceedings of the Greydanus, S.; Dzamba, M.; and Yosinski, J. 2019. Hamil- National Academy of Sciences 113(15): 3932–3937. ISSN tonian Neural Networks. arXiv:1906.01563 [cs] URL http: 0027-8424, 1091-6490. doi:10.1073/pnas.1517384113. URL //arxiv.org/abs/1906.01563. ArXiv: 1906.01563. http://www.pnas.org/lookup/doi/10.1073/pnas.1517384113. Harris, C. R.; Millman, K. J.; van der Walt, S. J.; Gommers, Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; S.; Smith, N. J.; Kern, R.; Picus, M.; Hoyer, S.; van Kerkwijk, Grobler, J.; Layton, R.; Vanderplas, J.; Joly, A.; Holt, M. H.; Brett, M.; Haldane, A.; del Rı́o, J. F.; Wiebe, M.; B.; and Varoquaux, G. 2013. API design for machine Peterson, P.; Gérard-Marchant, P.; Sheppard, K.; Reddy, T.; learning software: experiences from the scikit-learn project. Weckesser, W.; Abbasi, H.; Gohlke, C.; and Oliphant, T. E. arXiv:1309.0238 [cs] URL http://arxiv.org/abs/1309.0238. 2020. Array programming with NumPy. Nature 585(7825): ArXiv: 1309.0238. 357–362. ISSN 0028-0836, 1476-4687. doi:10.1038/s41586- 020-2649-2. URL http://www.nature.com/articles/s41586- Champion, K.; Lusch, B.; Kutz, J. N.; and Brunton, S. L. 020-2649-2. 2019a. Data-driven discovery of coordinates and governing equations. arXiv:1904.02107 [stat] URL http://arxiv.org/abs/ Hunter, J. D. 2007. Matplotlib: A 2D Graphics Environment. 1904.02107. ArXiv: 1904.02107. Computing in Science Engineering 9(3): 90–95. ISSN 1558- 366X. doi:10.1109/MCSE.2007.55. Conference Name: Com- Champion, K.; Zheng, P.; Aravkin, A. Y.; Brunton, S. L.; and puting in Science Engineering. Kutz, J. N. 2019b. A unified sparse optimization framework to learn parsimonious physics-informed models from data. Iten, R.; Metger, T.; Wilming, H.; del Rio, L.; and Renner, R. arXiv:1906.10612 [physics] URL http://arxiv.org/abs/1906. 2020. Discovering physical concepts with neural networks. 10612. ArXiv: 1906.10612. Physical Review Letters 124(1): 010508. ISSN 0031-9007, 1079-7114. doi:10.1103/PhysRevLett.124.010508. URL http: Chen, Z.; Liu, Y.; and Sun, H. 2020. Deep learning of physical //arxiv.org/abs/1807.10300. ArXiv: 1807.10300. Lu, P. Y.; Kim, S.; and Soljačić, M. 2019. Extracting Inter- Sanchez-Gonzalez, A.; Heess, N.; Springenberg, J. T.; Merel, pretable Physical Parameters from Spatiotemporal Systems us- J.; Riedmiller, M.; Hadsell, R.; and Battaglia, P. 2018. Graph ing Unsupervised Learning. arXiv:1907.06011 [physics, stat] networks as learnable physics engines for inference and con- URL http://arxiv.org/abs/1907.06011. ArXiv: 1907.06011. trol. arXiv:1806.01242 [cs, stat] URL http://arxiv.org/abs/ 1806.01242. ArXiv: 1806.01242. Löning, M.; Bagnall, A.; Ganesh, S.; and Kazakov, V. ???? sktime: A Unified Interface for Machine Learning with Time Schaeffer, H. 2017. Learning partial differential equa- Series 10. tions via data discovery and sparse optimization. Proceed- ings of the Royal Society A: Mathematical, Physical and Maddu, S.; Cheeseman, B. L.; Sbalzarini, I. F.; and Müller, Engineering Sciences 473(2197): 20160446. ISSN 1364- C. L. 2019. Stability selection enables robust learning 5021, 1471-2946. doi:10.1098/rspa.2016.0446. URL https: of partial differential equations from limited noisy data. //royalsocietypublishing.org/doi/10.1098/rspa.2016.0446. arXiv:1907.07810 [physics] URL http://arxiv.org/abs/1907. 07810. ArXiv: 1907.07810. Seo, S.; and Liu, Y. 2019. Differentiable Physics-informed Graph Networks. arXiv:1902.02950 [cs, stat] URL http:// Maslyaev, M.; Hvatov, A.; and Kalyuzhnaya, A. 2019. arxiv.org/abs/1902.02950. ArXiv: 1902.02950. Data-driven PDE discovery with evolutionary approach. arXiv:1903.08011 [cs, math] 11540: 635–641. doi:10.1007/ Sitzmann, V.; Martel, J. N. P.; Bergman, A. W.; Lindell, D. B.; 978-3-030-22750-0 61. URL http://arxiv.org/abs/1903.08011. and Wetzstein, G. 2020. Implicit Neural Representations with ArXiv: 1903.08011. Periodic Activation Functions. arXiv:2006.09661 [cs, eess] URL http://arxiv.org/abs/2006.09661. ArXiv: 2006.09661. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Virtanen, P.; Gommers, R.; Oliphant, T. E.; Haberland, Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; L.; Desmaison, A.; Köpf, A.; Yang, E.; DeVito, Z.; Raison, Weckesser, W.; Bright, J.; van der Walt, S. J.; Brett, M.; Wil- M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, son, J.; Millman, K. J.; Mayorov, N.; Nelson, A. R. J.; Jones, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, E.; Kern, R.; Larson, E.; Carey, C. J.; Polat, ; Feng, Y.; Moore, High-Performance Deep Learning Library. arXiv:1912.01703 E. W.; VanderPlas, J.; Laxalde, D.; Perktold, J.; Cimrman, [cs, stat] URL http://arxiv.org/abs/1912.01703. ArXiv: R.; Henriksen, I.; Quintero, E. A.; Harris, C. R.; Archibald, 1912.01703. A. M.; Ribeiro, A. H.; Pedregosa, F.; van Mulbregt, P.; and Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Contributors, S. . . 2020. SciPy 1.0–Fundamental Algorithms Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, for Scientific Computing in Python. Nature Methods 17(3): R.; Dubourg, V.; Vanderplas, J.; Passos, A.; and Cournapeau, 261–272. ISSN 1548-7091, 1548-7105. doi:10.1038/s41592- D. ???? Scikit-learn: Machine Learning in Python. MACHINE 019-0686-2. URL http://arxiv.org/abs/1907.10121. ArXiv: LEARNING IN PYTHON 6. 1907.10121. Perez, F.; and Granger, B. E. 2007. IPython: A System Yuan, Y.; Li, J.; Li, L.; Jiang, F.; Tang, X.; Zhang, F.; Liu, S.; for Interactive Scientific Computing. Computing in Sci- Goncalves, J.; Voss, H. U.; Li, X.; Kurths, J.; and Ding, H. ence Engineering 9(3): 21–29. ISSN 1558-366X. doi: 2019. Machine Discovery of Partial Differential Equations 10.1109/MCSE.2007.53. from Spatiotemporal Data. arXiv:1909.06730 [physics, stat] URL http://arxiv.org/abs/1909.06730. ArXiv: 1909.06730. Rackauckas, C.; Ma, Y.; Martensen, J.; Warner, C.; Zubov, K.; Supekar, R.; Skinner, D.; and Ramadhan, A. 2020. Uni- versal Differential Equations for Scientific Machine Learn- ing. arXiv:2001.04385 [cs, math, q-bio, stat] URL http: //arxiv.org/abs/2001.04385. ArXiv: 2001.04385. Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2017a. Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. arXiv:1711.10561 [cs, math, stat] URL http://arxiv.org/abs/ 1711.10561. ArXiv: 1711.10561. Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2017b. Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations. arXiv:1711.10566 [cs, math, stat] URL http://arxiv.org/abs/ 1711.10566. ArXiv: 1711.10566. Rudy, S. H.; Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2017. Data-driven discovery of partial differential equations. Science Advances 3(4): e1602614. ISSN 2375-2548. doi: 10.1126/sciadv.1602614. URL http://advances.sciencemag. org/lookup/doi/10.1126/sciadv.1602614. Rudy, S. H.; Kutz, J. N.; and Brunton, S. L. 2019. Deep learn- ing of dynamics and signal-noise decomposition with time- stepping constraints. Journal of Computational Physics 396: 483–506. ISSN 00219991. doi:10.1016/j.jcp.2019.06.056. URL http://arxiv.org/abs/1808.02578. ArXiv: 1808.02578.