Generalized Physics-Informed Learning Through Language-Wide Differentiable Programming Chris Rackauckas,1,2 Alan Edelman,1,3 Keno Fischer,3 Mike Innes3 Elliot Saba,3 Viral B. Shah,3 Will Tebbutt4 1 Massachusetts Institute of Technology, 2 University of Maryland, Baltimore 3 Julia Computing, 4 University of Cambridge 182 Memorial Dr, Cambridge, MA 02142 crackauc@mit.edu Abstract like Physics-Informed Neural Networks (PINNs) suggest that data-efficient machine learning for scientific applications can Scientific computing is increasingly incorporating the advance- be found at the intersection of the methods (Raissi, Perdikaris, ments in machine learning to allow for data-driven physics- informed modeling approaches. However, re-targeting existing and Karniadakis 2019). While a major advance, this tech- scientific computing workloads to machine learning frame- nique requires re-implementing partial differential equation works is both costly and limiting, as scientific simulations tend simulation techniques, such as Runge-Kutta methods, in the to use the full feature set of a general purpose programming context of machine learning frameworks like TensorFlow. language. In this manuscript we develop an infrastructure Likewise, it would be impractical to require every scientific for incorporating deep learning into existing scientific com- simulation suite to re-target their extensive stiff differential puting code through Differentiable Programming (∂P ). We equation solver and numerical linear algebra stacks to spe- describe a ∂P system that is able to take gradients of full Julia cific machine learning libraries for specific tasks. In order programs, making Automatic Differentiation a first class lan- to truly scale physics-informed learning to big science ap- guage feature and compatibility with deep learning pervasive. plications, we see a need for efficiently incorporating neural Our system utilizes the one-language nature of Julia pack- age development to augment the existing package ecosystem networks into existing scientific simulation suites. with deep learning, supporting almost all language constructs Differentiable Programming (∂P ) has the potential to be (control flow, recursion, mutation, etc.) while generating high- the lingua franca that can further unite the worlds of sci- performance code without requiring any user intervention or entific computing and machine learning. Here we present refactoring to stage computations. We showcase several ex- a ∂P system which allows for embedding deep neural net- amples of physics-informed learning which directly utilizes this extension to existing simulation code: neural surrogate works into arbitrary existing scientific simulations, enabling models, machine learning on simulated quantum hardware, these packages to automatically build surrogates for ML- and data-driven stochastic dynamical model discovery with acceleration and learn missing functions from data. Previous neural stochastic differential equations. work has shown that differentiable programming systems in domain specific languages for image processing can al- low for machine learning integration into the domain appli- Introduction cations in a programmer-friendly manner (Li et al. 2018; A casual practitioner might think that scientific comput- Li 2019). Supporting multiple languages within a single ∂P ing and machine learning are different scientific disciplines. system causes an explosion in complexity, vastly increasing Modern machine learning has made its mark through break- the developer effort required, which excludes languages in throughs in neural networks. Their applicability towards solv- which packages are developed in alternative languages like ing a large class of difficult problems in computer science has Cython or Rcpp. For this reason we extend the full Julia led to the design of new hardware and software to process ex- programming language (Bezanson et al. 2017) with differen- treme amounts of labelled training data, while simultaneously tiable programming capabilities in a way that allows existing deploying trained models in devices. Scientific computing, package to incorporate deep learning. By choosing the Julia in contrast, a discipline that is perhaps as old as computing language, we arrive at an abundance of pure-Julia packages itself, tends to use a broader set of modelling techniques aris- for both machine learning and scientific computing with both ing out of the underlying physical phenomena. Compared to speed and automatic compatibility, allowing us to test our the typical machine learning researcher, many computational ideas on fairly large real-world applications. scientists works with smaller volumes of data but with more Our system can be directly used on existing Julia pack- computational complexity and range. However, recent results ages, handling user-defined types, general control flow con- Copyright c 2020, for this paper by its authors. Use permitted under structs, and plentiful scalar operations through source-to- Creative Commons License Attribution 4.0 International (CCBY source mixed-mode automatic differentiation, composing the 4.0). reverse-mode capabilities of Zygote.jl and Tracker.jl with ForwardDiff.jl for the "best of both worlds" approach. In this tracing methods to extract simplified program representations paper we briefly describe how we achieve our goals for a that are more easily amenable to AD transforms. These traces ∂P system and showcase its ability to solve problems which evaluate derivatives only at specific points in the program mix machine learning and pre-existing scientific simulation space. Unfortunately, this generally unrolls control flow (i.e. packages. building a tape that requires O(n) memory instead of keep- ing the loop construct intact) and requires compilation and A simple sin example: Differentiate Programs not optimization for every new input. Formulas This choice has been driven largely by the fact that, as We start out with a very simple example to differentiate sin(x) the JAX authors put it, “ML workloads often consist of written as a program through its Taylor series: large, accelerable, pure-and-statically-composed (PSC) oper- ations” (Johnson et al. 2018). Indeed, for many ML models x3 x5 the per-executed-operation overhead (in both time and mem- sin x = x − + − .... 3! 5! ory) incurred by tracing-based AD systems is immaterial, Note that the number of terms will not be fixed, but will because these execution time and memory requirements of depend on x through a numerical convergence criterion. the operations dwarf any AD overhead. To run, install Julia v1.1 or higher, and install the Zygote.jl However, this assumption does not hold for many scientific and ForwardDiff.jl packages with: inverse problems, or even the cutting edge of ML research. using Pkg Instead, these problems require a ∂P system capable of: Pkg.add("Zygote") 1. Low overhead, independent of the size of the executed Pkg.add("ForwardDiff") operation using Zygote, ForwardDiff 2. Efficient support for control flow function s(x) 3. Complete, efficient support for user defined data types t = 0.0 4. Customizability sign = -1.0 for i in 1:19 5. Composability with existing code unaware of ∂P if isodd(i) 6. Dynamism. newterm = x^i/factorial(i) abs(newterm)<1e-8 && return t Particularly, scientific programs tend to have adaptive al- sign = -sign gorithms, whose control flow depends on error estimates t += sign * newterm and thus the current state of the simulation, numerous scalar end operations, define large nonlinear models using every term end individually or implementing specialized numerical linear return t algebra routines, and pervasive use of user-defined data struc- end tures to describe model components, which require efficient memory handling (stack-allocation) in order for the problem While the Taylor series for sine could have been written to be computationally feasible. more compactly in Julia, for purposes of illustrating more To take these kinds of problems, Zygote does not utilize complex programs, we purposefully used a loop, a condi- the standard methodology and instead generates a derivative tional, and function calls to isodd and factorial, which function directly from the original source which is able to are native Julia implementations. AD just works, and that is handle all input values. This is called a source-to-source trans- the powerful part of the Julia approach. Let’s compute the formation, a methodology with a long history (Baydin et al. gradient at x = 1.0 and check whether it matches cos(1.0): 2018) going back at least to the ADIFOR source-to-source julia> ForwardDiff.derivative(s, 1.0) AD program for FORTRAN 77 (Bischof et al. 1996). Using 0.540302303791887 this source-to-source formulation, Zygote can then be com- pile, heavily optimize, and re-use a single gradient definition julia> Zygote.gradient(s, 1.0) for all input values. Significantly, this transformation keeps (0.5403023037918872,) control flow in tact: not unrolling loops to allow for all pos- sible branches in a memory-efficient form. However, where julia> cos(1.0) prior source-to-source AD work has often focused on static 0.5403023058681398 languages, Zygote expands upon this idea by supporting a full high level language, dynamic, Julia, in a way that al- Implementation lows for its existing scientific and machine learning package Recent progress in tooling for automatic differentiation (AD) ecosystem to benefit from this tool. has been driven primarily by the machine learning commu- nity. Many state of the art reverse-mode AD tools such as Generality, Flexibility, and Composability Tracker.jl (Innes et al. 2018; Gandhi et al. 2019), PyTorch (Py- One of the primary design decisions of a ∂P system is how Torch Team 2018), JAX (Johnson et al. 2018), and Tensor- these capabilities should be exposed to the user. One con- Flow (Abadi et al. 2016) (in the recent Eager version) employ venient way to do so is using a differential operator J that function J(f . g)(x) a fully user extensible function ∂ that provides a default a, da = J(f)(x) fallback as follows b, db = J(g)(a) b, z -> da(db(z)) ∂(f )(args...) = J (f )(args...), end where the implementation that is generated automatically by J recurses to ∂ rather than J and can thus easily be Figure 1: The differential operator J is able to implement the intercepted using Julia’s standard multiple dispatch system chain rule through a local, syntactic recursive transformation. at any level of the stack. For example, we might make the following definitions: julia> f(x) = x^2 + 3x + 1 julia> gradient(f, 1/3) (3.6666666666666665,) ∂(f )(:: typeof(+))(a :: IntOrFloat,b :: IntOrFloat) = a + b, z → (z, z) julia> using Measurements; julia> gradient(f, 1/3 +- 0.01) ∂(f )(:: typeof(∗))(a :: IntOrFloat,b :: IntOrFloat) = (3.6666666666666665 +- 0.02,) a ∗ b, z → (z ∗ b, a ∗ z) i.e. declaring how to compute the partial derivative of + Figure 2: With two minimal definitions, Zygote is able to obtain and ∗ for two integer or float-valued numbers, but simultane- derivatives of any function that only requires those definitions, ously leaving unconstrained the same for other functions or even through custom data types (such as Measurement) and other types of values (which will thus fall back to applying many layers of abstraction. the AD transform). With these two definitions, any program that is ultimately just a composition of ‘+‘, and ‘*‘ operations of real numbers will work. We show a simple example in operates on first class functions and once again returns a figure 2. Here, we used the user-defined Measurement type first class function (by returning a function we automatically from the Measurements.jl package (Giordano 2016). We did obtain higher order derivatives, through repeated application not have to define how to differentiate the ∧ function or how of J ). There are several valid choices for this differential to differentiate + and ∗ on a Measurement, nor did the Mea- operator, but a convenient choice is surements.jl package have to be aware of the AD system in order to be differentiated. Thus standard user definitions of J (f ) := x → (f (x), Jf (x)z), types are compatible with the differentiation system. This extra, user-extensible layer of indirection has a number of i.e. J (f )(x) returns the value of f at x, as well as a func- important consequences: tion which evaluates the jacobian-vector product between Jf (x) and some vector of sensitivities z. From this primitive • The AD system does not depend on, nor require any we can define the gradient of a scalar function g : Rn → R knowledge of primitives on new types. By default we which is written as: provide implementations of the differentiable operator for many common scalar mathematical and linear algebra op- ∇g(x) := [J (g)(x)]2 (1) erations, written with a scalar LLVM backend and BLAS- like linear algebra operations. This means that even when ([]2 selects the second value of the tuple, 1 = ∂z/∂z is the Julia builds an array type to target TPUs (Fischer and Saba initial sensitivity). 2018), its XLA IR primitives are able to be used and differ- This choice of differential operator is convenient for sev- entiated without fundamental modifications to our system. eral reasons: (1) The computation of the forward pass often computes values that can be re-used for the computation of • Custom gradients become trivial. Since all operations the backwards pass. By combining the two operations, it is indirect through ∂, there is no difference between user- easy to re-use this work. (2) It can be used to recursively defined custom gradients and those provided by the sys- implement the chain rule (see figure 1). tem. They are written using the same mechanism, are co- This second property also suggests the implementation optimized by the compiler and can be finely targeted using strategy: hard code the operation of J on a set of primi- Julia’s multiple dispatch mechanism. tive f ’s and let the AD system generate the rest by repeated application of the chain rule transform. This same general Since Julia solves the two language problem, its Base, stan- approach has been implemented in many systems (Pearl- dard library, and package ecosystem are almost entirely pure mutter and Siskind 2008; Wang et al. 2018) and a detailed Julia. Thus, since our ∂P system does not require primitives description of how to perform this on Julia’s SSA form IR is to handle new types, this means that almost all functions available in earlier work (Innes 2018). and types defined throughout the language are automatically However, to achieve our extensibility and composability supported by Zygote, and users can easily accelerate specific goals, we implement a slight twist on this scheme. We define functions as they deem necessary. Quantum Machine Learning On the one hand, a promising potential application and research direction for Noisy Intermediate-Scale Quantum (NISQ) technology (Preskill 2018) is variational quantum circuits (Benedetti, Lloyd, and Sack 2019), where a quantum circuit is parameterized by classical parameters in quantum gates, which may have less requirements on the hardware. Here, in many cases, the classical parameters are optimized with classical gradient-based algorithms. On the other hand, Figure 3: Using a neural network surrogate to solve in- designing quantum circuits is hard, however, it is poten- verse problems tially an interesting direction to explore using gradient-based method to search circuit architecture for certain task with AD support. One such state of the art simulator is the Yao.jl (zhe Luo et ∂P in Practice al. 2019) quantum simulator project. Yao.jl is implemented in Julia and thus composable with our AD technology. There Surrogates for Realtime ML-Acceleration of are a number of interesting applications of this combination Inverse Problems (Mitarai et al. 2018). A subtle application is to perform traditional AD of the Model-based reinforcement learning has advantages over quantum simulator itself. As a simple example of this capa- model-agnostic methods, given that an effective agent must bility, we consider a Variational Quantum Eigensolver (VQE) approximate the dynamics of its environment (Atkeson and (Kandala et al. 2017). A variational quantum eigensolver is Santamaria 1997). However, model-based approaches have used to compute the eigenvalue of some matrix H (gener- been hindered by the inability to incorporate realistic en- ally the Hamiltonian of some quantum system for which the vironmental models into deep learning models. Previous eigenvalue problem is hard to solve classically, but that is eas- work has had success re-implementing physics engines us- ily encoded into quantum hardware). This is done by using ing machine learning frameworks (Degrave et al. 2019; a variational quantum circuit Φ(θ) to prepare some quan- de Avila Belbute-Peres et al. 2018), but this effort has a tum state |Ψi = Φ(θ)|0i, measuring the expectation value large engineering cost, has limitations compared to existing hΨ|H|Ψ0i and then using a classical optimizer to adjust θ to engines, and has limited applicability to other domains such minimize the measured value. In our example, we will use a 4- as biology or meteorology. For example, one notable example site toy Hamiltonian corresponding to an anti-ferromagnetic has shown that solving the 3-body problem can be acceler- Heisenberg chain: ated 100 million times by using a neural surrogate approach, but required writing a simulator that was compatible with the   chosen neural network framework (Breen et al. 2019). 1 X x x y y H= σi σj + σi σj + σiz σjz  Zygote can be used for control problems, incorporating the 4 hi,ji model into backpropagation with one call to gradient. We pick trebuchet dynamics as a motivating example. Instead of We use a standard differentiable variational quantum cir- aiming at a single target, we optimize a neural surrogate that cuit composed of layers (2 in our example) of (parameterized) can aim it given any target. The neural net takes two inputs, rotations and CNOT entanglers with randomly initialized ro- the target distance in metres and the current wind speed. The tation angles. Figure 4 shows the result of the minimization network outputs trebuchet settings (the mass of the counter- process as performed using gradients provided by Zygote. weight and the angle of release) that get fed into a simulator which solves an ODE and calculates the achieved distance. Data-Driven Stochastic Dynamical Model We compare to our target and backpropagate through the en- Discovery with Neural Stochastic Differential tire chain to adjust the weights of the network. Our dataset is Equations a randomly chosen set of targets and wind speeds. An agent that aims a trebuchet to a given target can thus be trained in a Neural latent differential equations (Chen et al. 2018; few minutes on a laptop CPU, resulting in a network which Álvarez, Luengo, and Lawrence 2009; Hu et al. 2013; is a surrogate to the inverse problem that allows for aiming Rackauckas et al. 2019) incorporate a neural network into in constant-time. Given that the forward-pass of this network the ODE derivative function. Recent results have shown that is about 100× faster than performing the full optimization many deep learning architectures can be compacted and gen- on the trebuchet system (Figure 3), this surrogate technique eralized through neural ODE descriptions (Chen et al. 2018; gives the ability to decouple real-time model-based control He et al. 2016; Dupont, Doucet, and Whye Teh 2019; Grath- from the simulation cost through a pre-trained neural net- wohl et al. 2018). Latent differential equations have also seen work surrogate to the inverse. We present the code for this use in time series extrapolation (Gao et al. 2008) and model and other common reinforcement learning examples such as reduction (Ugalde et al. 2013; Hartman and Mestha 2017; the cartpole and inverted pendulum (Innes, Joy, and Karmali Bar-Sinai et al. 2018; Ordaz-Hernandez, Fischer, and Bennis 2019). 2008). Neural SDE: Before Training 4 3 2 1 0 -1 0.00 0.25 0.50 0.75 1.00 Neural SDE: After Training 2 1 0 -1 0.00 0.25 0.50 0.75 1.00 Time Figure 5: Neural SDE Training. For the SDE solution X(t), the blue line shows X1 (t) while the orange line shows X2 (t). Figure 4: Optimization progress over steps of the The green points shows the fitting data for X1 while the classical optimizer to ground state. purple points show the fitting data for X2 . The ribbons show the 95 percentile bounds of the stochastic solutions. Here we demonstrate automatic construction of Langevin equations from data using neural stochastic differential equa- computationally accessible. tions (SDE). Typically, Langevin equation models can be written in the form: Example Code A more extensive code listing for these examples is available at the following URL: dXt = f (Xt )dt + g(Xt )dWt , https://github.com/MikeInnes/zygote-paper. where f : R → Rn and g : Rn×m → Rn with Wt as the n Acknowledgments This work would not have been possi- m-dimensional Wiener process. ble without the work of many people. We are particularly To automatically learn such a physical model, we can indebted to our users that contributed the examples in this replace the drift function f and the diffusion function g paper. Thanks to Roger Luo and JinGuo Liu for the quantum with neural networks and train these against time series experiments. Thanks also to Zenna Tavares, Jesse Bettencourt data by repeatedly solving the differential equation 10 and Lyndon White for being early adopters of Zygote and its times and using the average gradient of the cost function underlying technology, and shaping its development. Much against the parameters of the neural network. Our simulator credit is also due to the core Julia language and compiler utilizes a high strong order adaptive integration provided team for supporting Zygote’s development, including Jame- by DifferentialEquations.jl (Rackauckas and Nie 2017a; son Nash, Jeff Bezanson and others. Thanks also to James 2017b). Figure 5 depicts a two-dimensional neural SDE Bradbury for helpful discussions on this paper. trained using the l2 normed distance between the solution and the data points. Included is a forecast of the neural SDE solution beyond the data to a final time of 1.2, showcasing a References potential use case for timeseries extrapolation from a learned Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; dynamical model. Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. The analytical formula for the adjoint of the strong solution Tensorflow: A system for large-scale machine learning. In 12th of a SDE is difficult to efficiently calculate due to the lack of {USENIX} Symposium on Operating Systems Design and Imple- classical differentiability of the solution. However, Zygote mentation ({OSDI} 16), 265–283. still manages to calculate a useful derivative for optimiza- Atkeson, C. G., and Santamaria, J. C. 1997. A comparison of tion with respect to single solutions by treating the Brownian direct and model-based reinforcement learning. In Proceedings of process as fixed and applying forward-mode automatic differ- International Conference on Robotics and Automation, volume 4, entiation, showcasing Zygote’s ability to efficiently optimize 3557–3564. IEEE. its AD through mixed-mode approaches (Rackauckas et al. Bar-Sinai, Y.; Hoyer, S.; Hickey, J.; and Brenner, M. P. 2018. Data- 2018). Common numerical techniques require computing driven discretization: machine learning for coarse graining of partial the gradient with respect to a difference over thousands of differential equations. arXiv e-prints arXiv:1808.04930. trajectories to receive an average cost, while our numerical Baydin, A. G.; Pearlmutter, B. A.; Radul, A. A.; and Siskind, J. M. experiments suggest that it is sufficient with Zygote to per- 2018. Automatic differentiation in machine learning: a survey. form gradient decent on a neural SDE using only single or Journal of Marchine Learning Research 18:1–43. a few trajectories, reducing the overall computational cost Benedetti, M.; Lloyd, E.; and Sack, S. 2019. Parameterized by this thousands. This methodological advance combined quantum circuits as machine learning models. arXiv preprint with GPU-accelerated high order adaptive SDE integrators arXiv:1906.07682. in DifferentialEquations.jl makes a whole new field of study Bezanson, J.; Edelman, A.; Karpinski, S.; and Shah, V. B. 2017. Julia: A fresh approach to numerical computing. SIAM Review Li, T.-M. 2019. Differentiable Visual Computing. arXiv e-prints 59(1):65–98. arXiv:1904.12228. Bischof, C.; Khademi, P.; Mauer, A.; and Carle, A. 1996. ADI- Mitarai, K.; Negoro, M.; Kitagawa, M.; and Fujii, K. 2018. Quantum FOR 2.0: Automatic differentiation of Fortran 77 programs. IEEE circuit learning. Physical Review A 98(3):032309. Computational Science and Engineering 3(3):18–32. Ordaz-Hernandez, K.; Fischer, X.; and Bennis, F. 2008. Model Breen, P. G.; Foley, C. N.; Boekholt, T.; and Portegies Zwart, S. reduction technique for mechanical behaviour modelling: Efficiency 2019. Newton vs the machine: solving the chaotic three-body prob- criteria and validity domain assessment. Proceedings of the Insti- lem using deep neural networks. arXiv e-prints arXiv:1910.07291. tution of Mechanical Engineers, Part C: Journal of Mechanical Chen, T. Q.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. K. Engineering Science 222(3):493–505. 2018. Neural ordinary differential equations. In Advances in Neural Pearlmutter, B. A., and Siskind, J. M. 2008. Reverse-mode AD in a Information Processing Systems, 6571–6583. functional framework: Lambda the ultimate backpropagator. ACM de Avila Belbute-Peres, F.; Smith, K.; Allen, K.; Tenenbaum, J.; and Transactions on Programming Languages and Systems (TOPLAS) Kolter, J. Z. 2018. End-to-end differentiable physics for learning 30(2):7. and control. In Advances in Neural Information Processing Systems, Preskill, J. 2018. Quantum computing in the nisq era and beyond. 7178–7189. Quantum 2:79. Degrave, J.; Hermans, M.; Dambre, J.; and wyffels, F. 2019. A PyTorch Team. 2018. The road to 1.0: production ready PyTorch. differentiable physics engine for deep learning in robotics. Frontiers https://pytorch.org/blog/a-year-in/. Accessed: 2018-09-22. in Neurorobotics 13. Rackauckas, C., and Nie, Q. 2017a. Differentialequations.jl – Dupont, E.; Doucet, A.; and Whye Teh, Y. 2019. Augmented Neural a performant and feature-rich ecosystem for solving differential ODEs. arXiv e-prints arXiv:1904.01681. equations in julia. 5(1). Exported from https://app.dimensions.ai on Fischer, K., and Saba, E. 2018. Automatic full compilation of Julia 2019/05/05. programs and ML models to cloud TPUs. CoRR abs/1810.09868. Rackauckas, C. V., and Nie, Q. 2017b. Adaptive methods for Gandhi, D.; Innes, M.; Saba, E.; Fischer, K.; and Shah, V. 2019. stochastic differential equations via natural embeddings and rejec- Julia E Flux: Modernizando o Aprendizado de Máquina. 5. tion sampling with memory. Discrete and continuous dynamical Gao, P.; Honkela, A.; Rattray, M.; and Lawrence, N. D. 2008. systems. Series B 22 7:2731–2761. Gaussian process modelling of latent chemical species: applications Rackauckas, C.; Ma, Y.; Dixit, V.; Guo, X.; Innes, M.; Revels, J.; to inferring transcription factor activities. Bioinformatics 24(16):i70– Nyberg, J.; and Ivaturi, V. 2018. A Comparison of Automatic Dif- i75. ferentiation and Continuous Sensitivity Analysis for Derivatives of Giordano, M. 2016. Uncertainty propagation with functionally Differential Equation Solutions. arXiv e-prints arXiv:1812.01892. correlated quantities. ArXiv e-prints. Rackauckas, C.; Innes, M.; Ma, Y.; Bettencourt, J.; White, L.; and Grathwohl, W.; Chen, R. T. Q.; Bettencourt, J.; Sutskever, I.; and Dixit, V. 2019. Diffeqflux.jl - A julia library for neural differential Duvenaud, D. K. 2018. FFJORD: free-form continuous dynamics equations. CoRR abs/1902.02376. for scalable reversible generative models. CoRR abs/1810.01367. Raissi, M.; Perdikaris, P.; and Karniadakis, G. 2019. Physics- Hartman, D., and Mestha, L. K. 2017. A deep learning framework informed neural networks: A deep learning framework for solving for model reduction of dynamical systems. In 2017 IEEE Confer- forward and inverse problems involving nonlinear partial differential ence on Control Technology and Applications (CCTA), 1917–1922. equations. Journal of Computational Physics 378:686 – 707. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning Ugalde, H. M. R.; Carmona, J.-C.; Alvarado, V. M.; and Reyes- for image recognition. 2016 IEEE Conference on Computer Vision Reyes, J. 2013. Neural network design and model reduction ap- and Pattern Recognition (CVPR) 770–778. proach for black box nonlinear system identification with reduced number of parameters. Neurocomputing 101:170 – 180. Hu, Y.; Boker, S.; Neale, M.; and Klump, K. 2013. Coupled latent differential equation with moderators: Simulation and application. Wang, F.; Wu, X.; Essertel, G.; Decker, J.; and Rompf, T. 2018. De- Psychological methods 19. mystifying differentiable programming: Shift/reset the penultimate Innes, M.; Saba, E.; Fischer, K.; Gandhi, D.; Rudilosso, M. C.; backpropagator. arXiv preprint arXiv:1803.10228. Joy, N. M.; Karmali, T.; Pal, A.; and Shah, V. 2018. Fashionable zhe Luo, X.; guo Liu, J.; Zhang, P.; and Wang, L. 2019. Yao.jl: modelling with Flux. CoRR abs/1811.01457. Extensible, efficient quantum algorithm design for humans. In Innes, M.; Joy, N. M.; and Karmali, T. 2019. Reinforcement learning preparation. vs. differentiable programming. https://fluxml.ai/2019/03/05/dp-vs- Álvarez, M.; Luengo, D.; and Lawrence, N. D. 2009. Latent force rl.html. models. In van Dyk, D., and Welling, M., eds., Proceedings of Innes, M. 2018. Don’t unroll adjoint: Differentiating SSA-form the Twelth International Conference on Artificial Intelligence and programs. CoRR abs/1810.07951. Statistics, volume 5 of Proceedings of Machine Learning Research, 9–16. Hilton Clearwater Beach Resort, Clearwater Beach, Florida Johnson, M.; Frostig, R.; Maclaurin, D.; and Leary, C. 2018. JAX: USA: PMLR. Autograd and xla. https://github.com/google/jax. Kandala, A.; Mezzacapo, A.; Temme, K.; Takita, M.; Brink, M.; Chow, J. M.; and Gambetta, J. M. 2017. Hardware-efficient varia- tional quantum eigensolver for small molecules and quantum mag- nets. Nature 549(7671):242. Li, T.-M.; Gharbi, M.; Adams, A.; Durand, F.; and Ragan-Kelley, J. 2018. Differentiable programming for image processing and deep learning in Halide. ACM Trans. Graph. (Proc. SIGGRAPH) 37(4):139:1–139:13.