=Paper= {{Paper |id=Vol-2587/article_8 |storemode=property |title=Generalized Physics-Informed Learning through Language-Wide Differentiable Programming |pdfUrl=https://ceur-ws.org/Vol-2587/article_8.pdf |volume=Vol-2587 |authors=Chris Rackauckas,Alan Edelman,Keno Fischer,Mike Innes,Elliot Saba,Viral B. Shah,Will Tebbutt |dblpUrl=https://dblp.org/rec/conf/aaaiss/RackauckasEFISS20 }} ==Generalized Physics-Informed Learning through Language-Wide Differentiable Programming== https://ceur-ws.org/Vol-2587/article_8.pdf
                              Generalized Physics-Informed Learning
                        Through Language-Wide Differentiable Programming
                       Chris Rackauckas,1,2 Alan Edelman,1,3 Keno Fischer,3 Mike Innes3
                                  Elliot Saba,3 Viral B. Shah,3 Will Tebbutt4
                            1
                                Massachusetts Institute of Technology, 2 University of Maryland, Baltimore
                                             3
                                               Julia Computing, 4 University of Cambridge
                                               182 Memorial Dr, Cambridge, MA 02142
                                                            crackauc@mit.edu


                            Abstract                                   like Physics-Informed Neural Networks (PINNs) suggest that
                                                                       data-efficient machine learning for scientific applications can
  Scientific computing is increasingly incorporating the advance-
                                                                       be found at the intersection of the methods (Raissi, Perdikaris,
  ments in machine learning to allow for data-driven physics-
  informed modeling approaches. However, re-targeting existing         and Karniadakis 2019). While a major advance, this tech-
  scientific computing workloads to machine learning frame-            nique requires re-implementing partial differential equation
  works is both costly and limiting, as scientific simulations tend    simulation techniques, such as Runge-Kutta methods, in the
  to use the full feature set of a general purpose programming         context of machine learning frameworks like TensorFlow.
  language. In this manuscript we develop an infrastructure            Likewise, it would be impractical to require every scientific
  for incorporating deep learning into existing scientific com-        simulation suite to re-target their extensive stiff differential
  puting code through Differentiable Programming (∂P ). We             equation solver and numerical linear algebra stacks to spe-
  describe a ∂P system that is able to take gradients of full Julia    cific machine learning libraries for specific tasks. In order
  programs, making Automatic Differentiation a first class lan-        to truly scale physics-informed learning to big science ap-
  guage feature and compatibility with deep learning pervasive.
                                                                       plications, we see a need for efficiently incorporating neural
  Our system utilizes the one-language nature of Julia pack-
  age development to augment the existing package ecosystem            networks into existing scientific simulation suites.
  with deep learning, supporting almost all language constructs           Differentiable Programming (∂P ) has the potential to be
  (control flow, recursion, mutation, etc.) while generating high-     the lingua franca that can further unite the worlds of sci-
  performance code without requiring any user intervention or          entific computing and machine learning. Here we present
  refactoring to stage computations. We showcase several ex-           a ∂P system which allows for embedding deep neural net-
  amples of physics-informed learning which directly utilizes
  this extension to existing simulation code: neural surrogate
                                                                       works into arbitrary existing scientific simulations, enabling
  models, machine learning on simulated quantum hardware,              these packages to automatically build surrogates for ML-
  and data-driven stochastic dynamical model discovery with            acceleration and learn missing functions from data. Previous
  neural stochastic differential equations.                            work has shown that differentiable programming systems
                                                                       in domain specific languages for image processing can al-
                                                                       low for machine learning integration into the domain appli-
                        Introduction                                   cations in a programmer-friendly manner (Li et al. 2018;
A casual practitioner might think that scientific comput-              Li 2019). Supporting multiple languages within a single ∂P
ing and machine learning are different scientific disciplines.         system causes an explosion in complexity, vastly increasing
Modern machine learning has made its mark through break-               the developer effort required, which excludes languages in
throughs in neural networks. Their applicability towards solv-         which packages are developed in alternative languages like
ing a large class of difficult problems in computer science has        Cython or Rcpp. For this reason we extend the full Julia
led to the design of new hardware and software to process ex-          programming language (Bezanson et al. 2017) with differen-
treme amounts of labelled training data, while simultaneously          tiable programming capabilities in a way that allows existing
deploying trained models in devices. Scientific computing,             package to incorporate deep learning. By choosing the Julia
in contrast, a discipline that is perhaps as old as computing          language, we arrive at an abundance of pure-Julia packages
itself, tends to use a broader set of modelling techniques aris-       for both machine learning and scientific computing with both
ing out of the underlying physical phenomena. Compared to              speed and automatic compatibility, allowing us to test our
the typical machine learning researcher, many computational            ideas on fairly large real-world applications.
scientists works with smaller volumes of data but with more               Our system can be directly used on existing Julia pack-
computational complexity and range. However, recent results            ages, handling user-defined types, general control flow con-
Copyright c 2020, for this paper by its authors. Use permitted under   structs, and plentiful scalar operations through source-to-
Creative Commons License Attribution 4.0 International (CCBY           source mixed-mode automatic differentiation, composing the
4.0).                                                                  reverse-mode capabilities of Zygote.jl and Tracker.jl with
ForwardDiff.jl for the "best of both worlds" approach. In this      tracing methods to extract simplified program representations
paper we briefly describe how we achieve our goals for a            that are more easily amenable to AD transforms. These traces
∂P system and showcase its ability to solve problems which          evaluate derivatives only at specific points in the program
mix machine learning and pre-existing scientific simulation         space. Unfortunately, this generally unrolls control flow (i.e.
packages.                                                           building a tape that requires O(n) memory instead of keep-
                                                                    ing the loop construct intact) and requires compilation and
A simple sin example: Differentiate Programs not                    optimization for every new input.
Formulas                                                               This choice has been driven largely by the fact that, as
We start out with a very simple example to differentiate sin(x)     the JAX authors put it, “ML workloads often consist of
written as a program through its Taylor series:                     large, accelerable, pure-and-statically-composed (PSC) oper-
                                                                    ations” (Johnson et al. 2018). Indeed, for many ML models
                               x3    x5                             the per-executed-operation overhead (in both time and mem-
                sin x = x −       +      − ....
                               3!    5!                             ory) incurred by tracing-based AD systems is immaterial,
  Note that the number of terms will not be fixed, but will         because these execution time and memory requirements of
depend on x through a numerical convergence criterion.              the operations dwarf any AD overhead.
  To run, install Julia v1.1 or higher, and install the Zygote.jl      However, this assumption does not hold for many scientific
and ForwardDiff.jl packages with:                                   inverse problems, or even the cutting edge of ML research.
using Pkg                                                           Instead, these problems require a ∂P system capable of:
Pkg.add("Zygote")                                                   1. Low overhead, independent of the size of the executed
Pkg.add("ForwardDiff")                                                 operation
using Zygote, ForwardDiff
                                                                    2. Efficient support for control flow
function s(x)                                                       3. Complete, efficient support for user defined data types
       t = 0.0                                                      4. Customizability
       sign = -1.0
       for i in 1:19                                                5. Composability with existing code unaware of ∂P
             if isodd(i)                                            6. Dynamism.
                    newterm = x^i/factorial(i)
                    abs(newterm)<1e-8 && return t                      Particularly, scientific programs tend to have adaptive al-
                    sign = -sign                                    gorithms, whose control flow depends on error estimates
                    t += sign * newterm                             and thus the current state of the simulation, numerous scalar
             end                                                    operations, define large nonlinear models using every term
       end                                                          individually or implementing specialized numerical linear
       return t                                                     algebra routines, and pervasive use of user-defined data struc-
end                                                                 tures to describe model components, which require efficient
                                                                    memory handling (stack-allocation) in order for the problem
   While the Taylor series for sine could have been written         to be computationally feasible.
more compactly in Julia, for purposes of illustrating more             To take these kinds of problems, Zygote does not utilize
complex programs, we purposefully used a loop, a condi-             the standard methodology and instead generates a derivative
tional, and function calls to isodd and factorial, which            function directly from the original source which is able to
are native Julia implementations. AD just works, and that is        handle all input values. This is called a source-to-source trans-
the powerful part of the Julia approach. Let’s compute the          formation, a methodology with a long history (Baydin et al.
gradient at x = 1.0 and check whether it matches cos(1.0):          2018) going back at least to the ADIFOR source-to-source
julia> ForwardDiff.derivative(s, 1.0)                               AD program for FORTRAN 77 (Bischof et al. 1996). Using
0.540302303791887                                                   this source-to-source formulation, Zygote can then be com-
                                                                    pile, heavily optimize, and re-use a single gradient definition
julia> Zygote.gradient(s, 1.0)                                      for all input values. Significantly, this transformation keeps
(0.5403023037918872,)                                               control flow in tact: not unrolling loops to allow for all pos-
                                                                    sible branches in a memory-efficient form. However, where
julia> cos(1.0)                                                     prior source-to-source AD work has often focused on static
0.5403023058681398                                                  languages, Zygote expands upon this idea by supporting a
                                                                    full high level language, dynamic, Julia, in a way that al-
                     Implementation                                 lows for its existing scientific and machine learning package
Recent progress in tooling for automatic differentiation (AD)       ecosystem to benefit from this tool.
has been driven primarily by the machine learning commu-
nity. Many state of the art reverse-mode AD tools such as           Generality, Flexibility, and Composability
Tracker.jl (Innes et al. 2018; Gandhi et al. 2019), PyTorch (Py-    One of the primary design decisions of a ∂P system is how
Torch Team 2018), JAX (Johnson et al. 2018), and Tensor-            these capabilities should be exposed to the user. One con-
Flow (Abadi et al. 2016) (in the recent Eager version) employ       venient way to do so is using a differential operator J that
      function J(f . g)(x)                                          a fully user extensible function ∂ that provides a default
          a, da = J(f)(x)                                           fallback as follows
          b, db = J(g)(a)
          b, z -> da(db(z))                                                       ∂(f )(args...) = J (f )(args...),
      end
                                                                    where the implementation that is generated automatically
                                                                    by J recurses to ∂ rather than J and can thus easily be
Figure 1: The differential operator J is able to implement the      intercepted using Julia’s standard multiple dispatch system
chain rule through a local, syntactic recursive transformation.     at any level of the stack. For example, we might make the
                                                                    following definitions:
      julia> f(x) = x^2 + 3x + 1
      julia> gradient(f, 1/3)
      (3.6666666666666665,)                                         ∂(f )(:: typeof(+))(a :: IntOrFloat,b :: IntOrFloat) =
                                                                                                        a + b, z → (z, z)
      julia> using Measurements;
      julia> gradient(f, 1/3 +- 0.01)                               ∂(f )(:: typeof(∗))(a :: IntOrFloat,b :: IntOrFloat) =
      (3.6666666666666665 +- 0.02,)                                                                     a ∗ b, z → (z ∗ b, a ∗ z)

                                                                       i.e. declaring how to compute the partial derivative of +
Figure 2: With two minimal definitions, Zygote is able to obtain    and ∗ for two integer or float-valued numbers, but simultane-
derivatives of any function that only requires those definitions,   ously leaving unconstrained the same for other functions or
even through custom data types (such as Measurement) and            other types of values (which will thus fall back to applying
many layers of abstraction.                                         the AD transform). With these two definitions, any program
                                                                    that is ultimately just a composition of ‘+‘, and ‘*‘ operations
                                                                    of real numbers will work. We show a simple example in
operates on first class functions and once again returns a          figure 2. Here, we used the user-defined Measurement type
first class function (by returning a function we automatically      from the Measurements.jl package (Giordano 2016). We did
obtain higher order derivatives, through repeated application       not have to define how to differentiate the ∧ function or how
of J ). There are several valid choices for this differential       to differentiate + and ∗ on a Measurement, nor did the Mea-
operator, but a convenient choice is                                surements.jl package have to be aware of the AD system in
                                                                    order to be differentiated. Thus standard user definitions of
               J (f ) := x → (f (x), Jf (x)z),                      types are compatible with the differentiation system. This
                                                                    extra, user-extensible layer of indirection has a number of
   i.e. J (f )(x) returns the value of f at x, as well as a func-   important consequences:
tion which evaluates the jacobian-vector product between
Jf (x) and some vector of sensitivities z. From this primitive      • The AD system does not depend on, nor require any
we can define the gradient of a scalar function g : Rn → R            knowledge of primitives on new types. By default we
which is written as:                                                  provide implementations of the differentiable operator for
                                                                      many common scalar mathematical and linear algebra op-
                  ∇g(x) := [J (g)(x)]2 (1)                            erations, written with a scalar LLVM backend and BLAS-
                                                                      like linear algebra operations. This means that even when
   ([]2 selects the second value of the tuple, 1 = ∂z/∂z is the       Julia builds an array type to target TPUs (Fischer and Saba
initial sensitivity).                                                 2018), its XLA IR primitives are able to be used and differ-
   This choice of differential operator is convenient for sev-        entiated without fundamental modifications to our system.
eral reasons: (1) The computation of the forward pass often
computes values that can be re-used for the computation of          • Custom gradients become trivial. Since all operations
the backwards pass. By combining the two operations, it is            indirect through ∂, there is no difference between user-
easy to re-use this work. (2) It can be used to recursively           defined custom gradients and those provided by the sys-
implement the chain rule (see figure 1).                              tem. They are written using the same mechanism, are co-
   This second property also suggests the implementation              optimized by the compiler and can be finely targeted using
strategy: hard code the operation of J on a set of primi-             Julia’s multiple dispatch mechanism.
tive f ’s and let the AD system generate the rest by repeated
application of the chain rule transform. This same general             Since Julia solves the two language problem, its Base, stan-
approach has been implemented in many systems (Pearl-               dard library, and package ecosystem are almost entirely pure
mutter and Siskind 2008; Wang et al. 2018) and a detailed           Julia. Thus, since our ∂P system does not require primitives
description of how to perform this on Julia’s SSA form IR is        to handle new types, this means that almost all functions
available in earlier work (Innes 2018).                             and types defined throughout the language are automatically
   However, to achieve our extensibility and composability          supported by Zygote, and users can easily accelerate specific
goals, we implement a slight twist on this scheme. We define        functions as they deem necessary.
                                                                   Quantum Machine Learning
                                                                   On the one hand, a promising potential application and
                                                                   research direction for Noisy Intermediate-Scale Quantum
                                                                   (NISQ) technology (Preskill 2018) is variational quantum
                                                                   circuits (Benedetti, Lloyd, and Sack 2019), where a quantum
                                                                   circuit is parameterized by classical parameters in quantum
                                                                   gates, which may have less requirements on the hardware.
                                                                   Here, in many cases, the classical parameters are optimized
                                                                   with classical gradient-based algorithms. On the other hand,
Figure 3: Using a neural network surrogate to solve in-            designing quantum circuits is hard, however, it is poten-
verse problems                                                     tially an interesting direction to explore using gradient-based
                                                                   method to search circuit architecture for certain task with AD
                                                                   support.
                                                                      One such state of the art simulator is the Yao.jl (zhe Luo et
                      ∂P in Practice                               al. 2019) quantum simulator project. Yao.jl is implemented
                                                                   in Julia and thus composable with our AD technology. There
Surrogates for Realtime ML-Acceleration of                         are a number of interesting applications of this combination
Inverse Problems                                                   (Mitarai et al. 2018).
                                                                      A subtle application is to perform traditional AD of the
Model-based reinforcement learning has advantages over
                                                                   quantum simulator itself. As a simple example of this capa-
model-agnostic methods, given that an effective agent must
                                                                   bility, we consider a Variational Quantum Eigensolver (VQE)
approximate the dynamics of its environment (Atkeson and
                                                                   (Kandala et al. 2017). A variational quantum eigensolver is
Santamaria 1997). However, model-based approaches have
                                                                   used to compute the eigenvalue of some matrix H (gener-
been hindered by the inability to incorporate realistic en-
                                                                   ally the Hamiltonian of some quantum system for which the
vironmental models into deep learning models. Previous
                                                                   eigenvalue problem is hard to solve classically, but that is eas-
work has had success re-implementing physics engines us-
                                                                   ily encoded into quantum hardware). This is done by using
ing machine learning frameworks (Degrave et al. 2019;
                                                                   a variational quantum circuit Φ(θ) to prepare some quan-
de Avila Belbute-Peres et al. 2018), but this effort has a
                                                                   tum state |Ψi = Φ(θ)|0i, measuring the expectation value
large engineering cost, has limitations compared to existing
                                                                   hΨ|H|Ψ0i and then using a classical optimizer to adjust θ to
engines, and has limited applicability to other domains such
                                                                   minimize the measured value. In our example, we will use a 4-
as biology or meteorology. For example, one notable example
                                                                   site toy Hamiltonian corresponding to an anti-ferromagnetic
has shown that solving the 3-body problem can be acceler-
                                                                   Heisenberg chain:
ated 100 million times by using a neural surrogate approach,
but required writing a simulator that was compatible with the                                                        
chosen neural network framework (Breen et al. 2019).                                  1 X x x           y y
                                                                                H=            σi σj + σi σj + σiz σjz 
   Zygote can be used for control problems, incorporating the                         4
                                                                                         hi,ji
model into backpropagation with one call to gradient. We
pick trebuchet dynamics as a motivating example. Instead of           We use a standard differentiable variational quantum cir-
aiming at a single target, we optimize a neural surrogate that     cuit composed of layers (2 in our example) of (parameterized)
can aim it given any target. The neural net takes two inputs,      rotations and CNOT entanglers with randomly initialized ro-
the target distance in metres and the current wind speed. The      tation angles. Figure 4 shows the result of the minimization
network outputs trebuchet settings (the mass of the counter-       process as performed using gradients provided by Zygote.
weight and the angle of release) that get fed into a simulator
which solves an ODE and calculates the achieved distance.          Data-Driven Stochastic Dynamical Model
We compare to our target and backpropagate through the en-         Discovery with Neural Stochastic Differential
tire chain to adjust the weights of the network. Our dataset is
                                                                   Equations
a randomly chosen set of targets and wind speeds. An agent
that aims a trebuchet to a given target can thus be trained in a   Neural latent differential equations (Chen et al. 2018;
few minutes on a laptop CPU, resulting in a network which          Álvarez, Luengo, and Lawrence 2009; Hu et al. 2013;
is a surrogate to the inverse problem that allows for aiming       Rackauckas et al. 2019) incorporate a neural network into
in constant-time. Given that the forward-pass of this network      the ODE derivative function. Recent results have shown that
is about 100× faster than performing the full optimization         many deep learning architectures can be compacted and gen-
on the trebuchet system (Figure 3), this surrogate technique       eralized through neural ODE descriptions (Chen et al. 2018;
gives the ability to decouple real-time model-based control        He et al. 2016; Dupont, Doucet, and Whye Teh 2019; Grath-
from the simulation cost through a pre-trained neural net-         wohl et al. 2018). Latent differential equations have also seen
work surrogate to the inverse. We present the code for this        use in time series extrapolation (Gao et al. 2008) and model
and other common reinforcement learning examples such as           reduction (Ugalde et al. 2013; Hartman and Mestha 2017;
the cartpole and inverted pendulum (Innes, Joy, and Karmali        Bar-Sinai et al. 2018; Ordaz-Hernandez, Fischer, and Bennis
2019).                                                             2008).
                                                                                             Neural SDE: Before Training
                                                                         4
                                                                         3
                                                                         2
                                                                         1
                                                                         0
                                                                        -1

                                                                             0.00     0.25          0.50          0.75     1.00

                                                                                             Neural SDE: After Training
                                                                         2

                                                                         1

                                                                         0

                                                                        -1


                                                                             0.00     0.25          0.50          0.75     1.00
                                                                                                           Time


                                                                    Figure 5: Neural SDE Training. For the SDE solution X(t),
                                                                    the blue line shows X1 (t) while the orange line shows X2 (t).
Figure 4: Optimization progress over steps of the                   The green points shows the fitting data for X1 while the
classical optimizer to ground state.                                purple points show the fitting data for X2 . The ribbons show
                                                                    the 95 percentile bounds of the stochastic solutions.
   Here we demonstrate automatic construction of Langevin
equations from data using neural stochastic differential equa-
                                                                    computationally accessible.
tions (SDE). Typically, Langevin equation models can be
written in the form:                                                Example Code A more extensive code listing for
                                                                    these examples is available at the following URL:
                 dXt = f (Xt )dt + g(Xt )dWt ,                      https://github.com/MikeInnes/zygote-paper.
where f : R → Rn and g : Rn×m → Rn with Wt as the
             n
                                                                    Acknowledgments This work would not have been possi-
m-dimensional Wiener process.
                                                                    ble without the work of many people. We are particularly
   To automatically learn such a physical model, we can
                                                                    indebted to our users that contributed the examples in this
replace the drift function f and the diffusion function g
                                                                    paper. Thanks to Roger Luo and JinGuo Liu for the quantum
with neural networks and train these against time series
                                                                    experiments. Thanks also to Zenna Tavares, Jesse Bettencourt
data by repeatedly solving the differential equation 10
                                                                    and Lyndon White for being early adopters of Zygote and its
times and using the average gradient of the cost function
                                                                    underlying technology, and shaping its development. Much
against the parameters of the neural network. Our simulator
                                                                    credit is also due to the core Julia language and compiler
utilizes a high strong order adaptive integration provided
                                                                    team for supporting Zygote’s development, including Jame-
by DifferentialEquations.jl (Rackauckas and Nie 2017a;
                                                                    son Nash, Jeff Bezanson and others. Thanks also to James
2017b). Figure 5 depicts a two-dimensional neural SDE
                                                                    Bradbury for helpful discussions on this paper.
trained using the l2 normed distance between the solution
and the data points. Included is a forecast of the neural SDE
solution beyond the data to a final time of 1.2, showcasing a                                    References
potential use case for timeseries extrapolation from a learned      Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.;
dynamical model.                                                    Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016.
   The analytical formula for the adjoint of the strong solution    Tensorflow: A system for large-scale machine learning. In 12th
of a SDE is difficult to efficiently calculate due to the lack of   {USENIX} Symposium on Operating Systems Design and Imple-
classical differentiability of the solution. However, Zygote        mentation ({OSDI} 16), 265–283.
still manages to calculate a useful derivative for optimiza-        Atkeson, C. G., and Santamaria, J. C. 1997. A comparison of
tion with respect to single solutions by treating the Brownian      direct and model-based reinforcement learning. In Proceedings of
process as fixed and applying forward-mode automatic differ-        International Conference on Robotics and Automation, volume 4,
entiation, showcasing Zygote’s ability to efficiently optimize      3557–3564. IEEE.
its AD through mixed-mode approaches (Rackauckas et al.             Bar-Sinai, Y.; Hoyer, S.; Hickey, J.; and Brenner, M. P. 2018. Data-
2018). Common numerical techniques require computing                driven discretization: machine learning for coarse graining of partial
the gradient with respect to a difference over thousands of         differential equations. arXiv e-prints arXiv:1808.04930.
trajectories to receive an average cost, while our numerical        Baydin, A. G.; Pearlmutter, B. A.; Radul, A. A.; and Siskind, J. M.
experiments suggest that it is sufficient with Zygote to per-       2018. Automatic differentiation in machine learning: a survey.
form gradient decent on a neural SDE using only single or           Journal of Marchine Learning Research 18:1–43.
a few trajectories, reducing the overall computational cost         Benedetti, M.; Lloyd, E.; and Sack, S. 2019. Parameterized
by this thousands. This methodological advance combined             quantum circuits as machine learning models. arXiv preprint
with GPU-accelerated high order adaptive SDE integrators            arXiv:1906.07682.
in DifferentialEquations.jl makes a whole new field of study        Bezanson, J.; Edelman, A.; Karpinski, S.; and Shah, V. B. 2017.
Julia: A fresh approach to numerical computing. SIAM Review                Li, T.-M. 2019. Differentiable Visual Computing. arXiv e-prints
59(1):65–98.                                                               arXiv:1904.12228.
Bischof, C.; Khademi, P.; Mauer, A.; and Carle, A. 1996. ADI-              Mitarai, K.; Negoro, M.; Kitagawa, M.; and Fujii, K. 2018. Quantum
FOR 2.0: Automatic differentiation of Fortran 77 programs. IEEE            circuit learning. Physical Review A 98(3):032309.
Computational Science and Engineering 3(3):18–32.                          Ordaz-Hernandez, K.; Fischer, X.; and Bennis, F. 2008. Model
Breen, P. G.; Foley, C. N.; Boekholt, T.; and Portegies Zwart, S.          reduction technique for mechanical behaviour modelling: Efficiency
2019. Newton vs the machine: solving the chaotic three-body prob-          criteria and validity domain assessment. Proceedings of the Insti-
lem using deep neural networks. arXiv e-prints arXiv:1910.07291.           tution of Mechanical Engineers, Part C: Journal of Mechanical
Chen, T. Q.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. K.            Engineering Science 222(3):493–505.
2018. Neural ordinary differential equations. In Advances in Neural        Pearlmutter, B. A., and Siskind, J. M. 2008. Reverse-mode AD in a
Information Processing Systems, 6571–6583.                                 functional framework: Lambda the ultimate backpropagator. ACM
de Avila Belbute-Peres, F.; Smith, K.; Allen, K.; Tenenbaum, J.; and       Transactions on Programming Languages and Systems (TOPLAS)
Kolter, J. Z. 2018. End-to-end differentiable physics for learning         30(2):7.
and control. In Advances in Neural Information Processing Systems,         Preskill, J. 2018. Quantum computing in the nisq era and beyond.
7178–7189.                                                                 Quantum 2:79.
Degrave, J.; Hermans, M.; Dambre, J.; and wyffels, F. 2019. A              PyTorch Team. 2018. The road to 1.0: production ready PyTorch.
differentiable physics engine for deep learning in robotics. Frontiers     https://pytorch.org/blog/a-year-in/. Accessed: 2018-09-22.
in Neurorobotics 13.                                                       Rackauckas, C., and Nie, Q. 2017a. Differentialequations.jl –
Dupont, E.; Doucet, A.; and Whye Teh, Y. 2019. Augmented Neural            a performant and feature-rich ecosystem for solving differential
ODEs. arXiv e-prints arXiv:1904.01681.                                     equations in julia. 5(1). Exported from https://app.dimensions.ai on
Fischer, K., and Saba, E. 2018. Automatic full compilation of Julia        2019/05/05.
programs and ML models to cloud TPUs. CoRR abs/1810.09868.                 Rackauckas, C. V., and Nie, Q. 2017b. Adaptive methods for
Gandhi, D.; Innes, M.; Saba, E.; Fischer, K.; and Shah, V. 2019.           stochastic differential equations via natural embeddings and rejec-
Julia E Flux: Modernizando o Aprendizado de Máquina. 5.                    tion sampling with memory. Discrete and continuous dynamical
Gao, P.; Honkela, A.; Rattray, M.; and Lawrence, N. D. 2008.               systems. Series B 22 7:2731–2761.
Gaussian process modelling of latent chemical species: applications        Rackauckas, C.; Ma, Y.; Dixit, V.; Guo, X.; Innes, M.; Revels, J.;
to inferring transcription factor activities. Bioinformatics 24(16):i70–   Nyberg, J.; and Ivaturi, V. 2018. A Comparison of Automatic Dif-
i75.                                                                       ferentiation and Continuous Sensitivity Analysis for Derivatives of
Giordano, M. 2016. Uncertainty propagation with functionally               Differential Equation Solutions. arXiv e-prints arXiv:1812.01892.
correlated quantities. ArXiv e-prints.                                     Rackauckas, C.; Innes, M.; Ma, Y.; Bettencourt, J.; White, L.; and
Grathwohl, W.; Chen, R. T. Q.; Bettencourt, J.; Sutskever, I.; and         Dixit, V. 2019. Diffeqflux.jl - A julia library for neural differential
Duvenaud, D. K. 2018. FFJORD: free-form continuous dynamics                equations. CoRR abs/1902.02376.
for scalable reversible generative models. CoRR abs/1810.01367.            Raissi, M.; Perdikaris, P.; and Karniadakis, G. 2019. Physics-
Hartman, D., and Mestha, L. K. 2017. A deep learning framework             informed neural networks: A deep learning framework for solving
for model reduction of dynamical systems. In 2017 IEEE Confer-             forward and inverse problems involving nonlinear partial differential
ence on Control Technology and Applications (CCTA), 1917–1922.             equations. Journal of Computational Physics 378:686 – 707.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning       Ugalde, H. M. R.; Carmona, J.-C.; Alvarado, V. M.; and Reyes-
for image recognition. 2016 IEEE Conference on Computer Vision             Reyes, J. 2013. Neural network design and model reduction ap-
and Pattern Recognition (CVPR) 770–778.                                    proach for black box nonlinear system identification with reduced
                                                                           number of parameters. Neurocomputing 101:170 – 180.
Hu, Y.; Boker, S.; Neale, M.; and Klump, K. 2013. Coupled latent
differential equation with moderators: Simulation and application.         Wang, F.; Wu, X.; Essertel, G.; Decker, J.; and Rompf, T. 2018. De-
Psychological methods 19.                                                  mystifying differentiable programming: Shift/reset the penultimate
Innes, M.; Saba, E.; Fischer, K.; Gandhi, D.; Rudilosso, M. C.;            backpropagator. arXiv preprint arXiv:1803.10228.
Joy, N. M.; Karmali, T.; Pal, A.; and Shah, V. 2018. Fashionable           zhe Luo, X.; guo Liu, J.; Zhang, P.; and Wang, L. 2019. Yao.jl:
modelling with Flux. CoRR abs/1811.01457.                                  Extensible, efficient quantum algorithm design for humans. In
Innes, M.; Joy, N. M.; and Karmali, T. 2019. Reinforcement learning        preparation.
vs. differentiable programming. https://fluxml.ai/2019/03/05/dp-vs-        Álvarez, M.; Luengo, D.; and Lawrence, N. D. 2009. Latent force
rl.html.                                                                   models. In van Dyk, D., and Welling, M., eds., Proceedings of
Innes, M. 2018. Don’t unroll adjoint: Differentiating SSA-form             the Twelth International Conference on Artificial Intelligence and
programs. CoRR abs/1810.07951.                                             Statistics, volume 5 of Proceedings of Machine Learning Research,
                                                                           9–16. Hilton Clearwater Beach Resort, Clearwater Beach, Florida
Johnson, M.; Frostig, R.; Maclaurin, D.; and Leary, C. 2018. JAX:
                                                                           USA: PMLR.
Autograd and xla. https://github.com/google/jax.
Kandala, A.; Mezzacapo, A.; Temme, K.; Takita, M.; Brink, M.;
Chow, J. M.; and Gambetta, J. M. 2017. Hardware-efficient varia-
tional quantum eigensolver for small molecules and quantum mag-
nets. Nature 549(7671):242.
Li, T.-M.; Gharbi, M.; Adams, A.; Durand, F.; and Ragan-Kelley,
J. 2018. Differentiable programming for image processing and
deep learning in Halide. ACM Trans. Graph. (Proc. SIGGRAPH)
37(4):139:1–139:13.