Finding Multiple Solutions of ODEs with Neural Networks Marco Di Giovanni1 , David Sondak2 , Pavlos Protopapas2 , Marco Brambilla1 1 Politecnico di Milano. Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB). Milan, Italy 2 Institute for Applied Computational Science, Harvard University, Cambridge, MA, United States Abstract ANN approach, suggesting that even if the results are not quite as good as the classical approaches, there are good rea- Applications of neural networks to numerical problems have sons to continue the research in this direction. In fact, the so- gained increasing interest. Among different tasks, finding so- lutions of ordinary differential equations (ODEs) is one of the lution obtained is differentiable with a closed analytic form most intriguing problems, as there may be advantages over and this approach is easily parallelizable and general since it well established and robust classical approaches. In this pa- can be applied to ODEs, systems of ODE, and PDEs. per, we propose an algorithm to find all possible solutions of In this paper we design an algorithm in order to apply an ordinary differential equation that has multiple solutions, this technique to solve a different problem: finding all pos- using artificial neural networks. The key idea is the introduc- sible solutions of an ordinary differential equation that per- tion of a new loss term that we call the interaction loss. The mits non-unique solutions. The vast majority of analytical an interaction loss prevents different solutions families from co- numerical techniques have been developed and applied to inciding. We carried out experiments with two nonlinear dif- differential equations emitting unique solutions. However, ferential equations, all admitting more than one solution, to there are practical scientific problems that are modeled by show the effectiveness of our algorithm, and we performed a sensitivity analysis to investigate the impact of different nonlinear differential equations that do not result in unique hyper-parameters. solutions. The Bratu equation has been used to model com- bustion phenomena as well as temperatures in the sun’s core. The Bratu equation does not have a known analytical solu- Introduction tion in more than one spatial dimension. Moreover, classical A wide variety of phenomena are governed by mathemati- numerical methods will not be able to find all solution fam- cal laws that can be represented using ordinary differential ilies. (Karkowski 2013). equations, ranging from the field of physics to chemistry to After describing the background concepts and the pro- finance. posed algorithm, we list a set of simple problems where In the field of physics, examples include Navier-Stokes the solutions are known in order to test the algorithm. The equation, Schrodinger equation and many others. However, results are presented including sensitivity analysis of the it is not always possible to solve those problems analyt- hyper-parameters that need to be tuned. ically, and sometimes numerical approaches are required since these equations are too complicated or the number of Related Work equations and variables is too big. Algorithms to solve differential equations using neural net- Numerical approaches to solve differential equations have works were originally developed by Lagaris, Likas, and been widely studied and improved almost since the first dif- Fotiadis(1998). The authors proposed a novel approach to ferential equations were written down. Classical approaches tackle the problem of numerically solving differential equa- for spatial discretization of partial differential equations tions using ANNs, listing some advantages of their algo- (PDEs) include finite differences, finite elements, finite vol- rithm with respect to classical ones. Testing it with ordi- umes, and spectral methods. Classical methods for dis- nary differential equations (ODEs), partial differential equa- cretizing ordinary differential equations (ODEs) include the tions (PDEs) and systems of ODEs, they checked their gen- Runge-Kutta (RK) methods and the backward difference eralization abilities, obtaining surprisingly good results. Fi- formulas (BDF). Also artificial neural networks (ANNs) nally, they extended it to problems with arbitrarily shaped have been applied to this scope. Lagaris, Likas, and Fo- domains, in more than one dimension (Lagaris, Likas, and tiadis(1998) listed a set of theoretical advantages of the Papageorgiou(1998)) and they applied it to quantum me- Copyright c 2020, for this paper by its authors. Use permit- chanics (Lagaris, Likas, and Fotiadis(1997)). An interesting ted under Creative Commons License Attribution 4.0 International survey is presented by Kumar and Yadav(2011). The au- (CCBY 4.0). thors collect a large number of papers about solving dif- ferential equations, focusing on both multi-layer percep- advantages of using neural networks to solve differential trons and radial basis functions techniques, published be- equations with respect to classical approaches. In order to tween 2001 and 2011. A similar approach, called the Deep determine u(t) from (1), one must specify either initial or Galerkin Method (DGM), was proposed by Sirignano and boundary values. Boundary value problems (BVPs) impose Spiliopoulos(2018). The authors present an algorithm simi- the values of the function at the boundaries of the interval lar to Galerkin methods, but with neural networks instead of t ∈ [t0 , tf ], e,g. u0 = u(t0 ) and uf = u(tf ). Initial value basis functions. Raissi, Perdikaris, and Karniadakis(2019) problems (IVPs), on the other hand, impose the value of the formulate Physics-informed neural networks, a framework function and its derivatives at t0 , e.g. u0 = u(t0 ), v0 = developed to solve two main classes of problems: data- u0 (t0 ). Both of these approaches can be implemented using driven solution of PDEs, also inspired by Lagaris, Likas, and neural networks with very little modification to the loss. For Fotiadis(1998), and a data-driven discovery of PDEs, encod- example, switching between an IVP and a BVP does not ing in the neural networks the physical laws that govern a require any modification to the network architecture. One data-set. In another work, Baymani, Kerayechian, and Ef- only need adjust the loss function to incorporate the appro- fati(2010) compare the performance of neural networks with priate initial or boundary conditions. Classical methods of- the errors using Aman-Kerayechian method, for obtaining ten require different algorithms for IVPs or BVPs and there- the solutions of Stokes Equation. fore may require additional code infrastructure. The situa- As stated before, one of the theoretical advantages of us- tion may become more severe when considering PDEs. The ing artificial neural networks to solve ODEs, with respect to neural network architecture will now take in multiple inputs classical methods, is that the latter are affected to the curse (e.g. space and time) and the loss function will incorprate the of dimensionality, while the NN approach scales better with appropriate initial and boundary conditions. However, tradi- the number of dimensions. E, Han, and Jentzen(2017) ana- tional PDE software codes may use multiple algorithms to lyze this propriety, reformulating the PDEs using backward solve the entire PDE (e.g. time-integration in time and finite stochastic differential equations, and applying it to solve elements in space (Seefeldt et al. 2017; Alnæs et al. 2015; the nonlinear Black-Scholes equation, the Hamilton-Jacobi- Burns et al. 2016). Bellman equation, and the Allen-Cahn equation. It is also essential that neural networks incorporate known physical laws. Mattheakis et al.(2019) embed physical sym- Solving ODEs with Neural networks metries into the structure of the neural network. The sym- plectic neural network obtained is tested with a system of Following the methodology explained in Lagaris, Likas, and energy-conserving differential equations and outperforms an Fotiadis(1998), we use a fully connected feed forward neural unsupervised, non-symplectic neural network. network to solve ODEs. There is also a new trend investigating dynamical sys- The key idea of this approach is that the neural network tems, introducing new families of neural networks, charac- itself represents the solution of the differential equation. Its terized by continuous-depth, whose output is computed us- parameters are learnt in order to obtain a function that min- ing black-box differential equation solvers. Their applica- imizes a loss, forcing the differential equation to be solved tion is different from this work, since they are mainly used and its boundary values (or initial conditions) to be satisfied. for classification tasks (He et al. 2016; Chen et al. 2018; Firstly, the architecture of the neural network has to be se- Zhu, Chang, and Fu 2018). lected, setting the number of layers, the number of units per layer and the activation functions. Fine-tuning these param- Background eters and testing more complex architectures can be done to As described by Lagaris, Likas, and Fotiadis(1998), fully better approximate the loss functions. However, for the pur- connected feed forward neural networks can be used to com- pose of this work, we fix the architecture of the neural net- pute solutions of differential equations with some advan- work selecting the one that we believe is suitable enough for tages and disadvantages with respect to classical approaches. the differential equations selected, by manually inspecting In this section, we briefly expose the background concepts. the validation loss. Too small architectures could be unable to model the target function, while bigger architecture could Differential equations be harder to train. For ODEs, the neural network only has a single input: the independent variable t at several discrete We write a general ODE as, points ti . The output of the neural network represents the F (t, u(t), u0 (t), u00 (t), ...) = 0 (1) value of the function that we want to learn. Since we are only focusing on scalar equations in the present work, the num- where t is the independent variable and u(t) is the solu- ber of units in the output layer should also be one. Automatic tion of the differential equation. We restrict the scope of differentiation (Corliss et al. 2002) is used to compute exact the present paper to differential equations of only a single derivatives with respect not only to the parameters of the net- variable and interpret t as time. Note that u0 represents the work, but also to the inputs to the neural network (here t). first derivative and u00 represents the second derivative. We The training process of the neural network is accomplished remark that the differential equation considered herein (1) by tuning the parameters of the neural network to minimize need not to be separable with respect to u0 (t), as required a loss function. In the current problem, the loss function is in classical Runge-Kutta methods. This is one of the main simply the differential operator acting on the predicted func- tion. That is, Clairaut equation The general Clairaut equation is X LODE = F (ti , u(ti ), u0 (ti ), u00 (ti ), ...)2 . (2) u(t) = tu0 (t) + f (u0 (t)), u (0) = u0 (6) i We note that this method is related to the Galerkin Least- where f is a known nonlinear function Ince(1956). There Squares method where the residual is evaluated at discrete are two families of solutions to (6). The first is a line of the points (similar to a collocation method). A function that form, guarantees F (·) = 0 everywhere is said to satisfy the dif- u(t) = Ct + f (C) (7) ferential equation. In practice, the best case is that the neural where is determined by the initial conditions. The second network will find a function u∗ (t) that minimizes (2) by ob- family of solutions can be represented in parametric form taining values for the residual close to zero. as, t(z) = −f 0 (z)  To train the neural network, we pick a set of ntrain points belonging to the domain in which we are interested [t0 , tf ]. (8) u(z) = −zf 0 (z) + f (z). In this work we consider equally spaced points, although this is not required. We perform a forward pass of the network In this work, we will consider two forms for the function which provides the function approximation and calculate the f . The first form is, derivatives as required by the differential operator through 1 automatic differentiation. The forward pass is completed by f (u0 (t)) = (9) u0 (t) computing the loss via (2). The weights and biases of the network are updated with the gradient descent algorithm via which can be written in the form of (1) as, application of the backpropagation algorithms. This process is repeated until convergence or until the maximum number F (t, u(t), u0 (t)) = tu0 (t)2 − u(t)u0 (t) + 1 = 0 (10) of epochs is reached. The output of the process is the func- With u (1) = 2, the two separate solutions are, tion that obtains the lowest loss calculated using an eval- √ uation set, picking nval points equally spaced in the same u1 (t) = 2 t (11a) domain. u2 (t) = t + 1 (11b) A critical consideration is to realize the specified initial and/or boundary conditions. To encode boundary values or For the second problem we select, initial conditions, we use the method described by Sirignano f (u0 (t)) = u0 (t)3 (12) and Spiliopoulos(2018). We add a term in the loss function evaluating the squared distance between the actual value of which can be written as, the function at its boundaries and the imposed values. In the case of BVPs this additional loss function is, F (t, u(t), u0 (t)) = tu0 (t) − u(t) + u0 (t)3 = 0. (13) LBV = (u(t0 ) − u0 )2 + (u(tf ) − uf )2 . (3) This problem has three distict solutions. With u(−3) = 2, In the case of IVPs, the new loss function is, the three solutions are, X−1  nord (n) 2  3 t 2 LIC = u(n) (t0 ) − v0 (4) u1 (t) = 2 − (14a) n=0 3 where nord is the order of the differential operator, u(n) is u2 (t) = 2t + 8 (14b) (n) the nth derivative of the function u and v0 is the initial u3 (t) = −t − 1 (14c) th value corresponding to the n derivative. For example, for a 1D Bratu problem The one-dimensional Bratu problem (0) (1) second order IVP, the loss would contain v0 and v0 . is The total loss is, u00 (t) + Ceu(t) = 0, u (0) = u (1) = 0. (15) L = LODE + LI (5) This ODE has two solutions if C < Ccrit and only one so- where LI is given by (3) or (4) depending on the context of lution if C ≥ Ccrit , where Ccrit ≈ 3.51. We select C = 1 the problem. in the regime where two functions solve the problem (Bratu In some applications, it may be necessary to introduce 1914). The exact analytical solution is a penalty parameter before LI to mitigate different scal- cosh (α) ings between LODE and LI . In the present work, we fix the u(t) = 2 ln (16) penalty parameter to unity and note that this did not inter- cosh (α(1 − 2t)) fere with the current results. It may be necessary to tune this where α satisfies penalty parameter for more complicated problems. 4 cosh α = √ α. (17) Equation library 2C The proposed approach is tested on two nonlinear differen- For C = 1, equation (17) has two solutions given by α1 ≈ tial equations. Each of these equations is known to admit 0.379 and α2 ≈ 2.73, which ultimately results in two solu- multiple families of solutions. tions for (16). Methodology ones. After ni epochs, the new loss in (21) replaces the loss In this section, we describe an algorithm that generalizes in (18) and the networks begin to interact through the inter- the method described in section ”Solving ODEs with Neural action term. The networks gradually try to converge to so- networks” to find all families of solutions to a given differ- lutions of the differential equation while moving away from ential equation. each other. After nf total epochs, the interaction term is re- To find the whole set of solutions of an ODE, we repre- moved and the original loss in (18) is used to help each net- sent each of the N solutions with a different neural network work converge to the real minima. If the second phase of ul (t), l = 1, ..., N . Each function ul (t) is modelled through the training separated the functions enough to be near differ- a neural network with the same architecture, the same num- ent minima, then they will converge to different solutions. A ber of layers and the same number of units, but weights are variant to this approach is to fix one or more of the neural not shared. networks during the second part of the training, so that the The training is done simultaneously, minimizing a loss effect of the interaction is to move the remaining solutions function, away from the others. For example, if there are two distinct XN solutions, one will be fixed near a real minimum, while the L= `j (18) other will be pushed away. If the number of solutions of the j differential equation is lower than the number of neural net- where works, at least two of them will eventually converge to the `j = LODEj + LIj (19) same solution. To explore better the space of solutions, we can gradually as in equation 5. Since the neural networks are not ”inter- increase the strength of the interaction λ. During the second acting”, they will learn one of the possible solutions, but part of the training, the differential equation is gradually ne- there is no requirement that the learnt solutions will be dif- glected and the functions will tend to become more distinct. ferent. To overcome this shortcoming, we define an interac- tion function g (u1 , u2 ), detecting if two predicted functions Data: λ0 , λM , lossM , Dm , k are close. Given two functions u1 (t) and u2 (t), the value of N ← 1, λ ← λ0 , N Nold ← N one; g(u1 , u2 ) needs to be high if they are similar, while, if they while λ < λM do are different, its value should be low. We propose an inter- N N ← createN N (N ); action function of the form, loss, D ← train(N N ); nX train if ∃ i |lossi > lossM then gp (u1 , u2 ) = |u1 (ti ) − u2 (ti )|−p (20) return N Nold ; i end where p is a hyperparameter that controls the separation be- if ∃ i, j |Dij < Dm then tween the two functions. We propose a new loss function λ ← λ ∗ k; that incorporates this interaction via, end X else Ltot = L + λ gp (ui , uj ) (21) N Nold ← N N ; i6=j N ← N + 1; where L is given by (18) and λ is a hyperparameter setting end the importance of the interaction. end Training the neural network with loss defined in (21) return N Nold ; could lead to bad results near the boundaries or initial points. Algorithm 1: Complete algorithm This is because the different solution families must be close to each other in those regions, but (21) insists that they Algorithm 1 depicts the method in detail. We initialize the should be different. To overcome this problem, we use the number of neural networks to 1 and the interaction param- modified loss in equation (21) for a training interval [ni , nf ] eter to λ0 . We train the network and compute the loss and where ni > 0 and nf < ntot . the pairwise distances. The pairwise distances Dij are cal- To improve the performance of the algorithm, the inter- culated using (22). action term can be scaled by a factor dependent on the dis- tance of the point with respect to the boundaries. For ex- nX train 1 ample, if we are dealing with an IVP the scaling factor will Dij = |ui (tl ) − uj (tl )| (22) t − t0 ntrain l be , so that the interaction is low near t0 , while it is tf − t0 Then, we check that every final loss is lower than a thresh- maximum near tf . For BVPs, the procedure is similar, with old lossM , in order to understand if the algorithm converged low values at the borders and high values in the middle of or not. If not, it means that the training process must be im- the interval. proved (for example using a better optimization technique, The basic training process is as follows. First, N neural increasing the number of epochs or widening the architec- networks are randomly initialized. The first part of the train- ture). ing proceeds for ni epochs using the loss in (18). The net- If the final losses are all lower than a threshold, then we works could converge to the same minima or to different check the pairwise distances, to understand if the solutions obtained are the same or not. If there is at least one couple spaced in the domains. In figures 3, 4, and 5, the solutions of solutions that has a distance lower than a fixed thresh- of the three problems are plotted. The real solutions and old Dm , then it means that those functions converged to the the ones obtained with the new algorithm are in very good same minimum, thus we increase the strength of interaction agreement, their difference is not visible. by a factor k and perform again the training. Otherwise, it means that we were able to find at least N solutions. We in- crease the number of networks by 1 and rerun the algorithm in order to check if there are more than N possible solutions. This procedure allows us to calculate all the solutions with- out knowing their number a priori. To speed up the training, we can use the neural networks learned as initialization of the new ones, so that only one network has to be learned from scratch each time we increase N . Results and discussion In this section the results are presented, followed by a sensi- tivity analysis of the hyper-parameters selected. Figure 3: Solutions of Figure 4: Solutions of Neural network architecture Clairaut equation, first Clairaut equation, second problem (equation 11) problem (equation 14) All results were obtained using the same network archi- tecture. The network consists of two fully-connected lay- ers with 8 units per layer and sigmoid activations functions. The architecture also uses a skip connection, which we ob- served helps with learning linear functions. The network is trained using the Adam optimizer with learning rate 10−2 , β1 = 0.99 and β2 = 0.999 Interaction function analysis We analyzed different shapes of interaction functions. The results suggest that the interaction function in equation (20) represents fairly well our idea of interaction and can be eas- ily tuned. In figure 1, the shape of the function changing p is shown, while in figure 2, the interaction functions are shown for different levels of strength λ. While λ increases during the training, we fix the value of p = 5. Figure 5: Solutions of Bratu problem (equation 16) Training phases Figures 6 and 7 show how the solutions evolve in differ- ent training phases on the Clairaut equation, when the loss function changes from equation (18) to equation (21). Fig- ure 6 shows the functions learned after ni = 1000 epochs. They have not already converged, but we note that they are converging to the same minima. Figure 7 shows the func- tions learned in nf = 1000 epochs after the first training Figure 1: Interaction func- Figure 2: Interaction func- phase. The functions learned are not satisfying the differ- tion at different p values at tion at different λ values at ential equation yet, but are distinct enough to converge to λ=1 p=1 different minima that represents the two solutions of the dif- ferential equation. The proposed algorithm doesn’t guaran- tee that the solutions will be far enough to reach different minima, if they exist, but increases the chances of not con- Quantitative results verging to the same one. After the third training phase, the The proposed approach is able to successfully find the dis- solutions converged (see figure 3). tinct solutions corresponding to the analytical ones (equa- In figure 8, we plot an example of loss function from tions (11), (14), and (16)), up to a threshold of 10−5 , cal- the Clairaut problem (problem 1) . The blue line represents culated as mean squared error for 10000 test points equally the loss in equation (18), while the green one is the total • λM is the maximum strength to try, usually set as a stop- ping condition. We should be careful to set it not too small, since the second training phase should separate the solutions enough to make them converge to different min- ima, but not too high in order to not spend to much time in the training phase (the higher λM , the higher the number of iterations of the loop). We set this value to 10; • lossM is the maximum value of the loss so that we know the algorithm converged (the functions learned satisfy the differential equation). We set it to 10−4 ; • Dm is the most important hyper parameter to set since Figure 6: Two solutions Figure 7: Two solutions it defines if two learned functions are the same or not. If of Clairaut equation (prob- of Clairaut equation (prob- this value is too low, then two functions will be consid- lem 1), after the first train- lem 1), after the second ered different even if they are actually the same, while if ing phase training phase it is set too high then two real solutions too close to each other will be viewed as the same leading to an error. This parameter is analyzed in more detail in the following sec- tion. loss (equation (21)). We notice that they coincide before Analysis of Dm In figure 9, we show how a wrong selec- ni = 1000 and after ni + nf = 2000, while they are dif- tion of Dm leads to wrong results. If we set Dm = 10−2 , ferent elsewhere where the algorithm tries to minimize the a value too low, the algorithm finds more solutions than the loss with the interaction term included. real ones. All the functions shown in the figure have loss less than lossM and are therefore considered solutions of the dif- ferential equation. However, they are distant more than Dm from each other so they are not considered the same solu- tion. If we pick Dm too high, the distinct solutions could be considered as the same and the algorithm will not be able to detect both of them. For example, since for Clairaut equa- tion (problem 1), the average distance of the two solutions is D∗ ≈ 0.61 in the range [1, 5], setting Dm ≥ D∗ will lead to wrong results. Figure 8: Validation loss during training Selection of Hyperparameters The algorithm described above (algorithm 1) requires as in- puts a set of hyper-parameters, that we describe and analyze in this section: • λ0 is the initial strength of the interaction. This is not a sensitive parameter. If not properly tuned, it will only slow down the algorithm without any severe impact on Figure 9: Example of Dm value set too low for Clairaut the accuracy of the solutions. The only caution needed is equation (problem 1) to select it small enough to make the algorithm converge, otherwise the second training phase could move the solu- tions too far from the minima not to be able to properly Conclusion converge back to the real solutions. We set this value to In this paper, we designed a novel algorithm to find all pos- 10−1 ; sible solutions of a differential equation with non-unique so- • k is the strength increase and is usually set to 2 in order to lutions using neural networks. We tested the accuracy apply- double the strength of the interaction at every iteration; ing it to three simple nonlinear differential equations that we know admit more than one possible solution. The method Mattheakis, M.; Protopapas, P.; Sondak, D.; Giovanni, M. D.; and successfully finds the solutions with high accuracy. We Kaxiras, E. 2019. Physical symmetries embedded in neural net- tested different interaction functions and hyper-parameters works. and we verified the robustness of the approach. We expect Raissi, M.; Perdikaris, P.; and Karniadakis, G. 2019. Physics- this algorithm to work also for more challenging ODEs, if informed neural networks: A deep learning framework for solving an accurate selection of hyper-parameters is performed. Fu- forward and inverse problems involving nonlinear partial differen- ture work will involve not only application and analysis of tial equations. Journal of Computational Physics 378:686 – 707. the proposed algorithm to higher dimension ODEs and sys- Seefeldt, B.; Sondak, D.; Hensinger, D. M.; Phipps, E. T.; Foucar, tems of ODEs, but also the exploitation of more appropriate J. G.; Pawlowski, R. P.; Cyr, E. C.; Shadid, J. N.; Smith, T. M.; architectures such as dynamical systems like ResNets (He et Weber, P. D.; et al. 2017. Drekar v. 2.0. Technical report, Sandia al. 2016). National Lab.(SNL-NM), Albuquerque, NM (United States). Sirignano, J., and Spiliopoulos, K. 2018. Dgm: A deep learn- ing algorithm for solving partial differential equations. Journal of References Computational Physics 375:1339 – 1364. Alnæs, M.; Blechta, J.; Hake, J.; Johansson, A.; Kehlet, B.; Logg, Zhu, M.; Chang, B.; and Fu, C. 2018. Convolutional neural A.; Richardson, C.; Ring, J.; Rognes, M. E.; and Wells, G. N. 2015. networks combined with runge-kutta methods. arXiv preprint The fenics project version 1.5. Archive of Numerical Software arXiv:1802.08831. 3(100). Baymani, M.; Kerayechian, A.; and Effati, S. 2010. Artificial neu- ral networks approach for solving stokes problem. Applied Mathe- matics 1:288–292. Bratu, G. 1914. Sur les équations intégrales non linéaires. Bulletin de la Société Mathématique de France 42:113–142. Burns, K. J.; Vasil, G. M.; Oishi, J. S.; Lecoanet, D.; and Brown, B. 2016. Dedalus: Flexible framework for spectrally solving dif- ferential equations. Astrophysics Source Code Library. Chen, T. Q.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. K. 2018. Neural ordinary differential equations. In Bengio, S.; Wal- lach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Gar- nett, R., eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 6571–6583. Corliss, G.; Faure, C.; Griewank, A.; Hascoet, L.; and Naumann, U. 2002. Automatic differentiation of algorithms: from simulation to optimization, volume 1. Springer Science & Business Media. E, W.; Han, J.; and Jentzen, A. 2017. Deep learning-based nu- merical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commu- nications in Mathematics and Statistics 5(4):349–380. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn- ing for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Ince, E. 1956. Ordinary Differential Equations. Dover Books on Mathematics. Dover Publications. Karkowski, J. 2013. Numerical experiments with the bratu equa- tion in one, two and three dimensions. Computational and Applied Mathematics 32(2):231–244. Kumar, M., and Yadav, N. 2011. Multilayer perceptrons and ra- dial basis function neural network methods for the solution of dif- ferential equations: A survey. Computers and Mathematics with Applications 62(10):3796 – 3811. Lagaris, I.; Likas, A.; and Fotiadis, D. 1997. Artificial neural net- work methods in quantum mechanics. Computer Physics Commu- nications 104(1):1 – 14. Lagaris, I. E.; Likas, A.; and Fotiadis, D. I. 1998. Artificial neu- ral networks for solving ordinary and partial differential equations. Trans. Neur. Netw. 9(5):987–1000. Lagaris, I. E.; Likas, A.; and Papageorgiou, D. G. 1998. Neural network methods for boundary value problems defined in arbitrar- ily shaped domains. CoRR cs.NE/9812003.