Learning Physics-guided Neural Networks with Competing Physics Loss: A Summary of Results in Solving Eigenvalue Problems Mohannad Elhamod1 * , Jie Bu1 * , Christopher Singh2 , Matthew Redell2 , Abantika Ghosh3 , Viktor Podolskiy3 , Wei-Cheng Lee2 , Anuj Karpatne1 1 Department of Computer Science, Virginia Tech, 2 Department of Physics, Binghamton University, 3 Department of Physics and Applied Physics, University of Massachusetts Lowell, * Equal contribution, {elhamod, jayroxis, karpatne}@vt.edu, {csingh5, mredell1, wlee}@binghamton.edu, {abantika, viktor podolskiy }@uml.edu Abstract While some existing work in PGNN have attempted to learn neural networks by solely minimizing PG loss (and Existing work in Physics-guided Neural Networks (PGNNs) thus being label-free) (Raissi, Perdikaris, and Karniadakis have demonstrated the efficacy of adding single PG loss 2019; Stewart and Ermon 2017), others have used both PG functions in the neural network objectives, using constant loss and data label loss using appropriate trade-off hyper- trade-off parameters, to ensure better generalizability. How- ever, in the presence of multiple physics loss functions with parameters (Karpatne et al. 2017c; Jia et al. 2019). How- competing gradient directions, there is a need to adaptively ever, what is even more challenging is when there are mul- tune the contribution of competing PG loss functions dur- tiple physics equations with competing PG loss functions ing the course of training to arrive at generalizable solutions. that need to be minimized together, where each PG loss We demonstrate the presence of competing PG losses in the may show multiple local minima. In such situations, sim- generic neural network problem of solving for the lowest (or ple addition of PG losses in the objective function with highest) eigenvector of a physics-based eigenvalue equation, constant trade-off hyper-parameters may result in the learn- common to many scientific problems. We present a novel ap- ing of non-generalizable solutions. This may seem counter- proach to handle competing PG losses and demonstrate its intuitive since the addition of PG loss is generally assumed efficacy in learning generalizable solutions in two motivat- to offer generalizability in the PGNN literature (Karpatne ing applications of quantum mechanics and electromagnetic propagation. et al. 2017c; de Bezenac, Pajot, and Gallinari 2019; Shin, Darbon, and Karniadakis 2020). This motivates us to ask the question: is it possible to adaptively balance the importance 1 Introduction of competing PG loss functions at different stages of neural network learning to arrive at generalizable solutions? With the increasing impact of deep learning methods in diverse scientific disciplines (Appenzeller 2017; Graham- Rowe et al. 2008), there is a growing realization in the sci- entific community to harness the power of artificial neu- In this work, we introduce a novel framework of Co- ral networks (ANNs) without ignoring the rich supervision Phy-PGNN, which is an abbreviation for Competing Physics available in the form of physics knowledge in several scien- Physics-Guided Neural Networks, to handle competing PG tific problems (Karpatne et al. 2017a; Willard et al. 2020). loss functions in neural network training. We specifically One of the promising lines of research in this direction is to consider the domain of scientific problems where physics modify the objective function of neural networks by adding knowledge are represented as eigenvalue equations and we loss functions that measure the violations of ANN outputs are required to solve for the highest or lowest eigen-solution. with physical equations, termed as physics-guided (PG) loss This representation is common to many types of physics functions (Karpatne et al. 2017b; Stewart and Ermon 2017). such as the Schrödinger equation in the domain of quan- By anchoring ANN models to be consistent with physics, tum mechanics and Maxwell’s equations in the domain of PG loss functions have been shown to impart generalizabil- electromagnetic propagation. In these applications, solving ity even in the paucity of training data across several scien- eigenvalue equations using exact numerical techniques (e.g., tific problems (Jia et al. 2019; Karpatne et al. 2017c; Raissi, diagonalization methods) can be computationally expensive Perdikaris, and Karniadakis 2019; de Bezenac, Pajot, and especially for large physical systems. On the other hand, Gallinari 2019). We refer to the class of neural networks that PGNN models, once trained, can be applied on testing sce- are trained using PG loss functions as physics-guided neural narios to predict their eigen-solutions in drastically smaller networks (PGNNs). running times. We empirically demonstrate the efficacy of our CoPhy-PGNN solution on two diverse applications in Copyright © 2021 for this paper by its authors. Use permitted un- quantum mechanics and electromagnetic propagation, high- der Creative Commons License Attribution 4.0 International (CC lighting the generalizability of our proposed approach to BY 4.0) many physics problems. 2 Background The second category of methods incorporate PG loss as additional terms in the objective function along with la- 2.1 Overview of Physics Problems: bel loss, using constant trade-off hyper-parameters. This The physics of the problem is available in the form of an includes work in basic Physics-guided Neural Networks eigen-value equation of the form: Ây = by, where, for a (PGNNs) (Karpatne et al. 2017c; Jia et al. 2019) for the given input matrix Â, b is an eigenvalue and y is the corre- target application of lake temperature modeling. We use an sponding eigenvector. We are interested in solving the low- analogue of this basic PGNN as a baseline in our experi- est or highest eigen-solution of this equation in our target ments. problems. Here, we provide a brief overview of the two tar- While some recent works have investigated the effects of get applications. PG loss on generalization performance (Shin, Darbon, and Karniadakis 2020) and the importance of normalizing the scale of hyper-parameters corresponding to PG loss terms Quantum Mechanics: In this application, the goal is to (Wang, Teng, and Perdikaris 2020), they do not study the ef- predict the ground-state wave function of an Ising chain fects of competing physics losses which is the focus of this model with n = 4 particles. This problem can be described paper. Our work is related to the field of multi-task learn- by the Schrödinger equation HΨ̂ = Ê Ψ̂, where Ê, the en- ing (MTL) (Caruana 1993), as the minimization of physics ergy level, is the eigenvalue; Ψ̂, the wave function, is the losses and label loss can be viewed as multiple shared tasks. eigenvector, and H, the Hamiltonian, is the matrix. Since the For example, alternating minimization techniques in MTL ground-state wave function corresponds to the lowest energy (Kang, Grauman, and Sha 2011) in MTL can be used to level, we are interested in finding the lowest eigen-solution alternate between minimizing different PG loss and label of this eigen-value equation. To be able to execute a detailed loss terms over different mini-batches. We consider this as analysis, we choose a small problem scale (n = 4) for this a baseline approach in our experiments. application. 3 Methodology Electromagnetic Propagation: To illustrate our model’s 3.1 Problem statement: scalability to large systems, we consider another applica- From an ML perspective, we are given a collection of train- tion involving the propagation of the electromagnetic waves ing pairs, DT r := {Âi , (yi , bi )}N i=1 , where (yi , bi ) is gener- in periodically stratified layer stacks. The description of ated by diagonalization solvers. We consider the problem of this propagation can be reduced to the eigenvalue problem learning an ANN model, (ŷ, b̂) = fN N (Â, θ), that can pre- Â~hm = kzm ~hm where kzm , the propagation constant of the 2 2 electromagnetic modes along the layers, is the eigenvalue; dict (y, b) for any input matrix, Â, where θ are the learnable parameters of ANN. We are also given a set of unlabeled and ~hm , the coefficients of the Fourier transform of the spa- tial profile of the electromagnetic field, is the eigenvector. It examples, DU := {Âi }M i=1 , which will be used for testing. is important to note for this application that these quantities We consider a simple feed-forward architecture of fN N in are complex valued, and that we are interested in the largest all our formulations. eigenvalue rather than the smallest. 3.2 Designing physics-guided loss functions: 2.2 Related work in PGNN: A naı̈ve approach for learning fN N is to minimize the mean sum of squared errors (MSE) of predictions on the training PGNN has found successful applications in several disci- set, referred to as the Train-MSE. However, instead of solely plines including fluid dynamics (Wang, Wu, and Xiao 2017, relying on Train-MSE, we consider the following PG loss 2016; Wang et al. 2017), climate science (de Bezenac, Pa- terms to guide the learning of fN N to generalizable solu- jot, and Gallinari 2019), and lake modeling (Karpatne et al. tions: 2017c; Jia et al. 2019; Daw et al. 2020). However, to the best of our knowledge, PGNN formulations have not been explored yet for our target applications of solving eigen- Characteristic Loss: A fundamental equation we want to value equations in the field of quantum mechanics and elec- satisfy in our predictions, (ŷ, b̂), for any input  is the eigen- tromagnetic propagation. Existing work in PGNN can be value equation, Âŷ = b̂ŷ. Hence, we consider minimizing broadly divided into two categories. The first category in- the following equation: volves label-free learning by only minimizing PG loss with- out using any labeled data. For example, Physics-informed X ||Âi ŷi − b̂i ŷi ||2 C-Loss(θ) := , (1) neural networks (PINNs) and its variants (Raissi, Perdikaris, i ŷ > ŷ and Karniadakis 2019, 2017a,b) have been recently devel- oped to solve PDEs by solely minimizing PG loss functions, where the denominator term ensures that ŷ resides on a unit for simple canonical problems such as Burger’s equation. hyper-sphere with ||ŷ|| = 1, thus avoiding scaling issues. Since these methods are label-free, they do not explore the Note that by construction, C-Loss only depends on the pre- interplay between PG loss and label loss. We consider an dictions of fN N and does not rely on true labels, (y, b). analogue of PINN for our target application as a baseline in Hence, C-Loss can be evaluated even on the unlabeled test our experiments. data, DU . Spectrum Loss: Note that there are many non-interesting Cold Starting λC : The second observation we make is on solutions of Âŷ = b̂ŷ that can appear as “local minima” the effect of C-Loss on the convergence of gradient descent in the optimization landscape of C-Loss. For example, for towards a generalizable solution. Note that C-Loss suffers every input Âi ∈ DU , there are d possible eigen-solutions from a large number of local minima and hence is suscepti- (where d is the length of ŷ), each of which will result in ble to favoring the learning of non-generalizable solutions. a perfectly low value of C-Loss = 0, thus acting as a lo- Hence, in the beginning epochs, it is important to keep C- cal minima. However, we are only interested in a specific Loss turned off. Once we have crossed a sufficient number eigenvalue—usually the smallest or the largest—for every of epochs and have already zoomed into a region in the pa- rameter space in close vicinity to a generalizable solution, Âi . Therefore, we consider minimizing another PG loss term we can safely turn on C-Loss so that it can help refine θ to that ensures the predicted b̂ at every sample is the desired converge to the generalizable solution. Essentially, we “cold one. In the case of the quantum mechanics application, we starting” λC as given by the following procedure: use the following loss to find the smallest eigen-solution: X   λC (t) = λC0 × sigmoid(αC × (t − Ta )), (4) S-Loss(θ) := exp b̂i (2) where, λC0 is a hyper-parameter denoting the constant value i of λC after a sufficient number of epochs, αC is a hyper- The use of exp function ensures that E-Loss is always posi- parameter that dictates the rate of growth of the sigmoid tive, even when predicted eigenvalues are negative (which is function, and Ta is a hyper-parameter that controls the cut- the case for all energy states, especially the ground-state). As off number of epochs after which λC is activated from a cold for the electromagnetic propagation application, we simply start of 0. direct the optimization towards the largest eigenvalue by re- placing b̂i with − Re(b̂i ), where Re extracts the real part of Overall Learning Objective: Combining all of the inno- the complex eigenvalue. Since in both cases, the exp func- vations described above in designing and incorporating PG tion is being applied over negative quantities, S-Loss has loss functions, we consider the following overall learning smoothly varying gradients. objective: 3.3 Adaptive tuning of PG loss weights: E(t) = Train-Loss + λC (t) C-Loss + λS (t) S-Loss A simple strategy for incorporating PG loss terms in the Note that Train-Loss is only computed over DT r , whereas learning objective of fN N is to add them to Train-MSE us- the PG loss terms, C-Loss and S-Loss, are computed over ing trade-off weight parameters, λC and λS , for C-Loss and DT r as well as the set of unlabeled samples, DU . We re- S-Loss, respectively. Conventionally, such trade-off weights fer to our proposed model trained using the above learn- are kept constant to a certain value across all epochs of gradi- ing objective as CoPhy-PGNN, which is an abbreviation for ent descent. This inherently assumes that the importance of Competing Physics PGNN. PG loss terms in guiding the learning of fN N towards a gen- eralizable solution is constant across all stages (or epochs) 4 Evaluation setup of gradient descent, and they are in agreement with each Data in Quantum Physics: We considered n = 4 spin other. However, in practice, we empirically find that C-Loss, systems of Ising chain models for predicting their ground- S-Loss, and Train-MSE compete with each other and have state wave-function under varying influences of two control- varying importance at different stages (or epochs) of ANN ling parameters: Bx and Bz , which represent the strength learning. Hence, we consider the following ways of adap- of external magnetic field along the X axis (parallel to the tively tuning the trade-off weights of C-Loss and S-Loss, direction of Ising chain), and Z axis (perpendicular to the λC and λS as a function of the epoch number t. direction of the Ising chain), respectively. The Hamiltonian matrix H for these systems is then given as: Annealing λS : The first observation we make is that S- n−1 X n−1 X n−1 X Loss plays a critical role in the initial stages of learning. H=− σiz σi+1 z − Bx σix − Bz σiz , (5) Having a large value of λS in the beginning few epochs is i=0 i=0 i=0 thus helpful to avoid the selection of local minima and in- stead converge towards a generalizable solution. Hence, we where σ x,y,z are Pauli operators and ring boundary con- consider performing a simulated annealing of λS that takes ditions are imposed. Note that the size of H is d = 2n = 16. on a high value in the beginning epochs, that slowly decays We set Bz to be equal to 0.01 to break the ground state de- to 0 after sufficiently many epochs. Specifically, we consider generacy, while Bx was sampled from a uniform distribution the following annealing procedure for λS : from the interval [0, 2]. Note that when Bx < 1, the system is said to be in a fer- λS (t) = λS0 × (1 − αS )bt/T e , (3) romagnetic phase, since all the spins prefer to either point upward or downward collectively. However, when Bx > 1, where, λS0 is a hyper-parameter denoting the starting value the system transitions to paramagnetic phase, where both up- of λS at epoch 0, αS < 1 is a hyper-parameter that controls ward and downward spins are equally possible. Because the the rate of annealing, and T is a scaling hyper-parameter. ground-state wave-function behaves differently in the two regions, the system actually exhibits different physical prop- Models MSE (×102 ) Cosine Similarity erties. Hence, in order to test for the generalizability of ANN CoPhy-PGNN (proposed) 0.35 ± 0.12 99.50 ± 0.12% Black-box NN 1.06 ± 0.16 95.32 ± 0.58% models when training and test distributions are different, we PINN-analogue 6.27 ± 6.94 87.37 ± 12.87% generate training data only from the region deep inside the PGNN-analogue 0.91 ± 1.90 97.97 ± 4.89% ferromagnetic phase for Bx < 0.5, while the test data is MTL-PGNN 6.33 ± 2.69 84.26 ± 6.33% generated from a much wider range 0 < Bx < 2, covering CoPhy-PGNN (only-DT r ) 1.82 ± 0.36 93.61 ± 0.91% both ferromagnetic and paramagnetic phases. In particular, CoPhy-PGNN (w/o S-Loss) 10.97 ± 0.71 76.27 ± 0.80% the training set comprises of N = 100, 000 points with Bx CoPhy-PGNN (Label-free) 9.97 ± 4.42 63.97 ± 16.20% uniformly sampled from 0 to 0.5, while the test set com- prises of M = 20, 000 points with Bx uniformly sampled Table 1: Test-MSE and Cosine Similarity of comparative ANN from 0 to 2. For validation, we used sub-sampling on the models on training size N = 1000 on the quantum physics application. training set to obtain a validation set of 2000 samples. We performed 10 random runs of uniform sampling over N , to show the mean and variance of the performance metrics of with constant weights. Note that the PG loss terms are comparative ANN models, where at every run, a different not defined as PDEs in our problem. random initializtion of the ANN models is also used. Unless 4. MTL-PGNN: Multi-task Learning (MTL) variant of otherwise stated, the results in any experiment are presented PGNN where PG loss terms are optimized alternatively over training size N = 2000. (Kang, Grauman, and Sha 2011) by randomly selecting one from all the loss terms for each mini-batch in every Data in Electromagnetic Propagation: We considered a epoch. periodically stratified layer stack of 10 layers of equal length We also consider the following ablation models: per period. The refractive index n of each layer was ran- 1. CoPhy-PGNN (only-DT r ): This is an ablation model domly assigned an integer value between 1 and 4. Hence, the where the PG loss terms are only trained over the training permittivity  = n2 can take values from {1, 4, 9, 16}. Note set, DT r . Comparing our results with this model will help that the majority of eigenvalue solvers rely on iterative al- in evaluating the importance of using unlabeled samples gorithms and are therefore not easily deployable in GPU en- DU in the computation of PG loss. vironments. To demonstrate the scalability of our approach we generate N = 2000 realizations of the layered structure. 2. CoPhy-PGNN (w/o S-Loss): This is another ablation model where we only consider C-Loss in the learning ob- For each example, we also generate the associated  of size jective, while discarding S-Loss. 401 × 401 complex values, making the scale of this problem about 2500 times larger than that of the quantum mecanics 3. CoPhy-PGNN (Label-free): This ablation model drops problem. The combination of the challenging scale of this Train-MSE from the learning objective and hence per- eigen-decompostion and the scarcity of training data makes forms label-free (LF) learning only using PG loss terms. this problem interesting from scalability and generalizaility perspective. To demonstrate extrapolation ability, we take a Evaluation Metrics: We use two evaluation metrics: (a) training size |DT r | = 370 realizations that has a refractive Test MSE, and (b) Cosine Similarity between our predicted index of only 1 in its first layer. On the other hand, we take a eigenvector, ŷ, and the ground-truth, y, averaged across all test set of size |DU | = 1630 with the first layer’s refractive test samples. We particularly chose the cosine similarity index unconstrained (i.e. any value from the set {1,2,3,4}). for multiple reasons. First, Euclidean distances are not very meaningful in high-dimensional spaces of wave-functions, such as the ones we are considering in our analyses. Sec- Baseline Methods: Since there does not exist any re- ond, an ideal cosine similarity of 1 provides an intuitive lated work in PGNN that has been explored for our tar- baseline to evaluate goodness of results. But most impor- get applications, we construct analogue versions of PINN- tantly, in the electromagnetic propagation application, it is analogue (Raissi, Perdikaris, and Karniadakis 2019) and crucial to compare not just Fourier coefficients of the ex- PGNN-analogue (Karpatne et al. 2017c) adapted to our pansion (which is what the neural net produces) but rather problem using their major features. We describe these base- the actual profile of the magnetic field in the real space. The lines along with others in the following: accuracy of this prediction can be tested by calculating the 1. Black-box NN (or NN): This refers to the “black-box” overlap integral between the exact and the predicted profiles. ANN model trained just using Train-Loss without any PG That integral, due to orthogonality of Fourier expansion, re- loss terms. duces to the cosine similarity. This facilitates testing whether 2. PGNN-analogue: The analogue version of PGNN our predicted vectors are valid eigenvectors from a physical (Karpatne et al. 2017c) for our problem where the hyper- standpoint. parameters corresponding to S-Loss and C-Loss are set to a constant value. 5 Results and analysis 3. PINN-analogue: The analogue version of PINN (Raissi, 5.1 Quantum Physics Application: Perdikaris, and Karniadakis 2019) for our problem that Table 1 provides a summary of the comparison of CoPhy- performs label-free learning only using PG loss terms PGNN with baseline methods on the quantum physics ap- board. 1.0 Analysis of loss landscapes: We visualize the landscape Cosine Similarity 0.8 of different loss functions w.r.t. ANN model parameters. In 0.6 NN particular, we use the code in (Bernardi 2019) to plot a 2D 0.4 CoPhy-PGNN (only-DT r ) view of the landscape of different loss functions, namely 0.2 CoPhy-PGNN Train-MSE, Test-MSE, and PG-Loss (sum of C-Loss and CoPhy-PGNN (Label-free) S-Loss), in the neighborhood of a model solution, as shown 0.0 0.0 0.5 1.0 1.5 2.0 in Figure 2. The model’s parameters are treated with filter Bx normalization as described in (Li et al. 2018), and hence, the coordinate values of the axes are unit-less. Also, the model Figure 1: Cosine Similarity on test samples as a function of Bx . solutions are represented by blue dots. As can be seen, all The dashed line represents the boundary between the interval used for training (left) and testing (right). label-aware models have found a minimum in Train-MSE landscape. However, when the test-MSE loss surface is plot- ted, it is clear that while the CoPhy-PGNN model is still at a minimum, the other baseline models are not. This is a strong plication. We can see that our proposed model shows signif- indication that using the PG loss with unlabeled data can icantly better performance in terms of both Test-MSE and lead to better extrapolation; it allows the model to general- Cosine Similarity. In fact, the cosine similarity of our pro- ize beyond in-distribution data. We can see that without us- posed model is almost 1, indicating almost perfect fit with ing labels, CoPhy-PGNN (Label-free) fails to reach a good test labels. (Note that even a small drop in cosine similar- minimum of Test-MSE, even though it arrives at a minimum ity can lead to cascading errors in the estimation of other of PG Loss. physical properties derived from the ground-state wave- function.) An interesting observation from Table 1 is that 5.2 Electromagnetic Propagation Application: CoPhy-PGNN (Label-free) actually performs even worse than black-box NN. This shows that solely relying on PG For this application, the size of  is 401 × 401, making it loss without considering Train-MSE is fraught with chal- a daunting task for an eigensolver in terms of computation lenges in arriving at a generalizable solution. Indeed, using time. As a result, a grid search hyper-parameter tuning of a small number of labeled examples to compute Train-MSE ANN models is prohibitively expensive. This is due to the provides a significant nudge to ANN learning to arrive at large number of epochs needed to optimize a model for a more accurate solutions. Another interesting observation is problem of this scale. Nonetheless, we were still able to op- that CoPhy-PGNN (only-DT r ) again performs even worse timize a model to do fairly well by manually adjusting the than Black-box NN. This demonstrates that it is important hyper-parameters and architecture of CoPhy-PGNN to yield to use unlabeled samples in DU , which are representative of acceptable results on the validation set. We emphasize, how- the test set, to compute the PG loss. Furthermore, notice that ever, that a more exhaustive tuning could lead to better re- CoPhy-PGNN (w/o S-Loss) actually performs worst across sults that surpass the ones we obtained. Figure 3 shows that all models, possibly due to the highly non-convex nature of CoPhy-PGNN is still able to better extrapolate than a Black- C-Loss function that can easily lead to local minima when box NN on testing scenarios with permittivity greater than used without S-Loss. This sheds light on another important 1. In fact, we have observed that as Black-box NN solely aspect of PGNN that is often over-looked, which is that it optimizes Train-MSE, its cosine similarity measure deteri- does not suffice to simply add a PG-Loss term in the objec- orates on the test set. This is in contrast to CoPhy-PGNN’s tive function in order to achieve generalizable solutions. In ability to maintain a cosine similarity close to 1 even though fact, an improper use of PG Loss can result in worse perfor- its validation loss is comparable to Black-box NN’s. mance than a black-box model. While training our model still takes a significant amount of time (about 12 hours), its effectiveness with respect to Evaluating generalization power: Instead of computing testing speed is demonstrated in Table 2. We can see that the average cosine similarity across all test samples, Figure our approach is at least an order of magnitude faster dur- 1 analyzes the trends in cosine similarity over test samples ing testing than any numerical eigensolver. This highlights with different values of Bx , for four comparative models. the promise in using neural networks to solve physics-based Note that none of these models have observed any labeled eigen-value problems, since, once trained, they can be used data during training outside the interval of Bx ∈ [0, 0.5]. to produce eigen-solutions on test points much faster than Hence, by testing for the cosine similarity over test sam- numerical methods. Further, while CoPhy-PGNN shows ples with Bx > 0.5, we are directly testing for the ability higher error than numerical solvers, note that the cosine sim- of ANN models to generalize outside the data distributions ilarity of our model’s predictions with ground-truth is close it has been trained upon. Evidently, all label-aware models to 0.8, thus admitting physical usability. perform well on the interval of Bx ∈ [0, 0.5]. However, except for CoPhy-PGNN, all baseline models degrade sig- 6 Conclusions and future work nificantly outside that interval, proving their lack of gener- This work proposed novel strategies to address the problem alizability. Moreover, the label-free, CoPhy-PGNN (Label- of competing physics loss functions in PGNN. For the gen- free), model is highly erratic, and performs poorly across the eral problem of solving eigen-value equations, we designed NN / Train-MSE CoPhy-PGNN (only-DT r ) / Train-MSE CoPhy-PGNN (Label-free) / Train-MSE CoPhy-PGNN / Train-MSE 0. 0 000 0..00. 0 00. 0000 0 00. 0.002 0. 0.00175 0.0 0 0.00.0.0 0 .0 02302 0.001 0.000.0.0 0000222 2 0.0.0.0 2622677 0.0 0. 0.001911213 2 00 0.0 0 2 2 1 01 1 2 26 2 0. 0.0000 0. 011 67 0.025 0.003 005 0.025 0.002 1 0.02 1 0.02 1 0.02 1 0.02 0.026 0.00 0 0 0 0.026 0 0 0.01 0.01 26 0.01 0.01 0.0 0.002 −1 −1 −1 0.02 7 −1 27 0.0 28 0.003 0.000.001 0. 0.0 28 0. 0.00 005 0.0 −2 0.000..0030.00 0000 1 000. −2 0.000..000.00 1 1001 0000 0.001 −2 0.0 .028 −2 0.0.0 00907 6 0322 2 20222 26 0.001 0 −2 0 2 −2 0 2 −2 0 2 −2 0 2 NN / Test-MSE CoPhy-PGNN (only-DT r ) / Test-MSE CoPhy-PGNN (Label-free) / Test-MSE CoPhy-PGNN / Test-MSE 0.040 0.057 0. 0.10.1 0.030.503 050.0 0.0.0 0.04 05255 0.1 0. 1111 0.0600.075 7 0.00. 0.100608102 4 2 2 0.040 42045 8 2 0.100 4 2 0.03 0.045 0.15 0.037 0.15 0.15 0 0.15 0.104 0.102 0. 0.015 0. 106 0.035 1010 0.1 0. 8 1 0.033 1 1 0. 0. 1111 2 1 0.111 18 0.1 64 20 0.1 0.12 0.10 0.10 2 0. 0.10 0.10 24 0.0 12 0.0 0 0 0 6 8 0 0.0 0.12 30 33 0.03 0.120 0. 28 0.0 5 0 015 3 0.05 0.11 8 0.0.0430 0. 06570 0.0 −1 0.0 37 0. 4 0.05 −1 0.0 37 0.05 −1 0.118 0.05 −1 0.0 0 0. 0.10 9055 0.05 0.00.00402 0.1 0.15.1312 0 0.00.505408 45 0.0 0.040 0.120 65 0 5 −2 0. 0.05 5 8 05 2 −2 0.00.50400.08 45 42 −2 0.11 8 0.122 −2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 NN / PG-loss CoPhy-PGNN (only-DT r ) / PG-loss CoPhy-PGNN (Label-free) / PG-loss CoPhy-PGNN / PG-loss 0.183 0 00. 0.20 92518 0 0. 0.119958 0.1 00.0007 6.0065750 0 03.15 86 0.1.188.1919 0.10.1.1 8 8992 0.000.0.05.05.0 0.100.1.1 20 5 0 2 0.1 6 0.20 2 0.1770.18083 6 0.20 2 0.030 0.045 05 0.20 2 0.0 00..009 0.20 0.180 0.03 0 4 0 60 7505 5 0.030.045 0.0 1 0.1 0.15 1 0.15 1 0.15 1 15 0.15 77 0.1 71 0 0 0 0 0.10 0.10 0.10 0.10 74 0.01 0.1 5 −1 −1 −1 −1 0 0.030 0.05 0.05 0.0 0.040 0.0 35 0.05 0.06 .045 0.05 0.1 4 0 0.00. 9 0750 0.18 0.1 8 0.18 0 0.17 .1 0 −2 0.200..200.20 190. 1919 70418 529 0.18683 −2 0.10.0.0.190.19 1818 89529 63 7 −2 0.005.0550 5 −2 0.10. 0. 0.615 5 505 1312 0 0 −2 0 2 −2 0 2 −2 0 2 −2 0 2 Figure 2: A comprehensive comparison between CoPhy-PGNN and different baseline models. The 1st and 2nd columns show that without using unlabeled data, the model does not generalize well. On the other hand, the 3rd column shows that without labeled data, the model fails to reach a good minimum. Only the last column, our proposed model, shows a good fit across both labeled and unlabeled data. The best performing model is also the model that best optimizes the PG loss. a PGNN model CoPhy-PGNN and demonstrated its efficacy in two target applications in quantum mechanics and elec- 0.8 tromagnetic propagation. From our results, we found that: 1) PG loss helps to extrapolate and gives the model better gen- Cosine Similarity 0.6 NN eralizablity; and 2) Using labeled data along with PG loss 0.4 CoPhy-PGNN results in more stable PGNN models. Moreover, we visual- 0.2 ized the loss landscape to give a better understanding of how 0.0 the combination of both labeled data loss and PG loss leads −0.2 to better generalization performance. We have also demon- 1 4 9 16 strated the generalizability of our CoPhy-PGNN to multiple Permittivity of First Layer application domains with varying types of physics loss func- Figure 3: Cosine similarity of CoPhy-PGNN compared to Black- tions, as well as its scalability to large systems. Future work box NN for the electromagnetic propagation application. The can focus on reducing the training time of our model so as to dashed line represents the boundary between the interval used for perform extensive hyper-parameter tuning to reach a better training (left) and testing (right). global minima. Finally, while this work empirically demon- strated the value of CoPhy-PGNN in combating with com- peting PG loss terms, future work can focus on theoretical analyses of our approach. Solver average time (seconds) average |Ây − by| References CoPhy-PGNN 0.0430 1.878 × 102 numpy.linalg.eig 93.743 7.714 × 10−6 Appenzeller, T. 2017. The scientists’ apprentice. Science Matlab 0.196 8.747 × 10−12 357(6346): 16–17. torch.eig 16.565 6.821 × 10−13 scipy.linalg.eig 106.223 7.538 × 10−4 Bernardi, M. D. 2019. loss-landscapes. URL https://github. scipy.sparse.linalg.eigs 8.893 4.418 × 10−3 com/marcellodebernardi/loss-landscapes/. Table 2: Comparison of speed and accuracy between CoPhy-PGNN Caruana, R. 1993. Multitask Learning: A Knowledge-Based and other numerical eigensolvers. Note that Matlab calculates the Source of Inductive Bias. In Proceedings of the Tenth In- eigenvalue of interest (i.e. the largest), while the other eigensolvers, ternational Conference on International Conference on Ma- except for our proposed method, calculate all the eigenvalues of the chine Learning, ICML’93, 41–48. San Francisco, CA, USA: given matrix. This explains why Matlab has relatively faster execution Morgan Kaufmann Publishers Inc. ISBN 1558603077. time. Daw, A.; Thomas, R. Q.; Carey, C. C.; Read, J. S.; Appling, A. P.; and Karpatne, A. 2020. Physics-Guided Architecture (PGA) of Neural Networks for Quantifying Uncertainty in Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019. Lake Temperature Modeling. In Proceedings of the 2020 Physics-informed neural networks: A deep learning frame- SIAM International Conference on Data Mining, 532–540. work for solving forward and inverse problems involving SIAM. nonlinear partial differential equations. Journal of Compu- tational Physics 378: 686–707. de Bezenac, E.; Pajot, A.; and Gallinari, P. 2019. Deep learning for physical processes: Incorporating prior scien- Shin, Y.; Darbon, J.; and Karniadakis, G. E. 2020. On the tific knowledge. Journal of Statistical Mechanics: Theory convergence and generalization of physics informed neural and Experiment 2019(12): 124009. networks. arXiv preprint arXiv:2004.01806 . Graham-Rowe, D.; Goldston, D.; Doctorow, C.; Waldrop, Stewart, R.; and Ermon, S. 2017. Label-free supervision of M.; Lynch, C.; Frankel, F.; Reid, R.; Nelson, S.; Howe, D.; neural networks with physics and domain knowledge. In and Rhee, S. 2008. Big data: science in the petabyte era. AAAI. Nature 455(7209): 8–9. Wang, J.-X.; Wu, J.; Ling, J.; Iaccarino, G.; and Xiao, H. 2017. A Comprehensive Physics-Informed Machine Learn- Jia, X.; Willard, J.; Karpatne, A.; Read, J.; Zwart, J.; Stein- ing Framework for Predictive Turbulence Modeling. arXiv bach, M.; and Kumar, V. 2019. Physics Guided RNNs for preprint arXiv:1701.07102 . Modeling Dynamical Systems: A Case Study in Simulat- ing Lake Temperature Profiles. In Proceedings of the 2019 Wang, J.-X.; Wu, J.-L.; and Xiao, H. 2016. Physics- SIAM International Conference on Data Mining, 558–566. Informed Machine Learning for Predictive Turbulence Mod- SIAM. eling: Using Data to Improve RANS Modeled Reynolds Stresses. arXiv preprint arXiv:1606.07987 . Kang, Z.; Grauman, K.; and Sha, F. 2011. Learning with Whom to Share in Multi-Task Feature Learning. In Proceed- Wang, J.-X.; Wu, J.-L.; and Xiao, H. 2017. Physics- ings of the 28th International Conference on International informed machine learning approach for reconstructing Conference on Machine Learning, ICML’11, 521–528. Reynolds stress modeling discrepancies based on DNS data. Madison, WI, USA: Omnipress. ISBN 9781450306195. Physical Review Fluids 2(3): 034603. Karpatne, A.; Atluri, G.; Faghmous, J. H.; Steinbach, M.; Wang, S.; Teng, Y.; and Perdikaris, P. 2020. Understand- Banerjee, A.; Ganguly, A.; Shekhar, S.; Samatova, N.; and ing and mitigating gradient pathologies in physics-informed Kumar, V. 2017a. Theory-guided data science: A new neural networks. arXiv preprint arXiv:2001.04536 . paradigm for scientific discovery from data. IEEE Trans- Willard, J.; Jia, X.; Xu, S.; Steinbach, M.; and Kumar, V. actions on Knowledge and Data Engineering 29(10): 2318– 2020. Integrating physics-based modeling with machine 2331. learning: A survey. arXiv preprint arXiv:2003.04919 . Karpatne, A.; Atluri, G.; Faghmous, J. H.; Steinbach, M.; Banerjee, A.; Ganguly, A.; Shekhar, S.; Samatova, N.; and Kumar, V. 2017b. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Trans- actions on Knowledge and Data Engineering 29(10): 2318– 2331. Karpatne, A.; Watkins, W.; Read, J.; and Kumar, V. 2017c. Physics-guided Neural Networks (PGNN): An Applica- tion in Lake Temperature Modeling. arXiv preprint arXiv:1710.11431 . Li, H.; Xu, Z.; Taylor, G.; Studer, C.; and Goldstein, T. 2018. Visualizing the Loss Landscape of Neural Nets. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31, 6389–6399. Curran Associates, Inc. URL http://papers.nips.cc/paper/7875-visualizing-the-loss- landscape-of-neural-nets.pdf. Raissi, M.; Perdikaris, P.; and Karniadakis, G. 2017a. Physics Informed Deep Learning (Part I): Data-driven So- lutions of Nonlinear Partial Differential Equations. arXiv preprint arXiv:1711.10561 . Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2017b. Physics Informed Deep Learning (Part II): Data-driven Dis- covery of Nonlinear Partial Differential Equations. arXiv preprint arXiv:1711.10566 .