Graph Networks with Physics-aware Knowledge Informed in Latent Space Sungyong Seo and Yan Liu {sungyons, yanliu.cs} at usc.edu Department of Computer Science University of Southern California Abstract Known physics While physics conveys knowledge of nature built from an in- terplay between observations and theory, it has been consid- Unknown physics ered less important for modeling deep neural networks. De- 𝑡 𝑡 + ∆𝑡 spite the usefulness of physical rules, it is particularly chal- lenging to leverage the knowledge for sparse data since most Modeling physics equations are well defined on the continuous and dense space. In addition, it is even harder to inform the equa- Physics tions into a model if the observations are not fully governed Constraint by the given physical knowledge. In this work, we present a Graph novel architecture to incorporate physics or domain knowl- Networks edge given as a form of partial differential equations (PDEs) 𝑡 𝑡 + ∆𝑡 on sparse observations by utilizing graph structure. Moreover, we leverage the representation power of deep learning by in- Figure 1: Concept of the proposed PaGN. Many sensor- forming the knowledge in latent space. We demonstrate that based observations are only sparsely available (See circled climate prediction tasks are significantly improved and vali- date the effectiveness and importance of the proposed model. regions) but there are continuous physical process (e.g., Dif- fusion) behind the sparse observations. Some of the known physics rules are injected into a model and the remained un- Introduction known dynamics will be extracted from data. Modeling natural phenomena in the real-world, such as cli- mate, traffic, molecule, and so on, is extremely challenging Physics is one of the fundamental pillars describing but important. Deep learning has achieved significant suc- how the real-world behaves. It is imperative that physics- cesses in prediction performance by learning latent repre- informed learning models are powerful solutions to mod- sentations from data-rich applications such as speech recog- eling natural phenomena. Incorporating domain knowledge nition (Hinton et al. 2012), text understanding (Wu et al. has several benefits: first, it helps an optimized solution to 2016), and image recognition (Krizhevsky, Sutskever, and be more stable and to prevent overfitting; second, it pro- Hinton 2012). While the accuracy and efficiency of data- vides theoretical guidance with which an optimized model driven deep learning models can be improved with ad-hoc is supposed to follow and thus, helps training with fewer architectural changes for specific tasks, we are confronted data; lastly, since a model is driven by the desired inductive with many challenging learning scenarios in modeling nat- bias, it would be more robust to unseen data, and thus it is ural phenomenon, where a limited number of labeled ex- easier to enable accurate extrapolation. amples are available, there is much noise in the data, and In the meanwhile, there exist a series of challenges when there could be constant changes in data distributions (e.g. we incorporate physics principles into machine learning dynamic systems). Furthermore, in many domains, data are models. First, a model needs to properly handle the spatial only available on scattered collections of points (sensors or and temporal constraints. Many physics equations demon- point clouds, see Figure 1) where the majority of existing strate how a set of physical quantities behaves on space and methods are not applicable. These challenges are not easily time. For example, the wave equation describes how a sig- addressed under the purely data-driven learning models and nal is propagated through a medium over time. Second, the therefore, there is a pressing need to develop new generation model should capture relations between objects, such as im- robust learning models that can address these challenging age patches (Santoro et al. 2017) or rigid bodies (Battaglia learning scenarios. et al. 2016; Chang et al. 2017). Third, the learning mod- Copyright © 2021for this paper by its authors. Use permitted under ules should be shared over all objects because physical laws Creative Commons License Attribution 4.0 International (CC BY are commonly applicable to all objects. Finally, the model 4.0). should be flexible to extract unknown patterns instead of be- ing strictly constrained to the physics knowledge. Since it other physics equations. Furthermore, it is restricted in a reg- is not always possible to describe all rules governing real- ular grid to use conventional convolutional neural networks world data, data-driven learning is required to fill the gap (CNNs) for images. between the known physics and real observations. In this paper, we address the problem of modeling dynam- Discovering physical dynamics A class of mod- ical systems based on graph neural networks by incorpo- els (Grzeszczuk, Terzopoulos, and Hinton 1998; Battaglia rating useful knowledge described as differentiable physics et al. 2016; Chang et al. 2017; Watters et al. 2017; Sanchez- equations. We propose a generic architecture, physics-aware Gonzalez et al. 2018; Kipf et al. 2018) have been proposed graph networks (PaGN), which can leverage explicitly re- based on the assumption that neural networks can learn quired physics and learn implicit patterns from data as il- complex physical interactions and simulate unseen dynam- lustrated in Figure 1. The proposed model properly handles ics based on a current state. The models along this direction spatially distributed objects and their relations as vertices are based on common relational inductive biases (Santoro and edges in a graph. Moreover, temporal dependencies are et al. 2017; Battaglia et al. 2018), i.e., functions connect- learned by recurrent computations. As Battaglia et al. (2018) ing entities and relations are shared and can be learned suggest, the inductive bias of a graph-based model is its in- from a given sequence of simulated dynamics. (Chang variance [to] node/edge permutations, and thus, all trainable et al. 2017; Battaglia et al. 2016; Sanchez-Gonzalez et al. functions for the same input types are shared. 2018) commonly assumed that the objects’ behaviors were Our contributions of this work are summarized as follows: governed by classical kinetic physics equations. Then, • We develop a novel physics-aware learning architecture, object- and relation-centric functions were proposed to PaGN, which incorporates differentiable physics equa- learn the transition from the current state to the next state tions with a graph network framework. without explicitly injecting the equations into the model. • We explore the performance of PaGN on graph signal pre- Discovering latent physics by data-driven learning has been diction tasks to demonstrate that the physics knowledge is actively studied (Long et al. 2018; Brunton, Proctor, and helpful to provide a significant improvement in prediction Kutz 2016). While the properly constrained filters enable us tasks and make a model more robust. to identify the governing PDEs, it is only applicable when we are aware of the form of target PDEs. Unlike this line • We investigate the effectiveness and the importance of of works that extracts latent patterns from data only, our PaGN from climate prediction to provide how physics proposed model can incorporate known physics and at the knowledge can be beneficial for prediction performance. same time extract latent patterns from data which cannot be captured by existing knowledge. Related Work Incorporating physics Among many attempts incorporat- Background ing physical knowledge into data-driven models, Cressie and Wikle (2015) covered a number of statistical models In this section, we introduce how differential operators in (e.g., a hierarchical Bayesian framework) handling physi- Euclidean domain are analogously defined on the discrete cal equations. Raissi, Perdikaris, and Karniadakis (2017a) graph domain and briefly show that the graph networks mod- introduced a concept of physics-informed neural networks, ule is able to efficiently express the differential operators. which utilize physics equations explicitly to train neural net- works. By optimizing the model at initial/boundary and sam- Calculus on Graphs pled collocation points, the data-driven solutions of nonlin- Preliminary Given a graph G = (V, E) where V and  E ear PDEs can be found. Based on this fundamental idea, a are a set of vertices V = {1, . . . , n} and edges E ⊆ V2 , re- number of works for simulating and discovering PDEs have spectively, two types of real functions can be defined on the been published (Raissi and Karniadakis 2018; Raissi 2018; vertices, f : V → R, and edges, F : E → R, of the graph. Raissi, Perdikaris, and Karniadakis 2017b). Although these It is also possible to define multiple functions on the ver- works leveraged physical knowledge, they are limited be- tices or edges as multiple feature maps of a pixel in CNNs. cause they require all physics behind given data to be ex- Since f and F can be viewed as scalar and vector fields in plicitly known. differential geometry (Figure 2), the corresponding discrete de Bezenac, Pajot, and Gallinari (2018) considered a sim- operators on graphs can be defined as follow (Bronstein et al. ilar problem as ours. They proposed how transport physics 2017). (advection and diffusion) could be incorporated for forecast- ing sea surface temperature (SST). In other words, they pro- Gradient on graphs The gradient on a graph is the linear posed how the motion flow that is helpful for the tempera- operator defined by ture flow prediction could be extracted in an unsupervised manner from a sequence of SST images. ∇ : L2 (V) → L2 (E) This work is a major milestone since it captures not only (∇f )ij = (fj − fi ) if {i, j} ∈ E and 0 otherwise. the dominant transport physics but also unknown patterns inferred through the neural networks. Despite of its novel where L2 (V) and L2 (E) denote Hilbert spaces of vertex architecture, the model is specifically designed for transport and edge functions, respectively, thus f ∈ L2 (V) and F ∈ physics and it is not straightforward to extend the model to L2 (E). As the gradient in Euclidean space measures the rate Mapping Equation Physics example node eij = φe (vi , vj ) ∇φ = −E → edge = (∇v)ij (Electric field) (a) Scalar field (b) Vector field (c) Vertex func (d) Edge func edge vi = φv (eij ) ∇ · E = ρ/0 Figure 2: Scalar/vector fields on Euclidean space and ver- → node (Maxwell’s eqn.) tex/edge functions on a graph. = (div e)i and direction of change in a scalar field, the gradient on a node vi = φv (vi , {vj:(i,j)∈E }) ∆φ = 0 graph computes differences of the values between two adja- → node = (∆v)i (Laplace’s eqn.) cent vertices and the differences are defined along the direc- tions of the corresponding edges. Table 1: Examples of static equations in Graph networks Divergence on graphs The divergence in Euclidean space maps vector fields to scalar fields. Similarly, the divergence where φe , φv , φu are edge, node, and global update func- on a graph is the linear operator defined by tions, respectively, and they can be implemented by learn- div : L2 (E) → L2 (V) able neural networks. Note that the computation order is X flexible. The aggregators can be chosen freely once it is in- (div F )i = wij Fij ∀i ∈ V variant to permutations of their inputs. j:(i,j)∈E As φe is a mapping function from vertices to edges, where wij is a weight on the edge (i, j). It denotes a it can be replaced by the graph gradient operator to de- weighted sum of incident edge functions to a vertex i, which scribe the known relation explicitly. Similarly, φv can learn is interpreted as the netflow at a vertex i. divergence-like mapping (edge to node) functions. For curl- involved functions, it is required to add another updating function, φc , which is mapping from nodes/edges/global at- Laplacian on graphs Laplacian (∆ = ∇2 ) in Euclidean tributes to a 3-clique attribute and vice versa. In other words, space measures the difference between the values of the the graph networks have highly flexible modules which are scalar field with its average on infinitesimal balls. Similarly, able to imitate the differential operators in a graph explicitly the graph Laplacian is defined as or implicitly. ∆ : L2 (V) → L2 (V) X Physics-aware Graph Networks (∆f )i = wij (fi − fj ) ∀i ∈ V j:(i,j)∈E As deep learning models are successful to model complex behaviors or extract abstract features in data, it is natural to The graph Laplacian can be represented P as a matrix form, focus on how the data-driven modeling can solve practical L = D − W where D = diag( j:j6=i wij ) is a degree problems in physics or engineering fields. In this section, we matrix and W denotes a weighted adjacency matrix. Note provide how domain knowledge described in physics can be that L = ∆ = −div∇ and the minus sign is required to incorporated with the graph networks framework. make L positive semi-definite. Based on the core differential operators on a graph, we Static Physics can re-write differentiable physics equations (e.g., Diffusion Many fields in physics dealing with static properties, such equation or Wave equation) on a graph. as Electrostatic, Magnetostatic, or Hydrostatic, describe a number of physics phenomena at rest. Among the various Graph Networks phenomena, it is easy to express differentiable physics rules Battaglia et al. (2018) proposed a graph networks frame- in discrete forms on a graph with the operators in previous work, which generalizes relations among vertices, edges, Section . For instances, the Poisson equation (∇2 φ = − ρ0 ) and a whole graph. Graph Networks (GN) describe how in Electrostatics is realized as a simple matrix multiplication edge, node, and global attributes are updated by propagat- of graph Laplacian with a vertex function. Table 1 provides ing information among themselves. some differential formulas in Electrostatic and how the up- Given a set of nodes (v), edges (e), and global (u) at- dating functions are defined in graph networks. tributes, the steps of computation in a graph networks block are as follow: Dynamic Physics 1. e0ij ← φe (eij , vi , vj , u) for all {i, j} ∈ E pairs. More practical equations have been written in the dynamic 2. vi0 ← φv (vi , ē0i , u) for all i ∈ V. forms, which describe how a given physical quantity is ē0i is an aggregated edge attribute related to the node i. changing in a given region over time. GN can be regarded as 0 u 0 0 a module that updates a graph state including the attributes 3. u ← φ (u, ē , v̄ ) of node, edge, and a whole graph. ē0 and v̄ 0 are aggregated attributes of all edges and all nodes in a graph, respectively. G 0 = GN(G) (1) Equation Physics example Forward/Recurrent computation Figure 3 provides how the desired physics knowledge is integrated with the graph vi0 = vi + αφv (vi , {vj:(i,j)∈E }) u̇ = α∆u networks. Given a graph G = {v, e, u}, it is fed into an = vi + α(∆v)i (Diffusion eqn.) encoder which transforms a set of attributes of nodes (v), edges (e), and a whole graph (u) into latent spaces. vi00 = 2vi0 − vi + c2 φv (vi0 , {vj:(i,j)∈E 0 }) ü = c2 ∆u ṽ, ẽ, ũ = Encoder(v, e, u) (3) = 2vi0 − vi + c2 (∆v 0 )i (Wave eqn.) After the encoder, the encoded graph H = {ṽ, ẽ, ũ} is re- peatedly updated within the core block as many as the re- Table 2: Examples of dynamic equations in Graph networks quired time steps T . For each step, H is updated to H0 which denotes the next state of the encoded graph. where G 0 is the updated graph state. Dynamic physics for- H0 = GN(H) (4) mulas are written as a function of time and spatial deriva- Finally, the sequentially updated attributes are re- tives: transformed to the original spaces by a decoder. ∂ M u ∂u ∂N u   ∂u v 0 , e0 , u0 = Decoder(ṽ 0 , ẽ0 , ũ0 ) (5) f ,··· , M , ,··· , N = 0 (2) ∂t ∂t ∂x ∂x There are two types of objective function in this architec- where u is a physical quantity spatiotemporally varying and ture, physics knowledge and supervised objective. First, we x is the direction where u is defined on. M and N denote the define physics-informed constraint, which is a form of equa- highest order of time and spatial derivatives, respectively. tions in Table 1 and 2 depending on given physics knowl- Under the state updating view in Equation 1, any types of edge and even mixed. PDEs written in Equation 2 can be represented as a form of finite differences. Table 2 provides the examples of the s fphy (Ht0 ), fphy d (Ht0 , · · · , Ht+M 0 ) (6) dynamic physics. u̇ and ü are the first and second order time X Lphy = s fphy (Ht0 ) + fphy d (Ht0 , · · · , Ht+M 0 ) (7) derivatives, respectively. t Physics in Latent Space s where fphy (Ht0 ) and fphy d (Ht0 , · · · , Ht+M 0 ) are the static and dynamic physics-informed quantity, respectively. For exam- We provide how the differential operators are implemented ple, we can impose gradient constraint or the diffusion equa- in a GN module in a previous section. However, it is tion between node/edge latent representations as follow: hardly practical for modeling complicated real-world prob- lems with the differential operators solely because it is only s fphy (Ht0 ) = kẽ0t − ∇ṽt0 k2 possible when all physics equations governing the observed phenomena are explicitly known. For example, although we d fphy (Ht0 , Ht+1 0 0 ) = kṽt+1 − ṽt0 − α∇2 ṽt0 k2 are aware that there are a number of physics equations in- Secondly, the supervised loss function between the predicted volved in climate observations, it is almost infeasible to in- graph, Ĝ 0 , and the target graph, G 0 . This loss function is con- clude all required equations for modeling the observations. structed based on the task, such as the cross-entropy or the Thus, it is necessary to utilize the learnable parameters in mean squared error (MSE). Finally, the total objective func- GN to fill the missing dynamics which is not described by tion is a sum of the two constraints: given equations. There is another advantage to utilize learnable parameters. L = Lsup + λLphy (8) There are a number of unknown parameters, which need to where λ controls the importance of the physics term. be pre-defined to specify the physics equations, and the pa- rameters can be inferred by the learnable parameters. For example, while we have knowledge that input signal has a Experiment wave property, the speed of waves (c in Table 2) should be In this section, we evaluate PaGN on a real-world climate given to fully describe the wave equation. It will be even dataset on the Southern California region. worse when multiple input signals are involved since each signal is governed by different parameters in the same kind Climate Data of equation. While both temperature and surface pressure For the evaluation on real-world data, we used the hourly are continuous and diffusive, they should have different dif- simulated climate observations for 16 days on the South- fusion coefficients (α in Table 2) in the same diffusion equa- ern California region (Zhang et al. 2018). In this dataset, we tion. To address the issue we can transform the input signals sampled small regions randomly from two area (Los Ange- to latent space and use one equation in the latent space in- les and San Diego, Figure 4) encompassing urban and rural stead of imposing multiple equations to input signals sepa- meteorological features to generate spatially discrete obser- rately. Then, the parameters in Encoder make the different vations. To build a graph, we connected a pair of the sampled signals follow the equation differently. We formalize how regions by using k-nearest neighbors algorithm (k = 3). this idea is implemented as follow. This data preprocessing is required to verify the proposed 𝒢 Encoder ⨀ GN ℋ′ Decoder 𝒢$ ′ 𝒢′ ℋ Supervised Loss Physics equation xT Figure 3: Recurrent architecture to incorporate physics equation on GN. The blue blocks have learnable parameters and the orange blocks are objective functions. is a concatenation operator and the middle core block can be repeated as many as the required time steps (T ). idea as well as evaluate PaGN on the spatiotemporally sparse setting, which is more common for sensor-based datasets. The vertex attributes consist of 10 climate observations, Air temperature, Albedo, Precipitation, Soil moisture, Rel- ative humidity, Specific humidity, Surface pressure, Plane- tary boundary layer height, and Wind vector (2 directions). While the edge attributes are not given explicitly, we could specify the type of each edge by using the type of connected regions. There are 13 different land-usage types and each Figure 4: Sampled regions in Southern California area. type summarizes how the corresponding land is used. Based (Left) Los Angeles (274 nodes) and (Right) San Diego (282 on the types of connected regions, we assigned different em- nodes) area. Model LA area SD area bedding vectors to edges. MLP 0.8140±0.0651 0.7735±0.0539 PaGN Architecture LSTM 0.7855±0.0644 0.8123±0.0875 GN-only 0.5951±0.0517 0.6947±0.1859 As explained in Section , PaGN consists of three modules, GN-skip 0.5906±0.0620 0.6456±0.1499 graph encoder, GN block, and graph decoder (Figure 3). The PaGN (wave) 0.5366±0.0631 0.6413±0.1549 encoder contains two feed forward networks, φv and φe , ap- PaGN (diff) 0.5289±0.0405 0.5746 ±0.0471 plied to node and edge features, respectively. By passing the encoder, the features are transformed to the latent space (H) Table 3: One step prediction error (MSE) where we will impose physics equations. In the GN block, the node/edge/graph features are updated by the GN algorithm described in Section . The latent graph tions, which is found through cross validation. Note that the states, H and H0 , indicate the hidden states of the current and equation term can be replaced by other equations properly. next observations. For the physics constraint, we informed the diffusion and wave equation in Table 2, which describe Experimental Settings the behavior of the continuous physical quantities. As the In our experiments, we used the air temperature as a target most of the climate observations are varying continuously, observation and other 9 observations were used as input. We the diffusion equation, as a part of the continuity equation, is first evaluated our model by performing the one-step and one of the inductive bias that should be considered for mod- multistep prediction tasks on the two different area with a eling. In addition, the wave equation is useful to describe mean square error metric. For both regions, we commonly atmospheric phenomena, especially 1 solar day harmonics trained the model with input observations for 10 timesteps (e.g., Atmospheric tide). Note that the physics equations are (t − 10 : t − 1) and predicted targets from t − 9 to t. First not directly applied to the input observations, but rather to 65% of a total length was used as a training set and remain- the latent representations. The state-updating process is re- ing series was split into validation (10%) and test sets (25%). peated at least as many as the order of the equations to pro- We explored several baselines: MLP, LSTM, and GN- vide the finite difference equation. For multistep predictions, only ignoring the physics constraint in PaGN. We also com- the recurrent module is repeated as many as the number of pared GN-skip which connects between H and H0 with the the predictions and the physics equation will be also applied skip-connection (He et al. 2016) without the physics con- multiple times as well. Finally, the decoder takes H0 as input straint. to return the next predictions. The following objective is the total loss function of PaGN with the diffusion equation. One step Prediction T T X X Table 3 shows the prediction error of the baselines and PaGN L= kŷi0 − yi0 k2 + λ kṽi0 − ṽi−1 − α∇2 ṽi−1 k2 on different areas. MLP and LSTM are shared over all sta- i=1 i=1 tions and their performaces are outperformed by other mod- (9) els leveraging a given graph structure. It implies that know- where y 0 is a vector of the target observations (i.e. node vec- ing neighboring information is significantly helpful to infer tors) and α adjusts the diffusivity of the latent representa- its own state and it is intuitive since climate behaviors are Model LA area SD area 1.2 LA area LA area SD area 2.5 SD area LSTM 1.9022±0.2078 1.2489±0.2295 1.1 GN-only 1.6137±0.1128 1.5532±0.2023 1.0 2.0 GN-skip 1.5429±0.0932 1.4423±0.1622 MSE MSE 0.9 PaGN(diff) 1.4656±0.0474 1.0999±0.0435 1.5 0.8 Table 4: Multistep prediction error (MSE) 0.7 1.0 0.6 0.5 0.5 spatiotemporally continuous. Among the graph-based mod- 0% 25% 50% 75% 0.001 0.01 0.1 1.0 els, PaGN(diff) provides the least MSEs. It validates that (a) MSE on sampled data (b) MSE with different λ the diffusive property provides a strong inductive bias with the latent representation learning. Note that the standard de- Figure 5: In (a) MSEs of PaGN are almost as good as GN- viations from PaGN(diff) are significantly smaller than only (gray lines) despite the less number of training data. those of other baselines and it implies that the integrated (b) provides how the prediction performance is dependent physics knowledge properly stabilizes optimization process on the physics term. by introducing additional objective. Multistep Prediction Model LA area SD area To evaluate the effectiveness of the state-wise regularization PaGN(rand) 1.1406 0.7073 more carefully, we conducted the multistep prediction task PaGN(diff+wave) 0.5624 0.6724 (10 forecast horizon). For the task, the recurrent modules are modified to predict input observations as well and the pre- Table 5: One step prediction MSE with different constraints. dicted one is re-fed in the model for future timesteps. While Importance of Physics Constraint the models having a recurrent module are able to predict a To study the importance of the physics term, we trained few more steps reasonably, there are a couple of things we PaGN with different λ controlling the importance of the should pay attention. First, the results imply that utilizing physics term. While we found that the physics term is sub- the neighboring information is important because GN-only stantially helpful from Table 3 and 4, the term is not sup- model shows similar or better MSEs compared to LSTM posed to be dominant (See Figure 5b) but tuned properly. for the multistep tasks, even though it has a simple recur- This is intuitive since the term only provides partial knowl- rent module that is not as good as that of LSTM. Second, edge (diffusive input signals), which changes loss surface we found that the diffusion equation in PaGN gives the sta- to help parameters more stable to predict next signals, in- ble state transition and the property provides slowly varying stead of governing the dynamics explicitly. Scaling down the latent states which are desired particularly for the climate physics term is similar to what Sabour, Frosst, and Hinton forecasting. Note that the skip-connection in GN-skip is also (2017) did for reconstruction error not to dominate margin able to restrict the rapid changes of H. However, it is neces- loss but to help the optimization process. sary to more carefully optimize the parameters in GN-skip We also present MSEs from PaGN(rand) defined by to learn the residual term in H0 = H + GN(H) properly. randomly sampling (α, β) ∈ [−2.5, 2.5] in the constraint ||v 00 + αv 0 + βv − c∆v||2 , and PaGN(diff+wave) su- Effectiveness of Physics Constraint perposing the two equations. Table 5 shows that the random One of the benefits of physics-aware learning is data effi- equation significantly degrades the overall prediction qual- ciency. We explore how much the physics constraint is help- ity. Note that the simple superposition of two equations does ful by testing if PaGN can be well-trained when the num- not always guarantee lower error even if each equation is ber of data for the supervised objective is limited for the helpful separately. When the two equations are non-linearly one-step prediction task. We randomly sampled training data connected in the unknown (fully) governing equation, the which were used to optimize the total loss function (Equa- superposition cannot provide meaningful inductive bias. The tion 9) and the left unsampled data were only used to mini- results demonstrate that the physics term is an useful induc- mize the physics constraint: tive bias when it is properly defined. L = Lisup + λLiphy , i is a sampled step Conclusion L = λLiphy , otherwise In this work, we introduce a new architecture PaGN based We found that the diffusion equation can benefit to optimize on graph networks to incorporate prior knowledge given as PaGN even if the target observations are partially available a form of PDEs over time and space. While existing works (Figure 5a). Although the overall performances of PaGN are more focus on how to discover equations in data generated degraded when less number of sampled data are used, the by explicit physics rules, we propose a method to leverage error are not far deviated from those of GN-only. Even the weakly given inductive bias describing data. We empirically GN-only model is outperformed by PaGN when only 70% analyze the performance of PaGN across a range of predic- training data are used with the state-wise constraint. tion experiments on the climate observations. References Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2017a. Battaglia, P.; Pascanu, R.; Lai, M.; Rezende, D. J.; et al. 2016. Physics Informed Deep Learning (Part I): Data-driven So- Interaction networks for learning about objects, relations and lutions of Nonlinear Partial Differential Equations. arXiv physics. In Advances in neural information processing sys- preprint arXiv:1711.10561 . tems, 4502–4510. Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2017b. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez- Physics Informed Deep Learning (Part II): Data-driven Dis- Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; covery of Nonlinear Partial Differential Equations. arXiv Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational preprint arXiv:1711.10566 . inductive biases, deep learning, and graph networks. arXiv Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic rout- preprint arXiv:1806.01261 . ing between capsules. In Advances in neural information pro- Bronstein, M. M.; Bruna, J.; LeCun, Y.; Szlam, A.; and Van- cessing systems, 3856–3866. dergheynst, P. 2017. Geometric deep learning: going beyond Sanchez-Gonzalez, A.; Heess, N.; Springenberg, J. T.; Merel, euclidean data. IEEE Signal Processing Magazine 34(4): 18– J.; Riedmiller, M.; Hadsell, R.; and Battaglia, P. 2018. Graph 42. Networks as Learnable Physics Engines for Inference and Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2016. Discov- Control. In Proceedings of the 35th International Conference ering governing equations from data by sparse identification on Machine Learning. of nonlinear dynamical systems. Proceedings of the National Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pas- Academy of Sciences 113(15): 3932–3937. canu, R.; Battaglia, P.; and Lillicrap, T. 2017. A simple neural Chang, M. B.; Ullman, T.; Torralba, A.; and Tenenbaum, J. B. network module for relational reasoning. In Advances in neu- 2017. A Compositional Object-Based Approach to Learning ral information processing systems, 4967–4976. Physical Dynamics. International Conference on Learning Watters, N.; Tacchetti, A.; Weber, T.; Pascanu, R.; Battaglia, Representations . P.; and Zoran, D. 2017. Visual interaction networks. NIPS . Cressie, N.; and Wikle, C. K. 2015. Statistics for spatio- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; temporal data. John Wiley & Sons. Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; de Bezenac, E.; Pajot, A.; and Gallinari, P. 2018. Deep et al. 2016. Google’s neural machine translation system: Learning for Physical Processes: Incorporating Prior Scien- Bridging the gap between human and machine translation. tific Knowledge. In International Conference on Learning arXiv preprint arXiv:1609.08144 . Representations. Zhang, J.; Mohegh, A.; Li, Y.; Levinson, R.; and Ban-Weiss, Grzeszczuk, R.; Terzopoulos, D.; and Hinton, G. 1998. Neu- G. 2018. Systematic Comparison of the Influence of Cool roanimator: Fast neural network emulation and control of Wall versus Cool Roof Adoption on Urban Climate in the Los physics-based models. In Proceedings of the 25th annual Angeles Basin. Environmental science & technology 52(19): conference on Computer graphics and interactive techniques, 11188–11197. 9–20. ACM. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770– 778. Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.- r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T. N.; et al. 2012. Deep neural networks for acoustic model- ing in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29(6): 82–97. Kipf, T.; Fetaya, E.; Wang, K.-C.; Welling, M.; and Zemel, R. 2018. Neural Relational Inference for Interacting Systems. International Conference on Machine Learning . Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Ima- genet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097– 1105. Long, Z.; Lu, Y.; Ma, X.; and Dong, B. 2018. PDE-Net: Learning PDEs from Data. In Proceedings of the 35th In- ternational Conference on Machine Learning. URL http: //proceedings.mlr.press/v80/long18a.html. Raissi, M. 2018. Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations. arXiv preprint arXiv:1801.06637 . Raissi, M.; and Karniadakis, G. E. 2018. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics 357: 125–141.