Graph Networks with Physics-aware Knowledge Informed in Latent Space
                                              Sungyong Seo and Yan Liu
                                       {sungyons, yanliu.cs} at usc.edu
                                            Department of Computer Science
                                            University of Southern California


                            Abstract                                                              Known
                                                                                                  physics
  While physics conveys knowledge of nature built from an in-
  terplay between observations and theory, it has been consid-                                   Unknown
                                                                                                  physics
  ered less important for modeling deep neural networks. De-
                                                                                   𝑡                              𝑡 + ∆𝑡
  spite the usefulness of physical rules, it is particularly chal-
  lenging to leverage the knowledge for sparse data since most                                         Modeling
  physics equations are well defined on the continuous and
  dense space. In addition, it is even harder to inform the equa-                                Physics
  tions into a model if the observations are not fully governed                                 Constraint
  by the given physical knowledge. In this work, we present a
                                                                                                  Graph
  novel architecture to incorporate physics or domain knowl-                                     Networks
  edge given as a form of partial differential equations (PDEs)                    𝑡                              𝑡 + ∆𝑡
  on sparse observations by utilizing graph structure. Moreover,
  we leverage the representation power of deep learning by in-       Figure 1: Concept of the proposed PaGN. Many sensor-
  forming the knowledge in latent space. We demonstrate that
                                                                     based observations are only sparsely available (See circled
  climate prediction tasks are significantly improved and vali-
  date the effectiveness and importance of the proposed model.       regions) but there are continuous physical process (e.g., Dif-
                                                                     fusion) behind the sparse observations. Some of the known
                                                                     physics rules are injected into a model and the remained un-
                        Introduction                                 known dynamics will be extracted from data.
Modeling natural phenomena in the real-world, such as cli-
mate, traffic, molecule, and so on, is extremely challenging            Physics is one of the fundamental pillars describing
but important. Deep learning has achieved significant suc-           how the real-world behaves. It is imperative that physics-
cesses in prediction performance by learning latent repre-           informed learning models are powerful solutions to mod-
sentations from data-rich applications such as speech recog-         eling natural phenomena. Incorporating domain knowledge
nition (Hinton et al. 2012), text understanding (Wu et al.           has several benefits: first, it helps an optimized solution to
2016), and image recognition (Krizhevsky, Sutskever, and             be more stable and to prevent overfitting; second, it pro-
Hinton 2012). While the accuracy and efficiency of data-             vides theoretical guidance with which an optimized model
driven deep learning models can be improved with ad-hoc              is supposed to follow and thus, helps training with fewer
architectural changes for specific tasks, we are confronted          data; lastly, since a model is driven by the desired inductive
with many challenging learning scenarios in modeling nat-            bias, it would be more robust to unseen data, and thus it is
ural phenomenon, where a limited number of labeled ex-               easier to enable accurate extrapolation.
amples are available, there is much noise in the data, and              In the meanwhile, there exist a series of challenges when
there could be constant changes in data distributions (e.g.          we incorporate physics principles into machine learning
dynamic systems). Furthermore, in many domains, data are             models. First, a model needs to properly handle the spatial
only available on scattered collections of points (sensors or        and temporal constraints. Many physics equations demon-
point clouds, see Figure 1) where the majority of existing           strate how a set of physical quantities behaves on space and
methods are not applicable. These challenges are not easily          time. For example, the wave equation describes how a sig-
addressed under the purely data-driven learning models and           nal is propagated through a medium over time. Second, the
therefore, there is a pressing need to develop new generation        model should capture relations between objects, such as im-
robust learning models that can address these challenging            age patches (Santoro et al. 2017) or rigid bodies (Battaglia
learning scenarios.                                                  et al. 2016; Chang et al. 2017). Third, the learning mod-
Copyright © 2021for this paper by its authors. Use permitted under   ules should be shared over all objects because physical laws
Creative Commons License Attribution 4.0 International (CC BY        are commonly applicable to all objects. Finally, the model
4.0).                                                                should be flexible to extract unknown patterns instead of be-
ing strictly constrained to the physics knowledge. Since it       other physics equations. Furthermore, it is restricted in a reg-
is not always possible to describe all rules governing real-      ular grid to use conventional convolutional neural networks
world data, data-driven learning is required to fill the gap      (CNNs) for images.
between the known physics and real observations.
   In this paper, we address the problem of modeling dynam-       Discovering physical dynamics A class of mod-
ical systems based on graph neural networks by incorpo-           els (Grzeszczuk, Terzopoulos, and Hinton 1998; Battaglia
rating useful knowledge described as differentiable physics       et al. 2016; Chang et al. 2017; Watters et al. 2017; Sanchez-
equations. We propose a generic architecture, physics-aware       Gonzalez et al. 2018; Kipf et al. 2018) have been proposed
graph networks (PaGN), which can leverage explicitly re-          based on the assumption that neural networks can learn
quired physics and learn implicit patterns from data as il-       complex physical interactions and simulate unseen dynam-
lustrated in Figure 1. The proposed model properly handles        ics based on a current state. The models along this direction
spatially distributed objects and their relations as vertices     are based on common relational inductive biases (Santoro
and edges in a graph. Moreover, temporal dependencies are         et al. 2017; Battaglia et al. 2018), i.e., functions connect-
learned by recurrent computations. As Battaglia et al. (2018)     ing entities and relations are shared and can be learned
suggest, the inductive bias of a graph-based model is its in-     from a given sequence of simulated dynamics. (Chang
variance [to] node/edge permutations, and thus, all trainable     et al. 2017; Battaglia et al. 2016; Sanchez-Gonzalez et al.
functions for the same input types are shared.                    2018) commonly assumed that the objects’ behaviors were
   Our contributions of this work are summarized as follows:      governed by classical kinetic physics equations. Then,
 • We develop a novel physics-aware learning architecture,        object- and relation-centric functions were proposed to
   PaGN, which incorporates differentiable physics equa-          learn the transition from the current state to the next state
   tions with a graph network framework.                          without explicitly injecting the equations into the model.
 • We explore the performance of PaGN on graph signal pre-        Discovering latent physics by data-driven learning has been
   diction tasks to demonstrate that the physics knowledge is     actively studied (Long et al. 2018; Brunton, Proctor, and
   helpful to provide a significant improvement in prediction     Kutz 2016). While the properly constrained filters enable us
   tasks and make a model more robust.                            to identify the governing PDEs, it is only applicable when
                                                                  we are aware of the form of target PDEs. Unlike this line
 • We investigate the effectiveness and the importance of         of works that extracts latent patterns from data only, our
   PaGN from climate prediction to provide how physics            proposed model can incorporate known physics and at the
   knowledge can be beneficial for prediction performance.        same time extract latent patterns from data which cannot be
                                                                  captured by existing knowledge.
                      Related Work
Incorporating physics Among many attempts incorporat-                                    Background
ing physical knowledge into data-driven models, Cressie
and Wikle (2015) covered a number of statistical models           In this section, we introduce how differential operators in
(e.g., a hierarchical Bayesian framework) handling physi-         Euclidean domain are analogously defined on the discrete
cal equations. Raissi, Perdikaris, and Karniadakis (2017a)        graph domain and briefly show that the graph networks mod-
introduced a concept of physics-informed neural networks,         ule is able to efficiently express the differential operators.
which utilize physics equations explicitly to train neural net-
works. By optimizing the model at initial/boundary and sam-
                                                                  Calculus on Graphs
pled collocation points, the data-driven solutions of nonlin-     Preliminary Given a graph G = (V, E) where V and          E
ear PDEs can be found. Based on this fundamental idea, a          are a set of vertices V = {1, . . . , n} and edges E ⊆ V2 , re-
number of works for simulating and discovering PDEs have          spectively, two types of real functions can be defined on the
been published (Raissi and Karniadakis 2018; Raissi 2018;         vertices, f : V → R, and edges, F : E → R, of the graph.
Raissi, Perdikaris, and Karniadakis 2017b). Although these        It is also possible to define multiple functions on the ver-
works leveraged physical knowledge, they are limited be-          tices or edges as multiple feature maps of a pixel in CNNs.
cause they require all physics behind given data to be ex-        Since f and F can be viewed as scalar and vector fields in
plicitly known.                                                   differential geometry (Figure 2), the corresponding discrete
   de Bezenac, Pajot, and Gallinari (2018) considered a sim-      operators on graphs can be defined as follow (Bronstein et al.
ilar problem as ours. They proposed how transport physics         2017).
(advection and diffusion) could be incorporated for forecast-
ing sea surface temperature (SST). In other words, they pro-      Gradient on graphs       The gradient on a graph is the linear
posed how the motion flow that is helpful for the tempera-        operator defined by
ture flow prediction could be extracted in an unsupervised
manner from a sequence of SST images.                                 ∇ : L2 (V) → L2 (E)
   This work is a major milestone since it captures not only          (∇f )ij = (fj − fi ) if {i, j} ∈ E and 0 otherwise.
the dominant transport physics but also unknown patterns
inferred through the neural networks. Despite of its novel        where L2 (V) and L2 (E) denote Hilbert spaces of vertex
architecture, the model is specifically designed for transport    and edge functions, respectively, thus f ∈ L2 (V) and F ∈
physics and it is not straightforward to extend the model to      L2 (E). As the gradient in Euclidean space measures the rate
                                                                      Mapping     Equation                       Physics example

                                                                      node        eij = φe (vi , vj )            ∇φ = −E
                                                                      → edge          = (∇v)ij                   (Electric field)
 (a) Scalar field (b) Vector field (c) Vertex func (d) Edge func
                                                                      edge        vi = φv (eij )                 ∇ · E = ρ/0
Figure 2: Scalar/vector fields on Euclidean space and ver-            → node                                     (Maxwell’s eqn.)
tex/edge functions on a graph.                                                       = (div e)i

and direction of change in a scalar field, the gradient on a          node        vi = φv (vi , {vj:(i,j)∈E })   ∆φ = 0
graph computes differences of the values between two adja-            → node         = (∆v)i                     (Laplace’s eqn.)
cent vertices and the differences are defined along the direc-
tions of the corresponding edges.
                                                                    Table 1: Examples of static equations in Graph networks
Divergence on graphs The divergence in Euclidean space
maps vector fields to scalar fields. Similarly, the divergence     where φe , φv , φu are edge, node, and global update func-
on a graph is the linear operator defined by                       tions, respectively, and they can be implemented by learn-
              div : L2 (E) → L2 (V)                                able neural networks. Note that the computation order is
                            X                                      flexible. The aggregators can be chosen freely once it is in-
              (div F )i =         wij Fij     ∀i ∈ V               variant to permutations of their inputs.
                           j:(i,j)∈E                                  As φe is a mapping function from vertices to edges,
where wij is a weight on the edge (i, j). It denotes a             it can be replaced by the graph gradient operator to de-
weighted sum of incident edge functions to a vertex i, which       scribe the known relation explicitly. Similarly, φv can learn
is interpreted as the netflow at a vertex i.                       divergence-like mapping (edge to node) functions. For curl-
                                                                   involved functions, it is required to add another updating
                                                                   function, φc , which is mapping from nodes/edges/global at-
Laplacian on graphs Laplacian (∆ = ∇2 ) in Euclidean               tributes to a 3-clique attribute and vice versa. In other words,
space measures the difference between the values of the            the graph networks have highly flexible modules which are
scalar field with its average on infinitesimal balls. Similarly,   able to imitate the differential operators in a graph explicitly
the graph Laplacian is defined as                                  or implicitly.
            ∆ : L2 (V) → L2 (V)
                       X                                                     Physics-aware Graph Networks
            (∆f )i =         wij (fi − fj ) ∀i ∈ V
                       j:(i,j)∈E
                                                                   As deep learning models are successful to model complex
                                                                   behaviors or extract abstract features in data, it is natural to
The graph Laplacian can be represented
                                    P      as a matrix form,       focus on how the data-driven modeling can solve practical
L = D − W where D = diag( j:j6=i wij ) is a degree                 problems in physics or engineering fields. In this section, we
matrix and W denotes a weighted adjacency matrix. Note             provide how domain knowledge described in physics can be
that L = ∆ = −div∇ and the minus sign is required to               incorporated with the graph networks framework.
make L positive semi-definite.
   Based on the core differential operators on a graph, we         Static Physics
can re-write differentiable physics equations (e.g., Diffusion     Many fields in physics dealing with static properties, such
equation or Wave equation) on a graph.                             as Electrostatic, Magnetostatic, or Hydrostatic, describe a
                                                                   number of physics phenomena at rest. Among the various
Graph Networks                                                     phenomena, it is easy to express differentiable physics rules
 Battaglia et al. (2018) proposed a graph networks frame-          in discrete forms on a graph with the operators in previous
 work, which generalizes relations among vertices, edges,          Section . For instances, the Poisson equation (∇2 φ = − ρ0 )
 and a whole graph. Graph Networks (GN) describe how               in Electrostatics is realized as a simple matrix multiplication
 edge, node, and global attributes are updated by propagat-        of graph Laplacian with a vertex function. Table 1 provides
 ing information among themselves.                                 some differential formulas in Electrostatic and how the up-
    Given a set of nodes (v), edges (e), and global (u) at-        dating functions are defined in graph networks.
 tributes, the steps of computation in a graph networks block
 are as follow:                                                    Dynamic Physics
1. e0ij ← φe (eij , vi , vj , u) for all {i, j} ∈ E pairs.         More practical equations have been written in the dynamic
2.   vi0 ← φv (vi , ē0i , u) for all i ∈ V.                       forms, which describe how a given physical quantity is
     ē0i is an aggregated edge attribute related to the node i.   changing in a given region over time. GN can be regarded as
        0      u      0     0
                                                                   a module that updates a graph state including the attributes
3. u ← φ (u, ē , v̄ )                                             of node, edge, and a whole graph.
   ē0 and v̄ 0 are aggregated attributes of all edges and all
   nodes in a graph, respectively.                                                           G 0 = GN(G)                            (1)
  Equation                                         Physics example    Forward/Recurrent computation Figure 3 provides how
                                                                      the desired physics knowledge is integrated with the graph
  vi0 = vi + αφv (vi , {vj:(i,j)∈E })              u̇ = α∆u           networks. Given a graph G = {v, e, u}, it is fed into an
      = vi + α(∆v)i                                (Diffusion eqn.)   encoder which transforms a set of attributes of nodes (v),
                                                                      edges (e), and a whole graph (u) into latent spaces.
  vi00 = 2vi0 − vi + c2 φv (vi0 , {vj:(i,j)∈E
                                    0
                                              })   ü = c2 ∆u                           ṽ, ẽ, ũ = Encoder(v, e, u)                  (3)
      = 2vi0 − vi + c2 (∆v 0 )i                    (Wave eqn.)
                                                                      After the encoder, the encoded graph H = {ṽ, ẽ, ũ} is re-
                                                                      peatedly updated within the core block as many as the re-
Table 2: Examples of dynamic equations in Graph networks              quired time steps T . For each step, H is updated to H0 which
                                                                      denotes the next state of the encoded graph.
where G 0 is the updated graph state. Dynamic physics for-                                       H0 = GN(H)                            (4)
mulas are written as a function of time and spatial deriva-           Finally, the sequentially updated attributes are re-
tives:                                                                transformed to the original spaces by a decoder.
                       ∂ M u ∂u        ∂N u
                                           
              ∂u                                                                      v 0 , e0 , u0 = Decoder(ṽ 0 , ẽ0 , ũ0 )       (5)
         f       ,··· , M ,     ,··· , N = 0            (2)
              ∂t       ∂t    ∂x        ∂x
                                                                         There are two types of objective function in this architec-
where u is a physical quantity spatiotemporally varying and           ture, physics knowledge and supervised objective. First, we
x is the direction where u is defined on. M and N denote the          define physics-informed constraint, which is a form of equa-
highest order of time and spatial derivatives, respectively.          tions in Table 1 and 2 depending on given physics knowl-
Under the state updating view in Equation 1, any types of             edge and even mixed.
PDEs written in Equation 2 can be represented as a form
of finite differences. Table 2 provides the examples of the
                                                                                  s
                                                                                 fphy (Ht0 ), fphy
                                                                                                d
                                                                                                     (Ht0 , · · · , Ht+M
                                                                                                                      0
                                                                                                                              )        (6)
dynamic physics. u̇ and ü are the first and second order time
                                                                                    X
                                                                             Lphy =        s
                                                                                        fphy  (Ht0 ) + fphy
                                                                                                         d
                                                                                                               (Ht0 , · · · , Ht+M
                                                                                                                                0
                                                                                                                                   )   (7)
derivatives, respectively.                                                              t

Physics in Latent Space                                                       s
                                                                      where fphy (Ht0 ) and fphy
                                                                                             d
                                                                                                 (Ht0 , · · · , Ht+M
                                                                                                                 0
                                                                                                                     ) are the static and
                                                                      dynamic physics-informed quantity, respectively. For exam-
We provide how the differential operators are implemented
                                                                      ple, we can impose gradient constraint or the diffusion equa-
in a GN module in a previous section. However, it is
                                                                      tion between node/edge latent representations as follow:
hardly practical for modeling complicated real-world prob-
lems with the differential operators solely because it is only                  s
                                                                               fphy (Ht0 ) = kẽ0t − ∇ṽt0 k2
possible when all physics equations governing the observed
phenomena are explicitly known. For example, although we
                                                                                d
                                                                               fphy (Ht0 , Ht+1
                                                                                            0          0
                                                                                                ) = kṽt+1 − ṽt0 − α∇2 ṽt0 k2
are aware that there are a number of physics equations in-            Secondly, the supervised loss function between the predicted
volved in climate observations, it is almost infeasible to in-        graph, Ĝ 0 , and the target graph, G 0 . This loss function is con-
clude all required equations for modeling the observations.           structed based on the task, such as the cross-entropy or the
Thus, it is necessary to utilize the learnable parameters in          mean squared error (MSE). Finally, the total objective func-
GN to fill the missing dynamics which is not described by             tion is a sum of the two constraints:
given equations.
   There is another advantage to utilize learnable parameters.                               L = Lsup + λLphy                          (8)
There are a number of unknown parameters, which need to
                                                                      where λ controls the importance of the physics term.
be pre-defined to specify the physics equations, and the pa-
rameters can be inferred by the learnable parameters. For
example, while we have knowledge that input signal has a                                        Experiment
wave property, the speed of waves (c in Table 2) should be            In this section, we evaluate PaGN on a real-world climate
given to fully describe the wave equation. It will be even            dataset on the Southern California region.
worse when multiple input signals are involved since each
signal is governed by different parameters in the same kind           Climate Data
of equation. While both temperature and surface pressure              For the evaluation on real-world data, we used the hourly
are continuous and diffusive, they should have different dif-         simulated climate observations for 16 days on the South-
fusion coefficients (α in Table 2) in the same diffusion equa-        ern California region (Zhang et al. 2018). In this dataset, we
tion. To address the issue we can transform the input signals         sampled small regions randomly from two area (Los Ange-
to latent space and use one equation in the latent space in-          les and San Diego, Figure 4) encompassing urban and rural
stead of imposing multiple equations to input signals sepa-           meteorological features to generate spatially discrete obser-
rately. Then, the parameters in Encoder make the different            vations. To build a graph, we connected a pair of the sampled
signals follow the equation differently. We formalize how             regions by using k-nearest neighbors algorithm (k = 3).
this idea is implemented as follow.                                   This data preprocessing is required to verify the proposed
                       𝒢            Encoder          ⨀           GN           ℋ′         Decoder        𝒢$ ′        𝒢′
                                                     ℋ
                                                                                                        Supervised Loss
                                                           Physics equation
                                                                                xT


Figure 3: Recurrent architecture to incorporate physics equation on GN. The blue blocks have learnable parameters and the
orange blocks are objective functions. is a concatenation operator and the middle core block can be repeated as many as the
required time steps (T ).

idea as well as evaluate PaGN on the spatiotemporally sparse
setting, which is more common for sensor-based datasets.
   The vertex attributes consist of 10 climate observations,
Air temperature, Albedo, Precipitation, Soil moisture, Rel-
ative humidity, Specific humidity, Surface pressure, Plane-
tary boundary layer height, and Wind vector (2 directions).
While the edge attributes are not given explicitly, we could
specify the type of each edge by using the type of connected
regions. There are 13 different land-usage types and each                     Figure 4: Sampled regions in Southern California area.
type summarizes how the corresponding land is used. Based                     (Left) Los Angeles (274 nodes) and (Right) San Diego (282
on the types of connected regions, we assigned different em-                  nodes) area.
                                                                                        Model           LA area              SD area
bedding vectors to edges.
                                                                                        MLP         0.8140±0.0651         0.7735±0.0539
PaGN Architecture                                                                       LSTM        0.7855±0.0644         0.8123±0.0875
                                                                                       GN-only      0.5951±0.0517         0.6947±0.1859
As explained in Section , PaGN consists of three modules,                              GN-skip      0.5906±0.0620         0.6456±0.1499
graph encoder, GN block, and graph decoder (Figure 3). The                           PaGN (wave)    0.5366±0.0631         0.6413±0.1549
encoder contains two feed forward networks, φv and φe , ap-                          PaGN (diff)    0.5289±0.0405         0.5746 ±0.0471
plied to node and edge features, respectively. By passing the
encoder, the features are transformed to the latent space (H)                           Table 3: One step prediction error (MSE)
where we will impose physics equations.
   In the GN block, the node/edge/graph features are updated
by the GN algorithm described in Section . The latent graph                   tions, which is found through cross validation. Note that the
states, H and H0 , indicate the hidden states of the current and              equation term can be replaced by other equations properly.
next observations. For the physics constraint, we informed
the diffusion and wave equation in Table 2, which describe                    Experimental Settings
the behavior of the continuous physical quantities. As the                    In our experiments, we used the air temperature as a target
most of the climate observations are varying continuously,                    observation and other 9 observations were used as input. We
the diffusion equation, as a part of the continuity equation, is              first evaluated our model by performing the one-step and
one of the inductive bias that should be considered for mod-                  multistep prediction tasks on the two different area with a
eling. In addition, the wave equation is useful to describe                   mean square error metric. For both regions, we commonly
atmospheric phenomena, especially 1 solar day harmonics                       trained the model with input observations for 10 timesteps
(e.g., Atmospheric tide). Note that the physics equations are                 (t − 10 : t − 1) and predicted targets from t − 9 to t. First
not directly applied to the input observations, but rather to                 65% of a total length was used as a training set and remain-
the latent representations. The state-updating process is re-                 ing series was split into validation (10%) and test sets (25%).
peated at least as many as the order of the equations to pro-                    We explored several baselines: MLP, LSTM, and GN-
vide the finite difference equation. For multistep predictions,               only ignoring the physics constraint in PaGN. We also com-
the recurrent module is repeated as many as the number of                     pared GN-skip which connects between H and H0 with the
the predictions and the physics equation will be also applied                 skip-connection (He et al. 2016) without the physics con-
multiple times as well. Finally, the decoder takes H0 as input                straint.
to return the next predictions. The following objective is the
total loss function of PaGN with the diffusion equation.                      One step Prediction
       T                          T
       X                          X                                           Table 3 shows the prediction error of the baselines and PaGN
  L=         kŷi0 − yi0 k2 + λ         kṽi0 − ṽi−1 − α∇2 ṽi−1 k2          on different areas. MLP and LSTM are shared over all sta-
       i=1                        i=1                                         tions and their performaces are outperformed by other mod-
                                                                   (9)        els leveraging a given graph structure. It implies that know-
where y 0 is a vector of the target observations (i.e. node vec-              ing neighboring information is significantly helpful to infer
tors) and α adjusts the diffusivity of the latent representa-                 its own state and it is intuitive since climate behaviors are
         Model             LA area           SD area                   1.2
                                                                                              LA area                   LA area
                                                                                              SD area         2.5       SD area
         LSTM           1.9022±0.2078   1.2489±0.2295                  1.1
        GN-only         1.6137±0.1128   1.5532±0.2023                  1.0                                    2.0
        GN-skip         1.5429±0.0932   1.4423±0.1622


                                                                                                        MSE
                                                                 MSE
                                                                       0.9
      PaGN(diff)        1.4656±0.0474   1.0999±0.0435                                                         1.5
                                                                       0.8

         Table 4: Multistep prediction error (MSE)                     0.7                                    1.0

                                                                       0.6

                                                                       0.5                                    0.5

spatiotemporally continuous. Among the graph-based mod-                      0%   25%   50%   75%               0.001     0.01    0.1     1.0

els, PaGN(diff) provides the least MSEs. It validates that             (a) MSE on sampled data                 (b) MSE with different λ
the diffusive property provides a strong inductive bias with
the latent representation learning. Note that the standard de-   Figure 5: In (a) MSEs of PaGN are almost as good as GN-
viations from PaGN(diff) are significantly smaller than          only (gray lines) despite the less number of training data.
those of other baselines and it implies that the integrated      (b) provides how the prediction performance is dependent
physics knowledge properly stabilizes optimization process       on the physics term.
by introducing additional objective.

Multistep Prediction                                                                    Model                  LA area       SD area
To evaluate the effectiveness of the state-wise regularization                       PaGN(rand)                 1.1406       0.7073
more carefully, we conducted the multistep prediction task                        PaGN(diff+wave)               0.5624       0.6724
(10 forecast horizon). For the task, the recurrent modules are
modified to predict input observations as well and the pre-      Table 5: One step prediction MSE with different constraints.
dicted one is re-fed in the model for future timesteps. While    Importance of Physics Constraint
the models having a recurrent module are able to predict a       To study the importance of the physics term, we trained
few more steps reasonably, there are a couple of things we       PaGN with different λ controlling the importance of the
should pay attention. First, the results imply that utilizing    physics term. While we found that the physics term is sub-
the neighboring information is important because GN-only         stantially helpful from Table 3 and 4, the term is not sup-
model shows similar or better MSEs compared to LSTM              posed to be dominant (See Figure 5b) but tuned properly.
for the multistep tasks, even though it has a simple recur-      This is intuitive since the term only provides partial knowl-
rent module that is not as good as that of LSTM. Second,         edge (diffusive input signals), which changes loss surface
we found that the diffusion equation in PaGN gives the sta-      to help parameters more stable to predict next signals, in-
ble state transition and the property provides slowly varying    stead of governing the dynamics explicitly. Scaling down the
latent states which are desired particularly for the climate     physics term is similar to what Sabour, Frosst, and Hinton
forecasting. Note that the skip-connection in GN-skip is also    (2017) did for reconstruction error not to dominate margin
able to restrict the rapid changes of H. However, it is neces-   loss but to help the optimization process.
sary to more carefully optimize the parameters in GN-skip           We also present MSEs from PaGN(rand) defined by
to learn the residual term in H0 = H + GN(H) properly.           randomly sampling (α, β) ∈ [−2.5, 2.5] in the constraint
                                                                 ||v 00 + αv 0 + βv − c∆v||2 , and PaGN(diff+wave) su-
Effectiveness of Physics Constraint                              perposing the two equations. Table 5 shows that the random
One of the benefits of physics-aware learning is data effi-      equation significantly degrades the overall prediction qual-
ciency. We explore how much the physics constraint is help-      ity. Note that the simple superposition of two equations does
ful by testing if PaGN can be well-trained when the num-         not always guarantee lower error even if each equation is
ber of data for the supervised objective is limited for the      helpful separately. When the two equations are non-linearly
one-step prediction task. We randomly sampled training data      connected in the unknown (fully) governing equation, the
which were used to optimize the total loss function (Equa-       superposition cannot provide meaningful inductive bias. The
tion 9) and the left unsampled data were only used to mini-      results demonstrate that the physics term is an useful induc-
mize the physics constraint:                                     tive bias when it is properly defined.
         L = Lisup + λLiphy ,    i is a sampled step
                                                                                                Conclusion
         L = λLiphy ,            otherwise
                                                                 In this work, we introduce a new architecture PaGN based
We found that the diffusion equation can benefit to optimize     on graph networks to incorporate prior knowledge given as
PaGN even if the target observations are partially available     a form of PDEs over time and space. While existing works
(Figure 5a). Although the overall performances of PaGN are       more focus on how to discover equations in data generated
degraded when less number of sampled data are used, the          by explicit physics rules, we propose a method to leverage
error are not far deviated from those of GN-only. Even the       weakly given inductive bias describing data. We empirically
GN-only model is outperformed by PaGN when only 70%              analyze the performance of PaGN across a range of predic-
training data are used with the state-wise constraint.           tion experiments on the climate observations.
                        References                                  Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2017a.
Battaglia, P.; Pascanu, R.; Lai, M.; Rezende, D. J.; et al. 2016.   Physics Informed Deep Learning (Part I): Data-driven So-
Interaction networks for learning about objects, relations and      lutions of Nonlinear Partial Differential Equations. arXiv
physics. In Advances in neural information processing sys-          preprint arXiv:1711.10561 .
tems, 4502–4510.                                                    Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2017b.
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-               Physics Informed Deep Learning (Part II): Data-driven Dis-
Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;          covery of Nonlinear Partial Differential Equations. arXiv
Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational      preprint arXiv:1711.10566 .
inductive biases, deep learning, and graph networks. arXiv          Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic rout-
preprint arXiv:1806.01261 .                                         ing between capsules. In Advances in neural information pro-
Bronstein, M. M.; Bruna, J.; LeCun, Y.; Szlam, A.; and Van-         cessing systems, 3856–3866.
dergheynst, P. 2017. Geometric deep learning: going beyond          Sanchez-Gonzalez, A.; Heess, N.; Springenberg, J. T.; Merel,
euclidean data. IEEE Signal Processing Magazine 34(4): 18–          J.; Riedmiller, M.; Hadsell, R.; and Battaglia, P. 2018. Graph
42.                                                                 Networks as Learnable Physics Engines for Inference and
Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2016. Discov-       Control. In Proceedings of the 35th International Conference
ering governing equations from data by sparse identification        on Machine Learning.
of nonlinear dynamical systems. Proceedings of the National         Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pas-
Academy of Sciences 113(15): 3932–3937.                             canu, R.; Battaglia, P.; and Lillicrap, T. 2017. A simple neural
Chang, M. B.; Ullman, T.; Torralba, A.; and Tenenbaum, J. B.        network module for relational reasoning. In Advances in neu-
2017. A Compositional Object-Based Approach to Learning             ral information processing systems, 4967–4976.
Physical Dynamics. International Conference on Learning             Watters, N.; Tacchetti, A.; Weber, T.; Pascanu, R.; Battaglia,
Representations .                                                   P.; and Zoran, D. 2017. Visual interaction networks. NIPS .
Cressie, N.; and Wikle, C. K. 2015. Statistics for spatio-          Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.;
temporal data. John Wiley & Sons.                                   Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.;
de Bezenac, E.; Pajot, A.; and Gallinari, P. 2018. Deep             et al. 2016. Google’s neural machine translation system:
Learning for Physical Processes: Incorporating Prior Scien-         Bridging the gap between human and machine translation.
tific Knowledge. In International Conference on Learning            arXiv preprint arXiv:1609.08144 .
Representations.                                                    Zhang, J.; Mohegh, A.; Li, Y.; Levinson, R.; and Ban-Weiss,
Grzeszczuk, R.; Terzopoulos, D.; and Hinton, G. 1998. Neu-          G. 2018. Systematic Comparison of the Influence of Cool
roanimator: Fast neural network emulation and control of            Wall versus Cool Roof Adoption on Urban Climate in the Los
physics-based models. In Proceedings of the 25th annual             Angeles Basin. Environmental science & technology 52(19):
conference on Computer graphics and interactive techniques,         11188–11197.
9–20. ACM.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, 770–
778.
Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.-
r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath,
T. N.; et al. 2012. Deep neural networks for acoustic model-
ing in speech recognition: The shared views of four research
groups. IEEE Signal processing magazine 29(6): 82–97.
Kipf, T.; Fetaya, E.; Wang, K.-C.; Welling, M.; and Zemel, R.
2018. Neural Relational Inference for Interacting Systems.
International Conference on Machine Learning .
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Ima-
genet classification with deep convolutional neural networks.
In Advances in neural information processing systems, 1097–
1105.
Long, Z.; Lu, Y.; Ma, X.; and Dong, B. 2018. PDE-Net:
Learning PDEs from Data. In Proceedings of the 35th In-
ternational Conference on Machine Learning. URL http:
//proceedings.mlr.press/v80/long18a.html.
Raissi, M. 2018. Deep Hidden Physics Models: Deep
Learning of Nonlinear Partial Differential Equations. arXiv
preprint arXiv:1801.06637 .
Raissi, M.; and Karniadakis, G. E. 2018. Hidden physics
models: Machine learning of nonlinear partial differential
equations. Journal of Computational Physics 357: 125–141.