Inference in Relational Neural Machines1
        Giuseppe Marra2 and Michelangelo Diligenti3 and Lapo Faggi2 and Francesco Giannini2 and
                                   Marco Gori2 and Marco Maggini2


Abstract. Integrating logic reasoning and deep learning from sen-            the flexibility of the approach. Deep ProbLog [13] extends the prob-
sory data is a key challenge to develop artificial agents able to op-        abilistic logic programming language ProbLog [3] with predicates
erate in complex environments. Whereas deep learning can operate             implemented by a deep learner. This approach is powerful but lim-
at a large scale thanks to recent hardware advancements (GPUs) as            ited to cases where exact inference is possible. Deep Logic Mod-
well as other important technical advancements like Stochastic Gra-          els [14] improve over related approaches like Semantic-based Reg-
dient Descent, logic inference can not be executed over large reason-        ularization [4], and Logic Tensor Networks [5], but they fail to per-
ing tasks, as it requires to consider a combinatorial number of pos-         fectly match the discrimination abilities of a pure-supervised learner.
sible assignments. Relational Neural Machines (RNMs) have been                  Relational Neural Machines (RNM) [15] can perfectly replicate
recently introduced in order to co-train a deep learning machine and         the effectiveness of training from supervised data of deep architec-
a first-order probabilistic logic reasoner in a fully integrated way. In     tures, while integrating the full expressivity and rich reasoning pro-
this context, it is crucial to avoid the logic inference to become a bot-    cess of Markov Logic Networks [18]. RNMs like any probabilistic
tleneck, preventing the application of the methodology to large scale        reasoner based on a graphical model representing the fully grounded
learning tasks. This paper proposes and compares different inference         FOL knowledge are strongly limited in the scale at which they can
schemata for Relational Neural Machines together with some pre-              operate. Indeed, the large combinatorial number of possible assign-
liminary results to show the effectiveness of the proposed method-           ments together with the complex casual dependency structure of the
ologies.                                                                     inference requires to device appropriate approximate inference al-
                                                                             gorithmic solutions. This paper proposes and studies different new
                                                                             inference solutions that are thought to be effective for RNMs.
1   Introduction                                                                The outline of the paper is as follows. Section 2 presents the model
                                                                             and how it can be used to integrate logic and learning. Sections 3 and
Empowering machine learning with explicit reasoning capabilities
                                                                             4 study tractable approaches to perform training and inference with
is a key step toward a trustworthy and human-centric AI, where
                                                                             the model, respectively. Section 5 shows the experimental evaluation
the decisions of the learners are explainable and with human-
                                                                             of the proposed ideas on various datasets. Finally, Section 6 draws
understandable guarantees. While sub-symbolic approaches like
                                                                             some conclusions and highlights some planned future work.
deep neural networks have achieved impressive results in several
tasks [7], standard neural networks can struggle to represent rela-
tional knowledge on different input patterns or relevant output struc-       2   Model
tures. Recently, some work has been done to learn and inject re-
                                                                             A Relational Neural Machine establishes a probability distribution
lational features into the learning process [17, 16]. Symbolic ap-
                                                                             over a set of n output variables of interest y = {y1 , . . . , yn }, given
proaches [3, 18] based on probabilistic logic reasoners can per-
                                                                             a set of predictions made by one or multiple deep architectures, and
form an explicit inference process in presence of uncertainty. An-
                                                                             the model parameters. In this paper the output variables are assumed
other related line of research studies hybrid approaches leveraging
                                                                             to be binary, i.e. yi = {0, 1}.
deep learning schemas and neural networks to learn the structure of
                                                                                Unlike standard neural networks which compute the output via a
the reasoning process like done, for instance, by Neural Theorem
                                                                             simple forward pass, the output computation in an RNM can be de-
Provers [19] or TensorLog [11].
                                                                             composed into two stages: a low-level stage processing the input pat-
   However, bridging the gap between symbolic and sub-symbolic
                                                                             terns, and a subsequent semantic stage, expressing constraints over
levels is still an open problem which has been recently addressed by
                                                                             the output and performing higher level reasoning. The first stage pro-
neuro-symbolic approaches [12, 20]. Hu et al. [10] inject the prior
                                                                             cesses D input patterns x = {x1 , . . . , xD }, returning the values f
knowledge into the network weights via a distillation process but
                                                                             using the network with parameters w. The higher layer takes as input
with no guarantee that the logic will be properly generalized to the
                                                                             f and applies reasoning using a set of constraints, whose parameters
test cases. Deep Structured Models [1] and Hazan et al. [9] imposes
                                                                             are indicated as λ, then it returns the set of output variables y.
statistical structure on the output predictions. The Semantic Loss [22]
                                                                                A RNM model defines a conditional probability distribution in the
defines a loss which encodes the desired output structure. However,
                                                                             exponential family defined as:
the loss does not define a probabilistic reasoning process, limiting                                                                   !
1 Copyright
                                                                                                           1        X
            c 2020 for this paper by its authors. Use permitted under Cre-                  p(y|f , λ) = exp             λc Φc (f , y)
  ative Commons License Attribution 4.0 International (CC BY 4.0).                                         Z          c
2 Department of Computer Science, KU Leuven, Belgium
3 Department of Information Engineering and Science, University of Siena,    where Z is the partition function and the C potentials Φc express
  Italy                                                                      some properties on the input and output variables. The parameters
                                        flion (x1 )                                             probability distribution:
                                                               fzoo (x1 )                                                                            
                       fsavanna (x1 )
                                                                                                               1        X            X X
                                                                                                 p(y|f , λ) = exp φ0 (f (x), y(x))+   λc   φc (yc,g )
                                        y3 = lion(x1 )                                                         Z          x          c    y                c,g
              y1 = savanna(x1 )


                                                                            y2 = zoo(x1 )
                                                                                                This will allow to develop the data embeddings as part of training by
                                                y7=sameloc(x1 ,x2 )
                                                                                                enforcing the consistency between the reasoner and network outputs,
                                                                                                while distilling the logical knowledge into the network weights.
                                                                                                   Figure 1 shows the graphical model obtained for a simple multi-
                                                      y6=lion(x2 )
                                                                                                class image classification task. The goal of the training process is to
                                                                                                train the classifiers approximating the predicates, but also to establish
               y4 = savanna(x2 )                                            y5 = zoo(x2 )       the relevance of each rule. For example, in an image classification
                                                                                                task, the formula ∀x Antelope(x) ∧ Lion(x) is likely to be associ-
                                                                                                ated to a higher weight than ∀x P olarBear(x) ∧ Lion(x), which
                                                                                                are unlikely to correlate in the data.
                                                flion (x2 )

                       fsavanna (x2 )                          fzoo (x2 )
                                                                                                3   Training
 Figure 1. The graphical model representing an RNM, where the output                            The computation of the partition function requires a summation over
  variables y depend on the output of first stage f , processing the inputs                     all possible assignments of the output variables, which is intractable
{x1 , x2 } instantiated for the rules ∀x lion(x) ⇒ savanna(x) ∨ zoo(x),                         for all but trivial cases. A particularly interesting case is when it is
      ∀x∀x0 sameloc(x, x0 ) ∧ savanna(x) ⇒ savanna(x0 ), and                                    assumed that the partition function factorizes over the potentials like
               ∀x∀x0 sameloc(x, x0 ) ∧ zoo(x) ⇒ zoo(x0 ).
                                                                                                done in piecewise likelihood [21]:
                                                                                                                                                          
λ = {λ1 , . . . , λC } determine the strength of the potentials Φc .                                         Z≈
                                                                                                                   Y
                                                                                                                       Zc =
                                                                                                                             Y X
                                                                                                                                                      0
                                                                                                                                       exp(λc Φc (f , yc ))        (2)
    In a classical and pure supervised learning setup, the patterns are                                            c         c      0
                                                                                                                                   yc
i.i.d., it is therefore possible to split the y, f into disjoint sets group-
ing the variables of each pattern, forming separate cliques. Let us in-                         where yc is the subset of variables in y that are involved in the com-
dicate as y(x), f (x) the portion of the output and function variables                          putation of Φc . Then the piecewise-local probability for the c-th con-
referring to the processing of an input pattern x. A single potential                           straint can be expressed as:
Φ0 corresponding to the dot product between y and f is needed to
                                                                                                                                   exp (λc Φc (f , yc ))
represent supervised learning, and this potential decomposes over the                                           pP L
                                                                                                                 c (yc |f , λ) =
                                                                                                                                           Zc
patterns yielding the distribution,
                                                               !                                   Under this assumption, the factors can be distributed over the po-
                              1          X                                                      tential giving the following generalized piecewise likelihood:
               p0 (y|f , λ) = exp             φ0 (y(x), f (x))           (1)
                              Z          x∈S
                                                                                                                       Y                          Y PL
                                                                                                   p(y|f , λ) ≈           p(yc |y\yc , f , λc ) =   pc (yc |f , λ)
                                                                                                                       c                           c
where S ⊆ x is the set of supervised patterns. As shown by Marra et
al. [15], the cross entropy loss over sigmoidal or softmax outputs can                              If the variables in y are binary, the computation of Z requires sum-
be exactly recovered by maximizing the log-likelihood of a RNM for                              mation over all possible assignments which has O(2|y| ) complexity.
one-label and multi-label classification tasks, respectively.                                   Using the local decomposition this is reduced to O(|yc̄ |·2nc̄ ), where
   Neuro-symbolic integration can be obtained by employing one                                  c̄ is the index of the formula corresponding to the potential with the
potential Φ0 enforcing the consistency with the supervised data to-                             largest number nc̄ of variables to ground.
gether with potentials representing the logic knowledge. Using a sim-                               If the c-th constraint is factorized using the P L partitioning, the
ilar approach to Markov Logic Networks, a set of First–Order Logic                              derivatives of the log-likelihood with respect to the model potential
(FOL) formulas is input to the system, and a potential Φc for each                              weights are:
formula is considered. It is assumed that some (or all) the predicates
in a KB are unknown and need to be learned together with the pa-                                             ∂ log p(y|f , λ)
                                                                                                                              ≈ Φc (f , y) − EpPc L [Φc ]
rameters driving the reasoning process.                                                                            ∂λc
   In the following we refer to grounded expression (the same ap-                               and with respect to the learner parameters:
plies to atom or predicate) as a FOL rule whose variables are as-                                                           X ∂fi h
                                                                                                         ∂ log p(y|f ,λ)                                 0 i
signed to specific constants. It is assumed that the undirected graph-                                                    ≈            y i − Ey 0 ∼p P L yi
                                                                                                              ∂wk
ical model is built such that: each grounded atom corresponds to a                                                            i
                                                                                                                                 ∂wk                 0


node in the graph; all the nodes corresponding to grounded atoms
co-occurring in at least one rule are connected on the graph. As a re-                            EM When the world is not fully observed during training, an
sult, there is one clique (and then potential) for each grounding gc of                         iterative Expectation Maximization (EM) schema can be used to
the formula in y. It is assumed that all the potentials resulting from                          marginalize over the unobserved data in the expectation step using
the c-th formula share the same weight λc , therefore the potential                             the inference methodology as described in the next paragraph. Then,
Φc is the sum
            P over all groundings of φc in the world y, such that:                              the average constraint satisfaction can be recomputed, and, finally,
Φc (y) = yc,g φc (yc.g ) where φc (gc ) assumes a value equal to 1                              the λ, w parameters can be updated in the maximization step. This
and 0 if the grounded formula holds true and false. This yields the                             process is then iterated until convergence.

                                                                                            2
4   Inference.                                                                over the single factors, allowing to efficiently perform MARG infer-
                                                                              ence as:
This sections proposes some general methodologies which can be                                                    X         h
used to make RNM inference tractable.                                                p(yq |λ, f , y e ) ≈                     pP L
                                                                                                                               0 (yu , yq |f )·
   Inference tasks can be sub-categorized into different groups. In                                           yu =y\(y e ,yq )
                                                                                                                                               #
particular, MAP inference methods search the most probable assign-                                              Y PL
ment of the y given the evidence and the fixed parameters w, λ. The                                           ·  pc (yu , yq |λc , f , y e )
                                                                                                                 c
problem of finding the best assignment y ? to the unobserved query
variables given the evidence y e and current parameters can be stated         Finally the shared variable can be reconciled by selecting the assign-
as:                             X                                             ment for a shared variable that has the highest marginal probability.
               y ? = arg max 0
                                    λc Φc (f , [y 0 , y e ])      (3)
                               y
                                      c
                                                                              Piecewise Gibbs. Gibbs sampling can be used to sample from the
         0   e                                                                distribution and then select the sample with the highest probability:
where [y , y ] indicates a full assignment to the y variables, split
into the query and evidence sets.                                                                              X
   On the other hand, MARG inference methods compute the                                      y ? = arg max
                                                                                                          0
                                                                                                                  λc Φc (f , [y 0 , y e ])
                                                                                                         y ∼p
marginal probability of a set of random variables given some evi-                                                c

dence. MARG inference sums the probability of an assignment of a              In order to speed up the process, a blocked Gibbs sampler may
query variable yq over all possible worlds. MARG inference for a              be considered, by grouping many variables and then sampling by
single query variable is defined as:                                          their joint distribution. For instance, piecewise approaches suggest to
                                   X                                          group the variables belonging to potential, exploiting the constraints
       p(yq |λ, f , y e ) =                  p(y|f , λ, y e ) ∀yq ∈ y         expressed by the potential on the samples. To speed up the process,
                              y\{y e ,yq }
                                                                              a certain flip can be accepted, only if it yields a strictly greater prob-
                                                                              ability value (Monotonic Gibbs sampling).
   Both MAP and MARG inference are intractable in the most gen-
eral cases as they require to consider all possible assignments. There-
fore, approximate methods must be devised to be able to tackle the              Piecewise Gibbs with Fuzzy Map Inference. Gibbs sampling
most interesting applications. This section proposes a few inference          would generally require a high number of samples to converge to
solutions that can be naturally applied to RNM.                               the correct distribution. Hybrid inference methodologies can be used
                                                                              to reduce the burn in time by starting the Gibbs sampler from the
                                                                              MAP solution found using efficient approximate inference methods
  Piecewise MAP Inference. MAP inference in RNMs requires                     like Fuzzy Logic MAP. The sampler then modifies the solution by it-
to evaluates all possible 2|y| assignments, which is generally in-            eratively sampling from the piecewise local distributions. Combining
tractable. A possible solution is to employ the the piecewise approx-         Fuzzy Logic MAP and a Gibbs sampler allows to avoid low-quality
imation (Equation 2) to separately optimize each single factor, so            local minima where fuzzy MAP solutions can get stuck, while speed-
reducing the complexity to 2nĉ with nĉ the size of the largest fac-         ing up the Gibbs sampling convergence.
tor. The main issue with this solution is that the same variable can
be present in different factors, and the piecewise assignments can be
                                                                               Relax, Compensate and Recover. Relax, Compensate & Recover
inconsistent across the factors. The assignments to variables shared
                                                                              (RCR) [2] is an algorithm to transform a graphical model into a sim-
across factors can be performed by selecting the assignment selected
                                                                              plified model, where inference is tractable. This simplified model is
by the most factors.
                                                                              changed while running the algorithm, by computing compensations
                                                                              to recover a solution as close as possible to the correct distribution.
 Fuzzy Logic MAP Inference. The y values can be relaxed into
the [0, 1] interval and assume that each potential Φc (f , y) has a            Graph Neural Networks. A few recent works show that inference
fuzzy-logic [8] continuous surrogate Φsc (f , y) which collapses into         in probabilistic models can be approximated by using Graph Neu-
the original potential when the y assume crisp values and is contin-          ral Networks [23]. This would allow to define an end-to-end neural
uous with respect to each yi . When relaxing the potentials to accept         RNM formulation.
continuous variables, the MAP problem stated by Equation 3 can
be solved by gradient-based techniques, by computing the derivative
with respect of each output variable:                                         5   Experiments
                                                                              The experimental evaluation aims at testing some of the proposed
       ∂ log p(y 0 |y e , f , λ)        X ∂Φsc (f , [y 0 , y e ])
                                 = fk +   λc                                  inference methods. The evaluation is still partial as more methods
               ∂yk                               ∂yk
                                        c                                     are currently being implemented.
                                                                                 The evaluation is carried out on a toy task, that is designed to high-
Different t-norm fuzzy logics have been considered as continuous              light the capability of RNMs to learn and employ soft rules that are
relaxations of logic formulas [4] e.g. Gödel, Łukasiewicz and Prod-          holding only for a sub-portion of the whole dataset. The MNIST
uct logics. Furthermore, a fragment of the Łukasiewicz logic [6] has          dataset contains images of single handwritten digits. In this task it
been recently proposed for translating logic inference into a convex          is assumed that additional relational logic knowledge is available in
optimization problem.                                                         the form of a binary predicate link connecting image pairs. Given
                                                                              two images x, y, whose corresponding digits are denoted by i, j, a
 Piecewise MARG. The piecewise approximation defined by                       link between x and y is established if the second digit follows the
Equation 2 allows to marginalize the probability of an assignment             first one, i.e. i = j + 1. However, the link predicate can be noisy,

                                                                          3
such that there is a degree of probability that the link(x, y) is estab-              [2] Arthur Choi and Adnan Darwiche, ‘Relax, compensate and then re-
lished for i 6= j + 1. The knowledge about the link predicate can be                      cover’, in JSAI International Symposium on Artificial Intelligence, pp.
                                                                                          167–180. Springer, (2010).
represented by the FOL formulas:                                                      [3] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen, ‘Problog: A
                                                                                          probabilistic prolog and its application in link discovery’, in Proceed-
∀x∀y link(x, y) ∧ digit(x, i) ⇒ digit(x, i + 1)             i = 0, . . . , 8 ,            ings of the 20th International Joint Conference on Artifical Intelligence,
                                                                                          IJCAI’07, pp. 2468–2473, San Francisco, CA, USA, (2007). Morgan
where digit(x, i) is a binary predicate indicating if a number i is the                   Kaufmann Publishers Inc.
digit class of the image x. Since the link predicate holds true also for              [4] Michelangelo Diligenti, Marco Gori, and Claudio Sacca, ‘Semantic-
pairs of non-consecutive digits, the above rule is violated by a certain                  based regularization for learning and inference’, Artificial Intelligence,
percentage of digit pairs. Therefore, the manifolds established by the                    244, 143–165, (2017).
                                                                                      [5] I Donadello, L Serafini, and AS d’Avila Garcez, ‘Logic tensor networks
formulas can help in driving the predictions, but the noisy links force                   for semantic image interpretation’, in IJCAI International Joint Confer-
the reasoner to be flexible about how to employ the knowledge. The                        ence on Artificial Intelligence, pp. 1596–1602, (2017).
training set is created by randomly selecting 1000 images from the                    [6] Francesco Giannini, Michelangelo Diligenti, Marco Gori, and Marco
MNIST dataset and by adding the link relation with an incremen-                           Maggini, ‘On a convex logic fragment for learning and reasoning’,
                                                                                          IEEE Transactions on Fuzzy Systems, (2018).
tal degree of noise. For each degree of noise in the training set, we                 [7] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio,
created an equally sized test set with the same degree of noise. A                        Deep learning, volume 1, MIT press Cambridge, 2016.
neural network with 100 hidden relu neurons is used to process the                    [8] Petr Hájek, Metamathematics of fuzzy logic, volume 4, Springer Sci-
input images. Table 1 reports the results of RNM inference meth-                          ence & Business Media, 2013.
                                                                                      [9] Tamir Hazan, Alexander G Schwing, and Raquel Urtasun, ‘Blending
    %link noise   NN     Fuzzy MAP      Piecewise Gibbs    Fuzzy MAP EM                   learning and inference in conditional random fields’, The Journal of
       0.0        0.78      1.00             1.00               1.00                      Machine Learning Research, 17(1), 8305–8329, (2016).
       0.2        0.78      1.00             1.00               1.00                 [10] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P.
       0.4        0.78      0.99             0.98               0.96                      Xing, ‘Harnessing deep neural networks with logic rules’, in Proceed-
                                                                                          ings of the 54th Annual Meeting of the Association for Computational
       0.6        0.78      0.89             0.88               0.96
                                                                                          Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1:
       0.8        0.78      0.86             0.64               0.86
                                                                                          Long Papers, (2016).
       0.9        0.78      0.78             0.28               0.78                 [11] William W Cohen Fan Yang Kathryn and Rivard Mazaitis, ‘Tensorlog:
                                                                                          Deep learning meets probabilistic databases’, Journal of Artificial In-
 Table 1. Accuracy of the different inference methods with respect to the                 telligence Research, 1, 1–15, (2018).
 percentage of link that are established between digit image pairs violating         [12] Navdeep Kaur, Gautam Kunapuli, Tushar Khot, Kristian Kersting,
                               the logical rule.                                          William Cohen, and Sriraam Natarajan, ‘Relational restricted boltz-
                                                                                          mann machines: A probabilistic logic learning approach’, in Inter-
                                                                                          national Conference on Inductive Logic Programming, pp. 94–111.
                                                                                          Springer, (2017).
ods against the baseline provided by the neural network varying the                  [13] Robin Manhaeve, Sebastijan Dumančić, Angelika Kimmig, Thomas
percentage of links that are predictive of a digit to follow another                      Demeester, and Luc De Raedt, ‘Deepproblog: Neural probabilistic logic
one. Using an EM based schema tends to consistently outperform                            programming’, arXiv preprint arXiv:1805.10872, (2018).
                                                                                     [14] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and
other methods, this is because EM allows to also improve the under-                       Marco Gori, ‘Integrating learning and reasoning with deep logic mod-
lying neural network by back-propagating the reasoning predictions                        els’, in Proceedings of the European Conference on Machine Learning,
to the learner. Other inference methods will be tested within EM.                         (2019).
Fuzzy MAP tends to find good solutions constantly improving over                     [15] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, Marco
the baseline. It is important to notice how RNM is robust with re-                        Maggini, and Gori. Marco, ‘Reational neural machines’, in Proceedings
                                                                                          of the European Conference on Artificial Intelligence (ECAI), (2020).
spect to the high link noise levels even when the relational data is not             [16] Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio, ‘Holo-
carrying any useful information, the final solution still matches the                     graphic embeddings of knowledge graphs’, in Thirtieth Aaai conference
baseline.                                                                                 on artificial intelligence, (2016).
                                                                                     [17] Mathias Niepert, ‘Discriminative gaifman models’, in Advances in Neu-
                                                                                          ral Information Processing Systems, pp. 3405–3413, (2016).
6     Conclusions and Future Work                                                    [18] Matthew Richardson and Pedro Domingos, ‘Markov logic networks’,
                                                                                          Machine learning, 62(1), 107–136, (2006).
This paper shows different inference methods for Relational Neural                   [19] Tim Rocktäschel and Sebastian Riedel, ‘End-to-end differentiable prov-
Machines, a novel framework to provide a tight integration between                        ing’, in Advances in Neural Information Processing Systems, pp. 3788–
learning from supervised data and logic reasoning. As future work,                        3800, (2017).
                                                                                     [20] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schock-
we plan to undertake a larger experimental exploration of RNM for                         aert, and Ondrej Kuzelka, ‘Lifted relational neural networks: Efficient
more structured problems using all the inference methods proposed                         learning of latent relational structures’, Journal of Artificial Intelligence
by this paper.                                                                            Research, 62, 69–100, (2018).
                                                                                     [21] Charles Sutton and Andrew McCallum, ‘Piecewise pseudolikelihood
                                                                                          for efficient training of conditional random fields’, in Proceedings of
ACKNOWLEDGEMENTS                                                                          the 24th international conference on Machine learning, pp. 863–870.
                                                                                          ACM, (2007).
This project has received funding from the European Union’s Hori-                    [22] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den
zon 2020 research and innovation program under grant agreement                            Broeck, ‘A semantic loss function for deep learning with symbolic
No 825619.                                                                                knowledge’, in Proceedings of the 35th International Conference on
                                                                                          Machine Learning (ICML), (July 2018).
                                                                                     [23] KiJung Yoon, Renjie Liao, Yuwen Xiong, Lisa Zhang, Ethan Fetaya,
REFERENCES                                                                                Raquel Urtasun, Richard Zemel, and Xaq Pitkow, ‘Inference in prob-
                                                                                          abilistic graphical models by graph neural networks’, arXiv preprint
[1] Liang-Chieh Chen, Alexander Schwing, Alan Yuille, and Raquel Urta-                    arXiv:1803.07710, (2018).
    sun, ‘Learning deep structured models’, in International Conference on
    Machine Learning, pp. 1785–1794, (2015).


                                                                                 4