Inference in Relational Neural Machines1 Giuseppe Marra2 and Michelangelo Diligenti3 and Lapo Faggi2 and Francesco Giannini2 and Marco Gori2 and Marco Maggini2 Abstract. Integrating logic reasoning and deep learning from sen- the flexibility of the approach. Deep ProbLog [13] extends the prob- sory data is a key challenge to develop artificial agents able to op- abilistic logic programming language ProbLog [3] with predicates erate in complex environments. Whereas deep learning can operate implemented by a deep learner. This approach is powerful but lim- at a large scale thanks to recent hardware advancements (GPUs) as ited to cases where exact inference is possible. Deep Logic Mod- well as other important technical advancements like Stochastic Gra- els [14] improve over related approaches like Semantic-based Reg- dient Descent, logic inference can not be executed over large reason- ularization [4], and Logic Tensor Networks [5], but they fail to per- ing tasks, as it requires to consider a combinatorial number of pos- fectly match the discrimination abilities of a pure-supervised learner. sible assignments. Relational Neural Machines (RNMs) have been Relational Neural Machines (RNM) [15] can perfectly replicate recently introduced in order to co-train a deep learning machine and the effectiveness of training from supervised data of deep architec- a first-order probabilistic logic reasoner in a fully integrated way. In tures, while integrating the full expressivity and rich reasoning pro- this context, it is crucial to avoid the logic inference to become a bot- cess of Markov Logic Networks [18]. RNMs like any probabilistic tleneck, preventing the application of the methodology to large scale reasoner based on a graphical model representing the fully grounded learning tasks. This paper proposes and compares different inference FOL knowledge are strongly limited in the scale at which they can schemata for Relational Neural Machines together with some pre- operate. Indeed, the large combinatorial number of possible assign- liminary results to show the effectiveness of the proposed method- ments together with the complex casual dependency structure of the ologies. inference requires to device appropriate approximate inference al- gorithmic solutions. This paper proposes and studies different new inference solutions that are thought to be effective for RNMs. 1 Introduction The outline of the paper is as follows. Section 2 presents the model and how it can be used to integrate logic and learning. Sections 3 and Empowering machine learning with explicit reasoning capabilities 4 study tractable approaches to perform training and inference with is a key step toward a trustworthy and human-centric AI, where the model, respectively. Section 5 shows the experimental evaluation the decisions of the learners are explainable and with human- of the proposed ideas on various datasets. Finally, Section 6 draws understandable guarantees. While sub-symbolic approaches like some conclusions and highlights some planned future work. deep neural networks have achieved impressive results in several tasks [7], standard neural networks can struggle to represent rela- tional knowledge on different input patterns or relevant output struc- 2 Model tures. Recently, some work has been done to learn and inject re- A Relational Neural Machine establishes a probability distribution lational features into the learning process [17, 16]. Symbolic ap- over a set of n output variables of interest y = {y1 , . . . , yn }, given proaches [3, 18] based on probabilistic logic reasoners can per- a set of predictions made by one or multiple deep architectures, and form an explicit inference process in presence of uncertainty. An- the model parameters. In this paper the output variables are assumed other related line of research studies hybrid approaches leveraging to be binary, i.e. yi = {0, 1}. deep learning schemas and neural networks to learn the structure of Unlike standard neural networks which compute the output via a the reasoning process like done, for instance, by Neural Theorem simple forward pass, the output computation in an RNM can be de- Provers [19] or TensorLog [11]. composed into two stages: a low-level stage processing the input pat- However, bridging the gap between symbolic and sub-symbolic terns, and a subsequent semantic stage, expressing constraints over levels is still an open problem which has been recently addressed by the output and performing higher level reasoning. The first stage pro- neuro-symbolic approaches [12, 20]. Hu et al. [10] inject the prior cesses D input patterns x = {x1 , . . . , xD }, returning the values f knowledge into the network weights via a distillation process but using the network with parameters w. The higher layer takes as input with no guarantee that the logic will be properly generalized to the f and applies reasoning using a set of constraints, whose parameters test cases. Deep Structured Models [1] and Hazan et al. [9] imposes are indicated as λ, then it returns the set of output variables y. statistical structure on the output predictions. The Semantic Loss [22] A RNM model defines a conditional probability distribution in the defines a loss which encodes the desired output structure. However, exponential family defined as: the loss does not define a probabilistic reasoning process, limiting ! 1 Copyright 1 X c 2020 for this paper by its authors. Use permitted under Cre- p(y|f , λ) = exp λc Φc (f , y) ative Commons License Attribution 4.0 International (CC BY 4.0). Z c 2 Department of Computer Science, KU Leuven, Belgium 3 Department of Information Engineering and Science, University of Siena, where Z is the partition function and the C potentials Φc express Italy some properties on the input and output variables. The parameters flion (x1 ) probability distribution: fzoo (x1 )   fsavanna (x1 ) 1 X X X p(y|f , λ) = exp φ0 (f (x), y(x))+ λc φc (yc,g ) y3 = lion(x1 ) Z x c y c,g y1 = savanna(x1 ) y2 = zoo(x1 ) This will allow to develop the data embeddings as part of training by y7=sameloc(x1 ,x2 ) enforcing the consistency between the reasoner and network outputs, while distilling the logical knowledge into the network weights. Figure 1 shows the graphical model obtained for a simple multi- y6=lion(x2 ) class image classification task. The goal of the training process is to train the classifiers approximating the predicates, but also to establish y4 = savanna(x2 ) y5 = zoo(x2 ) the relevance of each rule. For example, in an image classification task, the formula ∀x Antelope(x) ∧ Lion(x) is likely to be associ- ated to a higher weight than ∀x P olarBear(x) ∧ Lion(x), which are unlikely to correlate in the data. flion (x2 ) fsavanna (x2 ) fzoo (x2 ) 3 Training Figure 1. The graphical model representing an RNM, where the output The computation of the partition function requires a summation over variables y depend on the output of first stage f , processing the inputs all possible assignments of the output variables, which is intractable {x1 , x2 } instantiated for the rules ∀x lion(x) ⇒ savanna(x) ∨ zoo(x), for all but trivial cases. A particularly interesting case is when it is ∀x∀x0 sameloc(x, x0 ) ∧ savanna(x) ⇒ savanna(x0 ), and assumed that the partition function factorizes over the potentials like ∀x∀x0 sameloc(x, x0 ) ∧ zoo(x) ⇒ zoo(x0 ). done in piecewise likelihood [21]:   λ = {λ1 , . . . , λC } determine the strength of the potentials Φc . Z≈ Y Zc = Y X  0 exp(λc Φc (f , yc )) (2) In a classical and pure supervised learning setup, the patterns are c c 0 yc i.i.d., it is therefore possible to split the y, f into disjoint sets group- ing the variables of each pattern, forming separate cliques. Let us in- where yc is the subset of variables in y that are involved in the com- dicate as y(x), f (x) the portion of the output and function variables putation of Φc . Then the piecewise-local probability for the c-th con- referring to the processing of an input pattern x. A single potential straint can be expressed as: Φ0 corresponding to the dot product between y and f is needed to exp (λc Φc (f , yc )) represent supervised learning, and this potential decomposes over the pP L c (yc |f , λ) = Zc patterns yielding the distribution, ! Under this assumption, the factors can be distributed over the po- 1 X tential giving the following generalized piecewise likelihood: p0 (y|f , λ) = exp φ0 (y(x), f (x)) (1) Z x∈S Y Y PL p(y|f , λ) ≈ p(yc |y\yc , f , λc ) = pc (yc |f , λ) c c where S ⊆ x is the set of supervised patterns. As shown by Marra et al. [15], the cross entropy loss over sigmoidal or softmax outputs can If the variables in y are binary, the computation of Z requires sum- be exactly recovered by maximizing the log-likelihood of a RNM for mation over all possible assignments which has O(2|y| ) complexity. one-label and multi-label classification tasks, respectively. Using the local decomposition this is reduced to O(|yc̄ |·2nc̄ ), where Neuro-symbolic integration can be obtained by employing one c̄ is the index of the formula corresponding to the potential with the potential Φ0 enforcing the consistency with the supervised data to- largest number nc̄ of variables to ground. gether with potentials representing the logic knowledge. Using a sim- If the c-th constraint is factorized using the P L partitioning, the ilar approach to Markov Logic Networks, a set of First–Order Logic derivatives of the log-likelihood with respect to the model potential (FOL) formulas is input to the system, and a potential Φc for each weights are: formula is considered. It is assumed that some (or all) the predicates in a KB are unknown and need to be learned together with the pa- ∂ log p(y|f , λ) ≈ Φc (f , y) − EpPc L [Φc ] rameters driving the reasoning process. ∂λc In the following we refer to grounded expression (the same ap- and with respect to the learner parameters: plies to atom or predicate) as a FOL rule whose variables are as- X ∂fi h ∂ log p(y|f ,λ)  0 i signed to specific constants. It is assumed that the undirected graph- ≈ y i − Ey 0 ∼p P L yi ∂wk ical model is built such that: each grounded atom corresponds to a i ∂wk 0 node in the graph; all the nodes corresponding to grounded atoms co-occurring in at least one rule are connected on the graph. As a re- EM When the world is not fully observed during training, an sult, there is one clique (and then potential) for each grounding gc of iterative Expectation Maximization (EM) schema can be used to the formula in y. It is assumed that all the potentials resulting from marginalize over the unobserved data in the expectation step using the c-th formula share the same weight λc , therefore the potential the inference methodology as described in the next paragraph. Then, Φc is the sum P over all groundings of φc in the world y, such that: the average constraint satisfaction can be recomputed, and, finally, Φc (y) = yc,g φc (yc.g ) where φc (gc ) assumes a value equal to 1 the λ, w parameters can be updated in the maximization step. This and 0 if the grounded formula holds true and false. This yields the process is then iterated until convergence. 2 4 Inference. over the single factors, allowing to efficiently perform MARG infer- ence as: This sections proposes some general methodologies which can be X h used to make RNM inference tractable. p(yq |λ, f , y e ) ≈ pP L 0 (yu , yq |f )· Inference tasks can be sub-categorized into different groups. In yu =y\(y e ,yq ) # particular, MAP inference methods search the most probable assign- Y PL ment of the y given the evidence and the fixed parameters w, λ. The · pc (yu , yq |λc , f , y e ) c problem of finding the best assignment y ? to the unobserved query variables given the evidence y e and current parameters can be stated Finally the shared variable can be reconciled by selecting the assign- as: X ment for a shared variable that has the highest marginal probability. y ? = arg max 0 λc Φc (f , [y 0 , y e ]) (3) y c Piecewise Gibbs. Gibbs sampling can be used to sample from the 0 e distribution and then select the sample with the highest probability: where [y , y ] indicates a full assignment to the y variables, split into the query and evidence sets. X On the other hand, MARG inference methods compute the y ? = arg max 0 λc Φc (f , [y 0 , y e ]) y ∼p marginal probability of a set of random variables given some evi- c dence. MARG inference sums the probability of an assignment of a In order to speed up the process, a blocked Gibbs sampler may query variable yq over all possible worlds. MARG inference for a be considered, by grouping many variables and then sampling by single query variable is defined as: their joint distribution. For instance, piecewise approaches suggest to X group the variables belonging to potential, exploiting the constraints p(yq |λ, f , y e ) = p(y|f , λ, y e ) ∀yq ∈ y expressed by the potential on the samples. To speed up the process, y\{y e ,yq } a certain flip can be accepted, only if it yields a strictly greater prob- ability value (Monotonic Gibbs sampling). Both MAP and MARG inference are intractable in the most gen- eral cases as they require to consider all possible assignments. There- fore, approximate methods must be devised to be able to tackle the Piecewise Gibbs with Fuzzy Map Inference. Gibbs sampling most interesting applications. This section proposes a few inference would generally require a high number of samples to converge to solutions that can be naturally applied to RNM. the correct distribution. Hybrid inference methodologies can be used to reduce the burn in time by starting the Gibbs sampler from the MAP solution found using efficient approximate inference methods Piecewise MAP Inference. MAP inference in RNMs requires like Fuzzy Logic MAP. The sampler then modifies the solution by it- to evaluates all possible 2|y| assignments, which is generally in- eratively sampling from the piecewise local distributions. Combining tractable. A possible solution is to employ the the piecewise approx- Fuzzy Logic MAP and a Gibbs sampler allows to avoid low-quality imation (Equation 2) to separately optimize each single factor, so local minima where fuzzy MAP solutions can get stuck, while speed- reducing the complexity to 2nĉ with nĉ the size of the largest fac- ing up the Gibbs sampling convergence. tor. The main issue with this solution is that the same variable can be present in different factors, and the piecewise assignments can be Relax, Compensate and Recover. Relax, Compensate & Recover inconsistent across the factors. The assignments to variables shared (RCR) [2] is an algorithm to transform a graphical model into a sim- across factors can be performed by selecting the assignment selected plified model, where inference is tractable. This simplified model is by the most factors. changed while running the algorithm, by computing compensations to recover a solution as close as possible to the correct distribution. Fuzzy Logic MAP Inference. The y values can be relaxed into the [0, 1] interval and assume that each potential Φc (f , y) has a Graph Neural Networks. A few recent works show that inference fuzzy-logic [8] continuous surrogate Φsc (f , y) which collapses into in probabilistic models can be approximated by using Graph Neu- the original potential when the y assume crisp values and is contin- ral Networks [23]. This would allow to define an end-to-end neural uous with respect to each yi . When relaxing the potentials to accept RNM formulation. continuous variables, the MAP problem stated by Equation 3 can be solved by gradient-based techniques, by computing the derivative with respect of each output variable: 5 Experiments The experimental evaluation aims at testing some of the proposed ∂ log p(y 0 |y e , f , λ) X ∂Φsc (f , [y 0 , y e ]) = fk + λc inference methods. The evaluation is still partial as more methods ∂yk ∂yk c are currently being implemented. The evaluation is carried out on a toy task, that is designed to high- Different t-norm fuzzy logics have been considered as continuous light the capability of RNMs to learn and employ soft rules that are relaxations of logic formulas [4] e.g. Gödel, Łukasiewicz and Prod- holding only for a sub-portion of the whole dataset. The MNIST uct logics. Furthermore, a fragment of the Łukasiewicz logic [6] has dataset contains images of single handwritten digits. In this task it been recently proposed for translating logic inference into a convex is assumed that additional relational logic knowledge is available in optimization problem. the form of a binary predicate link connecting image pairs. Given two images x, y, whose corresponding digits are denoted by i, j, a Piecewise MARG. The piecewise approximation defined by link between x and y is established if the second digit follows the Equation 2 allows to marginalize the probability of an assignment first one, i.e. i = j + 1. However, the link predicate can be noisy, 3 such that there is a degree of probability that the link(x, y) is estab- [2] Arthur Choi and Adnan Darwiche, ‘Relax, compensate and then re- lished for i 6= j + 1. The knowledge about the link predicate can be cover’, in JSAI International Symposium on Artificial Intelligence, pp. 167–180. Springer, (2010). represented by the FOL formulas: [3] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen, ‘Problog: A probabilistic prolog and its application in link discovery’, in Proceed- ∀x∀y link(x, y) ∧ digit(x, i) ⇒ digit(x, i + 1) i = 0, . . . , 8 , ings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 2468–2473, San Francisco, CA, USA, (2007). Morgan where digit(x, i) is a binary predicate indicating if a number i is the Kaufmann Publishers Inc. digit class of the image x. Since the link predicate holds true also for [4] Michelangelo Diligenti, Marco Gori, and Claudio Sacca, ‘Semantic- pairs of non-consecutive digits, the above rule is violated by a certain based regularization for learning and inference’, Artificial Intelligence, percentage of digit pairs. Therefore, the manifolds established by the 244, 143–165, (2017). [5] I Donadello, L Serafini, and AS d’Avila Garcez, ‘Logic tensor networks formulas can help in driving the predictions, but the noisy links force for semantic image interpretation’, in IJCAI International Joint Confer- the reasoner to be flexible about how to employ the knowledge. The ence on Artificial Intelligence, pp. 1596–1602, (2017). training set is created by randomly selecting 1000 images from the [6] Francesco Giannini, Michelangelo Diligenti, Marco Gori, and Marco MNIST dataset and by adding the link relation with an incremen- Maggini, ‘On a convex logic fragment for learning and reasoning’, IEEE Transactions on Fuzzy Systems, (2018). tal degree of noise. For each degree of noise in the training set, we [7] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, created an equally sized test set with the same degree of noise. A Deep learning, volume 1, MIT press Cambridge, 2016. neural network with 100 hidden relu neurons is used to process the [8] Petr Hájek, Metamathematics of fuzzy logic, volume 4, Springer Sci- input images. Table 1 reports the results of RNM inference meth- ence & Business Media, 2013. [9] Tamir Hazan, Alexander G Schwing, and Raquel Urtasun, ‘Blending %link noise NN Fuzzy MAP Piecewise Gibbs Fuzzy MAP EM learning and inference in conditional random fields’, The Journal of 0.0 0.78 1.00 1.00 1.00 Machine Learning Research, 17(1), 8305–8329, (2016). 0.2 0.78 1.00 1.00 1.00 [10] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. 0.4 0.78 0.99 0.98 0.96 Xing, ‘Harnessing deep neural networks with logic rules’, in Proceed- ings of the 54th Annual Meeting of the Association for Computational 0.6 0.78 0.89 0.88 0.96 Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: 0.8 0.78 0.86 0.64 0.86 Long Papers, (2016). 0.9 0.78 0.78 0.28 0.78 [11] William W Cohen Fan Yang Kathryn and Rivard Mazaitis, ‘Tensorlog: Deep learning meets probabilistic databases’, Journal of Artificial In- Table 1. Accuracy of the different inference methods with respect to the telligence Research, 1, 1–15, (2018). percentage of link that are established between digit image pairs violating [12] Navdeep Kaur, Gautam Kunapuli, Tushar Khot, Kristian Kersting, the logical rule. William Cohen, and Sriraam Natarajan, ‘Relational restricted boltz- mann machines: A probabilistic logic learning approach’, in Inter- national Conference on Inductive Logic Programming, pp. 94–111. Springer, (2017). ods against the baseline provided by the neural network varying the [13] Robin Manhaeve, Sebastijan Dumančić, Angelika Kimmig, Thomas percentage of links that are predictive of a digit to follow another Demeester, and Luc De Raedt, ‘Deepproblog: Neural probabilistic logic one. Using an EM based schema tends to consistently outperform programming’, arXiv preprint arXiv:1805.10872, (2018). [14] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and other methods, this is because EM allows to also improve the under- Marco Gori, ‘Integrating learning and reasoning with deep logic mod- lying neural network by back-propagating the reasoning predictions els’, in Proceedings of the European Conference on Machine Learning, to the learner. Other inference methods will be tested within EM. (2019). Fuzzy MAP tends to find good solutions constantly improving over [15] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, Marco the baseline. It is important to notice how RNM is robust with re- Maggini, and Gori. Marco, ‘Reational neural machines’, in Proceedings of the European Conference on Artificial Intelligence (ECAI), (2020). spect to the high link noise levels even when the relational data is not [16] Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio, ‘Holo- carrying any useful information, the final solution still matches the graphic embeddings of knowledge graphs’, in Thirtieth Aaai conference baseline. on artificial intelligence, (2016). [17] Mathias Niepert, ‘Discriminative gaifman models’, in Advances in Neu- ral Information Processing Systems, pp. 3405–3413, (2016). 6 Conclusions and Future Work [18] Matthew Richardson and Pedro Domingos, ‘Markov logic networks’, Machine learning, 62(1), 107–136, (2006). This paper shows different inference methods for Relational Neural [19] Tim Rocktäschel and Sebastian Riedel, ‘End-to-end differentiable prov- Machines, a novel framework to provide a tight integration between ing’, in Advances in Neural Information Processing Systems, pp. 3788– learning from supervised data and logic reasoning. As future work, 3800, (2017). [20] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schock- we plan to undertake a larger experimental exploration of RNM for aert, and Ondrej Kuzelka, ‘Lifted relational neural networks: Efficient more structured problems using all the inference methods proposed learning of latent relational structures’, Journal of Artificial Intelligence by this paper. Research, 62, 69–100, (2018). [21] Charles Sutton and Andrew McCallum, ‘Piecewise pseudolikelihood for efficient training of conditional random fields’, in Proceedings of ACKNOWLEDGEMENTS the 24th international conference on Machine learning, pp. 863–870. ACM, (2007). This project has received funding from the European Union’s Hori- [22] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den zon 2020 research and innovation program under grant agreement Broeck, ‘A semantic loss function for deep learning with symbolic No 825619. knowledge’, in Proceedings of the 35th International Conference on Machine Learning (ICML), (July 2018). [23] KiJung Yoon, Renjie Liao, Yuwen Xiong, Lisa Zhang, Ethan Fetaya, REFERENCES Raquel Urtasun, Richard Zemel, and Xaq Pitkow, ‘Inference in prob- abilistic graphical models by graph neural networks’, arXiv preprint [1] Liang-Chieh Chen, Alexander Schwing, Alan Yuille, and Raquel Urta- arXiv:1803.07710, (2018). sun, ‘Learning deep structured models’, in International Conference on Machine Learning, pp. 1785–1794, (2015). 4