On the Capabilities of Logic Tensor Networks for Deductive Reasoning Federico Bianchi,1,2 Pascal Hitzler,2 1 University of Milan-Bicocca 2 Wright State University federico.bianchi@disco.unimib.it, pascal.hitzler@wright.edu, Abstract Deep learning models (Goodfellow, Bengio, and Neural-symbolic integration is a field in which classical sym- Courville 2016) usually learn by optimizing a function; in bolic knowledge mechanisms are combined with neural net- LTNs this task is replaced with the task of best satisfiability: works. This is done to provide satisfactory computational ca- the model has to optimize the representation of each atom, pabilities from the network side and to exploit the descriptive function and predicate in such a way that the satisfiability power of symbolic reasoning. Logic Tensor Networks (LTNs) of each formula is maximized. In this way the network are a deep learning model that can be used to combine data learns the best possible parameters to represent both data with fuzzy logic to provide inferences and reasoning mech- and axioms. anisms over data. While LTNs have been shown effective in The main advantages of LTNs are the following: i) it is some contexts no detailed analysis on their capabilities for deductive logical reasoning has been conducted. In this pa- possible to express knowledge using logical axioms over per we explore the capabilities and the limitations of LTNs in data ii) it is possible to tackle and solve standard machine terms of deductive reasoning. learning tasks (e.g., classification) and iii) provide expla- nations using fuzzy logic over the trained network. Indeed, after training it is possible to make fuzzy inferences over Introduction data to obtain the degree of truth with respect to certain Neural-symbolic learning and reasoning (Garcez, Lamb, and predicates. The model was tested with promising results Gabbay 2008; Besold et al. 2017) involves integrating stan- on simple reasoning tasks (Serafini and Garcez 2016) and dard logical reasoning with neural networks with the aim on semantic image interpretation (Donadello, Serafini, and of providing fast and robust computational methods for rea- d’Avila Garcez 2017). soning and explanation over data. Logic Tensor Networks An initial exploration of the reasoning capabilities of (LTNs) are a deep learning model that comes from the LTNs was done on the well-known smoker-friends-and- neural-symbolic field: it integrates both logic and data in cancer dataset (Serafini and Garcez 2016). The dataset con- a neural network to provide support for neural symbolic- tains data about two groups of people for which friend re- learning and reasoning (Serafini and Garcez 2016). LTNs lationships and smoking habits are given, while the fact of use first-order fuzzy logic to express knowledge about the having cancer or not is only given for people in the first world: using fuzzy logic over classical first-order logic al- group. Axioms related to smoking properties (i.e., smok- lows us to represent truth using continuous values in the in- ing implies cancer) are given to the network. The network terval [0, 1] to represent the degree of truth. learns to predict if people in the second group have cancer Input to LTNs are data and axioms over (fuzzy) first- having learned the patterns present in the first group. More order predicate logic, e.g., parent(Ann, Susan), ∀x, y : recently, LTNs were used on a semantic image interpretation parent(x, y) → ancestor(x, y). Two key components of task in which they learned to classify bounding boxes of im- logic tensor networks are the grounding of formulas and the ages with the help of background knowledge (Donadello, learning by best satisfiability. With formula grounding we Serafini, and d’Avila Garcez 2017). Still, an in-depth analy- refer to the mapping of formulas to a vector space. For ex- sis of the deductive reasoning capabilities of LTNs remains ample, constants are mapped to n-dimensional vectors while to be done. function symbols are mapped to linear functions. A neural In this work we explore LTNs in the context of reason- network can be used to compute the degree of truth of a ing tasks, showing insights and properties of the model. We given formula considering the embedded representation of introduce two simple datasets that contain relationships and constants and symbols. we define additional axioms over these datasets. These two Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. datasets are used to evaluate deductive reasoning capabil- Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of ities. We also perform some experiments on the computa- the AAAI 2019 Spring Symposium on Combining Machine Learn- tion time that is required to learn model parameters. Our ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford results show that LTNs are a good model that can fit well University, Palo Alto, California, USA, March 25-27, 2019. the data and that is able to do simple deductive inferences. The real added value of the model is that it lends itself to a symbol function of arity m and v1 , . . . , vm ∈ Rn are the explanations, since it allows us to do after-training fuzzy groundings of m terms then the grounding for the symbol inferences over the data. Nevertheless, the model generates function f can be expressed as: some errors, in particular when multi-hop inferences are to be drawn, and thus some refinements over the general model G(f )(v1 , . . . , vm ) = Mf v + Bf (1) might be required to improve the results. where v = hv1 , . . . , vm i, Mf is a transformation matrix and The rest of the paper is organized as follows: in Section Bf is the bias. This operation can be encoded into a one- 2 we describe LTNs showing the basic definitions and the layer neural network. learning process, in Section 3 we introduce our experimen- Predicates are instead mapped to neural tensor opera- tal setting and we describe and evaluate the results of our tions (Socher et al. 2013), the output of the neural tensor experiments. Section 4 contains other related work. Finally, network is given in input to a sigmoid such that the final we end the paper in Section 5 with some conclusions and output of the predicate is a value in the interval [0, 1]. The future work. tensor operation is the following: [1:k] Logic Tensor Networks G(P )(v) = σ(uTP (tanh(vT WP v + VP v + BP ))) (2) LTNs use first-order fuzzy logic (Petr 1998) and embed σ is the sigmoid function while W , V , B and u are param- atoms, functions, and predicates in a vector space. LTNs eters to be learned by the network while k corresponds to are inspired by Neural Tensor Networks (Socher et al. 2013) the layer size of the tensor and is an hyper-parameter in the that have been shown to be effective in natural logic reason- network. ing tasks (Bowman, Potts, and Manning 2015). In the fol- Quantifiers like ∀ in fuzzy logic are defined with aggrega- lowing sections we will give a short primer on logic tensor tion functions (like the min): this should consider an aggre- networks and their learning methodology. More details on gation over an infinite number of instances, making it im- LTNs can be found in the paper in which they were first in- possible to compute. Thus, quantifiers are implemented as troduced (Serafini and Garcez 2016). To describe LTNs we aggregation operations over a subset of the domain space will follow the definitions given by Serafini and Garcez. Rk . Different possible implementations can be used to im- plement the aggregation for the universal quantifiers, for ex- Logic ample mean, min and hmean (harmonic mean). LTNs are implemented over a logic called Real Logic that is described by a language L that contains a set of constants Learning to Satisfy Formulas C, a set of function symbols F and a set of predicates P . LTNs reduce the learning problem to a maximum sat- In this language rules from fuzzy logic apply and connec- isfiability problem: the task is to find groundings for tives are interpreted as binary operations over real numbers atoms, predicates and formulas that maximize the satisfia- in [0, 1]. For example t-norms are used in place of the con- bility of a given formula. For example, given the formula junction from classical logic. The t-norm is an operation parent(Susan, Ann), which describes the fact that Susan [0, 1]2 → [0, 1] and different versions of the operation exist is one of Ann’s parents, the network will try to optimize the (Lukasiewicz, Gödel and product t-norms are some possible groundings of the predicate parent (i.e., the parameters in the examples). Once the t-norm is chosen also the other con- tensor layer) and the groundings of Susan and Ann (i.e., nectives can be defined with respect to it. Thus, the use of their respective two vectors) in such a way that the degree t-norms and the other fuzzy connectives allows us to operate of truth of the formula is close to 1. Thus, the groundings on real-values in the interval [0, 1]. are both the embedded representation of the atoms and the Grounding parameters in the networks that represent both functions and predicates; the values of these components can be learned Each element of the language L is grounded in the vector through the use of back-propagation (Goodfellow, Bengio, space. Constants are mapped to vectors in Rm while func- and Courville 2016). The output of the learning process is a tion symbols are mapped to functions in the vector space. satisfiability score (in the interval [0, 1]) that can be consid- An n-ary function symbol is mapped to an n-ary function ered similar to the value of the loss function in a standard Rk·n → Rm . Predicates are mapped to functions with co- deep learning setting. domain in [0, 1]: Rm∗n → [0, 1]; the predicate is mapped to We show an example of how grounding and the satisfi- a fuzzy subset that defines the degree of truth (membership ability are combined. For compactness, in this example we to the set) for that predicate given its arguments. will identify the grounding of each element with a G as a su- perscript: given the formula P (x, y) ∧ R(w, z), the ground- Networks ings for the constants x, w, z, and y are retrieved (denoted The dimensionality of the vector of the constant is an hyper- with xG ). P and R are grounded to the respective opera- parameter of the model. While constants are mapped to vec- tions: P G (xG , y G ) ∧ RG (wG , z G ). The output of both pred- tors, functions and predicates are mapped to actual opera- icates is a real value in [0,1] that can be aggregated with the tions over the vector space. We will use G(f ) and G(P ) use of the t-norm. LTNs will learn to optimize the ground- to identify groundings of functions and predicates. Func- ings in such a way that the final value is close to 1 (i.e., the tion symbols are implemented as linear functions: given f formula is satisfied). Experiments SnakeCrocodile In this experimental section we aim to obtain answers for the Lizard following questions: i) what can LTNs learn and ii) how fast BaldEagle is the LTNs learning phase. To allow easy replication of our Thing Reptile LilBird Bank experiments we will first describe the datasets we use and Company Bird Organization Agent Eagle then we will introduce some details on the general method- Animal ology we have followed during our experiments. Details that Fish Shark are related to a particular experiment will be given in the re- Human Mammal lated section. For our experiment we use the original LTNs BlueFish TensorFlow implementation1 provided by the authors (Ser- Dog Feline Squirrel afini and Garcez 2016). Datasets, code and results are avail- Dolphin able online with specific instructions on how to repeat our Cat experiments2 . We briefly summarize here the four experi- ments we ran: • Experiments 1 and 2 will concentrate on a knowledge base completion task in which we will give to the network Figure 1: Taxonomy that represents the content of the A only true predicates and some axioms; dataset • Experiment 3 will compare LTNs with a simple deep learning baseline to provide insights about the strength C D A B and the limits of the model; • Experiment 4 will show computational times related to experiments on learning with LTNs. I H G E F Definitions By KBS we denote an input (starting) knowledge base, and KB will denote the corresponding completed knowledge P R Q O N M L base (i.e., with all relevant logical consequences added). KBT denotes the set of all inferences not in KBS , i.e., KBT = KB \ KBS . In the experiment we will often show S the performance over both KB and KBT by putting results related to KBT within parentheses. Datasets Figure 2: Representation of the parental relationships in the We use mainly two datasets for our experiments, the first P dataset one, called dataset A, represents a taxonomy that mainly contains hierarchies of classes (inspired by the DBpedia On- the models fail to produce correct answers, a task that is tology3 ). The taxonomy contains 25 nodes. Each node but more difficult with big knowledge bases. one (the root) has an outgoing edge to its superclass (i.e., Cat is connected to Feline). Figure 1 shows the taxonomy Methodology used in the experiment. The second dataset P is a parent-ancestor dataset that con- Given a dataset we define a set of axioms and we test a tains 17 nodes. Edges connect parents to one or more chil- knowledge base completion task, showing hyper-parameter dren, for a total of 22 parental relationships. Figure 2 shows details. LTNs will receive in input data under the form the parental relationships. of predicates (e.g., parent(Ann,Susan)) and axioms (e.g., We will test these two datasets on tasks in which we will ∀x, y : parent(x, y) → ancestor(x, y)); the network will have heavily unbalanced classes (more negative examples learn groundings for all the parameters and in the testing than positive ones). While our datasets are small compared phase we will analyze predictions over data. Since differ- to the ones currently used for knowledge base completion ent configurations of hyper-parameters are possible we run tasks (Bordes et al. 2013), we think that the results of our multiple models and we re-run each model multiple times experiments can point out interesting capabilities of LTNs: (to check variations due to random initialization). After a can LTNs perform deductive reasoning over these simple first phase of trial and error we set as static the following datasets? Moreover, using these small datasets, results can parameters: optimizer RMSprop, bias −1e−5 , learning rate be manually inspected to better understand where and how 0.01, decay 0.9. We tested three different aggregation func- tions for the universal quantifiers (harmonic mean, mean and 1 https://github.com/logictensornetworks min), two tensor layer sizes (10 and 20) for the tensor net- 2 https://github.com/vinid/ltns-experiments work and two embedding sizes for constants (10 and 20 di- 3 http://dbpedia.org mensions). Evaluation Measures To evaluate the models we use the have reached an accuracy equal to 0.85. This is important to Mean Absolute Error (MAE), Matthews correlation coef- remark since the two classes are ill-balanced. ficient (Matthews 1975) that is often regarded as stable Qualitative Analysis Analyzing the prediction of LTNs when classes are unbalanced, F1 score, precision, and re- we found that in some cases the model correctly predicts call. When we compute MAE we will compute the absolute multi-hop logical inferences (e.g., sub(Cat, Animal) close to distance between the fuzzy predictions and the actual true 1), but fails on other simple inferences (e.g., sub(Cat, Bird) values; this will give us the possibility of understanding how close to 1). When there is not enough information regarding good are models with a continuous error value. When com- the relationship between two elements (e.g., Cat and Bird) puting the other measures we will round the scores to the the model has difficulties to predict the correct answer. nearest integer in such a way that we compare only binary scores. We consider prediction values higher than 0.5 as 1 Summary of the outcomes and vice-versa. While this is a strong approximation over the degree of truth given by fuzzy logic it is still useful to un- • LTNs fit the data well; derstand the performance of the model. We will also report • Multi-hop inferences tend to be more difficult; accuracy to summarize the performance of the model when • As expected performance increases with satisfiability. necessary. In general, we select the best model for each ex- periment by considering the one with the highest F1 score. Experiment 2: Ancestors Reasoning Experiment 1: Taxonomy Reasoning For the P dataset we train LTNs with the following axioms: For the A dataset we ask the LTNs to learn the following • ∀a, b ∈ P : parent(a, b) → ancestor(a, b) axioms: • ∀a, b, c ∈ P : (ancestor(a, b) ∧ parent(b, c)) → • ∀a, b, c ∈ A : (sub(a, b) ∧ sub(b, c)) → sub(a, c) ancestor(a, c) • ∀a ∈ A : ¬sub(a, a) • ∀a ∈ P : ¬parent(a, a) • ∀a, b : sub(a, b) → ¬sub(b, a) • ∀a ∈ P : ¬ancestor(a, a) • ∀a, b ∈ P : parent(a, b) → ¬parent(b, a) Where sub identifies the subclass relation in the dataset (e.g., sub(Cat, F eline)). The objective of this experiment is to • ∀a, b ∈ P : ancestor(a, b) → ¬ancestor(b, a) see if LTNs can generate the transitive closure starting from Thus, we combine the knowledge of these axioms with a dataset using the axioms. Data contained in the A dataset the data of the parental relationships. We distinguish is our KBS while the edges needed to generate the transi- two different relationships in this dataset parent (i.e., tive closure will be our KBT . We compare the predictions of parent(x, y) means that x is a parent of y) and ancestor LTNs (computed as the prediction over sub(x, y) given x, y) (i.e., ancestor(x, y) means that x is a ancestor of y). with the actual transitive closure of the graph. We recall that KBS contains only the parental relationships shown in KBS contains only true predicates (e.g., sub(Cat, F eline)) Figure 2 (e.g., parent(C, I)). The task we will test is to in- while we ask the model to perform inferences also over pred- fer the complete knowledge base for the ancestor predicate, icates that are false (e.g., we evaluate sub(F eline, Cat) ex- to which we will refer to as KBa ; therefore, we would like pecting a value close to 0). LTNs to learn if an ancestor relationship is true or false for Table 1 shows results for the knowledge completion tasks two given nodes only from axioms and parental data. The of the top performing model and one of the worst perform- representation for the ancestor predicate should be gener- ing ones: the top performing model had a satisfiability equal ated from the knowledge in the axioms since no data about to 0.99 while one of the worst ones had a satisfiability of it is provided. 0.56. The top-performing model was initialized with a layer We will also test how the model performs over the set size in the tensor network of 20 and a dimension of the em- of ancestor formulas that require multi-hop inferences to beddings equal to 20; the best universal aggregator was the be inferred (i.e., those that cannot be directly inferred from mean aggregator. ∀a, b ∈ P : parent(a, b) → ancestor(a, b)), we will re- The best model over KB is able to fit well the data fer to this as KBTa : those ancestors pairs for which the par- since the F1 measure show good performance over the en- ent pair is false (e.g., ancestor(C, S)). As before, we recall tire knowledge base (F1 = 0.64). LTNs are prone to generate that KBp contains only true predicates (e.g., parent(C, I)) false positives: the model generates 36 false positives with while we ask the model to perform inferences over the respect to 55 true positives and 26 false negatives with re- ancestor dataset (KBa ) that also contains predicates that spect to 459 true negatives. should be inferred as false (e.g., ancestor(I, C)). The performance drops when we consider only KBT el- We do this to understand if LTNs are able to pass the infor- ements for testing (F1 = 0.51), this means that LTNs are in mation from the parent predicate to the ancestor predicate this case not able to capture some more complicated infer- and if this is enough to give to the network the possibility ences. of making even more complex inferences that are related to Still, the approach is better than a binary random base- chains of ancestors. line. The accuracy of the model with the best satisfiability is The best performing model for this task (with hmean, 10 0.89, while a naive classifier that predicts only zeros would dimensional embeddings, 10 neural tensor layers) over KBa Table 1: Performance measures on the A dataset. Value out of the parentheses are computed over the complete KB while those within parentheses are computed only on the part of the KB that was not in the initial set of data. Satisfiability MAE Matthews F1 Precision Recall 0.99 0.12 (0.12) 0.58 (0.45) 0.64 (0.51) 0.60 (0.47) 0.68 (0.55) 0.56 0.51 (0.52) 0.09 (0.06) 0.27 (0.20) 0.20 (0.11) 0.95 (0.93) Random 0.50 (0.50) 0.00 (0.00) 0.22 (0.17) 0.14 (0.10) 0.50 (0.50) 0.45 equal to 10; the best universal aggregator was the hmean ag- gregator. Results show that the new axioms are beneficial for 0.40 the network, that is actually able to learn well the relation- 0.35 ships. Still, the precision over KBTa has increased by 0.19 Mean Absolute Error 0.30 points (the difference between the results within parenthe- ses). 0.25 One interesting result about this is related to the fact that 0.20 the network is able to learn a good representation for the 0.15 ancestor predicate just from the axioms. 0.56 0.57 0.58 0.63 0.65 0.67 0.7 Satisfiability 0.81 0.85 0.89 0.94 0.99 Qualitative Analysis LTNs allows us to do fuzzy infer- ences after training. The model is able to answer queries on fuzzy formulas that were not in the original training data. Figure 3: Average MAE for the ancestors tasks on rounded For example, ∀a, b : ancestor(a, b) → ¬parent(b, a) has level of satisfiability. MAE decreases with the increase of generally a value close to 1 in our experiments. satisfiability. Summary of the outcomes • Satisfiability is strongly related with performance of the had an F1 score of 0.77. If we do not consider the ances- model: the higher the satisfiability the lower the error; tor predicates that can be directly inferred from the axioms (KBTa ), the model correctly infers 22 ancestors while gen- • LTNs learn to pass information quite efficiently (informa- erating 25 false positives: the F1 is equal to 0.62. Again, the tion on parent(x, y) is passed to ancestor(x, y)). Still, network seems to be able to fit the data quite well, but it still some more complicated inferences are difficult; generates errors on multi-hop inferences. • More axioms increase the performance of the model. As another experiment over satisfiability, in Figure 3 we show the relation between the MAE computed on KBa and Experiment 3: Comparison with a Multi-Input the level of satisfiability. To draw this figure we run multi- Network ple experiments with LTNs and computed the mean MAE In this experiment, we want to compare LTNs with a sim- by aggregating the satisfiability levels rounded to 2 decimal ple deep learning architecture on a common task. Starting digit. It is clear that the error decreases with the increase of from the complete knowledge base of parents and ances- the satisfiability level and thus LTNs are able to learn and tors we randomly divide data into the training set and test infer some knowledge. This proves again that the model is set. Training data consists of 100 parent predicates (both able to learn the originally not known ancestor relationships true and false) and 100 ancestor predicates4 (both true and from the combination of data and rules. false); test set contains 189 parent predicates and 189 ances- Comparison With Added Axioms To provide a better un- tor predicates5 . We thus tackle this problem by considering a derstanding of this experiment we decided to add two ax- classification setting that can be solved with the use of deep ioms to the previous set. These two axioms explicitly state learning models. the relationships between parents and ancestors: We built a simple multi-input architecture that took as input three one-hot encoded representations of the pairs of • ∀a, b, c ∈ P : (ancestor(a, b) ∧ ancestor(b, c)) → atoms and the predicates (e.g., Susan, Ann, parent). This is ancestor(a, c) not the most optimized architecture to solve this task, but it • ∀a, b, c ∈ P : (parent(a, b) ∧ parent(b, c)) → is useful to understand the performance of LTNs compared ancestor(a, c) with classical deep learning approaches. We trained the net- work using binary cross-entropy and the RMSprop gradient Table 2 shows the comparison between the approach with- optimization algorithm over 5,000 epochs with a 20% vali- out the new axioms (Six Axioms) and with the new axioms dation split. To reduce possible effects of overfitting we use (Eight Axioms) on the ancestor dataset. Performances were computed on the two models with the highest satisfiability 4 Note that the training set contains very few examples that are (both around 0.99). The top-performing models for both Six positive Axioms and Eight Axioms were initialized with a layer size in 5 We tested different random subsets of training and testing, but the tensor network of 10 and a dimension of the embeddings the results tend to be similar Table 2: Ancestor completion task with different number of axioms. Value out of the parentheses are computed over the com- plete KBa while those within parentheses are computed on KBTa . Type MAE Matthews F1 Precision Recall Six Axioms 0.16 (0.17) 0.73 (0.61) 0.77 (0.62) 0.64 (0.47) 0.96 (0.92) Eight Axioms 0.14 (0.14) 0.83 (0.69) 0.85 (0.72) 0.80 (0.66) 0.89 (0.79) o Output be used in combination with quantifiers to make inferences over data using new axioms on which the network was not 3 dimensions Dense Layer trained (e.g., ∀x, y : parent(x, y) → ¬ancestor(y, x) has (sigmoid) a high truth value). Concatenation As shown in the recent work on LTNs on semantic image interpretation one key element of success might be the use 5 dimensions 5 dimensions 2 dimensions Dense of LTNs over deep learning architectures (Donadello, Ser- Layer afini, and d’Avila Garcez 2017); this would allow augment- ing data with semantic information that will make it possible [0,1,0....,0,0] [0,0,0....,1,0] [1,0] Input Layer to explain predictions. Susan Ann parent Summary of the outcomes Figure 4: Baseline Multi-input architecture • Results show that performance on this simple task is com- parable to a naive network; • Axioms in LTNs seem to provide a useful way of defin- L2 regularization (we experimentally found that results were ing constraints over the space of the solutions that might better with it than without). We show this architecture in Fig- reduce the possibility of overfitting; ure 4 where we also show the dimensions of the layers. • The main advantage of LTNs resides in the possibility of The network is trained to detect if a predicate, given two making inferences after training. constants, is true or false (binary outcome). LTNs are im- plicitly trained on the same task: we train the network over Experiment 4: Time to Learn best satisfiability given the data in input and the six axioms used in the previous setting. In this last experiment, we investigate how fast LTNs are in The performance of the models is computed over the 189 the learning context. We consider the following experimen- ancestor test predicates. We ignore the parent predicates in tal setting: we generate a range of N constant and N pred- this setting because there is little to no knowledge about how icates and we evaluate different combinations of them. We to predict if a parental relationship in the test set is true or divide this experiment in three by considering unary, binary false from the dataset. and ternary predicates of the following from ∀x : predn (x), ∀x, y : predn (x, y), ∀x, y, z : predn (x, y, z), we therefore Results show that the multi-input network achieves an ac- test only predicates that are universally quantified. We com- curacy equal to 0.84 while accuracy for LTNs was around pute 5,000 training epochs to learn the parameters of 4, 8, 12, 0.89; while the accuracies are comparable an in-depth analy- 20, 30 constants with 4, 8, 12, 20, 30 (universally quantified) sis with other measures revealed that the recall for the multi- predicates of arity one, two and three: this means that in the input was 1 and its precision was 0.12, while LTNs had a setting with 4 constants and 8 predicates of arity 3 we in- lower recall (0.66) but a much higher precision (0.57). A troduce 4 constants (a, b, c, d) in the model and 8 predicates naive model that predicts only zeros (since classes are unbal- (pred1 , pred2 , . . . , pred8 ) and each predicate is universally anced) would have reached an accuracy equal to 0.84. The quantified (e.g., ∀x, y, z : pred1 (x, y, z)). Size of the em- multi-input architecture tends to overfit in this task in which bedded representation in this experiment is 10. Experiments most of the classes are 0. It is anyway important to note that were run using a compiled version of Tensorflow on an i7 it is difficult for the multi-input architecture to understand machine. the task, while LTNs are helped by the axioms. However, the results show that while LTNs are good for Analysis Figures 5, 6, 7 show the seconds needed to com- learning logical rules, their accuracy is still comparable to plete the learning phase for each setting. While it is clear that the one obtained by neural-networks. Moreover, the multi- constants have an influence on computational time (since input architecture would require more control on overfitting, they are training data) we can also state that predicates and while the logical axioms used in LTNs seems to provide a their arity have a notable computational impact upon the natural way to define some constraints over the vector space learning phase. With a low number of constants and pred- and to reduce possible overfitting. Nevertheless, different icates (e.g., 4) the training time is not much different in all deep learning architectures with a different set of parame- the settings, but as soon as the number of constants increases ters might generate better results. the model requires more time to learn. The arity of the pred- Using classical neural networks we lose the ability to de- icates seem to be the element with the higher impact on the fine high-level semantics to the data. For example, LTNs can learning time: this is an expected result since the universal quantifier has to cover multiple elements in the ternary case. of different neural-symbolic approaches proposed in litera- Since experiments were run on a CPU we expect training ture: in this section we will only discuss a few of these ap- time to be shorter on GPU 6 . proaches and we will also describe some related methods. One of the main points of discussion that has involved the Summary of the outcomes artificial community in the last decades is the relationship • Time to learn the parameters is highly influenced by the between symbolic artificial intelligence and connectivist arity of the predicates; (i.e., related to neural networks) artificial intelligence (Min- sky 1991). In recent years deep learning approaches have Other Experimental Notes shown great computational capabilities (Goodfellow, Ben- In this section, we briefly describe other experimental results gio, and Courville 2016), but still these approaches do not that are interesting for the community. While the follow- achieve the same reasoning and knowledge transformation ing assertions are derived from empirical experiments they abilities that symbolic approaches show. On the other hand, might still be useful for the reader who wants to start using symbolic artificial intelligence suffers from computational LTNs. limits and the knowledge acquisition bottleneck, i.e., the LTNs as all deep learning model suffers from optimiza- need to generate high-quality knowledge bases, which is tion problems: in our experiments we often found the model usually done manually. A different voice in this group comes reaching local minima. Global optimization tools might help from the neural-symbolic field, where the task is to bring to- in a better parameter optimization search. gether the two worlds of symbolic artificial intelligence and In our experiments LTNs often predicted the class Cat to neural networks (Garcez, Gabbay, and Broda 2002; Ham- be a subclass of the class Bird. This error might be due to mer and Hitzler 2007; Garcez, Lamb, and Gabbay 2008; missing knowledge in the KB. The network is not able to Garcez et al. 2015). understand the difference between the two since they come In the current work we have explored only LTNs, but from different branches of the taxonomy. In general, it seems there are different approaches in the field that have been that LTNs predict many false positives, while they are better introduced. One of the most famous approaches of neural- in detecting true negatives. This seems due to the fact that symbolic integration are the Knowledge Based Artificial true negatives in our experiments can be directly inferred Neural Networks (KBANNs) (Towell and Shavlik 1994). from the axioms: for example, ∀a : ¬ancestor(a, a) gives a KBANNs where one of the first approaches to integrate good amount of information to the model about the fact that propositional clauses with data, developed at the same time each constant occurring in both parameters of the predicate as the closely related propositional core method (Hölldobler ancestor should generate negative values. and Kalinke 1994). Lifting these results towards first-order If the model fits the data too well (i.e., it overfits) the per- logic, however, has been proven difficult and limited to toy- formance over the test set decreases. While this is a common size knowledge bases (Hitzler, Hölldobler, and Seda 2004; event for machine learning models and there are techniques Gust, Kühnberger, and Geibel 2007; Bader, Hitzler, and to prevent this, applying these to LTNs is not so straight- Hölldobler 2008). forward: cross-validation would require us to provide com- On the other hand, there are other approaches from the pleteness information to the training set, that would bias the Statistical Relational Learning field that do not integrate reasoning task. neural networks with logic, but tackle the problem in a sym- We tested different sets of hyper-parameters and we re- bolic manner by also combining statistical information. Ex- lease results on the tested tasks online. While this was not amples of this category are ProbLog (De Raedt and Kimmig the primary scope of the paper it is still important to esti- 2015) that is an example of probabilistic logic programming mate the effects of the hyper-parameters to fully evaluate language and Markov Logic Networks (MLNs) are a sta- the approach. Nevertheless, we empirically find out that in- tistical relational learning model that has been shown to be creasing the layers of the tensor network and the size of the effective on a large variety of tasks (Richardson and Domin- embeddings makes the model much more difficult to opti- gos 2006; Meza-Ruiz and Riedel 2009). The intuition be- mize. hind MLNs and LTNs is similar since they both base their After paper acceptance a new version of LTNs was re- approach on logical languages. MLNs defines weights for leased by the original authors: this last version is easier to formulas and interpret the world by considering it under a optimize and shows a slight increase in performance over probabilistic point of view while LTNs use fuzzy logic com- the F1 measure. bined with a neural architecture to generate their inferences. Related Work Conclusions and Future Work In this section we summarize some related approaches that Conclusions have been introduced in the state of the art. We refer LTNs can be shown to obtain good results on reasoning tasks to Garcez, Lamb, and Gabbay; Besold et al. for discussions when optimal satisfiability conditions are met. This is of- 6 to show an effective comparison between different predicates ten difficult to reach and using the model with a low degree we decided to show results computed with a CPU: with the GPU of satisfiability can generate bad inferences. Nevertheless, it was more difficult to highlight the differences between these ex- LTNs show interesting capabilities and their ability to mix periments logic and data might prove to be a valuable resource. LTNs 150 4 2.6 2.7 2.8 3 3.4 4 3.3 4.5 7.5 14 28 4 6.8 24 64 2.6e+02 8.5e+02 6000 12 125 Number of predicates (arity 1) Number of predicates (arity 2) Number of predicates (arity 3) 8 3.7 3.9 4.2 4.6 5.1 8 5.1 7.1 17 31 56 8 11 39 1.1e+02 5.1e+02 1.7e+03 4500 10 100 12 5.1 5.3 6 6.1 6.5 8 12 6.5 9.6 18 33 66 75 12 15 56 1.6e+02 7.3e+02 2.5e+03 3000 20 8 8 8.4 8.9 9.9 6 20 9.7 15 27 51 1e+02 50 20 23 88 2.6e+02 1.2e+03 4.1e+03 1500 25 30 11 12 12 13 14 4 30 14 21 37 74 1.5e+02 30 34 1.3e+02 4.1e+02 2e+03 6.5e+03 4 8 12 20 30 4 8 12 20 30 4 8 12 20 30 Number of constants Number of constants Number of constants Figure 5: Computational times in sec- Figure 6: Computational times in sec- Figure 7: Computational times in sec- onds for predicates of arity one and con- onds for predicates of arity two and con- onds for predicates of arity three and stants stants constants fit well the data and can be used to make some simple infer- Besold, T. R.; d’Avila Garcez, A. S.; Bader, S.; Bowman, ences. More complex inferences (multi-hop) are more diffi- H.; Domingos, P. M.; Hitzler, P.; Kühnberger, K.; Lamb, cult to capture in the model. L. C.; Lowd, D.; Lima, P. M. V.; de Penning, L.; Pinkas, The main problem encountered in our experiments is re- G.; Poon, H.; and Zaverucha, G. 2017. Neural-symbolic lated to erroneous prediction generated by the LTNs and learning and reasoning: A survey and interpretation. CoRR scalability issues. We think that the former problem might abs/1711.03902. be solved with a more accurate use of logic constraints: for Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and example, in the ancestor experiment, adding notions about Yakhnenko, O. 2013. Translating embeddings for model- the concepts of “siblings” might help the network to perform ing multi-relational data. In Advances in neural information better. While a more efficient use of computational resources processing systems, 2787–2795. could help in reducing the latter problem we encountered. Bowman, S. R.; Potts, C.; and Manning, C. D. 2015. Learn- ing distributed word representations for natural logic reason- Future Work ing. In Proceedings of the Association for the Advancement While results have shown that LTNs are able to capture logic of Artificial Intelligence Spring Symposium (AAAI), 10–13. semantics in the vector space, they should also be compared De Raedt, L., and Kimmig, A. 2015. Probabilistic (logic) with other statistical relational learning methods like MLNs programming concepts. Machine Learning 100(1):5–47. on similar tasks. Another possible next step is to apply LTNs on bigger Donadello, I.; Serafini, L.; and d’Avila Garcez, A. 2017. knowledge bases defined in the state of the art (Bordes et Logic tensor networks for semantic image interpretation. In al. 2013). We expect the ability to make fuzzy inferences IJCAI, 1596–1602. over the trained model to be of great help in link prediction Garcez, A.; Besold, T. R.; De Raedt, L.; Földiak, P.; Hit- tasks over knowledge bases. zler, P.; Icard, T.; Kühnberger, K.-U.; Lamb, L. C.; Miikku- An interesting development of this work could be eval- lainen, R.; and Silver, D. L. 2015. Neural-symbolic learning uating the generated groundings: constants in LTNs have and reasoning: contributions and challenges. In Proceed- an associated vector and thus it is possible to compute the ings of the AAAI Spring Symposium on Knowledge Repre- similarity in the vector space between constants. This might sentation and Reasoning: Integrating Symbolic and Neural be interesting in the context of knowledge graph embed- Approaches, Stanford. dings (Bordes et al. 2013): vector representations of entities Garcez, A. S. d.; Gabbay, D. M.; and Broda, K. B. 2002. and relationships of a knowledge graph. Neural-Symbolic Learning System: Foundations and Appli- cations. Berlin, Heidelberg: Springer-Verlag. Acknowledgment Garcez, A. S.; Lamb, L. C.; and Gabbay, D. M. 2008. We thank Luciano Serafini and Artur d’Avila Garcez for Neural-symbolic cognitive reasoning. Springer Science & their comments and suggestions. We gratefully acknowledge Business Media. the support of NVIDIA Corporation with the donation of the Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Titan Xp GPU used for this research. Learning. MIT Press. Gust, H.; Kühnberger, K.; and Geibel, P. 2007. Learning References models of predicate logical theories with neural networks Bader, S.; Hitzler, P.; and Hölldobler, S. 2008. Connectionist based on topos theory. In Hammer, B., and Hitzler, P., eds., model generation: A first-order approach. Neurocomputing Perspectives of Neural-Symbolic Integration, volume 77 of 71(13-15):2420–2432. Studies in Computational Intelligence. Springer. 233–264. Hammer, B., and Hitzler, P., eds. 2007. Perspectives of Neural-Symbolic Integration, volume 77 of Studies in Com- putational Intelligence. Springer. Hitzler, P.; Hölldobler, S.; and Seda, A. K. 2004. Logic programs and connectionist networks. J. Applied Logic 2(3):245–272. Hölldobler, S., and Kalinke, Y. 1994. Ein massiv paralleles modell für die logikprogrammierung. In WLP, 89–92. Matthews, B. W. 1975. Comparison of the predicted and ob- served secondary structure of t4 phage lysozyme. Biochim- ica et Biophysica Acta (BBA)-Protein Structure 405(2):442– 451. Meza-Ruiz, I., and Riedel, S. 2009. Jointly identifying pred- icates, arguments and senses using markov logic. In NAACL, 155–163. Association for Computational Linguistics. Minsky, M. L. 1991. Logical versus analogical or sym- bolic versus connectionist or neat versus scruffy. AI maga- zine 12(2):34. Petr, H. 1998. Metamathematics of fuzzy logic, vol. 4 of trends in logicstudia logica library. Richardson, M., and Domingos, P. 2006. Markov logic net- works. Machine learning 62(1-2):107–136. Serafini, L., and Garcez, A. S. d. 2016. Learning and reason- ing with logic tensor networks. In Conference of the Italian Association for Artificial Intelligence, 334–348. Springer. Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, 926–934. Towell, G. G., and Shavlik, J. W. 1994. Knowledge- based artificial neural networks. Artificial intelligence 70(1- 2):119–165.