=Paper=
{{Paper
|id=Vol-2350/paper22
|storemode=property
|title=On the Capabilities of Logic Tensor Networks for Deductive Reasoning
|pdfUrl=https://ceur-ws.org/Vol-2350/paper22.pdf
|volume=Vol-2350
|authors=Federico Bianchi,Pascal Hitzler
|dblpUrl=https://dblp.org/rec/conf/aaaiss/BianchiH19
}}
==On the Capabilities of Logic Tensor Networks for Deductive Reasoning==
<pdf width="1500px">https://ceur-ws.org/Vol-2350/paper22.pdf</pdf>
<pre>
         On the Capabilities of Logic Tensor Networks for Deductive Reasoning

                                             Federico Bianchi,1,2 Pascal Hitzler,2
                                                      1
                                                    University of Milan-Bicocca
                                                     2
                                                       Wright State University
                                    federico.bianchi@disco.unimib.it, pascal.hitzler@wright.edu,


                            Abstract                                    Deep learning models (Goodfellow, Bengio, and
  Neural-symbolic integration is a field in which classical sym-     Courville 2016) usually learn by optimizing a function; in
  bolic knowledge mechanisms are combined with neural net-           LTNs this task is replaced with the task of best satisfiability:
  works. This is done to provide satisfactory computational ca-      the model has to optimize the representation of each atom,
  pabilities from the network side and to exploit the descriptive    function and predicate in such a way that the satisfiability
  power of symbolic reasoning. Logic Tensor Networks (LTNs)          of each formula is maximized. In this way the network
  are a deep learning model that can be used to combine data         learns the best possible parameters to represent both data
  with fuzzy logic to provide inferences and reasoning mech-         and axioms.
  anisms over data. While LTNs have been shown effective in             The main advantages of LTNs are the following: i) it is
  some contexts no detailed analysis on their capabilities for
  deductive logical reasoning has been conducted. In this pa-
                                                                     possible to express knowledge using logical axioms over
  per we explore the capabilities and the limitations of LTNs in     data ii) it is possible to tackle and solve standard machine
  terms of deductive reasoning.                                      learning tasks (e.g., classification) and iii) provide expla-
                                                                     nations using fuzzy logic over the trained network. Indeed,
                                                                     after training it is possible to make fuzzy inferences over
                        Introduction                                 data to obtain the degree of truth with respect to certain
Neural-symbolic learning and reasoning (Garcez, Lamb, and            predicates. The model was tested with promising results
Gabbay 2008; Besold et al. 2017) involves integrating stan-          on simple reasoning tasks (Serafini and Garcez 2016) and
dard logical reasoning with neural networks with the aim             on semantic image interpretation (Donadello, Serafini, and
of providing fast and robust computational methods for rea-          d’Avila Garcez 2017).
soning and explanation over data. Logic Tensor Networks                 An initial exploration of the reasoning capabilities of
(LTNs) are a deep learning model that comes from the                 LTNs was done on the well-known smoker-friends-and-
neural-symbolic field: it integrates both logic and data in          cancer dataset (Serafini and Garcez 2016). The dataset con-
a neural network to provide support for neural symbolic-             tains data about two groups of people for which friend re-
learning and reasoning (Serafini and Garcez 2016). LTNs              lationships and smoking habits are given, while the fact of
use first-order fuzzy logic to express knowledge about the           having cancer or not is only given for people in the first
world: using fuzzy logic over classical first-order logic al-        group. Axioms related to smoking properties (i.e., smok-
lows us to represent truth using continuous values in the in-        ing implies cancer) are given to the network. The network
terval [0, 1] to represent the degree of truth.                      learns to predict if people in the second group have cancer
   Input to LTNs are data and axioms over (fuzzy) first-             having learned the patterns present in the first group. More
order predicate logic, e.g., parent(Ann, Susan), ∀x, y :             recently, LTNs were used on a semantic image interpretation
parent(x, y) → ancestor(x, y). Two key components of                 task in which they learned to classify bounding boxes of im-
logic tensor networks are the grounding of formulas and the          ages with the help of background knowledge (Donadello,
learning by best satisfiability. With formula grounding we           Serafini, and d’Avila Garcez 2017). Still, an in-depth analy-
refer to the mapping of formulas to a vector space. For ex-          sis of the deductive reasoning capabilities of LTNs remains
ample, constants are mapped to n-dimensional vectors while           to be done.
function symbols are mapped to linear functions. A neural               In this work we explore LTNs in the context of reason-
network can be used to compute the degree of truth of a              ing tasks, showing insights and properties of the model. We
given formula considering the embedded representation of             introduce two simple datasets that contain relationships and
constants and symbols.                                               we define additional axioms over these datasets. These two
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     datasets are used to evaluate deductive reasoning capabil-
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of   ities. We also perform some experiments on the computa-
the AAAI 2019 Spring Symposium on Combining Machine Learn-           tion time that is required to learn model parameters. Our
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford            results show that LTNs are a good model that can fit well
University, Palo Alto, California, USA, March 25-27, 2019.           the data and that is able to do simple deductive inferences.
The real added value of the model is that it lends itself to     a symbol function of arity m and v1 , . . . , vm ∈ Rn are the
explanations, since it allows us to do after-training fuzzy      groundings of m terms then the grounding for the symbol
inferences over the data. Nevertheless, the model generates      function f can be expressed as:
some errors, in particular when multi-hop inferences are to
be drawn, and thus some refinements over the general model                     G(f )(v1 , . . . , vm ) = Mf v + Bf            (1)
might be required to improve the results.                        where v = hv1 , . . . , vm i, Mf is a transformation matrix and
   The rest of the paper is organized as follows: in Section     Bf is the bias. This operation can be encoded into a one-
2 we describe LTNs showing the basic definitions and the         layer neural network.
learning process, in Section 3 we introduce our experimen-          Predicates are instead mapped to neural tensor opera-
tal setting and we describe and evaluate the results of our      tions (Socher et al. 2013), the output of the neural tensor
experiments. Section 4 contains other related work. Finally,     network is given in input to a sigmoid such that the final
we end the paper in Section 5 with some conclusions and          output of the predicate is a value in the interval [0, 1]. The
future work.                                                     tensor operation is the following:
                                                                                                    [1:k]
               Logic Tensor Networks                               G(P )(v) = σ(uTP (tanh(vT WP             v + VP v + BP ))) (2)
LTNs use first-order fuzzy logic (Petr 1998) and embed
                                                                 σ is the sigmoid function while W , V , B and u are param-
atoms, functions, and predicates in a vector space. LTNs
                                                                 eters to be learned by the network while k corresponds to
are inspired by Neural Tensor Networks (Socher et al. 2013)
                                                                 the layer size of the tensor and is an hyper-parameter in the
that have been shown to be effective in natural logic reason-
                                                                 network.
ing tasks (Bowman, Potts, and Manning 2015). In the fol-
                                                                    Quantifiers like ∀ in fuzzy logic are defined with aggrega-
lowing sections we will give a short primer on logic tensor
                                                                 tion functions (like the min): this should consider an aggre-
networks and their learning methodology. More details on
                                                                 gation over an infinite number of instances, making it im-
LTNs can be found in the paper in which they were first in-
                                                                 possible to compute. Thus, quantifiers are implemented as
troduced (Serafini and Garcez 2016). To describe LTNs we
                                                                 aggregation operations over a subset of the domain space
will follow the definitions given by Serafini and Garcez.
                                                                 Rk . Different possible implementations can be used to im-
                                                                 plement the aggregation for the universal quantifiers, for ex-
Logic
                                                                 ample mean, min and hmean (harmonic mean).
LTNs are implemented over a logic called Real Logic that
is described by a language L that contains a set of constants    Learning to Satisfy Formulas
C, a set of function symbols F and a set of predicates P .
                                                                 LTNs reduce the learning problem to a maximum sat-
In this language rules from fuzzy logic apply and connec-
                                                                 isfiability problem: the task is to find groundings for
tives are interpreted as binary operations over real numbers
                                                                 atoms, predicates and formulas that maximize the satisfia-
in [0, 1]. For example t-norms are used in place of the con-
                                                                 bility of a given formula. For example, given the formula
junction from classical logic. The t-norm is an operation
                                                                 parent(Susan, Ann), which describes the fact that Susan
[0, 1]2 → [0, 1] and different versions of the operation exist
                                                                 is one of Ann’s parents, the network will try to optimize the
(Lukasiewicz, Gödel and product t-norms are some possible
                                                                 groundings of the predicate parent (i.e., the parameters in the
examples). Once the t-norm is chosen also the other con-
                                                                 tensor layer) and the groundings of Susan and Ann (i.e.,
nectives can be defined with respect to it. Thus, the use of
                                                                 their respective two vectors) in such a way that the degree
t-norms and the other fuzzy connectives allows us to operate
                                                                 of truth of the formula is close to 1. Thus, the groundings
on real-values in the interval [0, 1].
                                                                 are both the embedded representation of the atoms and the
Grounding                                                        parameters in the networks that represent both functions and
                                                                 predicates; the values of these components can be learned
Each element of the language L is grounded in the vector         through the use of back-propagation (Goodfellow, Bengio,
space. Constants are mapped to vectors in Rm while func-         and Courville 2016). The output of the learning process is a
tion symbols are mapped to functions in the vector space.        satisfiability score (in the interval [0, 1]) that can be consid-
An n-ary function symbol is mapped to an n-ary function          ered similar to the value of the loss function in a standard
Rk·n → Rm . Predicates are mapped to functions with co-          deep learning setting.
domain in [0, 1]: Rm∗n → [0, 1]; the predicate is mapped to         We show an example of how grounding and the satisfi-
a fuzzy subset that defines the degree of truth (membership      ability are combined. For compactness, in this example we
to the set) for that predicate given its arguments.              will identify the grounding of each element with a G as a su-
                                                                 perscript: given the formula P (x, y) ∧ R(w, z), the ground-
Networks                                                         ings for the constants x, w, z, and y are retrieved (denoted
The dimensionality of the vector of the constant is an hyper-    with xG ). P and R are grounded to the respective opera-
parameter of the model. While constants are mapped to vec-       tions: P G (xG , y G ) ∧ RG (wG , z G ). The output of both pred-
tors, functions and predicates are mapped to actual opera-       icates is a real value in [0,1] that can be aggregated with the
tions over the vector space. We will use G(f ) and G(P )         use of the t-norm. LTNs will learn to optimize the ground-
to identify groundings of functions and predicates. Func-        ings in such a way that the final value is close to 1 (i.e., the
tion symbols are implemented as linear functions: given f        formula is satisfied).
                      Experiments
                                                                                                        SnakeCrocodile
In this experimental section we aim to obtain answers for the                                         Lizard
following questions: i) what can LTNs learn and ii) how fast                                                          BaldEagle
is the LTNs learning phase. To allow easy replication of our                                  Thing      Reptile             LilBird
                                                                          Bank
experiments we will first describe the datasets we use and                     Company                               Bird
                                                                                     Organization Agent                        Eagle
then we will introduce some details on the general method-                                                Animal
ology we have followed during our experiments. Details that                                                          Fish      Shark
are related to a particular experiment will be given in the re-
                                                                                                 Human Mammal
lated section. For our experiment we use the original LTNs                                                                     BlueFish
TensorFlow implementation1 provided by the authors (Ser-                                                             Dog
                                                                                                   Feline Squirrel
afini and Garcez 2016). Datasets, code and results are avail-                                                   Dolphin
able online with specific instructions on how to repeat our                                     Cat
experiments2 . We briefly summarize here the four experi-
ments we ran:
• Experiments 1 and 2 will concentrate on a knowledge
  base completion task in which we will give to the network       Figure 1: Taxonomy that represents the content of the A
  only true predicates and some axioms;                           dataset
• Experiment 3 will compare LTNs with a simple deep
  learning baseline to provide insights about the strength                C        D                  A             B
  and the limits of the model;
• Experiment 4 will show computational times related to
  experiments on learning with LTNs.
                                                                          I        H        G                  E           F
Definitions
By KBS we denote an input (starting) knowledge base, and
KB will denote the corresponding completed knowledge                      P        R        Q         O        N           M         L
base (i.e., with all relevant logical consequences added).
KBT denotes the set of all inferences not in KBS , i.e.,
KBT = KB \ KBS . In the experiment we will often show                                  S
the performance over both KB and KBT by putting results
related to KBT within parentheses.

Datasets                                                          Figure 2: Representation of the parental relationships in the
We use mainly two datasets for our experiments, the first         P dataset
one, called dataset A, represents a taxonomy that mainly
contains hierarchies of classes (inspired by the DBpedia On-
                                                                  the models fail to produce correct answers, a task that is
tology3 ). The taxonomy contains 25 nodes. Each node but
                                                                  more difficult with big knowledge bases.
one (the root) has an outgoing edge to its superclass (i.e.,
Cat is connected to Feline). Figure 1 shows the taxonomy
                                                                  Methodology
used in the experiment.
   The second dataset P is a parent-ancestor dataset that con-    Given a dataset we define a set of axioms and we test a
tains 17 nodes. Edges connect parents to one or more chil-        knowledge base completion task, showing hyper-parameter
dren, for a total of 22 parental relationships. Figure 2 shows    details. LTNs will receive in input data under the form
the parental relationships.                                       of predicates (e.g., parent(Ann,Susan)) and axioms (e.g.,
   We will test these two datasets on tasks in which we will      ∀x, y : parent(x, y) → ancestor(x, y)); the network will
have heavily unbalanced classes (more negative examples           learn groundings for all the parameters and in the testing
than positive ones). While our datasets are small compared        phase we will analyze predictions over data. Since differ-
to the ones currently used for knowledge base completion          ent configurations of hyper-parameters are possible we run
tasks (Bordes et al. 2013), we think that the results of our      multiple models and we re-run each model multiple times
experiments can point out interesting capabilities of LTNs:       (to check variations due to random initialization). After a
can LTNs perform deductive reasoning over these simple            first phase of trial and error we set as static the following
datasets? Moreover, using these small datasets, results can       parameters: optimizer RMSprop, bias −1e−5 , learning rate
be manually inspected to better understand where and how          0.01, decay 0.9. We tested three different aggregation func-
                                                                  tions for the universal quantifiers (harmonic mean, mean and
   1
     https://github.com/logictensornetworks                       min), two tensor layer sizes (10 and 20) for the tensor net-
   2
     https://github.com/vinid/ltns-experiments                    work and two embedding sizes for constants (10 and 20 di-
   3
     http://dbpedia.org                                           mensions).
Evaluation Measures To evaluate the models we use the              have reached an accuracy equal to 0.85. This is important to
Mean Absolute Error (MAE), Matthews correlation coef-              remark since the two classes are ill-balanced.
ficient (Matthews 1975) that is often regarded as stable
                                                                   Qualitative Analysis Analyzing the prediction of LTNs
when classes are unbalanced, F1 score, precision, and re-
                                                                   we found that in some cases the model correctly predicts
call. When we compute MAE we will compute the absolute
                                                                   multi-hop logical inferences (e.g., sub(Cat, Animal) close to
distance between the fuzzy predictions and the actual true
                                                                   1), but fails on other simple inferences (e.g., sub(Cat, Bird)
values; this will give us the possibility of understanding how
                                                                   close to 1). When there is not enough information regarding
good are models with a continuous error value. When com-
                                                                   the relationship between two elements (e.g., Cat and Bird)
puting the other measures we will round the scores to the
                                                                   the model has difficulties to predict the correct answer.
nearest integer in such a way that we compare only binary
scores. We consider prediction values higher than 0.5 as 1         Summary of the outcomes
and vice-versa. While this is a strong approximation over the
degree of truth given by fuzzy logic it is still useful to un-     • LTNs fit the data well;
derstand the performance of the model. We will also report         • Multi-hop inferences tend to be more difficult;
accuracy to summarize the performance of the model when            • As expected performance increases with satisfiability.
necessary. In general, we select the best model for each ex-
periment by considering the one with the highest F1 score.         Experiment 2: Ancestors Reasoning
Experiment 1: Taxonomy Reasoning                                   For the P dataset we train LTNs with the following axioms:
For the A dataset we ask the LTNs to learn the following           • ∀a, b ∈ P : parent(a, b) → ancestor(a, b)
axioms:                                                            • ∀a, b, c ∈ P : (ancestor(a, b) ∧ parent(b, c)) →
• ∀a, b, c ∈ A : (sub(a, b) ∧ sub(b, c)) → sub(a, c)                   ancestor(a, c)
• ∀a ∈ A : ¬sub(a, a)                                              • ∀a ∈ P : ¬parent(a, a)
• ∀a, b : sub(a, b) → ¬sub(b, a)                                   • ∀a ∈ P : ¬ancestor(a, a)
                                                                   • ∀a, b ∈ P : parent(a, b) → ¬parent(b, a)
Where sub identifies the subclass relation in the dataset (e.g.,
sub(Cat, F eline)). The objective of this experiment is to         • ∀a, b ∈ P : ancestor(a, b) → ¬ancestor(b, a)
see if LTNs can generate the transitive closure starting from      Thus, we combine the knowledge of these axioms with
a dataset using the axioms. Data contained in the A dataset        the data of the parental relationships. We distinguish
is our KBS while the edges needed to generate the transi-          two different relationships in this dataset parent (i.e.,
tive closure will be our KBT . We compare the predictions of       parent(x, y) means that x is a parent of y) and ancestor
LTNs (computed as the prediction over sub(x, y) given x, y)        (i.e., ancestor(x, y) means that x is a ancestor of y).
with the actual transitive closure of the graph. We recall that        KBS contains only the parental relationships shown in
KBS contains only true predicates (e.g., sub(Cat, F eline))        Figure 2 (e.g., parent(C, I)). The task we will test is to in-
while we ask the model to perform inferences also over pred-       fer the complete knowledge base for the ancestor predicate,
icates that are false (e.g., we evaluate sub(F eline, Cat) ex-     to which we will refer to as KBa ; therefore, we would like
pecting a value close to 0).                                       LTNs to learn if an ancestor relationship is true or false for
   Table 1 shows results for the knowledge completion tasks        two given nodes only from axioms and parental data. The
of the top performing model and one of the worst perform-          representation for the ancestor predicate should be gener-
ing ones: the top performing model had a satisfiability equal      ated from the knowledge in the axioms since no data about
to 0.99 while one of the worst ones had a satisfiability of        it is provided.
0.56. The top-performing model was initialized with a layer            We will also test how the model performs over the set
size in the tensor network of 20 and a dimension of the em-        of ancestor formulas that require multi-hop inferences to
beddings equal to 20; the best universal aggregator was the        be inferred (i.e., those that cannot be directly inferred from
mean aggregator.                                                   ∀a, b ∈ P : parent(a, b) → ancestor(a, b)), we will re-
   The best model over KB is able to fit well the data             fer to this as KBTa : those ancestors pairs for which the par-
since the F1 measure show good performance over the en-            ent pair is false (e.g., ancestor(C, S)). As before, we recall
tire knowledge base (F1 = 0.64). LTNs are prone to generate        that KBp contains only true predicates (e.g., parent(C, I))
false positives: the model generates 36 false positives with       while we ask the model to perform inferences over the
respect to 55 true positives and 26 false negatives with re-       ancestor dataset (KBa ) that also contains predicates that
spect to 459 true negatives.                                       should be inferred as false (e.g., ancestor(I, C)).
   The performance drops when we consider only KBT el-                 We do this to understand if LTNs are able to pass the infor-
ements for testing (F1 = 0.51), this means that LTNs are in        mation from the parent predicate to the ancestor predicate
this case not able to capture some more complicated infer-         and if this is enough to give to the network the possibility
ences.                                                             of making even more complex inferences that are related to
   Still, the approach is better than a binary random base-        chains of ancestors.
line. The accuracy of the model with the best satisfiability is        The best performing model for this task (with hmean, 10
0.89, while a naive classifier that predicts only zeros would      dimensional embeddings, 10 neural tensor layers) over KBa
Table 1: Performance measures on the A dataset. Value out of the parentheses are computed over the complete KB while those
within parentheses are computed only on the part of the KB that was not in the initial set of data.
                     Satisfiability MAE            Matthews F1                 Precision       Recall
                     0.99           0.12 (0.12) 0.58 (0.45) 0.64 (0.51) 0.60 (0.47) 0.68 (0.55)
                     0.56           0.51 (0.52) 0.09 (0.06) 0.27 (0.20) 0.20 (0.11) 0.95 (0.93)
                     Random         0.50 (0.50) 0.00 (0.00) 0.22 (0.17) 0.14 (0.10) 0.50 (0.50)


                      0.45                                                                                             equal to 10; the best universal aggregator was the hmean ag-
                                                                                                                       gregator. Results show that the new axioms are beneficial for
                      0.40
                                                                                                                       the network, that is actually able to learn well the relation-
                      0.35                                                                                             ships. Still, the precision over KBTa has increased by 0.19
Mean Absolute Error


                      0.30
                                                                                                                       points (the difference between the results within parenthe-
                                                                                                                       ses).
                      0.25
                                                                                                                          One interesting result about this is related to the fact that
                      0.20                                                                                             the network is able to learn a good representation for the
                      0.15
                                                                                                                       ancestor predicate just from the axioms.
                             0.56   0.57   0.58   0.63   0.65   0.67          0.7
                                                                  Satisfiability
                                                                                    0.81   0.85   0.89   0.94   0.99   Qualitative Analysis LTNs allows us to do fuzzy infer-
                                                                                                                       ences after training. The model is able to answer queries on
                                                                                                                       fuzzy formulas that were not in the original training data.
Figure 3: Average MAE for the ancestors tasks on rounded
                                                                                                                       For example, ∀a, b : ancestor(a, b) → ¬parent(b, a) has
level of satisfiability. MAE decreases with the increase of
                                                                                                                       generally a value close to 1 in our experiments.
satisfiability.
                                                                                                                       Summary of the outcomes
                                                                                                                       • Satisfiability is strongly related with performance of the
had an F1 score of 0.77. If we do not consider the ances-
                                                                                                                         model: the higher the satisfiability the lower the error;
tor predicates that can be directly inferred from the axioms
(KBTa ), the model correctly infers 22 ancestors while gen-                                                            • LTNs learn to pass information quite efficiently (informa-
erating 25 false positives: the F1 is equal to 0.62. Again, the                                                          tion on parent(x, y) is passed to ancestor(x, y)). Still,
network seems to be able to fit the data quite well, but it still                                                        some more complicated inferences are difficult;
generates errors on multi-hop inferences.                                                                              • More axioms increase the performance of the model.
   As another experiment over satisfiability, in Figure 3 we
show the relation between the MAE computed on KBa and                                                                  Experiment 3: Comparison with a Multi-Input
the level of satisfiability. To draw this figure we run multi-                                                         Network
ple experiments with LTNs and computed the mean MAE                                                                    In this experiment, we want to compare LTNs with a sim-
by aggregating the satisfiability levels rounded to 2 decimal                                                          ple deep learning architecture on a common task. Starting
digit. It is clear that the error decreases with the increase of                                                       from the complete knowledge base of parents and ances-
the satisfiability level and thus LTNs are able to learn and                                                           tors we randomly divide data into the training set and test
infer some knowledge. This proves again that the model is                                                              set. Training data consists of 100 parent predicates (both
able to learn the originally not known ancestor relationships                                                          true and false) and 100 ancestor predicates4 (both true and
from the combination of data and rules.                                                                                false); test set contains 189 parent predicates and 189 ances-
Comparison With Added Axioms To provide a better un-                                                                   tor predicates5 . We thus tackle this problem by considering a
derstanding of this experiment we decided to add two ax-                                                               classification setting that can be solved with the use of deep
ioms to the previous set. These two axioms explicitly state                                                            learning models.
the relationships between parents and ancestors:                                                                          We built a simple multi-input architecture that took as
                                                                                                                       input three one-hot encoded representations of the pairs of
• ∀a, b, c ∈ P : (ancestor(a, b) ∧ ancestor(b, c)) →                                                                   atoms and the predicates (e.g., Susan, Ann, parent). This is
  ancestor(a, c)                                                                                                       not the most optimized architecture to solve this task, but it
• ∀a, b, c ∈ P                                      : (parent(a, b) ∧ parent(b, c)) →                                  is useful to understand the performance of LTNs compared
  ancestor(a, c)                                                                                                       with classical deep learning approaches. We trained the net-
                                                                                                                       work using binary cross-entropy and the RMSprop gradient
   Table 2 shows the comparison between the approach with-                                                             optimization algorithm over 5,000 epochs with a 20% vali-
out the new axioms (Six Axioms) and with the new axioms                                                                dation split. To reduce possible effects of overfitting we use
(Eight Axioms) on the ancestor dataset. Performances were
computed on the two models with the highest satisfiability                                                                4
                                                                                                                             Note that the training set contains very few examples that are
(both around 0.99). The top-performing models for both Six                                                             positive
Axioms and Eight Axioms were initialized with a layer size in                                                              5
                                                                                                                             We tested different random subsets of training and testing, but
the tensor network of 10 and a dimension of the embeddings                                                             the results tend to be similar
Table 2: Ancestor completion task with different number of axioms. Value out of the parentheses are computed over the com-
plete KBa while those within parentheses are computed on KBTa .
                    Type             MAE           Matthews F1                Precision     Recall
                    Six Axioms       0.16 (0.17) 0.73 (0.61) 0.77 (0.62) 0.64 (0.47) 0.96 (0.92)
                    Eight Axioms 0.14 (0.14) 0.83 (0.69) 0.85 (0.72) 0.80 (0.66) 0.89 (0.79)


                              o      Output                                         be used in combination with quantifiers to make inferences
                                                                                    over data using new axioms on which the network was not
                       3 dimensions
                                                Dense
                                                Layer
                                                                                    trained (e.g., ∀x, y : parent(x, y) → ¬ancestor(y, x) has
                                              (sigmoid)                             a high truth value).
                                                            Concatenation              As shown in the recent work on LTNs on semantic image
                                                                                    interpretation one key element of success might be the use
   5 dimensions        5 dimensions             2 dimensions
                                                                            Dense   of LTNs over deep learning architectures (Donadello, Ser-
                                                                            Layer   afini, and d’Avila Garcez 2017); this would allow augment-
                                                                                    ing data with semantic information that will make it possible
   [0,1,0....,0,0]      [0,0,0....,1,0]             [1,0]             Input Layer
                                                                                    to explain predictions.
      Susan                    Ann                  parent
                                                                                    Summary of the outcomes
            Figure 4: Baseline Multi-input architecture                             • Results show that performance on this simple task is com-
                                                                                      parable to a naive network;
                                                                                    • Axioms in LTNs seem to provide a useful way of defin-
L2 regularization (we experimentally found that results were                          ing constraints over the space of the solutions that might
better with it than without). We show this architecture in Fig-                       reduce the possibility of overfitting;
ure 4 where we also show the dimensions of the layers.
                                                                                    • The main advantage of LTNs resides in the possibility of
   The network is trained to detect if a predicate, given two                         making inferences after training.
constants, is true or false (binary outcome). LTNs are im-
plicitly trained on the same task: we train the network over                        Experiment 4: Time to Learn
best satisfiability given the data in input and the six axioms
used in the previous setting.                                                       In this last experiment, we investigate how fast LTNs are in
   The performance of the models is computed over the 189                           the learning context. We consider the following experimen-
ancestor test predicates. We ignore the parent predicates in                        tal setting: we generate a range of N constant and N pred-
this setting because there is little to no knowledge about how                      icates and we evaluate different combinations of them. We
to predict if a parental relationship in the test set is true or                    divide this experiment in three by considering unary, binary
false from the dataset.                                                             and ternary predicates of the following from ∀x : predn (x),
                                                                                    ∀x, y : predn (x, y), ∀x, y, z : predn (x, y, z), we therefore
   Results show that the multi-input network achieves an ac-
                                                                                    test only predicates that are universally quantified. We com-
curacy equal to 0.84 while accuracy for LTNs was around
                                                                                    pute 5,000 training epochs to learn the parameters of 4, 8, 12,
0.89; while the accuracies are comparable an in-depth analy-
                                                                                    20, 30 constants with 4, 8, 12, 20, 30 (universally quantified)
sis with other measures revealed that the recall for the multi-
                                                                                    predicates of arity one, two and three: this means that in the
input was 1 and its precision was 0.12, while LTNs had a
                                                                                    setting with 4 constants and 8 predicates of arity 3 we in-
lower recall (0.66) but a much higher precision (0.57). A
                                                                                    troduce 4 constants (a, b, c, d) in the model and 8 predicates
naive model that predicts only zeros (since classes are unbal-
                                                                                    (pred1 , pred2 , . . . , pred8 ) and each predicate is universally
anced) would have reached an accuracy equal to 0.84. The
                                                                                    quantified (e.g., ∀x, y, z : pred1 (x, y, z)). Size of the em-
multi-input architecture tends to overfit in this task in which
                                                                                    bedded representation in this experiment is 10. Experiments
most of the classes are 0. It is anyway important to note that
                                                                                    were run using a compiled version of Tensorflow on an i7
it is difficult for the multi-input architecture to understand
                                                                                    machine.
the task, while LTNs are helped by the axioms.
   However, the results show that while LTNs are good for                           Analysis Figures 5, 6, 7 show the seconds needed to com-
learning logical rules, their accuracy is still comparable to                       plete the learning phase for each setting. While it is clear that
the one obtained by neural-networks. Moreover, the multi-                           constants have an influence on computational time (since
input architecture would require more control on overfitting,                       they are training data) we can also state that predicates and
while the logical axioms used in LTNs seems to provide a                            their arity have a notable computational impact upon the
natural way to define some constraints over the vector space                        learning phase. With a low number of constants and pred-
and to reduce possible overfitting. Nevertheless, different                         icates (e.g., 4) the training time is not much different in all
deep learning architectures with a different set of parame-                         the settings, but as soon as the number of constants increases
ters might generate better results.                                                 the model requires more time to learn. The arity of the pred-
   Using classical neural networks we lose the ability to de-                       icates seem to be the element with the higher impact on the
fine high-level semantics to the data. For example, LTNs can                        learning time: this is an expected result since the universal
quantifier has to cover multiple elements in the ternary case.         of different neural-symbolic approaches proposed in litera-
Since experiments were run on a CPU we expect training                 ture: in this section we will only discuss a few of these ap-
time to be shorter on GPU 6 .                                          proaches and we will also describe some related methods.
                                                                          One of the main points of discussion that has involved the
Summary of the outcomes
                                                                       artificial community in the last decades is the relationship
• Time to learn the parameters is highly influenced by the             between symbolic artificial intelligence and connectivist
  arity of the predicates;                                             (i.e., related to neural networks) artificial intelligence (Min-
                                                                       sky 1991). In recent years deep learning approaches have
Other Experimental Notes                                               shown great computational capabilities (Goodfellow, Ben-
In this section, we briefly describe other experimental results        gio, and Courville 2016), but still these approaches do not
that are interesting for the community. While the follow-              achieve the same reasoning and knowledge transformation
ing assertions are derived from empirical experiments they             abilities that symbolic approaches show. On the other hand,
might still be useful for the reader who wants to start using          symbolic artificial intelligence suffers from computational
LTNs.                                                                  limits and the knowledge acquisition bottleneck, i.e., the
   LTNs as all deep learning model suffers from optimiza-              need to generate high-quality knowledge bases, which is
tion problems: in our experiments we often found the model             usually done manually. A different voice in this group comes
reaching local minima. Global optimization tools might help            from the neural-symbolic field, where the task is to bring to-
in a better parameter optimization search.                             gether the two worlds of symbolic artificial intelligence and
   In our experiments LTNs often predicted the class Cat to            neural networks (Garcez, Gabbay, and Broda 2002; Ham-
be a subclass of the class Bird. This error might be due to            mer and Hitzler 2007; Garcez, Lamb, and Gabbay 2008;
missing knowledge in the KB. The network is not able to                Garcez et al. 2015).
understand the difference between the two since they come                 In the current work we have explored only LTNs, but
from different branches of the taxonomy. In general, it seems          there are different approaches in the field that have been
that LTNs predict many false positives, while they are better          introduced. One of the most famous approaches of neural-
in detecting true negatives. This seems due to the fact that           symbolic integration are the Knowledge Based Artificial
true negatives in our experiments can be directly inferred             Neural Networks (KBANNs) (Towell and Shavlik 1994).
from the axioms: for example, ∀a : ¬ancestor(a, a) gives a             KBANNs where one of the first approaches to integrate
good amount of information to the model about the fact that            propositional clauses with data, developed at the same time
each constant occurring in both parameters of the predicate            as the closely related propositional core method (Hölldobler
ancestor should generate negative values.                              and Kalinke 1994). Lifting these results towards first-order
   If the model fits the data too well (i.e., it overfits) the per-    logic, however, has been proven difficult and limited to toy-
formance over the test set decreases. While this is a common           size knowledge bases (Hitzler, Hölldobler, and Seda 2004;
event for machine learning models and there are techniques             Gust, Kühnberger, and Geibel 2007; Bader, Hitzler, and
to prevent this, applying these to LTNs is not so straight-            Hölldobler 2008).
forward: cross-validation would require us to provide com-                On the other hand, there are other approaches from the
pleteness information to the training set, that would bias the         Statistical Relational Learning field that do not integrate
reasoning task.                                                        neural networks with logic, but tackle the problem in a sym-
   We tested different sets of hyper-parameters and we re-             bolic manner by also combining statistical information. Ex-
lease results on the tested tasks online. While this was not           amples of this category are ProbLog (De Raedt and Kimmig
the primary scope of the paper it is still important to esti-          2015) that is an example of probabilistic logic programming
mate the effects of the hyper-parameters to fully evaluate             language and Markov Logic Networks (MLNs) are a sta-
the approach. Nevertheless, we empirically find out that in-           tistical relational learning model that has been shown to be
creasing the layers of the tensor network and the size of the          effective on a large variety of tasks (Richardson and Domin-
embeddings makes the model much more difficult to opti-                gos 2006; Meza-Ruiz and Riedel 2009). The intuition be-
mize.                                                                  hind MLNs and LTNs is similar since they both base their
   After paper acceptance a new version of LTNs was re-                approach on logical languages. MLNs defines weights for
leased by the original authors: this last version is easier to         formulas and interpret the world by considering it under a
optimize and shows a slight increase in performance over               probabilistic point of view while LTNs use fuzzy logic com-
the F1 measure.                                                        bined with a neural architecture to generate their inferences.

                       Related Work                                               Conclusions and Future Work
In this section we summarize some related approaches that              Conclusions
have been introduced in the state of the art. We refer                 LTNs can be shown to obtain good results on reasoning tasks
to Garcez, Lamb, and Gabbay; Besold et al. for discussions             when optimal satisfiability conditions are met. This is of-
    6
      to show an effective comparison between different predicates     ten difficult to reach and using the model with a low degree
we decided to show results computed with a CPU: with the GPU           of satisfiability can generate bad inferences. Nevertheless,
it was more difficult to highlight the differences between these ex-   LTNs show interesting capabilities and their ability to mix
periments                                                              logic and data might prove to be a valuable resource. LTNs
                                                                                                                                                                      150
                                  4   2.6   2.7        2.8            3   3.4                                          4   3.3   4.5        7.5        14     28                                              4   6.8     24          64     2.6e+02 8.5e+02   6000
                                                                                12                                                                                    125
Number of predicates (arity 1)


                                                                                     Number of predicates (arity 2)


                                                                                                                                                                            Number of predicates (arity 3)
                                  8   3.7   3.9        4.2        4.6     5.1                                          8   5.1   7.1         17        31     56                                              8   11      39       1.1e+02 5.1e+02 1.7e+03     4500
                                                                                10                                                                                    100

                                 12   5.1   5.3         6         6.1     6.5   8                                     12   6.5   9.6         18        33     66      75                                     12   15      56       1.6e+02 7.3e+02 2.5e+03
                                                                                                                                                                                                                                                               3000

                                 20   8     8          8.4        8.9     9.9   6                                     20   9.7   15          27        51   1e+02     50                                     20   23      88       2.6e+02 1.2e+03 4.1e+03
                                                                                                                                                                                                                                                               1500
                                                                                                                                                                      25
                                 30   11    12          12        13      14    4                                     30   14    21          37        74   1.5e+02                                          30   34    1.3e+02 4.1e+02 2e+03 6.5e+03

                                      4     8           12        20      30                                               4     8           12        20     30                                                  4       8           12        20     30
                                                Number of constants                                                                  Number of constants                                                                      Number of constants


Figure 5: Computational times in sec-                                                Figure 6: Computational times in sec-                                                  Figure 7: Computational times in sec-
onds for predicates of arity one and con-                                            onds for predicates of arity two and con-                                              onds for predicates of arity three and
stants                                                                               stants                                                                                 constants


fit well the data and can be used to make some simple infer-                                                                                          Besold, T. R.; d’Avila Garcez, A. S.; Bader, S.; Bowman,
ences. More complex inferences (multi-hop) are more diffi-                                                                                            H.; Domingos, P. M.; Hitzler, P.; Kühnberger, K.; Lamb,
cult to capture in the model.                                                                                                                         L. C.; Lowd, D.; Lima, P. M. V.; de Penning, L.; Pinkas,
    The main problem encountered in our experiments is re-                                                                                            G.; Poon, H.; and Zaverucha, G. 2017. Neural-symbolic
lated to erroneous prediction generated by the LTNs and                                                                                               learning and reasoning: A survey and interpretation. CoRR
scalability issues. We think that the former problem might                                                                                            abs/1711.03902.
be solved with a more accurate use of logic constraints: for                                                                                          Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and
example, in the ancestor experiment, adding notions about                                                                                             Yakhnenko, O. 2013. Translating embeddings for model-
the concepts of “siblings” might help the network to perform                                                                                          ing multi-relational data. In Advances in neural information
better. While a more efficient use of computational resources                                                                                         processing systems, 2787–2795.
could help in reducing the latter problem we encountered.
                                                                                                                                                      Bowman, S. R.; Potts, C.; and Manning, C. D. 2015. Learn-
                                                                                                                                                      ing distributed word representations for natural logic reason-
Future Work
                                                                                                                                                      ing. In Proceedings of the Association for the Advancement
While results have shown that LTNs are able to capture logic                                                                                          of Artificial Intelligence Spring Symposium (AAAI), 10–13.
semantics in the vector space, they should also be compared
                                                                                                                                                      De Raedt, L., and Kimmig, A. 2015. Probabilistic (logic)
with other statistical relational learning methods like MLNs                                                                                          programming concepts. Machine Learning 100(1):5–47.
on similar tasks.
   Another possible next step is to apply LTNs on bigger                                                                                              Donadello, I.; Serafini, L.; and d’Avila Garcez, A. 2017.
knowledge bases defined in the state of the art (Bordes et                                                                                            Logic tensor networks for semantic image interpretation. In
al. 2013). We expect the ability to make fuzzy inferences                                                                                             IJCAI, 1596–1602.
over the trained model to be of great help in link prediction                                                                                         Garcez, A.; Besold, T. R.; De Raedt, L.; Földiak, P.; Hit-
tasks over knowledge bases.                                                                                                                           zler, P.; Icard, T.; Kühnberger, K.-U.; Lamb, L. C.; Miikku-
   An interesting development of this work could be eval-                                                                                             lainen, R.; and Silver, D. L. 2015. Neural-symbolic learning
uating the generated groundings: constants in LTNs have                                                                                               and reasoning: contributions and challenges. In Proceed-
an associated vector and thus it is possible to compute the                                                                                           ings of the AAAI Spring Symposium on Knowledge Repre-
similarity in the vector space between constants. This might                                                                                          sentation and Reasoning: Integrating Symbolic and Neural
be interesting in the context of knowledge graph embed-                                                                                               Approaches, Stanford.
dings (Bordes et al. 2013): vector representations of entities                                                                                        Garcez, A. S. d.; Gabbay, D. M.; and Broda, K. B. 2002.
and relationships of a knowledge graph.                                                                                                               Neural-Symbolic Learning System: Foundations and Appli-
                                                                                                                                                      cations. Berlin, Heidelberg: Springer-Verlag.
                                                        Acknowledgment                                                                                Garcez, A. S.; Lamb, L. C.; and Gabbay, D. M. 2008.
We thank Luciano Serafini and Artur d’Avila Garcez for                                                                                                Neural-symbolic cognitive reasoning. Springer Science &
their comments and suggestions. We gratefully acknowledge                                                                                             Business Media.
the support of NVIDIA Corporation with the donation of the                                                                                            Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep
Titan Xp GPU used for this research.                                                                                                                  Learning. MIT Press.
                                                                                                                                                      Gust, H.; Kühnberger, K.; and Geibel, P. 2007. Learning
                                                               References                                                                             models of predicate logical theories with neural networks
Bader, S.; Hitzler, P.; and Hölldobler, S. 2008. Connectionist                                                                                       based on topos theory. In Hammer, B., and Hitzler, P., eds.,
model generation: A first-order approach. Neurocomputing                                                                                              Perspectives of Neural-Symbolic Integration, volume 77 of
71(13-15):2420–2432.                                                                                                                                  Studies in Computational Intelligence. Springer. 233–264.
Hammer, B., and Hitzler, P., eds. 2007. Perspectives of
Neural-Symbolic Integration, volume 77 of Studies in Com-
putational Intelligence. Springer.
Hitzler, P.; Hölldobler, S.; and Seda, A. K. 2004. Logic
programs and connectionist networks. J. Applied Logic
2(3):245–272.
Hölldobler, S., and Kalinke, Y. 1994. Ein massiv paralleles
modell für die logikprogrammierung. In WLP, 89–92.
Matthews, B. W. 1975. Comparison of the predicted and ob-
served secondary structure of t4 phage lysozyme. Biochim-
ica et Biophysica Acta (BBA)-Protein Structure 405(2):442–
451.
Meza-Ruiz, I., and Riedel, S. 2009. Jointly identifying pred-
icates, arguments and senses using markov logic. In NAACL,
155–163. Association for Computational Linguistics.
Minsky, M. L. 1991. Logical versus analogical or sym-
bolic versus connectionist or neat versus scruffy. AI maga-
zine 12(2):34.
Petr, H. 1998. Metamathematics of fuzzy logic, vol. 4 of
trends in logicstudia logica library.
Richardson, M., and Domingos, P. 2006. Markov logic net-
works. Machine learning 62(1-2):107–136.
Serafini, L., and Garcez, A. S. d. 2016. Learning and reason-
ing with logic tensor networks. In Conference of the Italian
Association for Artificial Intelligence, 334–348. Springer.
Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013.
Reasoning with neural tensor networks for knowledge base
completion. In Advances in neural information processing
systems, 926–934.
Towell, G. G., and Shavlik, J. W. 1994. Knowledge-
based artificial neural networks. Artificial intelligence 70(1-
2):119–165.

</pre>