<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Genova, Italy
* Corresponding author.
$ matteo.magnini@unibo.it (M. Magnini); giovanni.citatto@unibo.it (G. Ciatto); andrea.omicini@unibo.it
(A. Omicini)
 http://matteomagnini.apice.unibo.it (M. Magnini); http://giovanniciatto.apice.unibo.it (G. Ciatto);
http://andreaomicini.apice.unibo.it (A. Omicini)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A view to a KILL: Knowledge Injection via Lambda Layer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Magnini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Ciatto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Omicini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica - Scienza e Ingegneria (DISI), Alma Mater Studiorum-Università di Bologna</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>We propose KILL (Knowledge Injection via Lambda Layer) as a novel method for the injection of symbolic knowledge into neural networks (NN) allowing data scientists to control what the network should (not) learn. Unlike other similar approaches, our method does not (i) require ground input formulae, (ii) impose any constraint on the NN undergoing injection, (iii) afect the loss function of the NN. Instead, it acts directly at the backpropagation level, by increasing penalty whenever the NN output is violating the injected knowledge. An experiment is reported to demonstrate the potential (and limits) of our approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;symbolic knowledge injection</kwd>
        <kwd>AI</kwd>
        <kwd>ML</kwd>
        <kwd>neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>is violating the knowledge to be injected. In other words, KILL performs knowledge injection
by constraining networks’ training to adhere to the symbolic knowledge.</p>
      <p>To validate our method, we report a simple experiment where the designer’s common sense
– conveniently represented as logic formulae – is injected into a NN classifier, to improve its
accuracy. Notably, our experiment shows how KILL can be exploited to improve classification
performance in inconvenient scenarios where training data is relatively small, and classes are
unbalanced and overlapping. Indeed, thanks to symbolic knowledge injection, these
inconveniences can be efectively tackled without re-engineering the dataset. The experiment also
reveal a lack of robustness w.r.t. extremely rare classes. Notably, this limitation let us elaborate
an interesting discussion on the limits of symbolic knowledge injectionattained via constraining.</p>
      <p>Accordingly, the paper is organised as follows. Section 2 briefly summarises the background
on symbolic knowledge injection, eliciting a number of related works. Then, Section 3 formally
describes KILL, as well as its rationale and internal operation. Section 4 then reports our
experiments and their design, while results are discussed in Section 5. Finally, Section 6
concludes the paper providing some insights about how the current limitations of KILL could
be overcame.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background &amp; related works</title>
      <p>In this paper, symbolic knowledge injection (SKI) is the task of letting a sub-symbolic predictor
exploit formal, symbolic information to improve its predictive performance (e.g. accuracy,
F1measure, learning time, etc.) over data or to use the predictor as a logic engine. Unlike numeric
data upon which predictors are commonly trained, symbolic data is generally more compact
and expressive, as intensional representations of complex concepts may be concisely written. In
particular, symbolic information may encode bold rules that must be satisfied by the concepts
the predictor is willing to learn. Hence, provided that some SKI procedure is available, data
scientists may craft ad-hoc collections of symbolic expressions aimed at aiding the training of a
particular predictor, for a specific learning task. In other words, injection enables provisioning
prior knowledge – namely, the designer’s common sense – to ML predictors under training.</p>
      <p>
        When it comes to neural networks, approaches for SKI are manifold, and the literature on
this topic is vast [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. Broadly speaking, there exist at least two major sorts of approaches –
not mutually exclusive – supporting the injection of symbolic knowledge into neural networks.
Approaches of the first sort perform injection during the network’s training, using the symbolic
knowledge as either a constraint or a guide for the optimisation process [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref7 ref8 ref9">7, 8, 9, 10, 11, 12</xref>
        ].
Conversely, approaches of the second sort perform injection by altering the network’s
architecture to make it mimic the symbolic knowledge [
        <xref ref-type="bibr" rid="ref11 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref7">13, 14, 7, 15, 16, 17, 18, 11, 19</xref>
        ]. In the reminder
of this paper, we focus on approaches of the former sort, as our proposed method falls in this
category.
      </p>
      <p>
        According to the state of the art, the latter strategy commonly works by converting the
symbolic knowledge into an additive regularisation term to be added to the loss function used for
training. In particular, when the predictor is a NN, knowledge injection is performed during the
back propagation step. When the loss function is evaluated, if the network violates the integrated
symbolic constraints, then the actual loss value will be grater than the unconstrained case. In
this way, data scientists can lead the networks’ learning algorithm to minimise constraints
violation, while minimising its error w.r.t. the data as well. Concerning the symbolic knowledge,
virtually all techniques we are aware of require information to be represented via (some subset
of) first order logic (FOL) formulae—e.g. propositional logic (quite limited) [
        <xref ref-type="bibr" rid="ref10 ref12 ref14 ref15 ref18 ref7 ref9">14, 7, 15, 18, 9, 10, 12</xref>
        ]
or full FOL [
        <xref ref-type="bibr" rid="ref13 ref16 ref17">13, 16, 17</xref>
        ]. Actual methods may then vary, depending on (i) which particular sub-set
of FOL they rely upon, (ii) how are logic formulae interpreted as constraints, and (iii) whether
formulae require to be grounded or not, before SKI can occur. To the best of our knowledge,
virtually all methods proposed so far work by converting formualae in fuzzy logic functions,
and they often require be grounded at some point in the process.
      </p>
      <p>With respect to other state-of-the-art contributions, our proposal difers in several ways.
In particular, we accept logic formulae in Datalog form as input – meaning that we support
a restricted subset of FOL –, and we do not require those formulae to be ground—neither are
formulae grounded anywhere in the process.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Knowledge Injection via Lambda Layer</title>
      <p>We propose KILL – short for Knowledge Injection via Lambda Layer – as an approach to SKI
where the training process is constrained in such a way that the network is penalised whenever
its predictions are in conflict with the symbolic knowledge to be injected. In doing so, our
approach does not impose any constraint on the architecture (e.g. number of layers, number of
neurons, types of activation functions, etc.) or the initialisation status (e.g. random weights or
partially trained) of the network subject to injection. It does require, however, (i) the network
to have an input and an output layer, and (ii) to be trained via gradient descent. Furthermore, it
also requires (iii) the symbolic knowledge to be expressed via one or more Datalog formulae,
and (iv) to encode logic statements about the network’s input or output features.</p>
      <sec id="sec-3-1">
        <title>3.1. Λ-layers for SKI</title>
        <p>KILL performs injection during training. It works by appending one further layer – the Λ-layer
henceforth – at the output end of the neural network, and by training the overall network as
usual, via gradient descent or others similar strategies. The Λ-layer is in charge of introducing
an error (w.r.t. the actual prediction provided by the network’s output layer) whenever the
prediction violates the symbolic knowledge. The error is expected to afect the gradient descent
– or whatever optimisation function – in such a way that violating the symbolic knowledge is
discouraged. In other words the NN inductively learns the penalties applied to wrong predictions
over the examples and conversely it is much more inclined to avoid such wrong predictions. To
serve its purpose, the Λ-layer requires an ad-hoc activation function altering the outcome of the
network’s original output layer. It also needs the logic formulae to be numerically interpreted
– i.e., converted into functions of real numbers –, to draw actual error values. Once the NN
training is over, the injection phase is considered over as well, hence the Λ-layer can be removed
and the remaining network can be used as usual. Hence, no architectural property of the original
network is hindered by the addition of the Λ-layer.</p>
        <p>Diferences notwithstanding, injecting by regularising the loss function and injecting using
the Λ-layer serve the same purpose. However, in some cases it may be easier to manipulate
the network via the Λ-layer rather than the loss function or in other scenarios the designer
may choose to preserve the layer after the training possibly with a diferent cost function than
the one reported in Equation (1). A more detailed discussion about the advantages of Λ-layeris
reported in 5.</p>
        <p>In the reminder of this section we delve into the details of KILL. First, we discuss how the
Λ-layer afects the network architecture, and how it works in detail. Then, we discuss how logic
formulae can be numerically interpreted as errors.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Additional Λ-layer</title>
        <p>We consider the case of a symbolic knowledge base, denoted by , to be injected into a
feedforward NN of arbitrary depth, denoted by  . We denote as  (resp.,  ) the input (resp.,
output) layer of  . Without loss of generality, we consider the case where the shape of 
is  × 1, but a similar discussion would hold for layers of any shape (e.g. bi-, tri-, or
multidimensional). We draw no assumptions on the activation function of  , nor on the amount,
topology, or nature of the hidden layers connecting  and  , nor on the shape of . Hence,
we denote by y = [1, . . . , ] the output of  —i.e., the prediction of the network for some
input x. We also assume  to be composed by as many rules are the neurons in  , thus we
write  = {1, . . . , }, where  is a symbolic formula describing/constraining the relation
among x and .</p>
        <p>To perform injection, we alter the structure of  by adding one additional layer – namely,
the Λ-layer – as depicted in Figure 1, and we then train it as usual, via gradient descent. The
Λ-layer is densely connected with both  and  , and its activation function aims at introducing
a penalty on  every time the formula  is violated by some input-output pair (x, y). In
particular, we denote as  the output of the Λ-layer, which is defined as follows:
 = y × (1 + (x, y))
(1)
where (x, y) is a positive penalty vector representing the cost of modifying the actual output
of the network y.</p>
        <p>
          In turn, the cost vector is defined as follows:
(x, y) = [1(x, 1), . . . , (x, ), . . . , (x, )]
(2)
where  :  ×  → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is a function interpreting  as a cost in the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range, for possible
actual value of x and .
        </p>
        <p>An in-depth discussion about how logic formulae can be interpreted as continuos (a.k.a. fuzzy)
penalties is provided in Section 3.4. For now, it is enough to understand that a penalty is
added on the output of the th neuron of  whenever the corresponding prediction violates the
symbolic knowledge in . Such penalty is closer to  when the formula  is violated the most,
while it is closer to 0 either when the formula is violated the less, or there is no formula for that
neuron.</p>
        <p>The rationale behind the Λ-layer, and the penalty it introduces, is that of altering the output
of the network in such a way that its error is very high when it violates the knowledge base to be
injected. In this way, the network error is the result of a function of two diferent components:
the actual prediction error and the penalty. The overall error is thus minimised during back
propagation, as well as both its components.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Input knowledge</title>
        <p>
          KILL supports the injection of knowledge bases composed by one or more logic formulae in
“stratified Datalog with negation” form—that is, a variant of Datalog [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] with no recursion
(neither direct nor indirect), yet supporting negated atoms.
        </p>
        <p>We choose Datalog because of its expressiveness (strictly higher than propositional logic) and
its acceptable limitations. The lack of recursion, in particular, prevents issues when it comes to
convert formulae into neural structures (which are DAG).</p>
        <p>
          More precisely, Datalog is a restricted subset of FOL, representing knowledge via function-free
Horn clauses [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Horn clauses, in turn, are formulae of the form  ←  1 ∧  2 ∧ . . . denoting
a logic implication (← ) stating that  (the head of the clause) is implied by the conjunction
among a number of atoms  1,  2, . . . (the body of the clause). Since we rely on Datalog with
negation, we allow atoms in the bodies of clauses to be negated. In case the ℎ atom in the body
of some clause is negated, we write ¬ . There, each atom ,  1,  2, . . . may be a predicate of
arbitrary arity.
        </p>
        <p>An -ary predicate  denotes a relation among  entities: p(1, . . . , ) where each  is a
term, i.e., either a constant (denoted in monospace) representing a particular entity, or a logic
variable (denoted by Capitalised Italic) representing some unknown entity or value.
Wellknown binary predicates – e.g., &gt;, &lt;, = – are admissible, too, and retain their usual semantics
from arithmetic. For the sake of readability, we may write these predicates in infix form—hence
&gt; (, 1) ≡  &gt; 1.</p>
        <p>Consider for instance the case of a rule aimed at defining when a Poker hand can be classified
as a pair—the example may be useful in the remainder of this paper. Assuming that a Poker
hand consists of 5 cards, each one denoted by a couple of variables ,  – where  (resp. )
is the rank (resp. seed) of the ℎ card in the hand –, hands of type pair may be described via a
set of clauses such as the following one:
pair (1, 1, . . . , 5, 5) ←
pair (1, 1, . . . , 5, 5) ←
pair (1, 1, . . . , 5, 5) ←
.
.
.</p>
        <p>1 = 2
2 = 3
4 = 5
(3)</p>
        <p>To support injection into a particular NN, we further assume the input knowledge base
defines one (and only one) outer relation – say output or class – involving as many variables as
the input and output features the NN has been trained upon. That relation must be defined via
one clause per output neuron. Yet, each clause may contain other predicates in their bodies, in
turn defined by one or more clause. In that case, since we rely on stratified Datalog, we require
the input knowledge to not include any (directly or indirectly) recursive clause definition.</p>
        <p>For example, for a 3-class classification task, any provided knowledge base should include a
clause such as the following one:
class(¯, y1) ←
p1(¯) ←
p2(¯) ←
class(¯, y2) ←
p1′(¯) ←
p2′(¯) ←
class(¯, y3) ←
p′′(¯) ←</p>
        <p>1
p2′′(¯) ←
p1(¯) ∧ p2(¯)
. . .
. . .
p1′(¯) ∧ p2′(¯)
. . .
. . .
p′′(¯) ∧ p2′′(¯)</p>
        <p>1
. . .
. . .
where ¯ is a tuple having as many variables as the neurons in the output layer, y is a constant
denoting the ℎ class, and p1, p2, p1′, p2′, p1′′, p2′′ are ancillary predicates defined via Horn clauses
as well.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Logic formulae as penalties</title>
        <p>
          Before undergoing injection, each formula corresponding to some output neuron must be
converted into a real-valued function aimed at computing the cost of violating that formula. To
this end, we rely on a multi-valued interpretation of logic inspired to Łukasiewicz’s logic [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
        <p>
          Accordingly, we encode each formula via the J· K function, mapping logic formulae into
realvalued functions accepting real vectors of size  +  as input and returning scalars in R as
output. These scalars are then clipped into the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range, via the  : R → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], defined as
J¬K
J ∧  K
JJ ∨=  KK
JJ &gt;̸=  KK
JJ &lt;≥  KK
        </p>
        <p>C. interpretation
  (((|J((JJ(K1−KK−,, JJJ JKKKK))|))))</p>
        <p>J¬( =  )K
 (0.5 − (JJKK +− JJKK))
 (0.5 + JK − J K
 )</p>
        <p>J ≤  K
JJecxlapsrs((¯¯),Ky) ←  K
JtrueK
JfalseK
JJkKK
Jp(¯)K*
 (JK − J K)
expr(JJ¯KK0*)
1


J 1 ∨ . . . ∨  K
follows:</p>
        <p>* encodes the penalty for the ℎ neuron
* assuming predicate  is defined by  clauses of the form:
p(¯) ←  1, . . . , p(¯) ←  
⎧0 if  ≤ 0
⎪
 () = ⎨ if 0 &lt;  &lt; 1
⎪⎩1 if  ≥ 1
(4)
The resulting values are the penalties discussed in Section 3.2. Hence, the penalty associated
with the ℎ neuron violating rule  can be written as (x, ) =  (JK(x, )).</p>
        <p>The J· K encoding function is recursively defined in Table 1. Put it simply, when
computing the penalty (x, ) of the ℎ neuron, KILL looks for the only Datalog rule of the form
class(¯, y) ←  . It then focuses on the body of this rule – namely,  – ignoring its head—since
the head simply reports which expected output the rule is focussing upon. If the body  contains
some predicates 1, 2, defined by one or more clauses in the provided knowledge base, then
these predicates are replaced by the disjunction of the bodies of all clauses defining them. This
process is repeated until no ancillary predicates remain in  , except for binary expressions
involving input variables, constants, arithmetic operators, and logical connectives. Finally,
operators and connectives are replaced by continuous functions, as indicated by Table 1. The
hole process produces a real-valued interpretation of the original formula which shall be used
by KILL to compute (x, ).</p>
        <p>Figure 2 depicts an example of the encoding process where a logic formula is firstly simplified
– i.e., converted in a form where it only contains a minimal subset of operators –, and then
encoded into an actual real-valued function. The example formula is:
class(1, 2, z) ←</p>
        <p>(1 ≥ k) ∧ (2 ≥ h)
where k, h, z are numeric constants, while 1 and 2 are input variables and  is an output
variable. In particular, Figure 2a shows the abstract syntax tree (AST) of this formula, Figure 2b
shows the same AST where the ≤ operator is replaced by a negated &gt; operator, and Figure 2c
shows the AST of the encoded function.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>Here we report a number of experiments aimed at assessing KILL for SKI w.r.t. its capability
to improve neural networks’ predictive performance.1 A public implementation of the KILL
algorithm is available on PSyKI [22].</p>
      <p>The design of the experiments is straightforward. We consider a simple learning task –
namely, classification – on a finite domain where (i) it is easy to formulate correct constraints
in Datalog, and (ii) NN training is dificult because of, e.g., poor separability among classes,
as well as unevenly distributed training data. Along this line, we write a set of logic formulae,
one for each class, logically denoting how classification should be performed. We then train a
neural network to solve such a classification task, with and without injecting those formulae.
We repeat the experiment by injecting diferent subsets of formulae each time. Finally, we assess
1For the sake of reproducibility, the code of our experiments is available at https://github.com/MatteoMagnini/
kill-experiments-woa-2022.
if and under which circumstances SKI succeeds/fails in improving the network’s predictive
accuracy.</p>
      <p>The rationale behind the experiment design is to assess the efectiveness of SKI in a toy
scenario where the correctness of the symbolic knowledge is undoubted, and where an ordinary
NN may easily struggle in reaching good predictive performance in a reasonable time. A
secondary goal of this design is to identify potential corner cases where SKI falls short.</p>
      <p>As we empirically demonstrate in the remainder of this section, KILL is capable of improving
the predictive performance of a neural network classifier trained on such dataset, despite being
sensitive to severe class unbalancing. A discussion about the possible motivations behind such
sensitivity is then provided in Section 5.</p>
      <sec id="sec-4-1">
        <title>4.1. Poker hand data set and logic rules</title>
        <p>We rely on the poker hand data set [23], which subtends a multi-classification task on a finite
– yet very large – discrete domain, where classes are overlapped and heavily unbalanced,
while exact classification rules can be written in logic formulae. It consists of a tabular dataset,
containing 1,025,010 records—each one composed by 11 features. Each record encodes a poker
hand of 5 cards. Hence, each records involves 5 couples of features – denoting the cards in
the hand –, plus a single categorical feature denoting the class of the hand. Two features are
necessary to identify each card: suit and rank. Suit is a categorical feature (heart, spade, diamond
and club), while rank is ordinal feature—suitably represented by an integer between 1 and 13
(ace to king). The multi-classification task consists in predicting the poker hand’s class. Each
hand may be classified as one of 10 diferent classes denoting the nature of the hand according
to the rules of Poker (e.g. nothing, pair, double pair, flush).</p>
        <p>This data set satisfies all the aforementioned requirements: (i) the input space is discrete and
ifnite in size 2, and the available dataset is just a small sample of it; (ii) classes are extremely
unbalanced, as shown in Table 2: a few classes (e.g. nothing and pair) cover nearly half of the
2The size of the input space is the amount 5-permutations of 52 cards, i.e., (52− 5)!
52!
dataset, while most classes cover less than 1% of the dataset; (iii) there is a hierarchy between
classes (e.g. if there are three cards with same rank, then class is three of a kind even if the
condition for one pair is satisfied). We use 25,010 records for training and the remaining million
for testing, as shown in Table 2.</p>
        <p>We define a class rule for each class, encoding the preferred way of classifying a Poker hand.
For example, let {1, 1, . . . , 5, 5} be the logic variables representing a Poker hand ( for
suit and  for rank), then for class flush we define the following rule:
class(1, 1, . . . , 5, 5, flush) ←
flush(1, 1, . . . , 5, 5) ←
flush(1, 1, . . . , 5, 5)
1 = 2 ∧ 1 = 3 ∧ 1 = 4 ∧ 1 = 5
(5)
All other rules have the same structure as equation 5: the left-hand side declares the expected
class, while the right-hand side describes the necessary conditions for that class—possibly, via
some ancillary predicates such as flush. Table 3 provides an overview of all the rules we rely
upon in our experiments.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Methodology</title>
        <p>It is worth repeating that we choose to use the same data partitioning proposed by the authors
of the dataset, meaning that for the training set we rely on 25.010 samples, therefore 1.000.000
for the test set. Such a small training set w.r.t. the test set is quite unusual, since it makes the
task more challenging, yet at the same time results are more reliable.</p>
        <p>We use the same starting model for all the experiments, consisting of a fully connected NN
with 3 layers, where each layer has rectified linear unit (ReLU) as activation function except for
the last one that has softmax. We use categorical cross-entropy as the loss function for training.
After empirical experiments, the best NN has 128 neurons in the first and second layer—10 for
the output layer (number of classes). We use batch size equals to 32 and 100 epochs for network’s
training. Networks’ performance is evaluated using accuracy, macro-F1 and weighted-F1 score
functions. To evaluate network’s performance with knowledge integration we remove the
Λ-layer and use the resulting NN as is.</p>
        <p>
          Since we are not relying on any validation set, to avoid overfitting we use 3 stopping criteria
during the training of the network: (i) for the 99% of training examples the activation of every
output unit is within 0.25 of correct, (ii) at most 100 epochs, (iii) predictor has at least 90% of
accuracy on training examples but has not improved its ability to classify training examples for
5 epochs. Similar criteria were used in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>Finally, we run 30 experiments for each configuration to have a statistical population for
comparisons.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>We define two diferent configurations for experiments: (i) “classic”, where we use the NN
described in Section 4.2, (ii) “knowledge”, where we apply KILL algorithm on the same network
architecture. For both configurations we use the same hyper-parameters and run 30 experiments.</p>
        <p>Results are reported in Table 4 and in Figure 3.
class(1, . . . , 5, pair) ← pair (1, . . . , 5)
pair (1, . . . , 5) ← 1 = 2
pair (1, . . . , 5) ← 1 = 3
pair (1, . . . , 5) ← 1 = 4
pair (1, . . . , 5) ← 1 = 5
pair (1, . . . , 5) ← 2 = 3
pair (1, . . . , 5) ← 2 = 4
pair (1, . . . , 5) ← 2 = 5
pair (1, . . . , 5) ← 3 = 4
pair (1, . . . , 5) ← 3 = 5
pair (1, . . . , 5) ← 4 = 5
class(1, . . . , 5, two) ← two(1, . . . , 5)
two(1, . . . , 5) ← 1 = 2 ∧ 3 = 4
two(1, . . . , 5) ← 1 = 3 ∧ 2 = 4
two(1, . . . , 5) ← 1 = 4 ∧ 2 = 3
two(1, . . . , 5) ← 1 = 2 ∧ 3 = 5
two(1, . . . , 5) ← 1 = 3 ∧ 3 = 5
two(1, . . . , 5) ← 1 = 5 ∧ 2 = 3
two(1, . . . , 5) ← 1 = 2 ∧ 4 = 5
two(1, . . . , 5) ← 1 = 4 ∧ 2 = 5
two(1, . . . , 5) ← 1 = 5 ∧ 2 = 4
two(1, . . . , 5) ← 1 = 3 ∧ 4 = 5
two(1, . . . , 5) ← 1 = 4 ∧ 3 = 5
two(1, . . . , 5) ← 1 = 5 ∧ 3 = 4
two(1, . . . , 5) ← 2 = 3 ∧ 4 = 5
two(1, . . . , 5) ← 2 = 4 ∧ 3 = 5
two(1, . . . , 5) ← 2 = 5 ∧ 3 = 4
class(1, . . . , 5, three) ← three(1, . . . , 5)
three(1, . . . , 5) ← 1 = 2 ∧ 1 = 3
three(1, . . . , 5) ← 1 = 2 ∧ 1 = 4
three(1, . . . , 5) ← 1 = 2 ∧ 1 = 5
three(1, . . . , 5) ← 1 = 3 ∧ 1 = 4
three(1, . . . , 5) ← 1 = 3 ∧ 1 = 5
three(1, . . . , 5) ← 1 = 4 ∧ 1 = 5
three(1, . . . , 5) ← 2 = 3 ∧ 2 = 4
three(1, . . . , 5) ← 2 = 3 ∧ 2 = 5
three(1, . . . , 5) ← 2 = 4 ∧ 2 = 5
three(1, . . . , 5) ← 3 = 4 ∧ 3 = 5
class(1, . . . , 5, straight) ← royal (1, . . . , 5)
class(1, . . . , 5, straight) ← straight (1, . . . , 5)
straight (1, . . . , 5) ← (1 + 2 + 3 + 4 + 5) = (5 * min(1, . . . , 5) + 10) ∧ ¬pair (1, . . . , 5)
royal (1, . . . , 5) ← min(1, . . . , 5) = 1 ∧ (1 + 2 + 3 + 4 + 5 = 47) ∧ ¬pair (1, . . . , 5)
class(1, . . . , 5, flush) ← flush(1, . . . , 5)
flush(1, . . . , 5) ← 1 = 2 ∧ 1 = 3 ∧ 1 = 4 ∧ 1 = 5
class(1, . . . , 5, four) ← four (1, . . . , 5)
four (1, . . . , 5) ← 1 = 2 ∧ 1 = 3 ∧ 1 = 4
four (1, . . . , 5) ← 1 = 2 ∧ 1 = 3 ∧ 1 = 5
four (1, . . . , 5) ← 1 = 2 ∧ 1 = 4 ∧ 1 = 5
four (1, . . . , 5) ← 1 = 3 ∧ 1 = 4 ∧ 1 = 5
four (1, . . . , 5) ← 2 = 3 ∧ 2 = 4 ∧ 2 = 5
class(1, . . . , 5, full) ← ℎ(1, . . . , 5) ∧ (1, . . . , 5) ∧ ¬ (1, . . . , 5)
class(1, . . . , 5, straight_flush) ← straight (1, . . . , 5) ∧ flush(1, . . . , 5)
class(1, . . . , 5, straight_flush) ← royal (1, . . . , 5) ∧ flush(1, . . . , 5)
class(1, . . . , 5, royal) ← royal (1, . . . , 5) ∧ flush(1, . . . , 5)
class(1, . . . , 5, nothing) ← ¬ pair (1, . . . , 5) ∧ ¬flush(1, . . . , 5) ∧ ¬straight (1, . . . , 5) ∧ ¬royal (1, . . . , 5)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In general, experiments show that both predictors are very good in classifying common classes
such as “nothing” and “pair”. Also for less frequent classes , “two pairs” and “three of a kind”,
1.0
0.8
0.6
y
c
a
r
u
c
c
A0.4
0.2
0.0
classic
knowledge</p>
      <p>Class accuracy distributions
Nothing</p>
      <p>Pair</p>
      <p>Two Pairs</p>
      <p>Three</p>
      <p>Straight</p>
      <p>Flush
Classes</p>
      <p>Full</p>
      <p>Four</p>
      <p>Straight F. Royal F.
accuracy is still high. The remaining six classes represent the 0.8% of the training set and
therefore much more dificult to correctly predict. For “full house” and “straight”, accuracy has
middle values, while for the remaining classes accuracy is pretty close to 0. All of this is quite
expected due to the strong unbalance in the distribution of classes. The only remarkable fact is
that accuracy is extremely low for class “flush” even if less frequent classes such as “full house”
and “four of a kind” have higher accuracy. We hypothesise that this happens because the vast
majority of data consists in classes which depend only on the values of rank and therefore the
network tends to consider it much more. Indeed, only “flush”, “straight flush” and “royal flush”
(about the 0.26% of the training set) depend on the values of suit.</p>
      <p>Concerning the comparison between the “classic” predictor – with no additional knowledge –
and the “knowledge” predictor – obtained by applying KILL algorithm – results show that the
latter has higher performances with statistic significance (Student-T test with p-value &lt; 0.01
when comparing the overall accuracy of the two populations of experiments). In particular, we
can observe that the predictor exploiting knowledge during training has on average a higher
accuracy for all frequent classes. Also for less common classes such as “straight” and “full house”
there is great improvement. However, accuracy is not improved for ”flush” and very sporadic
classes. We speculate that the reason for the lack of improvement in the prediction of rare
classes lies on the kind of injection we are performing. Injecting knowledge by constraining
the network during the training is efective as long as classes are represented by a suficient
amount of examples. For instance, if a class is not represented at all, having a logic rule for that
specific class is the same as not having it. So, if we have only few units representing a class
over a million of examples the efect it is too small to make a diference. We believe that this
issue is possibly common upon all SKI algorithms based on constraining, however this should
be verified in future works.</p>
      <p>To overcome such limit, we may exploit the advantage of having a Λ-layer afecting the
constraining instead of operating directly on the loss function of the network. More precisely,
we can keep the Λ-layerafter the training of the network and just change the function in
Equation (1) in such a way that the cost of the violation of the knowledge is used to reduce the
network’s error and not to increase it. This should be also explored in future works.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work we define KILL, a general technique for prior symbolic knowledge injection into
deep neural networks. Designers may use Datalog rules to express common sense that are
injected through Λ-layer into the network. Rules are encoded into class specific fuzzy logic
functions that add a cost to the class prediction value when the rule is violated.</p>
      <p>We report a number of experiments where we compare networks without knowledge
injection with networks that receives additional information in a multi-classification task. The
selected task has some common criticalities of ML classification tasks, in particular data set
size limitation, unbalanced classes, class overlapping, and intra-class constraints. Results show
that out approach can improve network’s accuracy for classes that are not too sporadic and the
overall network’s accuracy. However results reveal a limitation: KILL is quite sensitive w.r.t.
situations which have been rarely met during training. We speculate this is general limitation
characterising SKI methods acting by constraining the predictor during training. Along this line,
further experiments over diferent methods are required to confirm such general statement.</p>
      <p>Accordingly, in our future works we shall consider diferent Λ-layer functions and test the
technique on diferent domains. To mitigate the sensitivity w.r.t. rare situations, we intend to
investigate scenarios where we keep the Λ-layer after the training but instead of increase the
network’s error it should reduce the error w.r.t. the prior knowledge (i.e., use a diferent function
in the Λ-layer). In this way we combine techniques based on both constraining during the
training and structuring (i.e., altering the predictor’s architecture). This is expected to mitigate
the SKI as information about rare situation is encoded into the network structure after the
training.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This paper was partially supported by the CHIST-ERA IV project “Expectation” –
CHIST-ERA19-XAI-005 –, co-funded by EU and the Italian MUR (Ministry for University and Research).
[22] M. Magnini, G. Ciatto, A. Omicini, On the design of PSyKI: a platform for symbolic
knowledge injection into sub-symbolic predictors, in: D. Calvaresi, A. Najjar, M. Winikof,
K. Främling (Eds.), Proceedings of the 4th International Workshop on EXplainable and
TRAnsparent AI and Multi-Agent Systems, volume 13283 of Lecture Notes in Computer
Science, Springer, 2022, pp. 90–108. doi:10.1007/978-3-031-15565-9_6.
[23] R. Cattral, F. Oppacher, Poker hand data set, UCI machine learning repository, 2007. URL:
https://archive.ics.uci.edu/ml/datasets/Poker+Hand.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <article-title>The mythos of model interpretability</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>36</fpage>
          -
          <lpage>43</lpage>
          . doi:
          <volume>10</volume>
          .1145/3233231.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>51</volume>
          (
          <year>2019</year>
          )
          <volume>93</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>93</lpage>
          :
          <fpage>42</fpage>
          . doi:
          <volume>10</volume>
          .1145/3236009.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ajtai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gurevich</surname>
          </string-name>
          ,
          <article-title>Datalog vs first-order logic</article-title>
          ,
          <source>Journal of Computer and System Sciences</source>
          <volume>49</volume>
          (
          <year>1994</year>
          )
          <fpage>562</fpage>
          -
          <lpage>588</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0022-
          <volume>0000</volume>
          (
          <issue>05</issue>
          )
          <fpage>80071</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Besold</surname>
          </string-name>
          , A. S.
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bader</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
            ,
            <given-names>P. M.</given-names>
          </string-name>
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Kühnberger</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Lamb</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lowd</surname>
            ,
            <given-names>P. M. V.</given-names>
          </string-name>
          <string-name>
            <surname>Lima</surname>
            , L. de Penning, G. Pinkas,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Poon</surname>
          </string-name>
          , G. Zaverucha,
          <article-title>Neural-symbolic learning and reasoning: A survey and interpretation</article-title>
          , in: P. Hitzler, M. K. Sarker (Eds.),
          <source>Neuro-Symbolic Artificial Intelligence: The State of the Art</source>
          , volume
          <volume>342</volume>
          <source>of Frontiers in Artificial Intelligence and Applications</source>
          , IOS Press,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>51</lpage>
          . doi:
          <volume>10</volume>
          .3233/FAIA210348.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Meel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Soh</surname>
          </string-name>
          ,
          <article-title>Embedding symbolic knowledge into deep networks</article-title>
          , in: H.
          <string-name>
            <surname>M. Wallach</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>d'Alché-</article-title>
          <string-name>
            <surname>Buc</surname>
            ,
            <given-names>E. B.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          ,
          <article-title>NeurIPS 2019</article-title>
          , December 8-
          <issue>14</issue>
          ,
          <year>2019</year>
          , Vancouver, BC, Canada,
          <year>2019</year>
          , pp.
          <fpage>4235</fpage>
          -
          <lpage>4245</lpage>
          . URL: https://proceedings.neurips. cc/paper/2019/hash/7b66b4fd401a271a1c7224027ce111bc-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Calegari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ciatto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Omicini</surname>
          </string-name>
          ,
          <article-title>On the integration of symbolic and sub-symbolic techniques for XAI: A survey</article-title>
          ,
          <source>Intelligenza Artificiale</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>7</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .3233/IA-190036.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Tresp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hollatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <article-title>Network structuring and training using rule-based knowledge</article-title>
          , in: S. J.
          <string-name>
            <surname>Hanson</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          <string-name>
            <surname>Cowan</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          Giles (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          <volume>5</volume>
          , Morgan Kaufmann,
          <year>1992</year>
          , pp.
          <fpage>871</fpage>
          -
          <lpage>878</lpage>
          . URL: http://papers.nips.cc/ paper/638-network
          <article-title>-structuring-and-training-using-rule-based-knowledge</article-title>
          , NIPS Conference, Denver, Colorado, USA, November 30-December 3,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hölldobler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Marques</surname>
          </string-name>
          ,
          <article-title>Guiding backprop by inserting rules, in: A. S. d</article-title>
          . Garcez, P. Hitzler (Eds.),
          <source>Proceedings of the 4th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy)</source>
          , Patras, Greece, July
          <volume>21</volume>
          ,
          <year>2008</year>
          , volume
          <volume>366</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>22</lpage>
          . URL: http: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>366</volume>
          /paper-5.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Diligenti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roychowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          ,
          <article-title>Integrating prior knowledge into deep learning</article-title>
          , in: X.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Palade</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Wani</surname>
          </string-name>
          (Eds.),
          <source>16th IEEE International Conference on Machine Learning and Applications, ICMLA</source>
          <year>2017</year>
          , Cancun, Mexico,
          <source>December 18-21</source>
          ,
          <year>2017</year>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>920</fpage>
          -
          <lpage>923</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICMLA.
          <year>2017</year>
          .
          <volume>00</volume>
          -
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Friedman,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Van den Broeck, A semantic loss function for deep learning with symbolic knowledge</article-title>
          , in: J. G. Dy,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Krause (Eds.),
          <source>Proceedings of the 35th International Conference on Machine Learning (ICML)</source>
          , Stockholmsmässan, Stockholm, Sweden,
          <source>July 10-15</source>
          ,
          <year>2018</year>
          , volume
          <volume>80</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>5498</fpage>
          -
          <lpage>5507</lpage>
          . URL: http://proceedings.mlr.press/v80/xu18h.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          ,
          <article-title>Learning explanatory rules from noisy data</article-title>
          ,
          <source>Jounal of Artificial Intelligence Research</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>64</lpage>
          . doi:
          <volume>10</volume>
          .1613/jair.5714.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Marra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diligenti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Gori, LYRICS: A general interface layer to integrate logic inference and deep learning</article-title>
          , in: U. Brefeld, É. Fromont,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Knobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Maathuis</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Robardet (Eds.),
          <source>Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD</source>
          <year>2019</year>
          , Würzburg, Germany,
          <source>September 16-20</source>
          ,
          <year>2019</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>11907</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2019</year>
          , pp.
          <fpage>283</fpage>
          -
          <lpage>298</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -46147-8_
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Ballard</surname>
          </string-name>
          ,
          <article-title>Parallel logical inference and energy minimization</article-title>
          , in: T. Kehler (Ed.),
          <source>Proceedings of the 5th National Conference on Artificial Intelligence</source>
          . Philadelphia, PA, USA,
          <year>August</year>
          11-
          <issue>15</issue>
          ,
          <year>1986</year>
          . Volume 1: Science, Morgan Kaufmann,
          <year>1986</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>209</lpage>
          . URL: http://www.aaai.org/Library/AAAI/
          <year>1986</year>
          /aaai86-
          <fpage>033</fpage>
          .php.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Towell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Shavlik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Noordewier</surname>
          </string-name>
          ,
          <article-title>Refinement of approximate domain theories by knowledge-based neural networks</article-title>
          , in: H. E. Shrobe,
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Swartout</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 8th National Conference on Artificial Intelligence</source>
          . Boston, Massachusetts, USA,
          <source>July 29 - August 3</source>
          ,
          <year>1990</year>
          , 2 Volumes, AAAI Press / The MIT Press,
          <year>1990</year>
          , pp.
          <fpage>861</fpage>
          -
          <lpage>866</lpage>
          . URL: http://www.aaai.org/Library/AAAI/
          <year>1990</year>
          /aaai90-
          <fpage>129</fpage>
          .php.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>A. S.</surname>
          </string-name>
          d. Garcez,
          <string-name>
            <surname>G. Zaverucha,</surname>
          </string-name>
          <article-title>The connectionist inductive learning and logic programming system</article-title>
          ,
          <source>Applied Intelligence</source>
          <volume>11</volume>
          (
          <year>1999</year>
          )
          <fpage>59</fpage>
          -
          <lpage>77</lpage>
          . doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1008328630915</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. S. d.</given-names>
            <surname>Garcez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Gabbay</surname>
          </string-name>
          ,
          <article-title>Fibring neural networks</article-title>
          , in: D. L.
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          Ferguson (Eds.),
          <source>Proceedings of the Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25-29</source>
          ,
          <year>2004</year>
          , San Jose, California, USA, AAAI Press / The MIT Press,
          <year>2004</year>
          , pp.
          <fpage>342</fpage>
          -
          <lpage>347</lpage>
          . URL: http://www.aaai.org/Library/AAAI/
          <year>2004</year>
          /aaai04-
          <fpage>055</fpage>
          .php.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bader</surname>
          </string-name>
          , A. S. d. Garcez,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <article-title>Computing first-order logic programs by fibring artificial neural networks</article-title>
          , in: I.
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          Markov (Eds.),
          <source>Proceedings of the 18th International Florida Artificial Intelligence Research Society Conference (FlAIRS)</source>
          , Clearwater Beach, Florida, USA, May
          <volume>15</volume>
          -17,
          <year>2005</year>
          , AAAI Press,
          <year>2005</year>
          , pp.
          <fpage>314</fpage>
          -
          <lpage>319</lpage>
          . URL: http://www.aaai.org/ Library/FLAIRS/
          <year>2005</year>
          /flairs05-
          <fpage>052</fpage>
          .php.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>M. V. M. França</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Zaverucha</surname>
          </string-name>
          , A. S. d. Garcez,
          <article-title>Fast relational learning using bottom clause propositionalization with artificial neural networks</article-title>
          ,
          <source>Machine Learning</source>
          <volume>94</volume>
          (
          <year>2014</year>
          )
          <fpage>81</fpage>
          -
          <lpage>104</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10994-013-5392-1.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Manhaeve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumancic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kimmig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Demeester</surname>
          </string-name>
          , L. De Raedt,
          <article-title>Neural probabilistic logic programming in deepproblog</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>298</volume>
          (
          <year>2021</year>
          )
          <article-title>103504</article-title>
          . doi:
          <volume>10</volume>
          .1016/ j.artint.
          <year>2021</year>
          .
          <volume>103504</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. T.</given-names>
            <surname>Tekle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Warren</surname>
          </string-name>
          ,
          <article-title>Datalog: concepts, history, and outlook</article-title>
          , in: M.
          <string-name>
            <surname>Kifer</surname>
            ,
            <given-names>Y. A.</given-names>
          </string-name>
          Liu (Eds.),
          <source>Declarative Logic Programming: Theory, Systems, and Applications</source>
          , ACM / Morgan &amp; Claypool,
          <year>2018</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>100</lpage>
          . doi:
          <volume>10</volume>
          .1145/3191315.3191317.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Hay</surname>
          </string-name>
          ,
          <article-title>Axiomatization of the infinite-valued predicate calculus</article-title>
          ,
          <source>The Journal of Symbolic Logic</source>
          <volume>28</volume>
          (
          <year>1963</year>
          )
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          . URL: http://www.jstor.org/stable/2271339.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>