<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Inductive Learning of Complex Knowledge from Raw Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Cunnington</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Law</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Lobo</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Russo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research Europe</institution>
          ,
          <addr-line>Winchester</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ILASP Limited</institution>
          ,
          <addr-line>Grantham</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Imperial College London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universitat Pompeu Fabra</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>One of the ultimate goals of Artificial Intelligence is to learn generalised and human-interpretable knowledge from raw data. Neuro-symbolic reasoning approaches partly tackle this problem by improving the training of a neural network using a manually engineered symbolic knowledge base. In the case where symbolic knowledge is learned from raw data, this knowledge lacks the expressivity required to solve complex problems. In this paper, we introduce Neuro-Symbolic Inductive Learner (NSIL), an approach that trains a neural network to extract latent concepts from raw data, whilst learning symbolic knowledge that solves complex problems, defined in terms of these latent concepts. The novelty of our approach is a method for biasing a symbolic learner to learn improved knowledge, based on the in-training performance of both neural and symbolic components. We evaluate NSIL on two problem domains that require learning knowledge with diferent levels of complexity, and demonstrate that NSIL learns knowledge that is not possible to learn with other neuro-symbolic systems, whilst outperforming baseline models in terms of accuracy and data eficiency.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;neuro-symbolic learning</kwd>
        <kwd>inductive logic programming</kwd>
        <kwd>raw data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Within Artificial Intelligence, one of the ultimate goals is to learn generalised and human
interpretable knowledge from raw data. Neuro-symbolic approaches tackle this problem by
combining the best features of both neural networks and symbolic learning and reasoning
techniques [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. For example, many existing approaches improve the training of a neural
network using a symbolic knowledge base that is manually engineered [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ]. In contrast,
systems that learn symbolic knowledge are generally only applied to structured data, and use
pretrained neural networks when applied to raw data. To address this limitation,   jointly
trains a neural network whilst learning symbolic knowledge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], using a meta-interpreter learner
that only learns first-order definite logic programs [ 9]. Consequently, the learned symbolic
knowledge lacks the expressivity required for many common-sense learning and reasoning
tasks [10].
      </p>
      <p>
        In this paper, we present Neuro-Symbolic Inductive Learner (NSIL), that jointly trains a
neural network whilst learning a complex, first-order symbolic knowledge, called an inductive
hypothesis. Building upon state-of-the-art symbolic learners [11, 12], NSIL generalises  
by learning more expressive knowledge from raw data in the language of Answer Set
Programming (ASP) [13], solving computationally harder problems that   cannot solve. In NSIL,
the learned hypothesis maps latent concepts to downstream labels, and the neural network is
trained to classify these latent concepts from raw data, by reasoning over the learned hypothesis.
This is a form of neuro-symbolic reasoning, where the neural network is trained in a weakly
supervised fashion, as no labels are given for the latent concepts. In NSIL, we use the NeurASP
system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to perform such computation, and seamlessly integrate symbolic learning by creating
corrective examples for the symbolic learner. These examples use the current predictions of the
neural network and the space of possible latent concept values relevant to the downstream label.
A specific mechanism for weighting these examples is proposed, which biases the symbolic
learner towards exploration or exploitation of the inductive hypothesis search space. NSIL is
the first system that jointly trains a general 1 neural network from scratch, whilst inducing a
complex first-order hypothesis in the form of an ASP program.
      </p>
      <p>Our experiments apply NSIL to two tasks: MNIST Arithmetic and MNIST Hitting Sets, where
each training sample contains a collection of raw MNIST inputs [15, 16], together with a
downstream label. We include specific relations as domain knowledge for each task, and
compare NSILs performance with purely diferentiable baseline models. The results show that
NSIL; (1) outperforms the baselines in terms of its overall accuracy and data eficiency, (2) trains
the neural network to predict latent concepts with comparable accuracy to fully supervised
diferentiable models, and (3) learns interpretable knowledge. In comparison with  ,
the MNIST Hitting Sets task [17] demonstrates that our NSIL approach is capable of learning
inductive hypotheses of increased expressivity, including constraints and choice rules. The
learned hypotheses indeed solve such computationally harder problems, and, go a step further
by generating all the hitting sets of a given collection. This can only be achieved when the
learned hypothesis is expressed using ASP, which is clearly not possible with  .</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Many neuro-symbolic learning and reasoning techniques have been proposed in the literature
[18]. Their primary objective is to leverage properties of symbolic methods to enhance data
eficiency, transferability and interpretability of neural models. Some techniques inject
symbolic knowledge directly into the neural architectures [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ], whereas others preserve a clear
distinction between a neural and symbolic reasoning component [
        <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
        ]. The key drawback of
these neuro-symbolic reasoning approaches is that they require complete and manually
engineered symbolic background knowledge. In contrast, our approach learns symbolic knowledge,
expressed as a first-order rule-based program while training a neural network.
      </p>
      <p>
        Pure symbolic learning systems are capable of learning interpretable knowledge in a data
eficient manner [ 19, 20, 21]. However, as these systems are logic-based, they can only learn from
structured data. Even recent diferentiable methods [ 22, 23, 24] are only applied to structured
1[14] is limited to end-to-end training of a binary neural network only, in conjunction with rule learning.
data, and use pre-trained neural networks when applied to raw data [25, 26]. To address this
limitation, neuro-symbolic learning systems have been proposed [
        <xref ref-type="bibr" rid="ref8">8, 27, 14</xref>
        ].
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] extends [27] by using abduction and induction to jointly train a neural network
and induce definite logic programs from raw data. However, the Metagol symbolic learner [ 9],
used by  , can only learn symbolic knowledge expressed as definite logic programs
without function symbols, which can compute only polynomial functions [10]. Hence,  
cannot learn more expressive knowledge involving defaults, exceptions, constraints and choice,
which are essential aspects of common-sense learning and reasoning. Similarly to  ,
we also train a neural network whilst learning symbolic knowledge, through weak supervision
from downstream labels. However, we learn first-order complex knowledge expressed in the
language of ASP, which is more general than symbolic learning of definite clauses [ 28, 11, 29],
and can solve computationally harder problems, such as the hitting sets NP decision problem
[17]. In this task, the training set consists of binary True/False labels that indicate the existence
of a hitting set over a collection of up to four sets, but our learned programs are general as they
can decide if there is a hitting set over collections of any number of sets. This is not possible with
 , or indeed any neural network architecture, since at inference time their input/output
format must be exactly the same as that observed during training. Furthermore, our learned
programs not only solve the hitting set decision problem, but can also generate all the hitting
sets of a given collection. Finally,   may be vulnerable to becoming stuck in a local
optima, which justifies why the authors present in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] an evaluation using a CNN pre-trained
in a one-shot fashion. In contrast, we demonstrate convergence to the optimal hypothesis over
very large hypothesis spaces without using ground-truth concept labels. This is also the case
when a reduced training set is used, for which the neural network remains under-trained.
      </p>
      <p>Finally, the approach in [14] uses Binary Neural Networks (BNN) encoded into an ASP
program, together with an apperception theory as background knowledge. The task of jointly
learning the weights of the BNN and the additional knowledge required to solve a given problem,
is formalised as solving a single ASP problem, and as such is fully delegated to an ASP solver.
BNNs are an active area of research but they are still far from solving the kind of problems that
regular neural networks can solve. Hence, since ASP doesn’t support floating point arithmetic,
the approach in [14] is limited by the BNN and is unlikely to handle tasks that require complex
neural network architectures.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Neuro-Symbolic Inductive Learner</title>
      <sec id="sec-3-1">
        <title>3.1. Problem setting</title>
        <p>We consider a dataset of samples  = {⟨, ⟩, . . .} and a domain knowledge , where  is a
collection of size  containing multiple raw data  ∈  ,  ∈  is a target label, and  is a
ifrst-order logic program. For a given task, we assume a single latent concept  = ⟨ ,  ⟩,
where  is the name of the latent concept, and  is its set of possible values, which we
assume to be finite. Each  ∈  has an associated latent concept value  ∈  , and  is a
collection of size  denoting the combination of latent concept values associated with . Let
ℒ be the language defined using the relations in , , and . Informally, we wish to learn a
complex symbolic knowledge , called an inductive hypothesis, expressed as a first-order logic
program in the language ℒ, that defines the label  for a sample  in terms of its latent concept
values , and domain knowledge . Formally, the objective of NSIL is to learn a composite
function of the form ℎ* ∘  * :   →  where  * :   →  and ℎ* :  ∪  → . We
learn approximations  of  * and ℎ of ℎ* , and refer to them as the neural and symbolic learning
components of NSIL respectively. Under our single latent concept assumption,  is efectively a
classifier for the latent concept  in each input. Therefore, to simplify notation, we refer to the
set of possible latent concept values  as .</p>
        <p>During training, our approach learns from a dataset  and a domain knowledge . It is
crucial to note that  does not contain the correct latent labels  for any of the ⟨, ⟩ samples.
To solve the task, NSIL has to; (1) learn how to correctly map each raw input  to its latent
concept value , and (2) learn how to symbolically relate the combination of latent concept
values  predicted from  to the label . These mappings correspond to our  and ℎ functions
respectively. As ℎ is purely symbolic, available domain knowledge  can be leveraged during
learning. However, ℎ is not diferentiable, which makes the joint optimisation of ℎ ∘  through
standard gradient-based learning algorithms infeasible. The challenge is to train  and ℎ
via a weakly supervised learning task using  and  only, such that for every ⟨, ⟩ ∈ ,
ℎ( () ∪ ) =  = ℎ* ( * () ∪ ).</p>
        <p>NSIL addresses this challenge iteratively, using two steps.  is trained using the current
inductive hypothesis, denoted ′, and the symbolic learner is biased to learn a new ′ using
the current  and ℎ. Let us now describe the NSIL architecture, defining the neural and symbolic
learning components.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. NSIL architecture</title>
        <p>
          In this section, we instantiate the learning problem tackled by NSIL and present a novel
neurosymbolic learning architecture. The neural component  uses a neural network  :  → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]||,
parameterised by  , which computes a probability distribution over the latent concept values for
each  ∈ .  then uses the standard arg max function to aggregate the output of  for each
input, into a collection of latent concept values . The symbolic learning component ℎ learns a
complex inductive hypothesis  expressed in the first-order logic programming language ℒ
such that for every ⟨, ⟩ ∈ ,  ∪  ∪  satisfies .
        </p>
        <p>
          The   approach [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], addresses a similar problem of training a neural network whilst
learning an inductive hypothesis. Informally,   induces an inductive hypothesis , and
uses  to “abduce” relevant facts (i.e., pseudo-labels), to prune the search space for  while 
is trained with respect to these computed pseudo-labels. The   approach relies upon
a definition of  ∪  ∪  covering  in terms of logical entailment (i.e.,  (|, , ) = 1
if  ∪  ∪  |=  and 0 otherwise), and the strong (implicit) assumption that  ∪  ∪ 
defines a unique logical model. Such assumption is related to the confined expressivity of
the symbolic learner of  , which can only learn inductive hypotheses expressed as
ifrst-order definite clauses. Consequently, hypotheses learned by   can only solve tasks
requiring polynomial functions [10]. In contrast, we are able to learn more complex functions,
such as in the following example:
(b) Inference
(a) Learning
Example 1. Consider a set  of MNIST images from classes 1-5. The task is to decide whether
a collection of sets , containing images from  , has a hitting set of size ≤ 2. For example,
if  = {{ , }, { }, { }}, then the answer is yes, because {1, 3} is a hitting set. If  =
{{ , }, { }, { }} then the answer is no, because there are no hitting sets. The training data
 contains pairs, ⟨, ⟩, where  = 1 if there is a hitting set of , otherwise  = 0. Also, the
domain knowledge  defines the notion of an element in a set.
        </p>
        <p>Our system will not only learn an inductive hypothesis  that solves the hitting sets NP
decision problem, but also the same hypothesis will be able to generate all the possible hitting sets
of a given collection . These kind of problems cannot be solved by  . NSIL provides
a general neuro-symbolic learning architecture, where the learned inductive hypothesis is a
type of first-order logic program called an answer set program, based on the ASP paradigm [13].
ASP is a rule-based declarative formalism for representing and solving combinatorial problems.
Its highly expressive language allows for rules representing non-determinism, choice and
constraints, among other constructs. Knowledge about a real-world problem can be represented
as an ASP program, whose answer sets, or logical models, correspond to the solutions of the
original problem. In the context of the hitting set decision problem in ASP, stating that  ∪ ∪
covers  means that there is at least one answer set of  ∪  ∪  if  = 1, or no answer sets if
 = 0 [13].</p>
        <p>
          ASP programs can be learned from noisy, labelled, symbolic examples [30], and can also be
used to train neural networks by applying symbolic reasoning over semantic constraints [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. As
illustrated in Figure 1, the NSIL architecture leverages these two aspects of ASP, by seamlessly
integrating two components: (1) a Learning from Answer Sets (LAS) symbolic learner [30],
for learning an inductive hypothesis , that maps latent concept values  to outputs ; (2)
NeurASP [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to train the neural network to predict  from , by applying neuro-symbolic
reasoning over the learned . Let us now define the optimisation criteria for these components.
        </p>
        <p>Informally, let , be a set of examples depending on  and  (see Subsection 3.3 for
details on how , is dynamically generated during training). Each example  ∈ , has an
associated penalty  that indicates the cost that a candidate inductive hypothesis ′ pays for
not covering it. Let  be an ASP domain knowledge, defining also the set ℋ of possible learnable
ASP programs. An inductive hypothesis  ∈ ℋ is an ASP program with size() = || given
by the number of relations in , and a cost given by the sum of the penalties of uncovered
examples. We denote the set of examples uncovered by  with    (, , ), and define
the cost of  as (, (, , )) = ∑︀</p>
        <p>
          ∈ (,, ) . The score of , denoted as
(, (, , )), is the sum || + (, (, ,
the function ℎ :  ∪  →  by solving the following optimisation problem:
)). A LAS symbolic learner computes
∈ℋ
* = arg min [(, (, , ))]
The above minimisation problem can be equivalently interpreted as jointly maximising the
generality of  (i.e., the most compressed ASP program) and its coverage of the examples
in , . Given a candidate inductive hypothesis ′, NSIL uses NeurASP to train the neural
network. Therefore, the following is a reformulation of the NeurASP learning task [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Let Π
be the ASP program given by Π
        </p>
        <p>=  ∪ , where  includes a choice rule over the possible
probability of an answer set  as:  () = [ ∏︀
latent concept values in , that the neural network can predict from a collection  of raw data,
in a given sample ⟨, ⟩. Choice in ASP means that Π may have multiple answer sets which
include relations of the form () = . We indicate with Π (() = ) the set of answer
sets of Π that include the facts () = . For any  ∈ Π (() = ), we define the
()=∈  (() = ) ] / |Π (() =
)|, where  (() = ) is given by the neural network. Let Π () = Π
 ∪ {←
not },
where the constraint ←</p>
        <p>not  indicates that the label  must be satisfied in the answer
under the ASP semantics of Π ():
sets of Π (). Also, let Π ()(() = ) denote the answer sets of Π () that include
the facts () = . The probability of satisfying a particular label  can be computed as:
 (Π ()) = ∑︀
 :  →  by solving the following optimisation that maximises the log-likelihood of labels 
∈Π ()(()=)  (). This second step of NSIL computes the function
(1)
(2)
 * = arg max</p>
        <p>∑︁
⟨,⟩∈
( (Π ())).</p>
        <p>Hence, NSIL integrates neural and symbolic components by iteratively solving Equations 1
and 2:
1. NSIL is initialised by a bootstrap stage that learns an initial inductive hypothesis ′ that
specific neural predictions.  is initialised randomly.</p>
        <p>satisfies the domain knowledge  and covers each possible label in , independently of
2. The parameters of the neural network  are updated for 1 epoch on dataset , using the
ASP program Π () =  ∪ ′ ∪ {←</p>
        <p>not }, for each⟨, ⟩ ∈ .
3. Corrective examples are constructed using neural network predictions (described in detail
in Section 3.3), and a new ′ is learned with a LAS symbolic learner.
4. Steps 2-3 are repeated for a fixed number of iterations. NSIL outputs the trained neural
network  and the learned inductive hypothesis .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Symbolic learning optimisation</title>
        <p>At each iteration of NSIL, and as shown in Equation 1, the LAS symbolic learner searches for a
candidate inductive hypothesis ′ ∈ ℋ that minimises the score in terms of its size and the
total penalty of examples , that are not covered by ′. The examples, and their associated
weight penalties, are therefore crucial to the optimisation, as the symbolic learner is encouraged
to learn a ′ that covers examples with large weight penalties. The examples are symbolic and
define logical relations that are expected to be true or false w.r.t. some contextual information,
such that these relations are included or excluded from the answer sets of  ∪ ′ that cover the
example. Formally, an example  ∈ , is a tuple  = ⟨pen, inc, exc, ctx⟩ where pen is the
weight penalty, and inc and exc are respectively, sets of relations to be included and excluded
w.r.t. the contextual information ctx. Examples can be either positive or negative, depending
on the task; positive examples are used when the search space ℋ contains clauses without
constraints, whereas positive and negative examples are used when ℋ includes more expressive
ifrst-order rules, including constraints. In this paper, the key intuition is that each example
associates a possible combination of latent concepts  with a label , expressed using ctx, and
inc, exc respectively. For positive examples, the weight penalty therefore influences whether
 should be in the answer sets of  ∪ ′ ∪ , and the higher the penalty, the higher the cost
a candidate hypothesis pays if  is not in the answer sets of  ∪ ′ ∪ . The opposite holds
for negative examples. , is constructed initially during a bootstrap stage, and dynamically
modified throughout training to optimise towards the final . The example structure for each
of the learning tasks we have used to evaluate NSIL is given in Section 4, and the reader is
referred to [30] for further details regarding examples for the LAS systems.</p>
        <p>Bootstrap examples Initially, the neural network is not trained and therefore cannot be used
to optimise ′. We bootstrap the learning of ′ using a set of examples , defined to cover
all the target labels  ∈ , with respect to possible combinations of latent concept values .
The symbolic learner ensures the initial ′ is the shortest inductive hypothesis that satisfies
the label coverage. Let us now define how , is dynamically modified throughout training
using corrective examples.</p>
        <p>Computing corrective examples On each iteration of NSIL training, we use two sources of
information to optimise . Once the neural network has been trained with a candidate ′, we
can analyse the overall performance of both neural and symbolic components in predicting
training set labels  from . Also, the neural network returns confidence scores when predicting
. The challenge is how to create corrective examples from these information sources, that
appropriately weight possible combinations of  and  in the form of examples for the symbolic
learner, such that a new ′ is learned. To achieve this, we create two types of corrective
examples called explore and exploit, that respectively encourage exploration and exploitation of
the hypothesis search space. The explore examples relate possible  to a diferent label than
what ′ currently predicts, whilst the exploit examples reinforce the current prediction of  for
a particular . When a combination of  and  is obtained, we create a pair of linked explore
and exploit examples, adjusting their weights simultaneously, in order to give a consistent
optimisation signal to the symbolic learner. The goal is to maximise (c.f. minimise) the weights
of the exploitation (c.f. exploration) examples that contain the correct  for each label , such
that the correct  is learned. At this stage, two questions remain: (1) How are  and  obtained?
(2) How are the weights calculated and updated?
′ ∪ {←</p>
        <p>To obtain  and , for exploration we compute the answer sets of the ASP program  ∪
not }, for each  ∈ , where  contains a choice rule for possible values of . The
answer sets therefore contain  that lead to label , given the current ′. For exploitation, for
each ⟨, ⟩ ∈ , we run a forward pass of the neural network on each  ∈  to obtain , and
 is obtained directly from the training set. When a combination of  and  are first obtained,
the weights of the corresponding explore and exploit examples are initialised as pen = 1 (i.e.,
equal penalty). Let us now define how these example weights are updated throughout NSIL
training.</p>
        <p>Updating the weights of corrective examples Weight updates are computed at the end of
each NSIL iteration. Firstly, the weight of the explore (c.f. exploit) example is increased (c.f.
decreased) by the False Negative Rate (FNR) that NSIL has for , w.r.t. the current ′ and  :
  ′, () = 100 ×
︂(
1</p>
        <p>
          ′, ()
−  ′, () +  ′, ()
︂)
where  ′, () and  ′, () are respectively the number of true positive and false
negative training samples with label  that NSIL predicts on the training set. The FNR is
multiplied by 100 to ensure the weight adjustment is informative when converted to an integer,
as ASP does not support real numbers. For each explore example related to , the weight
is updated as pen +    ′, (), and the weight of the corresponding exploit example is
updated as pen −    ′, (), where in both cases, pen is the current example weight and
 ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is a learning rate that controls the efect of the proposed update. 2 The weight is then
clipped to a minimum of 1 and maximum of 101 to ensure the example weights are &gt; 0 as
required by the LAS symbolic learner. Secondly, the weight of the exploit (c.f. explore) example
is increased (c.f. decreased) by the aggregated confidence score of the neural network when
predicting  from . Let  () = ∏︀
        </p>
        <p>∈ (()), and let , be the set of ⟨, ⟩ ∈ 
with label , where the neural network predicts . The aggregated confidence score  
is defined as:
  (, ) =</p>
        <p>100
|,|
×</p>
        <p>∑︁
⟨,⟩∈,
 ()</p>
        <p>The example weight updates are then calculated as pen +    (, ) and pen −
   (, ) for the exploit and explore examples respectively. Again, the weights are
clipped to the range {1..101}. To summarise, Equations 3 and 4 generate exploration and
exploitation signals to encourage the symbolic learner to explore the hypothesis space when
NSIL performs poorly, and to retain the current ′ when NSIL performs well. Let us now
present the results of our experiments.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>
        The experiments answer the following questions: (1) Can NSIL solve tasks involving the joint
training of both neural and symbolic components? (2) Given the presence of symbolic learning,
2In this paper, we set  = 1 as this yields the best performance. Please see Appendix A.4.1 for more details.
(3)
(4)
is NSIL more data eficient than pure diferentiable models, whilst achieving the same if not
better overall accuracy? (3) Given the weak supervision, does NSIL train the neural network
efectively, and learn accurate, general and interpretable knowledge? (4) Can NSIL learn
symbolic knowledge from raw data that solves computationally harder problems compared
to computing polynomial functions? To answer these questions, we evaluate NSIL using two
task domains: MNIST Arithmetic, and MNIST Hitting Sets. Questions 1 and 3 are addressed in
both domains, where we evaluate the final accuracy achieved by the neural network, and also
present the learned inductive hypotheses. Question 2 is answered in the MNIST Arithmetic
tasks by using reducing percentages of training data. Finally, question 4 is answered in the
MNIST Hitting Sets tasks that require more complex and expressive hypotheses to be learned.
Baselines To the best of our knowledge,   is the only method that jointly trains both a
conventional neural network and a symbolic learner. However,   is limited to learning
definite rules. We demonstrate the advantage of NSIL w.r.t.   in the MNIST Hitting Sets
tasks, by showing that NSIL learns more general and expressive programs that can be applied
to any collection of sets, and can generate all the hitting sets of a given collection. We compare
NSIL to the following diferentiable models. In both tasks, we use two variants of a Concept
Bottleneck Model (CBM) [31]. The perception component is the same as the one used in NSIL,
but the reasoning component is diferentiable, and implemented as a neural network with 3
linear layers. We use the joint CBM and ensure no concept loss is used during training, therefore
achieving the same weak supervision as NSIL. The second CBM variant, denoted CBM-S, adds
a softmax layer to the CNN, replicating more closely the neural network used in NSIL. As a
third baseline, in the MNIST Arithmetic tasks, we use the CNN from [
        <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
        ], and in the MNIST
Hitting Sets tasks, we use a CNN-LSTM. Finally, in the MNIST Arithmetic tasks, we attach a
Neural Arithmetic Logic Unit (NALU) and a Neural Accumulator (NAC) to a CNN-LSTM [32].
NALU and NAC are not used in the Hitting Sets tasks as they are designed for regression and
not classification. Full architectural details are presented in Appendix A.4.3.
      </p>
      <p>
        Experiment setup The neural network in NSIL is the MNIST CNN model from [
        <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
        ], and
we use the FastLAS [11] symbolic learner in the MNIST Arithmetic tasks, and ILASP [28] in
the MNIST Hitting Sets tasks. To evaluate NSIL, we use the inference architecture shown in
Figure 1b. The neural component maps each raw input  to its latent concept value , and
the symbolic component maps  to a final prediction, using the learned hypothesis and .
Each experiment is repeated 20 times using 20 randomly generated seeds. The performance
is measured by mean classification accuracy per epoch, where 1 epoch = 1 NSIL iteration, for
comparison to the baseline models. We perform hyper-parameter tuning for all models using
a held-out validation set and a separate random seed.3 All experiments are performed on the
following infrastructure: RHEL 8.5 x86_64 with Intel Xeon E5-2667 CPUs (20 cores total), and
an NVIDIA A100 GPU, 150GB RAM. The ASP encodings of the domain knowledge are given in
Appendix A.4.2.
1.0
0.8
cy0.6
au
r
ccA0.4
0.2
      </p>
      <sec id="sec-4-1">
        <title>4.1. MNIST Arithmetic</title>
        <p>
          We consider the MNIST Addition task for single digit numbers used in [
          <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
          ] and a variation called
MNIST Even9Plus. They both use a set  composed of raw MNIST images of single digits (0..9),
and the dataset  contains ⟨, ⟩ samples where  is a pair of digit images. The space of target
labels  = {0..18}, and the latent concept ⟨ ,  ⟩ has  = digit and  = {0..9}. In the
MNIST Addition task, the label  indicates the sum of the digit values in  . The goal is to train
the neural network to classify the digit images in  , and jointly learn an inductive hypothesis
that defines the sum of the two digits. In the MNIST Even9Plus task,  is the same pair of images,
but the label  is equal to the digit value of the second image, if the digit value of the first image is
even, or 9 plus the digit value in the second image otherwise. Both tasks use training, validation
and test datasets of size 24,000, 6,000, and 5,000 samples respectively. The latent concept of a raw
image  ∈  is represented as digit(, ), where  ∈ {1, 2} and  is the associated latent
concept value.  uses the relation result to express the final label, relations even, not even,
plus_nine, and ‘=’, as well as the function ‘+’ to specify the search space of inductive hypotheses.
The positive explore examples have inc = {label}, exc = {}, and ctx given by the set of
rules {. label ← label(Y1), Y1 ̸= . ← label(Y1), label(Y2), Y2 ̸= Y1.}, which states
that label can be true if a diferent  is chosen for . Positive exploit examples have instead
inc = {}, exc = { ∖ {}}, and ctx = .
        </p>
        <p>Results and analysis The overall accuracy results are shown in Figure 2 for diferent
percentages of the training set  (100% - 1%). NSIL significantly outperforms the diferentiable
models on both tasks for all dataset percentages, except 1% on the Addition task. NSIL learns
the expected hypotheses, shown in Figure 4a in Appendix A.1. The hypotheses define the result
as variable V2, and capture the correct solutions. For the Addition task, the neural network is
3Please refer to https://github.com/dancunnington/NSIL for details of all the hyper-parameters used in our
experiments.
trained to classify MNIST images with a mean accuracy of 0.988, 0.968, 0.931, and 0.124 for each
dataset percentage respectively. The state-of-the-art for MNIST image classification is 0.991,
when the CNN is trained in a fully supervised fashion [33]. We achieve a similar accuracy of
0.988 with only weak-supervision during training. Also, when training with only 1200 samples
(5%), NSIL trains the neural network to 0.931 accuracy. For the E9P task, NSIL learns a more
expressive inductive hypothesis with negation as failure. The mean accuracy of the neural
network is 0.987, 0.972, 0.962, and 0.887 for each dataset percentage respectively. Again, this is
a negligible drop in performance compared to the state-of-the-art in fully supervised training
[33]. Interestingly, the performance of NSIL trained with just 1% of the data is very poor in
the Addition task, although this is not the case in the E9P task, as the neural network achieves
0.887 accuracy with just 240 training samples (1%). Despite the same domain knowledge and
hypothesis search space, the label is more informative: There is a reduced number of possible
latent concept values for certain labels which gives a stronger back-propagation signal for
training the neural network. E.g., for all samples with label  = 10, the first digit must be
odd and the second digit must be 1, as it’s only possible to obtain 10 if the first digit is odd,
and 9 + the second digit sums to 10. In the addition task, it’s possible to obtain 10 with
multiple combinations of the first and second digits. Finally, we increased the hypothesis search
space in the more challenging E9P task with the additional functions; ‘-’, ‘× ’, ‘÷ ’, as well as
plus_eight, ..., plus_one relations, and NSIL converges to the expected hypothesis in all of
these extended search spaces (see Appendix A.2).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. MNIST Hitting Sets</title>
        <p>We consider two variations of the hitting sets problem [17]. The first, denoted HS, is defined in
Example 1, the second, denoted CHS, adds the constraint that no hitting sets occur if || ≥ 3,
and digit 1 exists in any subset in . Each ⟨, ⟩ sample contains 4 MNIST or FashionMNIST
images from classes 1-5, arranged into subsets in . The label  is 1 or 0 indicating the existence
of a hitting set, and the goal is to train the neural network to classify MNIST images, whilst
learning the HS or CHS rules to decide the existence of and generate hitting sets. Training,
validation and test datasets contain 1502, 376, and 316 examples respectively. The latent
concept has  = MNIST_class and  = {1..5}, and for a raw image , it is represented as
ss_element(, ), where  ∈ {1..4},  is a subset identifier, and  is the associated latent
concept value. We consider well-formed , with no duplicate elements in a set, and elements
within the same set are arranged in ascending order. The domain knowledge  contains  = 2,
the ‘!=’ relations, the relation hs stating an element is in a hitting set, and ss_element(3, V)
and ss_element(V, 1) relations that define respectively subset 3 (and any element V) and digit
1 in any subset V. Both explore and exploit corrective examples share a common inc = {},
exc ={}, and ctx = . Exploration and exploitation is achieved by representing the example
positively or negatively w.r.t. the given label. In these tasks, we use a CNN-LSTM baseline,
where a CNN firstly encodes a feature vector for each image in a collection , and an LSTM
followed by a linear and a sigmoid layer returns a binary classification as to the existence of a
hitting set. To encode the structure of a collection, we create an image of ‘{’ and ‘}’ characters
and pass in the entire collection as input, e.g.,  = [ , , , , , , , , , ].
Results and analysis Figure 3 shows the overall accuracy for both hitting sets tasks using
0.4 0 1 2 3 4 5 6 7 8 9Ep1o0c1h1121314151617181920
(a) HS FashionMNIST
only FashionMNIST images, as this dataset is more challenging than MNIST. NSIL outperforms
the baselines on both tasks and learns the correct hypotheses (see Figure 4b in Appendix A.1).
The MNIST training curves are shown in Appendix A.3. In Figure 4b, the first rule in the HS
and CHS inductive hypotheses is a choice rule that generates hitting sets containing element
V2 at index V1. The second rule defines whether a subset in the collection is hit (i.e., contains
an element in the hitting set). The constraints ensure each subset in the collection is hit, and
elements in the hitting set are diferent. In the CHS task, the final constraint ensures collections
with 3 subsets and element 1 contain no hitting sets. When classifying images, the neural
network achieves a mean accuracy of 0.992 (MNIST) and 0.906 (FashionMNIST) on the HS task,
and 0.992 (MNIST) and 0.901 (FashionMNIST) on the CHS task. The neural network achieves
near state-of-the-art performance on MNIST images. For FashionMNIST, the state-of-the-art is
0.969 when a CNN is trained in a fully supervised fashion [34], and NSIL achieves 0.906. Despite
the hitting sets tasks having more complex hypotheses, the neural networks are trained to a
similar accuracy as in the MNIST Arithmetic tasks. All models perform better on the CHS task.
This is due to a shortcut; a negative test example with 3 subsets can be correctly predicted if
the neural network predicts class 1 for any image within the subset, as opposed to having to
rely on accurate predictions for more images when the constraint is not present in the HS task.
Also, as a result of the constraint, 40 test examples that contain 3 subsets become negative in
the CHS task, and the models can take advantage of the shortcut.</p>
        <p>The inductive hypotheses learned by NSIL have multiple answer sets due to the choice rule
0{hs(V1, V2)}1. Therefore, despite training labels indicating the existence of a hitting set, the
programs can generate all the hitting sets. We verified this by computing the hamming loss
against the ground truth hitting sets for all test samples. The hamming losses are 0.003 and 0.031
on the HS task, and 0.003 and 0.025 on the CHS task, for the MNIST and FashionMNIST images
respectively. This indicates almost perfect hitting set generation, with the minor errors due to
mis-classifications by the neural network, as opposed to errors in the learned hypotheses. The
larger hamming loss for FashionMNIST images reflects the weaker neural network performance
with this dataset. Finally, in the more challenging HS FashionMNIST task we sampled 10 sets
of additional predicates containing a combination of elements 1..5 and subset IDs 1..4. NSIL
converged to the expected hypothesis in all of these extended search spaces (see Appendix A.2).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper introduces a neuro-symbolic learner (NSIL) that learns complex knowledge whilst
training a neural network to classify latent concepts from raw data. The framework uses a
symbolic learner to learn an inductive hypothesis capable of solving computationally hard
problems. The novel aspect of the architecture is the use of corrective examples to bias the
symbolic learner. The results show that NSIL outperforms diferentiable models in terms of
accuracy and data eficiency, and solves the hitting sets problem that other approaches can’t
solve. In terms of limitations, the diferentiable baselines learn faster than NSIL, although
given more time, NSIL achieves a greater accuracy (see the Appendix for learning time vs.
accuracy plots). Future work includes generalising NSIL to support multiple latent concepts by
integrating multiple neural networks, and extending NSIL to take advantage of unsupervised
learning techniques.
IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, 2021,
pp. 1845–1851. URL: https://doi.org/10.24963/ijcai.2021/254. doi:10.24963/ijcai.2021/
254, main Track.
[9] A. Cropper, S. Muggleton, Metagol system, https://github.com/metagol/metagol, 2016.</p>
      <p>URL: https://github.com/metagol/metagol.
[10] E. Dantsin, T. Eiter, G. Gottlob, A. Voronkov, Complexity and expressive power of logic
programming, ACM Computing Surveys (CSUR) 33 (2001) 374–425.
[11] M. Law, A. Russo, E. Bertino, K. Broda, J. Lobo, Fastlas: scalable inductive logic
programming incorporating domain-specific optimisation criteria, in: Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, 2020, pp. 2877–2885.
[12] M. Law, A. Russo, K. Broda, Inductive learning of answer set programs from noisy examples,</p>
      <p>Advances in Cognitive Systems (2018) 57–76.
[13] M. Gelfond, Y. Kahl, Knowledge representation, reasoning, and the design of intelligent
agents: The answer-set programming approach, Cambridge University Press, Cambridge,
UK, 2014.
[14] R. Evans, M. Bošnjak, L. Buesing, K. Ellis, D. Pfau, P. Kohli, M. Sergot,
Making sense of raw input, Artificial Intelligence 299 (2021) 103521. URL: https://
www.sciencedirect.com/science/article/pii/S0004370221000722. doi:https://doi.org/
10.1016/j.artint.2021.103521.
[15] Y. LeCun, C. Cortes, MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/,
2010. URL: http://yann.lecun.com/exdb/mnist/.
[16] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms, CoRR abs/1708.07747 (2017). URL: http://arxiv.org/abs/1708.
07747. arXiv:1708.07747.
[17] R. M. Karp, Reducibility among combinatorial problems, in: Complexity of computer
computations, Springer, 1972, pp. 85–103.
[18] T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, L. C. Lamb, L. de Penning, B. Illuminoo,
H. Poon, C. Gerson Zaverucha, Neural-symbolic learning and reasoning: A survey and
interpretation, Neuro-Symbolic Artificial Intelligence: The State of the Art 342 (2022) 1.
[19] D. Corapi, A. Russo, E. Lupu, Inductive logic programming as abductive search, in:
Technical communications of the 26th international conference on logic programming,
Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2010, pp. 54–63.
[20] S. Muggleton, L. De Raedt, Inductive logic programming: Theory and methods, The</p>
      <p>Journal of Logic Programming 19 (1994) 629–679.
[21] S. H. Muggleton, D. Lin, A. Tamaddoni-Nezhad, Meta-interpretive learning of higher-order
dyadic datalog: Predicate invention revisited, Machine Learning 100 (2015) 49–73.
[22] A. Payani, F. Fekri, Inductive logic programming via diferentiable deep neural
logic networks, CoRR abs/1906.03523 (2019). URL: http://arxiv.org/abs/1906.03523.
arXiv:1906.03523.
[23] P. Sen, B. W. S. R. de Carvalho, R. Riegel, A. Gray, Neuro-symbolic inductive logic
programming with logical neural networks, 2021. arXiv:2112.03324.
[24] H. Shindo, M. Nishino, A. Yamamoto, Diferentiable inductive logic programming for
structured examples, Proceedings of the AAAI Conference on Artificial Intelligence 35
(2021) 5034–5041. URL: https://ojs.aaai.org/index.php/AAAI/article/view/16637.
[25] D. Cunnington, M. Law, A. Russo, J. Lobo, FF-NSL: feed-forward neural-symbolic learner,</p>
      <p>CoRR abs/2106.13103 (2021). URL: https://arxiv.org/abs/2106.13103. arXiv:2106.13103.
[26] R. Evans, E. Grefenstette, Learning explanatory rules from noisy data, Journal of Artificial</p>
      <p>Intelligence Research 61 (2018) 1–64.
[27] W.-Z. Dai, Q. Xu, Y. Yu, Z.-H. Zhou, Bridging machine learning and logical reasoning by
abductive learning, Advances in Neural Information Processing Systems 32 (2019).
[28] M. Law, Inductive learning of answer set programs, Ph.D. thesis, Imperial College London,
2018.
[29] M. Law, A. Russo, K. Broda, The complexity and generality of learning answer set programs,</p>
      <p>Artif. Intell. 259 (2018) 110–146.
[30] M. Law, A. Russo, K. Broda, Logic-based learning of answer set programs, in: Reasoning
Web. Explainable Artificial Intelligence - 15th International Summer School 2019, Bolzano,
Italy, September 20-24, 2019, Tutorial Lectures, 2019, pp. 196–231.
[31] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, P. Liang, Concept
bottleneck models, in: International Conference on Machine Learning, 2020, pp. 5338–5348.
[32] A. Trask, F. Hill, S. E. Reed, J. Rae, C. Dyer, P. Blunsom, Neural arithmetic logic units,</p>
      <p>Advances in neural information processing systems 31 (2018).
[33] S. An, M. Lee, S. Park, H. Yang, J. So, An ensemble of simple convolutional neural network
models for mnist digit recognition, arXiv preprint arXiv:2008.10400 (2020).
[34] M. S. Tanveer, M. U. K. Khan, C.-M. Kyung, Fine-tuning darts for image classification,
in: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp.
4789–4796.
[35] S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint
arXiv:1609.04747 (2016).</p>
    </sec>
    <sec id="sec-6">
      <title>A. Appendix</title>
      <sec id="sec-6-1">
        <title>A.1. NSIL learned hypotheses</title>
        <p>MNIST Addition
result(V0,V1,V2) :- V2 = V0 + V1.
MNIST E9P
result(V0,V1,V2) :- even(V0),V2 = V1.
result(V0,V1,V2) :- not even(V0),
plus_nine(V1,V2).</p>
        <p>(a) MNIST Arithmetic</p>
      </sec>
      <sec id="sec-6-2">
        <title>A.2. Increasing the hypothesis space</title>
        <p>To demonstrate NSIL can converge to the correct hypothesis in a variety of hypothesis spaces,
we select the more challenging E9P and HS FashionMNIST tasks, and increase the hypothesis
space using additional predicates, functions and relations in . We record the time and number
of iterations NSIL requires until convergence, and the results demonstrate NSIL converges to
the correct hypothesis in all of these cases, but may require more iterations and/or time to do
so.</p>
        <p>E9P We add arithmetic functions ‘-’, ‘× ’, ‘÷ ’, as well as plus_eight, ..., plus_one relations,
and increase the grounding of possible labels, such that the results of multiplication and
subtraction support all combinations of digits 0..9. Given the strong performance by NSIL
with reducing dataset percentages, we use 40% of the data to reduce the neural network training
time (this does not afect the symbolic learning time, nor simplify the symbolic learning task).
The results are shown in Table 1a.</p>
        <p>HS FashionMNIST Firstly, we define a set of domain knowledge called Standard, that
contains only the predicates required to learn the expected rules of the HS task. The standard
domain knowledge includes  = 2, the ‘!=’ relations, and the relation hs stating an element
is in a hitting set. We extend these by sampling 10 sets of additional predicates containing a
combination of elements 1..5 (denoted ‘el’) and subset IDs 1..4 (denoted ‘ssID’). The results are
shown in Table 1b.</p>
      </sec>
      <sec id="sec-6-3">
        <title>A.3. Hitting Sets test accuracy: MNIST images</title>
        <p>The results are presented in Figure 5.
0.4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Epoch</p>
      </sec>
      <sec id="sec-6-4">
        <title>A.4. Experiment setup</title>
        <sec id="sec-6-4-1">
          <title>A.4.1. Relying on neural network confidence for corrective example weights</title>
          <p>In this section, we investigate the  hyper-parameter for NSIL introduced in Section 3.3. When
NSIL calculates the weights of the corrective examples for the symbolic learner, two updates
occur; exploration based on the FNR, and exploitation, based on the neural network confidence
(see Section 3.3).  can be adjusted to vary the efect of each update. During exploration, we set
 = 1 always, to give priority to the FNR, as the FNR is reliable and takes into account both
neural and symbolic components. However, during exploitation, we may not want to fully rely
on the neural network confidence when the neural network is under-trained. Therefore, we run
experiments with varying  ∈ {1, 0.8, 0.6, 0.4, 0.2, 0} during exploitation only. We again select
the more challenging E9P and HS tasks, and use 40% of the training data in the E9P task. We
run NSIL for 10 iterations, with 5 repeats, using diferent randomly generated seeds. Figure 6
presents the results, where the error bars indicate standard error.</p>
          <p>In each task,  = 1 and  = 0 yields the best and worst performance respectively, which
indicates the neural network confidence can be relied upon for these tasks. This is our
justification for using  = 1 in Section 3.3. Analysing the  = 0 cases, the accuracy on the E9P task
improves throughout training and the symbolic learner converges to more accurate hypotheses.
However, in the HS tasks, after a few iterations, the symbolic learner learns the same hypothesis
and performance plateaus. When corrective examples are generated, there are many examples
with the same weight, because there are many choices of digits for each combination of
collection structure and label, given the labels are binary. Therefore, the optimisation signal to the
symbolic learner is weaker in the HS tasks compared to the E9P task when  = 0, because in the
E9P task, labels 0..18 generate corrective examples with many diferent weights. When  = 1
in the HS tasks, the neural network confidence scores help the symbolic learner to converge
the correct hypothesis by providing a strong exploitation signal for specific examples.</p>
        </sec>
        <sec id="sec-6-4-2">
          <title>A.4.2. Domain knowledge</title>
          <p>Domain knowledge used for all tasks is presented in Figure 7.</p>
        </sec>
        <sec id="sec-6-4-3">
          <title>A.4.3. Model details</title>
          <p>
            NSIL CNN The neural component in NSIL is the CNN architecture from [
            <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
            ]. It consists of an
encoder with 2 convolutional layers with kernel size 5, and output sizes 6 and 16 respectively.
Each convolutional layer is followed by a max pooling layer of size 2, stride 2. The encoder
is followed by 3 linear layers of size 120, 84, and 10 (in the hitting sets tasks, the last layer is
of size 5), and all layers are followed by a ReLU activation function. Finally, a softmax layer
returns the output probability distribution.
          </p>
          <p>
            Baselines The baseline CNN in the MNIST Arithmetic tasks follows the baseline CNN
architecture from [
            <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
            ], which is largely the same as the NSIL CNN, except the size of the last
r(0..18).
d(0..9).
even(X) :- d(X), X \\ 2 = 0.
plus_nine(X1,X2) :- d(X1), X2=9+X1.
res(X1,X2,Y) :- dig(1,X1), dig(2,X2).
:- dig(1,X1),dig(2,X2),res(X1,X2,Y1),res(X1
          </p>
          <p>,X2,Y2), Y1 != Y2.
#modeh(res(var(d),var(d),var(r))).
#modeb(var(n) = var(d)).
#modeb(var(n) = var(d) + var(d)).
#modeb(plus_nine(var(d),var(r))).
#modeb(even(var(d))).
#modeb(not even(var(d))).</p>
          <p>s(1..4).
h(1..2).
e(1..5).
#modeha(hs(var(h), var(e))).
#modeh(hit(var(s))).
#modeb(hs(var(h), var(e)),(positive)).
#modeb(var(e) != var(e)).
#modeb(ss_element(var(s),var(e)),(positive)).
#modeb(ss_element(3,var(e)),(positive)).
#modeb(ss_element(var(s),1),(positive)).</p>
          <p>#modeb(hit(var(s))).
(a) MNIST Arithmetic tasks
(b) MNIST Hitting Sets tasks
linear layer is 19, the network accepts a pair of concatenated images as input, and a log softmax
layer is used to provide a classification. The CNN-LSTM in the MNIST Hitting Sets tasks uses
the same CNN encoder as NSIL, applied to each image in the input sequence, followed by an
LSTM layer of tunable size, a linear layer of size 1, and a sigmoid layer. The CBM and CBM-S
baselines have similar architectures, consisting of the same CNN as NSIL, applied to each image
separately. The only diference is that the CBM variant doesn’t contain a softmax layer for the
CNN. In both variants, the CNN is followed by 3 linear layers where the size of the first two
layers are tunable and the last layer is of size 19 in the MNIST arithmetic tasks, and 1 in the
MNIST Hitting Sets tasks. In the MNIST Addition task, ReLU activation is used after the first
two linear layers for the CBM and CBM-S baselines, as this resulted in improved performance.
In the other tasks, no non-linearity was used. The CNN-LSTM-NALU and CNN-LSTM-NAC
baselines contain the same CNN as NSIL and the other baselines, followed by an LSTM layer of
tunable size, and then either an NALU or NAC layer. Finally, we use the stochastic gradient
descent optimiser for all neural networks [35], and tune the learning rate and momentum.3</p>
        </sec>
      </sec>
      <sec id="sec-6-5">
        <title>A.5. Learning time comparison</title>
        <p>Figures 8 and 9 show the learning time vs. accuracy comparison between NSIL and the
diferentiable models for the MNIST Arithmetic and MNIST Hitting Sets tasks respectively. Each
point shows the accuracy after an epoch of training (1 epoch = 1 NSIL iteration), and error bars
indicate standard error. Also, the -axis is adjusted to show a meaningful comparison instead
of displaying the total running time for 20 epochs. As observed in both task domains, NSIL
requires more time but achieves a greater accuracy than the diferentiable models.
1.0
0.8
cy0.6
au
r
ccA0.4
0.2
0.0 0</p>
        <p>200 400Time (s) 600 800 1000</p>
      </sec>
      <sec id="sec-6-6">
        <title>A.6. Asset licenses</title>
        <p>The ILASP system is free to use for research,4 FastLAS5 and the FashionMNIST dataset6 are
both open-source with an MIT license, the MNIST dataset is licensed with Creative Commons
Attribution-Share Alike 3.0,7 and the CNN models used from DeepProbLog are open-source
and licensed with Apache 2.0.8</p>
      </sec>
      <sec id="sec-6-7">
        <title>A.7. Broader impact</title>
        <p>There are no direct negative impacts that we can envisage for our work, given we introduce a
general machine learning approach. However, NSIL inherits general concerns regarding the
deployment of machine learning systems, and appropriate precautions should be taken, such
4https://ilasp.com/terms
5https://github.com/spike-imperial/FastLAS/blob/master/LICENSE
6https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE
7See bottom paragraph: https://keras.io/api/datasets/mnist/
8https://github.com/ML-KULeuven/deepproblog/blob/master/LICENSE
as ensuring training data is unbiased, and model inputs/outputs are monitored for adversarial
attacks. As NSIL learns human interpretable hypotheses using symbolic learning, this may
help to mitigate these issues in some applications, by revealing potential bias, and providing a
level of assurance regarding possible downstream predictions based on the learned hypothesis.
The usual performance monitoring will still be required if NSIL is deployed into production, to
prevent adversarial attacks, and to detect when re-training may be required if the input data is
subject to distributional shifts.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] A.
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gori</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Lamb</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Serafini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Spranger</surname>
            ,
            <given-names>S. N.</given-names>
          </string-name>
          <string-name>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>Neural-symbolic computing: An efective methodology for principled integration of machine learning and reasoning</article-title>
          ,
          <source>FLAP</source>
          <volume>6</volume>
          (
          <year>2019</year>
          )
          <fpage>611</fpage>
          -
          <lpage>632</lpage>
          . URL: https://collegepublications.co.uk/ifcolog/?00033.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] A.
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Lamb</surname>
          </string-name>
          ,
          <string-name>
            <surname>Neurosymbolic</surname>
            <given-names>AI</given-names>
          </string-name>
          :
          <article-title>the 3rd wave</article-title>
          , CoRR abs/
          <year>2012</year>
          .05876 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2012</year>
          .05876. arXiv:
          <year>2012</year>
          .05876.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Raedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumančić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manhaeve</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Marra, From statistical relational to neurosymbolic artificial intelligence</article-title>
          , in: C.
          <string-name>
            <surname>Bessiere</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4943</fpage>
          -
          <lpage>4950</lpage>
          . URL: https: //doi.org/10.24963/ijcai.
          <year>2020</year>
          /688. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2020</year>
          /688, survey track.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Badreddine</surname>
          </string-name>
          , A.
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Serafini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Spranger</surname>
          </string-name>
          ,
          <article-title>Logic tensor networks</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>303</volume>
          (
          <year>2022</year>
          )
          <article-title>103649</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0004370221002009. doi:https://doi.org/10.1016/j.artint.
          <year>2021</year>
          .
          <volume>103649</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Manhaeve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumancic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kimmig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Demeester</surname>
          </string-name>
          , L. De Raedt,
          <article-title>Deepproblog: Neural probabilistic logic programming</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3749</fpage>
          -
          <lpage>3759</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Riegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. P. S.</given-names>
            <surname>Luus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Makondo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Y.</given-names>
            <surname>Akhalwaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barahona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ikbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Karanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neelam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Likhyani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Logical neural networks</article-title>
          , CoRR abs/
          <year>2006</year>
          .13155 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2006</year>
          . 13155. arXiv:
          <year>2006</year>
          .13155.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ishay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Neurasp:
          <article-title>Embracing neural networks into answer set programming</article-title>
          , in: C.
          <string-name>
            <surname>Bessiere</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1755</fpage>
          -
          <lpage>1762</lpage>
          . URL: https://doi.org/10.24963/ijcai.
          <year>2020</year>
          /243. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2020</year>
          /243.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.-Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          ,
          <article-title>Abductive knowledge induction from raw data</article-title>
          , in: Z.
          <string-name>
            <surname>-H. Zhou</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>