Inductive Learning of Complex Knowledge from Raw
Data
Daniel Cunnington1,2 , Mark Law2,3 , Jorge Lobo4 and Alessandra Russo2
1
  IBM Research Europe, Winchester, UK
2
  Imperial College London, London, UK
3
  ILASP Limited, Grantham, UK
4
  Universitat Pompeu Fabra, Barcelona, Spain


                                         Abstract
                                         One of the ultimate goals of Artificial Intelligence is to learn generalised and human-interpretable
                                         knowledge from raw data. Neuro-symbolic reasoning approaches partly tackle this problem by improving
                                         the training of a neural network using a manually engineered symbolic knowledge base. In the case
                                         where symbolic knowledge is learned from raw data, this knowledge lacks the expressivity required
                                         to solve complex problems. In this paper, we introduce Neuro-Symbolic Inductive Learner (NSIL), an
                                         approach that trains a neural network to extract latent concepts from raw data, whilst learning symbolic
                                         knowledge that solves complex problems, defined in terms of these latent concepts. The novelty of
                                         our approach is a method for biasing a symbolic learner to learn improved knowledge, based on the
                                         in-training performance of both neural and symbolic components. We evaluate NSIL on two problem
                                         domains that require learning knowledge with different levels of complexity, and demonstrate that NSIL
                                         learns knowledge that is not possible to learn with other neuro-symbolic systems, whilst outperforming
                                         baseline models in terms of accuracy and data efficiency.

                                         Keywords
                                         neuro-symbolic learning, inductive logic programming, raw data


1. Introduction
Within Artificial Intelligence, one of the ultimate goals is to learn generalised and human
interpretable knowledge from raw data. Neuro-symbolic approaches tackle this problem by
combining the best features of both neural networks and symbolic learning and reasoning
techniques [1, 2, 3]. For example, many existing approaches improve the training of a neural
network using a symbolic knowledge base that is manually engineered [4, 5, 6, 7]. In contrast,
systems that learn symbolic knowledge are generally only applied to structured data, and use pre-
trained neural networks when applied to raw data. To address this limitation, 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 jointly
trains a neural network whilst learning symbolic knowledge [8], using a meta-interpreter learner
that only learns first-order definite logic programs [9]. Consequently, the learned symbolic
knowledge lacks the expressivity required for many common-sense learning and reasoning
tasks [10].

AAAI 2022 FALL SYMPOSIUM SERIES, Thinking Fast and Slow and Other Cognitive Theories in AI, November 17-19,
2022, Westin Arlington Gateway in Arlington, Virginia, USA
$ dancunnington@uk.ibm.com (D. Cunnington)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   In this paper, we present Neuro-Symbolic Inductive Learner (NSIL), that jointly trains a
neural network whilst learning a complex, first-order symbolic knowledge, called an inductive
hypothesis. Building upon state-of-the-art symbolic learners [11, 12], NSIL generalises 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑
by learning more expressive knowledge from raw data in the language of Answer Set Program-
ming (ASP) [13], solving computationally harder problems that 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 cannot solve. In NSIL,
the learned hypothesis maps latent concepts to downstream labels, and the neural network is
trained to classify these latent concepts from raw data, by reasoning over the learned hypothesis.
This is a form of neuro-symbolic reasoning, where the neural network is trained in a weakly
supervised fashion, as no labels are given for the latent concepts. In NSIL, we use the NeurASP
system [7] to perform such computation, and seamlessly integrate symbolic learning by creating
corrective examples for the symbolic learner. These examples use the current predictions of the
neural network and the space of possible latent concept values relevant to the downstream label.
A specific mechanism for weighting these examples is proposed, which biases the symbolic
learner towards exploration or exploitation of the inductive hypothesis search space. NSIL is
the first system that jointly trains a general1 neural network from scratch, whilst inducing a
complex first-order hypothesis in the form of an ASP program.
   Our experiments apply NSIL to two tasks: MNIST Arithmetic and MNIST Hitting Sets, where
each training sample contains a collection of raw MNIST inputs [15, 16], together with a
downstream label. We include specific relations as domain knowledge for each task, and
compare NSILs performance with purely differentiable baseline models. The results show that
NSIL; (1) outperforms the baselines in terms of its overall accuracy and data efficiency, (2) trains
the neural network to predict latent concepts with comparable accuracy to fully supervised
differentiable models, and (3) learns interpretable knowledge. In comparison with 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 ,
the MNIST Hitting Sets task [17] demonstrates that our NSIL approach is capable of learning
inductive hypotheses of increased expressivity, including constraints and choice rules. The
learned hypotheses indeed solve such computationally harder problems, and, go a step further
by generating all the hitting sets of a given collection. This can only be achieved when the
learned hypothesis is expressed using ASP, which is clearly not possible with 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 .


2. Related work
Many neuro-symbolic learning and reasoning techniques have been proposed in the literature
[18]. Their primary objective is to leverage properties of symbolic methods to enhance data
efficiency, transferability and interpretability of neural models. Some techniques inject sym-
bolic knowledge directly into the neural architectures [4, 6], whereas others preserve a clear
distinction between a neural and symbolic reasoning component [5, 7]. The key drawback of
these neuro-symbolic reasoning approaches is that they require complete and manually engi-
neered symbolic background knowledge. In contrast, our approach learns symbolic knowledge,
expressed as a first-order rule-based program while training a neural network.
   Pure symbolic learning systems are capable of learning interpretable knowledge in a data
efficient manner [19, 20, 21]. However, as these systems are logic-based, they can only learn from
structured data. Even recent differentiable methods [22, 23, 24] are only applied to structured
   1
       [14] is limited to end-to-end training of a binary neural network only, in conjunction with rule learning.
data, and use pre-trained neural networks when applied to raw data [25, 26]. To address this
limitation, neuro-symbolic learning systems have been proposed [8, 27, 14].
   𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 [8] extends [27] by using abduction and induction to jointly train a neural network
and induce definite logic programs from raw data. However, the Metagol symbolic learner [9],
used by 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 , can only learn symbolic knowledge expressed as definite logic programs
without function symbols, which can compute only polynomial functions [10]. Hence, 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑
cannot learn more expressive knowledge involving defaults, exceptions, constraints and choice,
which are essential aspects of common-sense learning and reasoning. Similarly to 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 ,
we also train a neural network whilst learning symbolic knowledge, through weak supervision
from downstream labels. However, we learn first-order complex knowledge expressed in the
language of ASP, which is more general than symbolic learning of definite clauses [28, 11, 29],
and can solve computationally harder problems, such as the hitting sets NP decision problem
[17]. In this task, the training set consists of binary True/False labels that indicate the existence
of a hitting set over a collection of up to four sets, but our learned programs are general as they
can decide if there is a hitting set over collections of any number of sets. This is not possible with
𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 , or indeed any neural network architecture, since at inference time their input/output
format must be exactly the same as that observed during training. Furthermore, our learned
programs not only solve the hitting set decision problem, but can also generate all the hitting
sets of a given collection. Finally, 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 may be vulnerable to becoming stuck in a local
optima, which justifies why the authors present in [8] an evaluation using a CNN pre-trained
in a one-shot fashion. In contrast, we demonstrate convergence to the optimal hypothesis over
very large hypothesis spaces without using ground-truth concept labels. This is also the case
when a reduced training set is used, for which the neural network remains under-trained.
   Finally, the approach in [14] uses Binary Neural Networks (BNN) encoded into an ASP
program, together with an apperception theory as background knowledge. The task of jointly
learning the weights of the BNN and the additional knowledge required to solve a given problem,
is formalised as solving a single ASP problem, and as such is fully delegated to an ASP solver.
BNNs are an active area of research but they are still far from solving the kind of problems that
regular neural networks can solve. Hence, since ASP doesn’t support floating point arithmetic,
the approach in [14] is limited by the BNN and is unlikely to handle tasks that require complex
neural network architectures.


3. Neuro-Symbolic Inductive Learner
3.1. Problem setting
We consider a dataset of samples 𝐷 = {⟨𝑋, 𝑦⟩, . . .} and a domain knowledge 𝐵, where 𝑋 is a
collection of size 𝑚 containing multiple raw data 𝑥𝑖 ∈ 𝒳 , 𝑦 ∈ 𝒴 is a target label, and 𝐵 is a
first-order logic program. For a given task, we assume a single latent concept 𝐶 = ⟨𝑛𝐶 , 𝒵𝐶 ⟩,
where 𝑛𝐶 is the name of the latent concept, and 𝒵𝐶 is its set of possible values, which we
assume to be finite. Each 𝑥𝑖 ∈ 𝑋 has an associated latent concept value 𝑧𝑖 ∈ 𝒵𝐶 , and 𝑍 is a
collection of size 𝑚 denoting the combination of latent concept values associated with 𝑋. Let
ℒ be the language defined using the relations in 𝐵, 𝐶, and 𝒴. Informally, we wish to learn a
complex symbolic knowledge 𝐻, called an inductive hypothesis, expressed as a first-order logic
program in the language ℒ, that defines the label 𝑦 for a sample 𝑋 in terms of its latent concept
values 𝑍, and domain knowledge 𝐵. Formally, the objective of NSIL is to learn a composite
function of the form ℎ* ∘ 𝑓 * : 𝒳 𝑚 → 𝒴 where 𝑓 * : 𝒳 𝑚 → 𝒵𝐶𝑚 and ℎ* : 𝒵𝐶𝑚 ∪ 𝐵 → 𝒴. We
learn approximations 𝑓 of 𝑓 * and ℎ of ℎ* , and refer to them as the neural and symbolic learning
components of NSIL respectively. Under our single latent concept assumption, 𝑓 is effectively a
classifier for the latent concept 𝐶 in each input. Therefore, to simplify notation, we refer to the
set of possible latent concept values 𝒵𝐶 as 𝒵.
   During training, our approach learns from a dataset 𝐷 and a domain knowledge 𝐵. It is
crucial to note that 𝐷 does not contain the correct latent labels 𝑍 for any of the ⟨𝑋, 𝑦⟩ samples.
To solve the task, NSIL has to; (1) learn how to correctly map each raw input 𝑥𝑖 to its latent
concept value 𝑧𝑖 , and (2) learn how to symbolically relate the combination of latent concept
values 𝑍 predicted from 𝑋 to the label 𝑦. These mappings correspond to our 𝑓 and ℎ functions
respectively. As ℎ is purely symbolic, available domain knowledge 𝐵 can be leveraged during
learning. However, ℎ is not differentiable, which makes the joint optimisation of ℎ ∘ 𝑓 through
standard gradient-based learning algorithms infeasible. The challenge is to train 𝑓 and ℎ
via a weakly supervised learning task using 𝐷 and 𝐵 only, such that for every ⟨𝑋, 𝑦⟩ ∈ 𝐷,
ℎ(𝑓 (𝑋) ∪ 𝐵) = 𝑦 = ℎ* (𝑓 * (𝑋) ∪ 𝐵).
   NSIL addresses this challenge iteratively, using two steps. 𝑓 is trained using the current
inductive hypothesis, denoted 𝐻 ′ , and the symbolic learner is biased to learn a new 𝐻 ′ using
the current 𝑓 and ℎ. Let us now describe the NSIL architecture, defining the neural and symbolic
learning components.

3.2. NSIL architecture
In this section, we instantiate the learning problem tackled by NSIL and present a novel neuro-
symbolic learning architecture. The neural component 𝑓 uses a neural network 𝑔 : 𝒳 → [0, 1]|𝒵| ,
parameterised by 𝜃, which computes a probability distribution over the latent concept values for
each 𝑥𝑖 ∈ 𝑋. 𝑓 then uses the standard arg max function to aggregate the output of 𝑔 for each
input, into a collection of latent concept values 𝑍. The symbolic learning component ℎ learns a
complex inductive hypothesis 𝐻 expressed in the first-order logic programming language ℒ
such that for every ⟨𝑋, 𝑦⟩ ∈ 𝐷, 𝐵 ∪ 𝐻 ∪ 𝑍 satisfies 𝑦.
   The 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 approach [8], addresses a similar problem of training a neural network whilst
learning an inductive hypothesis. Informally, 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 induces an inductive hypothesis 𝐻, and
uses 𝐻 to “abduce” relevant facts (i.e., pseudo-labels), to prune the search space for 𝑍 while 𝑔
is trained with respect to these computed pseudo-labels. The 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 approach relies upon
a definition of 𝐵 ∪ 𝐻 ∪ 𝑍 covering 𝑦 in terms of logical entailment (i.e., 𝑃 (𝑦|𝐵, 𝐻, 𝑍) = 1
if 𝐵 ∪ 𝐻 ∪ 𝑍 |= 𝑦 and 0 otherwise), and the strong (implicit) assumption that 𝐵 ∪ 𝐻 ∪ 𝑍
defines a unique logical model. Such assumption is related to the confined expressivity of
the symbolic learner of 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 , which can only learn inductive hypotheses expressed as
first-order definite clauses. Consequently, hypotheses learned by 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 can only solve tasks
requiring polynomial functions [10]. In contrast, we are able to learn more complex functions,
such as in the following example:
                                                                           (b) Inference


                        (a) Learning
Figure 1: NSIL learning with a single data point ⟨𝑋, 𝑦⟩, and inference with a single input 𝑋.


Example 1. Consider a set 𝒳 of MNIST images from classes 1-5. The task is to decide whether
a collection of sets 𝑋, containing images from 𝒳 , has a hitting set of size ≤ 2. For example,
if 𝑋 = {{ , }, { }, { }}, then the answer is yes, because {1, 3} is a hitting set. If 𝑋 =
{{ , }, { }, { }} then the answer is no, because there are no hitting sets. The training data
𝐷 contains pairs, ⟨𝑋, 𝑦⟩, where 𝑦 = 1 if there is a hitting set of 𝑋, otherwise 𝑦 = 0. Also, the
domain knowledge 𝐵 defines the notion of an element in a set.

   Our system will not only learn an inductive hypothesis 𝐻 that solves the hitting sets NP
decision problem, but also the same hypothesis will be able to generate all the possible hitting sets
of a given collection 𝑋. These kind of problems cannot be solved by 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 . NSIL provides
a general neuro-symbolic learning architecture, where the learned inductive hypothesis is a
type of first-order logic program called an answer set program, based on the ASP paradigm [13].
ASP is a rule-based declarative formalism for representing and solving combinatorial problems.
Its highly expressive language allows for rules representing non-determinism, choice and
constraints, among other constructs. Knowledge about a real-world problem can be represented
as an ASP program, whose answer sets, or logical models, correspond to the solutions of the
original problem. In the context of the hitting set decision problem in ASP, stating that 𝐵 ∪𝐻 ∪𝑍
covers 𝑦 means that there is at least one answer set of 𝐵 ∪ 𝐻 ∪ 𝑍 if 𝑦 = 1, or no answer sets if
𝑦 = 0 [13].
   ASP programs can be learned from noisy, labelled, symbolic examples [30], and can also be
used to train neural networks by applying symbolic reasoning over semantic constraints [7]. As
illustrated in Figure 1, the NSIL architecture leverages these two aspects of ASP, by seamlessly
integrating two components: (1) a Learning from Answer Sets (LAS) symbolic learner [30],
for learning an inductive hypothesis 𝐻, that maps latent concept values 𝑍 to outputs 𝑦; (2)
NeurASP [7] to train the neural network to predict 𝑍 from 𝑋, by applying neuro-symbolic
reasoning over the learned 𝐻. Let us now define the optimisation criteria for these components.
   Informally, let 𝐸𝒵,𝒴 be a set of examples depending on 𝒵 and 𝒴 (see Subsection 3.3 for
details on how 𝐸𝒵,𝒴 is dynamically generated during training). Each example 𝑒 ∈ 𝐸𝒵,𝒴 has an
associated penalty 𝑒𝑝𝑒𝑛 that indicates the cost that a candidate inductive hypothesis 𝐻 ′ pays for
not covering it. Let 𝐵 be an ASP domain knowledge, defining also the set ℋ of possible learnable
ASP programs. An inductive hypothesis 𝐻 ∈ ℋ is an ASP program with size(𝐻) = |𝐻| given
by the number of relations in 𝐻, and a cost given by the sum of the penalties of uncovered
examples. We denote the set of examples∑︀
                                        uncovered by 𝐻 with 𝑈 𝑁 𝐶𝑂𝑉 (𝐻, 𝐸𝒵,𝒴 ), and define
the cost of 𝐻 as 𝑝𝑒𝑛(𝐻, (𝐵, 𝐸𝒵,𝒴 )) = 𝑒∈𝑈 𝑁 𝐶𝑂𝑉 (𝐻,𝐸𝒵,𝒴 ) 𝑒𝑝𝑒𝑛 . The score of 𝐻, denoted as
𝑠𝑐𝑜𝑟𝑒(𝐻, (𝐵, 𝐸𝒵,𝒴 )), is the sum |𝐻| + 𝑝𝑒𝑛(𝐻, (𝐵, 𝐸𝒵,𝒴 )). A LAS symbolic learner computes
the function ℎ : 𝒵 ∪ 𝐵 → 𝒴 by solving the following optimisation problem:

                             𝐻 * = arg min [𝑠𝑐𝑜𝑟𝑒(𝐻, (𝐵, 𝐸𝒵,𝒴 ))]                               (1)
                                      𝐻∈ℋ

The above minimisation problem can be equivalently interpreted as jointly maximising the
generality of 𝐻 (i.e., the most compressed ASP program) and its coverage of the examples
in 𝐸𝒵,𝒴 . Given a candidate inductive hypothesis 𝐻 ′ , NSIL uses NeurASP to train the neural
network. Therefore, the following is a reformulation of the NeurASP learning task [7]. Let Π𝜃
be the ASP program given by Π𝜃 = 𝐵 ∪ 𝐻, where 𝐵 includes a choice rule over the possible
latent concept values in 𝒵, that the neural network can predict from a collection 𝑋 of raw data,
in a given sample ⟨𝑋, 𝑦⟩. Choice in ASP means that Π𝜃 may have multiple answer sets which
include relations of the form 𝑔(𝑥𝑖 ) = 𝑧𝑖 . We indicate with 𝐴𝑆Π𝜃 (𝑔(𝑥𝑖 ) = 𝑧𝑖 ) the set of answer
sets of Π𝜃 that include the facts 𝑔(𝑥𝑖 ) = 𝑧𝑖 . For∏︀ any 𝐴 ∈ 𝐴𝑆Π𝜃 (𝑔(𝑥𝑖 ) = 𝑧𝑖 ), we define the
probability of an answer set 𝐴 as: 𝑃 (𝐴) = [ 𝑔(𝑥𝑖 )=𝑧𝑖 ∈𝐴 𝑃 (𝑔(𝑥𝑖 ) = 𝑧𝑖 ) ] / |𝐴𝑆Π𝜃 (𝑔(𝑥𝑖 ) =
𝑧𝑖 )|, where 𝑃 (𝑔(𝑥𝑖 ) = 𝑧𝑖 ) is given by the neural network. Let Π𝜃 (𝑦) = Π𝜃 ∪ {← not 𝑦},
where the constraint ← not 𝑦 indicates that the label 𝑦 must be satisfied in the answer
sets of Π𝜃 (𝑦). Also, let 𝐴𝑆Π𝜃 (𝑦) (𝑔(𝑥𝑖 ) = 𝑧𝑖 ) denote the answer sets of Π𝜃 (𝑦) that include
the facts 𝑔(𝑥𝑖 )∑︀= 𝑧𝑖 . The probability of satisfying a particular label 𝑦 can be computed as:
𝑃 (Π𝜃 (𝑦)) =      𝐴∈𝐴𝑆Π𝜃 (𝑦) (𝑔(𝑥𝑖 )=𝑧𝑖 ) 𝑃 (𝐴). This second step of NSIL computes the function
𝑔 : 𝒳 → 𝒵 by solving the following optimisation that maximises the log-likelihood of labels 𝑦
under the ASP semantics of Π𝜃 (𝑦):
                                                  ∑︁
                               𝜃* = arg max            𝑙𝑜𝑔(𝑃 (Π𝜃 (𝑦))).                         (2)
                                       𝜃
                                            ⟨𝑋,𝑦⟩∈𝐷

  Hence, NSIL integrates neural and symbolic components by iteratively solving Equations 1
and 2:

   1. NSIL is initialised by a bootstrap stage that learns an initial inductive hypothesis 𝐻 ′ that
      satisfies the domain knowledge 𝐵 and covers each possible label in 𝒴, independently of
      specific neural predictions. 𝜃 is initialised randomly.

   2. The parameters of the neural network 𝜃 are updated for 1 epoch on dataset 𝐷, using the
      ASP program Π𝜃 (𝑦) = 𝐵 ∪ 𝐻 ′ ∪ {← not 𝑦}, for each⟨𝑋, 𝑦⟩ ∈ 𝐷.

   3. Corrective examples are constructed using neural network predictions (described in detail
      in Section 3.3), and a new 𝐻 ′ is learned with a LAS symbolic learner.

   4. Steps 2-3 are repeated for a fixed number of iterations. NSIL outputs the trained neural
      network 𝑔 and the learned inductive hypothesis 𝐻.
3.3. Symbolic learning optimisation
At each iteration of NSIL, and as shown in Equation 1, the LAS symbolic learner searches for a
candidate inductive hypothesis 𝐻 ′ ∈ ℋ that minimises the score in terms of its size and the
total penalty of examples 𝐸𝒵,𝒴 that are not covered by 𝐻 ′ . The examples, and their associated
weight penalties, are therefore crucial to the optimisation, as the symbolic learner is encouraged
to learn a 𝐻 ′ that covers examples with large weight penalties. The examples are symbolic and
define logical relations that are expected to be true or false w.r.t. some contextual information,
such that these relations are included or excluded from the answer sets of 𝐵 ∪ 𝐻 ′ that cover the
example. Formally, an example 𝑒 ∈ 𝐸𝒵,𝒴 is a tuple 𝑒 = ⟨𝑒pen , 𝑒inc , 𝑒exc , 𝑒ctx ⟩ where 𝑒pen is the
weight penalty, and 𝑒inc and 𝑒exc are respectively, sets of relations to be included and excluded
w.r.t. the contextual information 𝑒ctx . Examples can be either positive or negative, depending
on the task; positive examples are used when the search space ℋ contains clauses without
constraints, whereas positive and negative examples are used when ℋ includes more expressive
first-order rules, including constraints. In this paper, the key intuition is that each example
associates a possible combination of latent concepts 𝑍 with a label 𝑦, expressed using 𝑒ctx , and
𝑒inc , 𝑒exc respectively. For positive examples, the weight penalty therefore influences whether
𝑦 should be in the answer sets of 𝐵 ∪ 𝐻 ′ ∪ 𝑍, and the higher the penalty, the higher the cost
a candidate hypothesis pays if 𝑦 is not in the answer sets of 𝐵 ∪ 𝐻 ′ ∪ 𝑍. The opposite holds
for negative examples. 𝐸𝒵,𝒴 is constructed initially during a bootstrap stage, and dynamically
modified throughout training to optimise towards the final 𝐻. The example structure for each
of the learning tasks we have used to evaluate NSIL is given in Section 4, and the reader is
referred to [30] for further details regarding examples for the LAS systems.
Bootstrap examples Initially, the neural network is not trained and therefore cannot be used
to optimise 𝐻 ′ . We bootstrap the learning of 𝐻 ′ using a set of examples 𝐸𝒵,𝒴 defined to cover
all the target labels 𝑦 ∈ 𝒴, with respect to possible combinations of latent concept values 𝑍.
The symbolic learner ensures the initial 𝐻 ′ is the shortest inductive hypothesis that satisfies
the label coverage. Let us now define how 𝐸𝒵,𝒴 is dynamically modified throughout training
using corrective examples.
Computing corrective examples On each iteration of NSIL training, we use two sources of
information to optimise 𝐻. Once the neural network has been trained with a candidate 𝐻 ′ , we
can analyse the overall performance of both neural and symbolic components in predicting
training set labels 𝑦 from 𝑋. Also, the neural network returns confidence scores when predicting
𝑍. The challenge is how to create corrective examples from these information sources, that
appropriately weight possible combinations of 𝑍 and 𝑦 in the form of examples for the symbolic
learner, such that a new 𝐻 ′ is learned. To achieve this, we create two types of corrective
examples called explore and exploit, that respectively encourage exploration and exploitation of
the hypothesis search space. The explore examples relate possible 𝑍 to a different label than
what 𝐻 ′ currently predicts, whilst the exploit examples reinforce the current prediction of 𝑦 for
a particular 𝑍. When a combination of 𝑍 and 𝑦 is obtained, we create a pair of linked explore
and exploit examples, adjusting their weights simultaneously, in order to give a consistent
optimisation signal to the symbolic learner. The goal is to maximise (c.f. minimise) the weights
of the exploitation (c.f. exploration) examples that contain the correct 𝑍 for each label 𝑦, such
that the correct 𝐻 is learned. At this stage, two questions remain: (1) How are 𝑍 and 𝑦 obtained?
(2) How are the weights calculated and updated?
   To obtain 𝑍 and 𝑦, for exploration we compute the answer sets of the ASP program 𝐵 ∪
𝐻 ′ ∪ {← not 𝑦}, for each 𝑦 ∈ 𝒴, where 𝐵 contains a choice rule for possible values of 𝑍. The
answer sets therefore contain 𝑍 that lead to label 𝑦, given the current 𝐻 ′ . For exploitation, for
each ⟨𝑋, 𝑦⟩ ∈ 𝐷, we run a forward pass of the neural network on each 𝑥𝑖 ∈ 𝑋 to obtain 𝑍, and
𝑦 is obtained directly from the training set. When a combination of 𝑍 and 𝑦 are first obtained,
the weights of the corresponding explore and exploit examples are initialised as 𝑒pen = 1 (i.e.,
equal penalty). Let us now define how these example weights are updated throughout NSIL
training.
Updating the weights of corrective examples Weight updates are computed at the end of
each NSIL iteration. Firstly, the weight of the explore (c.f. exploit) example is increased (c.f.
decreased) by the False Negative Rate (FNR) that NSIL has for 𝑦, w.r.t. the current 𝐻 ′ and 𝜃:
                                              (︂                                   )︂
                                                              𝑇 𝑃𝐻 ′ ,𝜃 (𝑦)
                     𝐹 𝑁 𝑅𝐻 ′ ,𝜃 (𝑦) = 100 × 1 −                                                 (3)
                                                     𝑇 𝑃𝐻 ′ ,𝜃 (𝑦) + 𝐹 𝑁𝐻 ′ ,𝜃 (𝑦)
   where 𝑇 𝑃𝐻 ′ ,𝜃 (𝑦) and 𝐹 𝑁𝐻 ′ ,𝜃 (𝑦) are respectively the number of true positive and false
negative training samples with label 𝑦 that NSIL predicts on the training set. The FNR is
multiplied by 100 to ensure the weight adjustment is informative when converted to an integer,
as ASP does not support real numbers. For each explore example related to 𝑦, the weight
is updated as 𝑒pen + 𝜆𝐹 𝑁 𝑅𝐻 ′ ,𝜃 (𝑦), and the weight of the corresponding exploit example is
updated as 𝑒pen − 𝜆𝐹 𝑁 𝑅𝐻 ′ ,𝜃 (𝑦), where in both cases, 𝑒pen is the current example weight and
𝜆 ∈ [0, 1] is a learning rate that controls the effect of the proposed update.2 The weight is then
clipped to a minimum of 1 and maximum of 101 to ensure the example weights are > 0 as
required by the LAS symbolic learner. Secondly, the weight of the exploit (c.f. explore) example
is increased (c.f. decreased) by the aggregated     confidence score of the neural network when
predicting 𝑍 from 𝑋. Let 𝑃𝜃 (𝑋) = 𝑥𝑖 ∈𝑋 𝑚𝑎𝑥(𝑔(𝑥𝑖 )), and let 𝐷𝑍,𝑦 be the set of ⟨𝑋, 𝑦⟩ ∈ 𝐷
                                       ∏︀
with label 𝑦, where the neural network predicts 𝑍. The aggregated confidence score 𝐶𝑂𝑁 𝐹𝜃
is defined as:
                                                        100           ∑︁
                                𝐶𝑂𝑁 𝐹𝜃 (𝑍, 𝑦) =                ×                𝑃𝜃 (𝑋)                             (4)
                                                       |𝐷𝑍,𝑦 |
                                                                   ⟨𝑋,𝑦⟩∈𝐷𝑍,𝑦

   The example weight updates are then calculated as 𝑒pen + 𝜆𝐶𝑂𝑁 𝐹𝜃 (𝑍, 𝑦) and 𝑒pen −
𝜆𝐶𝑂𝑁 𝐹𝜃 (𝑍, 𝑦) for the exploit and explore examples respectively. Again, the weights are
clipped to the range {1..101}. To summarise, Equations 3 and 4 generate exploration and
exploitation signals to encourage the symbolic learner to explore the hypothesis space when
NSIL performs poorly, and to retain the current 𝐻 ′ when NSIL performs well. Let us now
present the results of our experiments.


4. Experiments
The experiments answer the following questions: (1) Can NSIL solve tasks involving the joint
training of both neural and symbolic components? (2) Given the presence of symbolic learning,
    2
        In this paper, we set 𝜆 = 1 as this yields the best performance. Please see Appendix A.4.1 for more details.
is NSIL more data efficient than pure differentiable models, whilst achieving the same if not
better overall accuracy? (3) Given the weak supervision, does NSIL train the neural network
effectively, and learn accurate, general and interpretable knowledge? (4) Can NSIL learn
symbolic knowledge from raw data that solves computationally harder problems compared
to computing polynomial functions? To answer these questions, we evaluate NSIL using two
task domains: MNIST Arithmetic, and MNIST Hitting Sets. Questions 1 and 3 are addressed in
both domains, where we evaluate the final accuracy achieved by the neural network, and also
present the learned inductive hypotheses. Question 2 is answered in the MNIST Arithmetic
tasks by using reducing percentages of training data. Finally, question 4 is answered in the
MNIST Hitting Sets tasks that require more complex and expressive hypotheses to be learned.
Baselines To the best of our knowledge, 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 is the only method that jointly trains both a
conventional neural network and a symbolic learner. However, 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 is limited to learning
definite rules. We demonstrate the advantage of NSIL w.r.t. 𝑀 𝑒𝑡𝑎𝐴𝑏𝑑 in the MNIST Hitting Sets
tasks, by showing that NSIL learns more general and expressive programs that can be applied
to any collection of sets, and can generate all the hitting sets of a given collection. We compare
NSIL to the following differentiable models. In both tasks, we use two variants of a Concept
Bottleneck Model (CBM) [31]. The perception component is the same as the one used in NSIL,
but the reasoning component is differentiable, and implemented as a neural network with 3
linear layers. We use the joint CBM and ensure no concept loss is used during training, therefore
achieving the same weak supervision as NSIL. The second CBM variant, denoted CBM-S, adds
a softmax layer to the CNN, replicating more closely the neural network used in NSIL. As a
third baseline, in the MNIST Arithmetic tasks, we use the CNN from [5, 7], and in the MNIST
Hitting Sets tasks, we use a CNN-LSTM. Finally, in the MNIST Arithmetic tasks, we attach a
Neural Arithmetic Logic Unit (NALU) and a Neural Accumulator (NAC) to a CNN-LSTM [32].
NALU and NAC are not used in the Hitting Sets tasks as they are designed for regression and
not classification. Full architectural details are presented in Appendix A.4.3.
Experiment setup The neural network in NSIL is the MNIST CNN model from [5, 7], and
we use the FastLAS [11] symbolic learner in the MNIST Arithmetic tasks, and ILASP [28] in
the MNIST Hitting Sets tasks. To evaluate NSIL, we use the inference architecture shown in
Figure 1b. The neural component maps each raw input 𝑥𝑖 to its latent concept value 𝑧𝑖 , and
the symbolic component maps 𝑍 to a final prediction, using the learned hypothesis and 𝐵.
Each experiment is repeated 20 times using 20 randomly generated seeds. The performance
is measured by mean classification accuracy per epoch, where 1 epoch = 1 NSIL iteration, for
comparison to the baseline models. We perform hyper-parameter tuning for all models using
a held-out validation set and a separate random seed.3 All experiments are performed on the
following infrastructure: RHEL 8.5 x86_64 with Intel Xeon E5-2667 CPUs (20 cores total), and
an NVIDIA A100 GPU, 150GB RAM. The ASP encodings of the domain knowledge are given in
Appendix A.4.2.
                                                    NSIL                       CNN                              CBM                        CBM-S                                 CNN-LSTM-NAC                                           CNN-LSTM-NALU
            1.0                                                                         1.0                                                                     1.0                                                                     1.0

            0.8                                                                         0.8                                                                     0.8                                                                     0.8

            0.6                                                                         0.6                                                                     0.6                                                                     0.6
 Accuracy


                                                                             Accuracy


                                                                                                                                                     Accuracy


                                                                                                                                                                                                                             Accuracy
            0.4                                                                         0.4                                                                     0.4                                                                     0.4

            0.2                                                                         0.2                                                                     0.2                                                                     0.2

            0.0                                                                         0.0                                                                     0.0                                                                     0.0
                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                                            Epoch                                                                   Epoch                                                                   Epoch                                                                   Epoch


                  (a) Addition 100%                                                           (b) Addition 10%                                                          (c) Addition 5%                                                         (d) Addition 1%
            1.0                                                                         1.0                                                                     1.0                                                                     1.0

            0.8                                                                         0.8                                                                     0.8                                                                     0.8

            0.6                                                                         0.6                                                                     0.6                                                                     0.6
 Accuracy


                                                                             Accuracy


                                                                                                                                                     Accuracy


                                                                                                                                                                                                                             Accuracy
            0.4                                                                         0.4                                                                     0.4                                                                     0.4

            0.2                                                                         0.2                                                                     0.2                                                                     0.2

            0.0                                                                         0.0                                                                     0.0                                                                     0.0
                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                                            Epoch                                                                   Epoch                                                                   Epoch                                                                   Epoch


                            (e) E9P 100%                                                             (f) E9P 10%                                                              (g) E9P 5%                                                              (h) E9P 1%
Figure 2: MNIST Arithmetic accuracy with reducing training sets. Error bars indicate standard error.


4.1. MNIST Arithmetic
We consider the MNIST Addition task for single digit numbers used in [5, 7] and a variation called
MNIST Even9Plus. They both use a set 𝒳 composed of raw MNIST images of single digits (0..9),
and the dataset 𝐷 contains ⟨𝑋, 𝑦⟩ samples where 𝑋 is a pair of digit images. The space of target
labels 𝒴 = {0..18}, and the latent concept ⟨𝑛𝐶 , 𝒵𝐶 ⟩ has 𝑛𝐶 = digit and 𝒵𝐶 = {0..9}. In the
MNIST Addition task, the label 𝑦 indicates the sum of the digit values in 𝑋. The goal is to train
the neural network to classify the digit images in 𝑋, and jointly learn an inductive hypothesis
that defines the sum of the two digits. In the MNIST Even9Plus task, 𝑋 is the same pair of images,
but the label 𝑦 is equal to the digit value of the second image, if the digit value of the first image is
even, or 9 plus the digit value in the second image otherwise. Both tasks use training, validation
and test datasets of size 24,000, 6,000, and 5,000 samples respectively. The latent concept of a raw
image 𝑥𝑖 ∈ 𝑋 is represented as digit(𝑖, 𝑧𝑖 ), where 𝑖 ∈ {1, 2} and 𝑧𝑖 is the associated latent
concept value. 𝐵 uses the relation result to express the final label, relations even, not even,
plus_nine, and ‘=’, as well as the function ‘+’ to specify the search space of inductive hypotheses.
The positive explore examples have 𝑒inc = {label}, 𝑒exc = {}, and 𝑒ctx given by the set of
rules {𝑍. label ← label(Y1), Y1 ̸= 𝑦. ← label(Y1), label(Y2), Y2 ̸= Y1.}, which states
that label can be true if a different 𝑦 is chosen for 𝑍. Positive exploit examples have instead
𝑒inc = {𝑦}, 𝑒exc = {𝒴 ∖ {𝑦}}, and 𝑒ctx = 𝑍.
Results and analysis The overall accuracy results are shown in Figure 2 for different per-
centages of the training set 𝐷 (100% - 1%). NSIL significantly outperforms the differentiable
models on both tasks for all dataset percentages, except 1% on the Addition task. NSIL learns
the expected hypotheses, shown in Figure 4a in Appendix A.1. The hypotheses define the result
as variable V2, and capture the correct solutions. For the Addition task, the neural network is

                  3
     Please refer to https://github.com/dancunnington/NSIL for details of all the hyper-parameters used in our
experiments.
trained to classify MNIST images with a mean accuracy of 0.988, 0.968, 0.931, and 0.124 for each
dataset percentage respectively. The state-of-the-art for MNIST image classification is 0.991,
when the CNN is trained in a fully supervised fashion [33]. We achieve a similar accuracy of
0.988 with only weak-supervision during training. Also, when training with only 1200 samples
(5%), NSIL trains the neural network to 0.931 accuracy. For the E9P task, NSIL learns a more
expressive inductive hypothesis with negation as failure. The mean accuracy of the neural
network is 0.987, 0.972, 0.962, and 0.887 for each dataset percentage respectively. Again, this is
a negligible drop in performance compared to the state-of-the-art in fully supervised training
[33]. Interestingly, the performance of NSIL trained with just 1% of the data is very poor in
the Addition task, although this is not the case in the E9P task, as the neural network achieves
0.887 accuracy with just 240 training samples (1%). Despite the same domain knowledge and
hypothesis search space, the label is more informative: There is a reduced number of possible
latent concept values for certain labels which gives a stronger back-propagation signal for
training the neural network. E.g., for all samples with label 𝑦 = 10, the first digit must be
odd and the second digit must be 1, as it’s only possible to obtain 10 if the first digit is odd,
and 9 + the second digit sums to 10. In the addition task, it’s possible to obtain 10 with mul-
tiple combinations of the first and second digits. Finally, we increased the hypothesis search
space in the more challenging E9P task with the additional functions; ‘-’, ‘×’, ‘÷’, as well as
plus_eight, ..., plus_one relations, and NSIL converges to the expected hypothesis in all of
these extended search spaces (see Appendix A.2).

4.2. MNIST Hitting Sets
We consider two variations of the hitting sets problem [17]. The first, denoted HS, is defined in
Example 1, the second, denoted CHS, adds the constraint that no hitting sets occur if |𝑋| ≥ 3,
and digit 1 exists in any subset in 𝑋. Each ⟨𝑋, 𝑦⟩ sample contains 4 MNIST or FashionMNIST
images from classes 1-5, arranged into subsets in 𝑋. The label 𝑦 is 1 or 0 indicating the existence
of a hitting set, and the goal is to train the neural network to classify MNIST images, whilst
learning the HS or CHS rules to decide the existence of and generate hitting sets. Training,
validation and test datasets contain 1502, 376, and 316 examples respectively. The latent
concept has 𝑛𝐶 = MNIST_class and 𝒵𝐶 = {1..5}, and for a raw image 𝑥𝑖 , it is represented as
ss_element(𝑠𝑖 , 𝑧𝑖 ), where 𝑖 ∈ {1..4}, 𝑠𝑖 is a subset identifier, and 𝑧𝑖 is the associated latent
concept value. We consider well-formed 𝑍, with no duplicate elements in a set, and elements
within the same set are arranged in ascending order. The domain knowledge 𝐵 contains 𝑘 = 2,
the ‘!=’ relations, the relation hs stating an element is in a hitting set, and ss_element(3, V)
and ss_element(V, 1) relations that define respectively subset 3 (and any element V) and digit
1 in any subset V. Both explore and exploit corrective examples share a common 𝑒inc = {},
𝑒exc ={}, and 𝑒ctx = 𝑍. Exploration and exploitation is achieved by representing the example
positively or negatively w.r.t. the given label. In these tasks, we use a CNN-LSTM baseline,
where a CNN firstly encodes a feature vector for each image in a collection 𝑋, and an LSTM
followed by a linear and a sigmoid layer returns a binary classification as to the existence of a
hitting set. To encode the structure of a collection, we create an image of ‘{’ and ‘}’ characters
and pass in the entire collection as input, e.g., 𝑋 = [ , , , , , , , , , ].
Results and analysis Figure 3 shows the overall accuracy for both hitting sets tasks using
                                                        NSIL                  CNN-LSTM                CBM                   CBM-S
                         1.0                                                                        1.0

                         0.9                                                                        0.9

                         0.8                                                                        0.8

              Accuracy


                                                                                         Accuracy
                         0.7                                                                        0.7

                         0.6                                                                        0.6

                         0.5                                                                        0.5

                         0.4                                                                        0.4
                               0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                       0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                                                     Epoch                                                                      Epoch

                                (a) HS FashionMNIST                                                       (b) CHS FashionMNIST

Figure 3: Hitting Sets accuracy. Error bars indicate standard error.

only FashionMNIST images, as this dataset is more challenging than MNIST. NSIL outperforms
the baselines on both tasks and learns the correct hypotheses (see Figure 4b in Appendix A.1).
The MNIST training curves are shown in Appendix A.3. In Figure 4b, the first rule in the HS
and CHS inductive hypotheses is a choice rule that generates hitting sets containing element
V2 at index V1. The second rule defines whether a subset in the collection is hit (i.e., contains
an element in the hitting set). The constraints ensure each subset in the collection is hit, and
elements in the hitting set are different. In the CHS task, the final constraint ensures collections
with 3 subsets and element 1 contain no hitting sets. When classifying images, the neural
network achieves a mean accuracy of 0.992 (MNIST) and 0.906 (FashionMNIST) on the HS task,
and 0.992 (MNIST) and 0.901 (FashionMNIST) on the CHS task. The neural network achieves
near state-of-the-art performance on MNIST images. For FashionMNIST, the state-of-the-art is
0.969 when a CNN is trained in a fully supervised fashion [34], and NSIL achieves 0.906. Despite
the hitting sets tasks having more complex hypotheses, the neural networks are trained to a
similar accuracy as in the MNIST Arithmetic tasks. All models perform better on the CHS task.
This is due to a shortcut; a negative test example with 3 subsets can be correctly predicted if
the neural network predicts class 1 for any image within the subset, as opposed to having to
rely on accurate predictions for more images when the constraint is not present in the HS task.
Also, as a result of the constraint, 40 test examples that contain 3 subsets become negative in
the CHS task, and the models can take advantage of the shortcut.
   The inductive hypotheses learned by NSIL have multiple answer sets due to the choice rule
0{hs(V1, V2)}1. Therefore, despite training labels indicating the existence of a hitting set, the
programs can generate all the hitting sets. We verified this by computing the hamming loss
against the ground truth hitting sets for all test samples. The hamming losses are 0.003 and 0.031
on the HS task, and 0.003 and 0.025 on the CHS task, for the MNIST and FashionMNIST images
respectively. This indicates almost perfect hitting set generation, with the minor errors due to
mis-classifications by the neural network, as opposed to errors in the learned hypotheses. The
larger hamming loss for FashionMNIST images reflects the weaker neural network performance
with this dataset. Finally, in the more challenging HS FashionMNIST task we sampled 10 sets
of additional predicates containing a combination of elements 1..5 and subset IDs 1..4. NSIL
converged to the expected hypothesis in all of these extended search spaces (see Appendix A.2).
5. Conclusion
This paper introduces a neuro-symbolic learner (NSIL) that learns complex knowledge whilst
training a neural network to classify latent concepts from raw data. The framework uses a
symbolic learner to learn an inductive hypothesis capable of solving computationally hard
problems. The novel aspect of the architecture is the use of corrective examples to bias the
symbolic learner. The results show that NSIL outperforms differentiable models in terms of
accuracy and data efficiency, and solves the hitting sets problem that other approaches can’t
solve. In terms of limitations, the differentiable baselines learn faster than NSIL, although
given more time, NSIL achieves a greater accuracy (see the Appendix for learning time vs.
accuracy plots). Future work includes generalising NSIL to support multiple latent concepts by
integrating multiple neural networks, and extending NSIL to take advantage of unsupervised
learning techniques.


References
 [1] A. d’Avila Garcez, M. Gori, L. C. Lamb, L. Serafini, M. Spranger, S. N. Tran, Neural-symbolic
     computing: An effective methodology for principled integration of machine learning and
     reasoning, FLAP 6 (2019) 611–632. URL: https://collegepublications.co.uk/ifcolog/?00033.
 [2] A. d’Avila Garcez, L. C. Lamb, Neurosymbolic AI: the 3rd wave, CoRR abs/2012.05876
     (2020). URL: https://arxiv.org/abs/2012.05876. arXiv:2012.05876.
 [3] L. De Raedt, S. Dumančić, R. Manhaeve, G. Marra, From statistical relational to neuro-
     symbolic artificial intelligence, in: C. Bessiere (Ed.), Proceedings of the Twenty-Ninth
     International Joint Conference on Artificial Intelligence, IJCAI-20, International Joint
     Conferences on Artificial Intelligence Organization, 2020, pp. 4943–4950. URL: https:
     //doi.org/10.24963/ijcai.2020/688. doi:10.24963/ijcai.2020/688, survey track.
 [4] S. Badreddine, A. d’Avila Garcez, L. Serafini, M. Spranger, Logic tensor networks, Artificial
     Intelligence 303 (2022) 103649. URL: https://www.sciencedirect.com/science/article/pii/
     S0004370221002009. doi:https://doi.org/10.1016/j.artint.2021.103649.
 [5] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. De Raedt, Deepproblog: Neural
     probabilistic logic programming, in: Advances in Neural Information Processing Systems,
     2018, pp. 3749–3759.
 [6] R. Riegel, A. G. Gray, F. P. S. Luus, N. Khan, N. Makondo, I. Y. Akhalwaya, H. Qian, R. Fagin,
     F. Barahona, U. Sharma, S. Ikbal, H. Karanam, S. Neelam, A. Likhyani, S. K. Srivastava,
     Logical neural networks, CoRR abs/2006.13155 (2020). URL: https://arxiv.org/abs/2006.
     13155. arXiv:2006.13155.
 [7] Z. Yang, A. Ishay, J. Lee, Neurasp: Embracing neural networks into answer set program-
     ming, in: C. Bessiere (Ed.), Proceedings of the Twenty-Ninth International Joint Conference
     on Artificial Intelligence, IJCAI-20, International Joint Conferences on Artificial Intelli-
     gence Organization, 2020, pp. 1755–1762. URL: https://doi.org/10.24963/ijcai.2020/243.
     doi:10.24963/ijcai.2020/243.
 [8] W.-Z. Dai, S. Muggleton, Abductive knowledge induction from raw data, in: Z.-H. Zhou
     (Ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,
     IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, 2021,
     pp. 1845–1851. URL: https://doi.org/10.24963/ijcai.2021/254. doi:10.24963/ijcai.2021/
     254, main Track.
 [9] A. Cropper, S. Muggleton, Metagol system, https://github.com/metagol/metagol, 2016.
     URL: https://github.com/metagol/metagol.
[10] E. Dantsin, T. Eiter, G. Gottlob, A. Voronkov, Complexity and expressive power of logic
     programming, ACM Computing Surveys (CSUR) 33 (2001) 374–425.
[11] M. Law, A. Russo, E. Bertino, K. Broda, J. Lobo, Fastlas: scalable inductive logic program-
     ming incorporating domain-specific optimisation criteria, in: Proceedings of the AAAI
     Conference on Artificial Intelligence, volume 34, 2020, pp. 2877–2885.
[12] M. Law, A. Russo, K. Broda, Inductive learning of answer set programs from noisy examples,
     Advances in Cognitive Systems (2018) 57–76.
[13] M. Gelfond, Y. Kahl, Knowledge representation, reasoning, and the design of intelligent
     agents: The answer-set programming approach, Cambridge University Press, Cambridge,
     UK, 2014.
[14] R. Evans, M. Bošnjak, L. Buesing, K. Ellis, D. Pfau, P. Kohli, M. Sergot, Mak-
     ing sense of raw input,         Artificial Intelligence 299 (2021) 103521. URL: https://
     www.sciencedirect.com/science/article/pii/S0004370221000722. doi:https://doi.org/
     10.1016/j.artint.2021.103521.
[15] Y. LeCun, C. Cortes, MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/,
     2010. URL: http://yann.lecun.com/exdb/mnist/.
[16] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking
     machine learning algorithms, CoRR abs/1708.07747 (2017). URL: http://arxiv.org/abs/1708.
     07747. arXiv:1708.07747.
[17] R. M. Karp, Reducibility among combinatorial problems, in: Complexity of computer
     computations, Springer, 1972, pp. 85–103.
[18] T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, L. C. Lamb, L. de Penning, B. Illuminoo,
     H. Poon, C. Gerson Zaverucha, Neural-symbolic learning and reasoning: A survey and
     interpretation, Neuro-Symbolic Artificial Intelligence: The State of the Art 342 (2022) 1.
[19] D. Corapi, A. Russo, E. Lupu, Inductive logic programming as abductive search, in:
     Technical communications of the 26th international conference on logic programming,
     Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2010, pp. 54–63.
[20] S. Muggleton, L. De Raedt, Inductive logic programming: Theory and methods, The
     Journal of Logic Programming 19 (1994) 629–679.
[21] S. H. Muggleton, D. Lin, A. Tamaddoni-Nezhad, Meta-interpretive learning of higher-order
     dyadic datalog: Predicate invention revisited, Machine Learning 100 (2015) 49–73.
[22] A. Payani, F. Fekri, Inductive logic programming via differentiable deep neural
     logic networks, CoRR abs/1906.03523 (2019). URL: http://arxiv.org/abs/1906.03523.
     arXiv:1906.03523.
[23] P. Sen, B. W. S. R. de Carvalho, R. Riegel, A. Gray, Neuro-symbolic inductive logic pro-
     gramming with logical neural networks, 2021. arXiv:2112.03324.
[24] H. Shindo, M. Nishino, A. Yamamoto, Differentiable inductive logic programming for
     structured examples, Proceedings of the AAAI Conference on Artificial Intelligence 35
     (2021) 5034–5041. URL: https://ojs.aaai.org/index.php/AAAI/article/view/16637.
[25] D. Cunnington, M. Law, A. Russo, J. Lobo, FF-NSL: feed-forward neural-symbolic learner,
     CoRR abs/2106.13103 (2021). URL: https://arxiv.org/abs/2106.13103. arXiv:2106.13103.
[26] R. Evans, E. Grefenstette, Learning explanatory rules from noisy data, Journal of Artificial
     Intelligence Research 61 (2018) 1–64.
[27] W.-Z. Dai, Q. Xu, Y. Yu, Z.-H. Zhou, Bridging machine learning and logical reasoning by
     abductive learning, Advances in Neural Information Processing Systems 32 (2019).
[28] M. Law, Inductive learning of answer set programs, Ph.D. thesis, Imperial College London,
     2018.
[29] M. Law, A. Russo, K. Broda, The complexity and generality of learning answer set programs,
     Artif. Intell. 259 (2018) 110–146.
[30] M. Law, A. Russo, K. Broda, Logic-based learning of answer set programs, in: Reasoning
     Web. Explainable Artificial Intelligence - 15th International Summer School 2019, Bolzano,
     Italy, September 20-24, 2019, Tutorial Lectures, 2019, pp. 196–231.
[31] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, P. Liang, Concept
     bottleneck models, in: International Conference on Machine Learning, 2020, pp. 5338–5348.
[32] A. Trask, F. Hill, S. E. Reed, J. Rae, C. Dyer, P. Blunsom, Neural arithmetic logic units,
     Advances in neural information processing systems 31 (2018).
[33] S. An, M. Lee, S. Park, H. Yang, J. So, An ensemble of simple convolutional neural network
     models for mnist digit recognition, arXiv preprint arXiv:2008.10400 (2020).
[34] M. S. Tanveer, M. U. K. Khan, C.-M. Kyung, Fine-tuning darts for image classification,
     in: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp.
     4789–4796.
[35] S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint
     arXiv:1609.04747 (2016).
A. Appendix
A.1. NSIL learned hypotheses

                                                     HS
                                                     0 {hs(V1,V2) } 1.
                                                     hit(V1) :- hs(V3,V2), ss_element(V1,V2).
                                                     :- ss_element(V1,V2), not hit(V1).
                                                     :- hs(V3,V1), hs(V3,V2), V1 != V2.
MNIST Addition                                       CHS
result(V0,V1,V2) :- V2 = V0 + V1.                    0 {hs(V1,V2) } 1.
MNIST E9P                                            hit(V1) :- hs(V3,V2), ss_element(V1,V2).
result(V0,V1,V2) :- even(V0),V2 = V1.                :- ss_element(V1,V2), not hit(V1).
result(V0,V1,V2) :- not even(V0),                    :- hs(V3,V1), hs(V3,V2), V1 != V2.
   plus_nine(V1,V2).                                 :- ss_element(3,V2), ss_element(V1,1).


         (a) MNIST Arithmetic                                   (b) MNIST Hitting Sets

Figure 4: NSIL learned hypotheses


A.2. Increasing the hypothesis space
To demonstrate NSIL can converge to the correct hypothesis in a variety of hypothesis spaces,
we select the more challenging E9P and HS FashionMNIST tasks, and increase the hypothesis
space using additional predicates, functions and relations in 𝐵. We record the time and number
of iterations NSIL requires until convergence, and the results demonstrate NSIL converges to
the correct hypothesis in all of these cases, but may require more iterations and/or time to do
so.
   E9P We add arithmetic functions ‘-’, ‘×’, ‘÷’, as well as plus_eight, ..., plus_one relations,
and increase the grounding of possible labels, such that the results of multiplication and
subtraction support all combinations of digits 0..9. Given the strong performance by NSIL
with reducing dataset percentages, we use 40% of the data to reduce the neural network training
time (this does not affect the symbolic learning time, nor simplify the symbolic learning task).
The results are shown in Table 1a.
   HS FashionMNIST Firstly, we define a set of domain knowledge called Standard, that
contains only the predicates required to learn the expected rules of the HS task. The standard
domain knowledge includes 𝑘 = 2, the ‘!=’ relations, and the relation hs stating an element
is in a hitting set. We extend these by sampling 10 sets of additional predicates containing a
combination of elements 1..5 (denoted ‘el’) and subset IDs 1..4 (denoted ‘ssID’). The results are
shown in Table 1b.

A.3. Hitting Sets test accuracy: MNIST images
The results are presented in Figure 5.
Table 1
Increasing the hypothesis space for NSIL. The asterisks indicate the predicates and relations used in
the results presented in Section 4. NSIL converges to the correct hypothesis in all of these cases, but
sometimes requires more iterations and/or time to do so.

                                      (a) E9P                                                                 (b) HS FashionMNIST
        Domain                   Label      Convergence         Convergence                               Domain               Convergence     Convergence
       knowledge                 range        iteration           time (s)                               knowledge               iteration       time (s)

       +, =, even,                                                                                        Standard                    1             73.12
 *      not even,                 0..18             2                237.88                         Standard, ssID 4, el 3            1             87.88
       plus_nine                                                                                    Standard, ssID 4, el 4            1             87.90
       +, -, ×, ÷, =,                                                                              Standard, ssID 2, ssID 4           1             89.88
     even, not even,              0..18             2                258.31
                                                                                                     Standard, el 3, el 4             1             100.31
       plus_nine
                                                                                                    Standard, ssID 2, el 2            1             101.97
                   "             −9..81             2                763.06
                                                                                                    Standard, ssID 1, el 4            1             106.83
        =, even,
                                                                                                   Standard, ssID 1, ssID 3           2             867.00
       not even,
                                  0..18             4               2034.62                         Standard, ssID 3, el 4            3            1014.50
     plus_nine, ...,
       plus_one                                                                                     Standard, ssID 3, el 3            3            1242.44
                   "             −9..81             3               23614.5            *            Standard, ssID 3, el 1            3            1736.42


                                                        NSIL           CNN-LSTM                    CBM             CBM-S

                   1.0                                                                       1.0

                   0.9                                                                       0.9

                   0.8                                                                       0.8
        Accuracy


                                                                                  Accuracy


                   0.7                                                                       0.7

                   0.6                                                                       0.6

                   0.5                                                                       0.5

                   0.4                                                                       0.4
                         0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                        0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                                               Epoch                                                                          Epoch

                                          (a) HS                                                                      (b) CHS
Figure 5: Hitting Sets accuracy when MNIST images are used to represent elements. Error bars indicate
standard error. NSIL significantly outperforms the baseline differentiable models on the HS task, and all
models perform better on the CHS task, due to the shortcut explained in Section 4.2.


A.4. Experiment setup
A.4.1. Relying on neural network confidence for corrective example weights
In this section, we investigate the 𝜆 hyper-parameter for NSIL introduced in Section 3.3. When
NSIL calculates the weights of the corrective examples for the symbolic learner, two updates
occur; exploration based on the FNR, and exploitation, based on the neural network confidence
(see Section 3.3). 𝜆 can be adjusted to vary the effect of each update. During exploration, we set
𝜆 = 1 always, to give priority to the FNR, as the FNR is reliable and takes into account both
neural and symbolic components. However, during exploitation, we may not want to fully rely
on the neural network confidence when the neural network is under-trained. Therefore, we run
experiments with varying 𝜆 ∈ {1, 0.8, 0.6, 0.4, 0.2, 0} during exploitation only. We again select
the more challenging E9P and HS tasks, and use 40% of the training data in the E9P task. We
run NSIL for 10 iterations, with 5 repeats, using different randomly generated seeds. Figure 6
presents the results, where the error bars indicate standard error.

                                     =1                       = 0.8                             = 0.6                        = 0.4                             = 0.2                        =0
               1.0                                                                1.0                                                                1.0

               0.8                                                                0.9                                                                0.9

               0.6                                                                0.8                                                                0.8
    Accuracy


                                                                       Accuracy


                                                                                                                                          Accuracy
                                                                                  0.7                                                                0.7
               0.4
                                                                                  0.6                                                                0.6
               0.2
                                                                                  0.5                                                                0.5
               0.0
                                                                                  0.4                                                                0.4
                     0   1   2   3   4 5 6            7   8   9   10                    0   1   2   3   4 5 6            7   8   9   10                    0   1   2   3   4 5 6            7    8   9   10
                                     NSIL Iteration                                                     NSIL Iteration                                                     NSIL Iteration


                                 (a) E9P                                                        (b) HS MNIST                                               (c) HS FashionMNIST

Figure 6: Accuracy with varying 𝜆, E9P and HS tasks.


   In each task, 𝜆 = 1 and 𝜆 = 0 yields the best and worst performance respectively, which
indicates the neural network confidence can be relied upon for these tasks. This is our justifica-
tion for using 𝜆 = 1 in Section 3.3. Analysing the 𝜆 = 0 cases, the accuracy on the E9P task
improves throughout training and the symbolic learner converges to more accurate hypotheses.
However, in the HS tasks, after a few iterations, the symbolic learner learns the same hypothesis
and performance plateaus. When corrective examples are generated, there are many examples
with the same weight, because there are many choices of digits for each combination of collec-
tion structure and label, given the labels are binary. Therefore, the optimisation signal to the
symbolic learner is weaker in the HS tasks compared to the E9P task when 𝜆 = 0, because in the
E9P task, labels 0..18 generate corrective examples with many different weights. When 𝜆 = 1
in the HS tasks, the neural network confidence scores help the symbolic learner to converge
the correct hypothesis by providing a strong exploitation signal for specific examples.

A.4.2. Domain knowledge
Domain knowledge used for all tasks is presented in Figure 7.

A.4.3. Model details
NSIL CNN The neural component in NSIL is the CNN architecture from [5, 7]. It consists of an
encoder with 2 convolutional layers with kernel size 5, and output sizes 6 and 16 respectively.
Each convolutional layer is followed by a max pooling layer of size 2, stride 2. The encoder
is followed by 3 linear layers of size 120, 84, and 10 (in the hitting sets tasks, the last layer is
of size 5), and all layers are followed by a ReLU activation function. Finally, a softmax layer
returns the output probability distribution.
   Baselines The baseline CNN in the MNIST Arithmetic tasks follows the baseline CNN
architecture from [5, 7], which is largely the same as the NSIL CNN, except the size of the last
  r(0..18).
  d(0..9).
  even(X) :- d(X), X \\ 2 = 0.
  plus_nine(X1,X2) :- d(X1), X2=9+X1.            s(1..4).
  res(X1,X2,Y) :- dig(1,X1), dig(2,X2).          h(1..2).
  :- dig(1,X1),dig(2,X2),res(X1,X2,Y1),res(X1    e(1..5).
       ,X2,Y2), Y1 != Y2.                        #modeha(hs(var(h), var(e))).
                                                 #modeh(hit(var(s))).
  #modeh(res(var(d),var(d),var(r))).             #modeb(hs(var(h), var(e)),(positive)).
  #modeb(var(n) = var(d)).                       #modeb(var(e) != var(e)).
  #modeb(var(n) = var(d) + var(d)).              #modeb(ss_element(var(s),var(e)),(positive)).
  #modeb(plus_nine(var(d),var(r))).              #modeb(ss_element(3,var(e)),(positive)).
  #modeb(even(var(d))).                          #modeb(ss_element(var(s),1),(positive)).
  #modeb(not even(var(d))).                      #modeb(hit(var(s))).


          (a) MNIST Arithmetic tasks                      (b) MNIST Hitting Sets tasks

Figure 7: ASP encodings of the domain knowledge used within NSIL.


linear layer is 19, the network accepts a pair of concatenated images as input, and a log softmax
layer is used to provide a classification. The CNN-LSTM in the MNIST Hitting Sets tasks uses
the same CNN encoder as NSIL, applied to each image in the input sequence, followed by an
LSTM layer of tunable size, a linear layer of size 1, and a sigmoid layer. The CBM and CBM-S
baselines have similar architectures, consisting of the same CNN as NSIL, applied to each image
separately. The only difference is that the CBM variant doesn’t contain a softmax layer for the
CNN. In both variants, the CNN is followed by 3 linear layers where the size of the first two
layers are tunable and the last layer is of size 19 in the MNIST arithmetic tasks, and 1 in the
MNIST Hitting Sets tasks. In the MNIST Addition task, ReLU activation is used after the first
two linear layers for the CBM and CBM-S baselines, as this resulted in improved performance.
In the other tasks, no non-linearity was used. The CNN-LSTM-NALU and CNN-LSTM-NAC
baselines contain the same CNN as NSIL and the other baselines, followed by an LSTM layer of
tunable size, and then either an NALU or NAC layer. Finally, we use the stochastic gradient
descent optimiser for all neural networks [35], and tune the learning rate and momentum.3

A.5. Learning time comparison
Figures 8 and 9 show the learning time vs. accuracy comparison between NSIL and the differ-
entiable models for the MNIST Arithmetic and MNIST Hitting Sets tasks respectively. Each
point shows the accuracy after an epoch of training (1 epoch = 1 NSIL iteration), and error bars
indicate standard error. Also, the 𝑥-axis is adjusted to show a meaningful comparison instead
of displaying the total running time for 20 epochs. As observed in both task domains, NSIL
requires more time but achieves a greater accuracy than the differentiable models.
                                                NSIL                           CNN                               CBM                       CBM-S                                  CNN-LSTM-NAC                                    CNN-LSTM-NALU
            1.0                                                                         1.0                                                                   1.0                                                                 1.0

            0.8                                                                         0.8                                                                   0.8                                                                 0.8

            0.6                                                                         0.6                                                                   0.6                                                                 0.6


                                                                                                                                                   Accuracy
 Accuracy


                                                                             Accuracy


                                                                                                                                                                                                                       Accuracy
            0.4                                                                         0.4                                                                   0.4                                                                 0.4

            0.2                                                                         0.2                                                                   0.2                                                                 0.2

            0.0                                                                         0.0                                                                   0.0                                                                 0.0
                  0     200         400         600          800      1000                    0    10000 20000 30000 40000 50000 60000 70000                        0.0     0.2      0.4          0.6    0.8     1.0                    0    10000     20000     30000    40000     50000   60000
                                     Time (s)                                                                     Time (s)                                                             Time (s)                1e6                                             Time (s)


                  (a) Addition 100%                                                               (b) Addition 10%                                                        (c) Addition 5%                                                   (d) Addition 1%
            1.0                                                                         1.0                                                                   1.0                                                                 1.0

            0.8                                                                         0.8                                                                   0.8                                                                 0.8

            0.6                                                                         0.6                                                                   0.6                                                                 0.6
 Accuracy


                                                                             Accuracy


                                                                                                                                                   Accuracy


                                                                                                                                                                                                                       Accuracy
            0.4                                                                         0.4                                                                   0.4                                                                 0.4

            0.2                                                                         0.2                                                                   0.2                                                                 0.2

            0.0                                                                         0.0                                                                   0.0                                                                 0.0
                  0     200         400         600          800      1000                    0          200     400          600    800    1000                    0       200     400           600   800    1000                     0     100000     200000       300000      400000    500000
                                     Time (s)                                                                      Time (s)                                                           Time (s)                                                                 Time (s)


                      (e) E9P 100%                                                                       (f) E9P 10%                                                        (g) E9P 5%                                                            (h) E9P 1%
Figure 8: MNIST Arithmetic learning time vs. accuracy with reducing training set percentages.


                                                                                                    NSIL                            CNN-LSTM                                CBM                         CBM-S
            1.0                                                                         1.0                                                                   1.0                                                                 1.0
            0.9                                                                         0.9                                                                   0.9                                                                 0.9
            0.8                                                                         0.8                                                                   0.8                                                                 0.8
 Accuracy


                                                                             Accuracy


                                                                                                                                                   Accuracy


                                                                                                                                                                                                                       Accuracy
            0.7                                                                         0.7                                                                   0.7                                                                 0.7
            0.6                                                                         0.6                                                                   0.6                                                                 0.6
            0.5                                                                         0.5                                                                   0.5                                                                 0.5
            0.4                                                                         0.4                                                                   0.4                                                                 0.4
                  0   200     400   600    800        1000    1200   1400                     0    500     1000 1500 2000 2500 3000 3500 4000                       0       500     1000         1500   2000   2500                     0   500   1000 1500 2000 2500 3000 3500 4000
                                      Time (s)                                                                     Time (s)                                                           Time (s)                                                                 Time (s)


                      (a) HS MNIST                                                      (b) HS FashionMNIST                                                               (c) CHS MNIST                                   (d) CHS FashionMNIST
Figure 9: Hitting Sets learning time vs. accuracy.


A.6. Asset licenses
The ILASP system is free to use for research,4 FastLAS5 and the FashionMNIST dataset6 are
both open-source with an MIT license, the MNIST dataset is licensed with Creative Commons
Attribution-Share Alike 3.0,7 and the CNN models used from DeepProbLog are open-source
and licensed with Apache 2.0.8

A.7. Broader impact
There are no direct negative impacts that we can envisage for our work, given we introduce a
general machine learning approach. However, NSIL inherits general concerns regarding the
deployment of machine learning systems, and appropriate precautions should be taken, such

                  4
                    https://ilasp.com/terms
                  5
                    https://github.com/spike-imperial/FastLAS/blob/master/LICENSE
                  6
                    https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE
                  7
                    See bottom paragraph: https://keras.io/api/datasets/mnist/
                  8
                    https://github.com/ML-KULeuven/deepproblog/blob/master/LICENSE
as ensuring training data is unbiased, and model inputs/outputs are monitored for adversarial
attacks. As NSIL learns human interpretable hypotheses using symbolic learning, this may
help to mitigate these issues in some applications, by revealing potential bias, and providing a
level of assurance regarding possible downstream predictions based on the learned hypothesis.
The usual performance monitoring will still be required if NSIL is deployed into production, to
prevent adversarial attacks, and to detect when re-training may be required if the input data is
subject to distributional shifts.