=Paper= {{Paper |id=Vol-2206/paper3 |storemode=property |title=Heuristic Based Induction of Answer Set Programs, From Default theories to Combinatorial problems |pdfUrl=https://ceur-ws.org/Vol-2206/paper3.pdf |volume=Vol-2206 |authors=Farhad Shakerin,Gopal Gupta |dblpUrl=https://dblp.org/rec/conf/ilp/ShakerinG18 }} ==Heuristic Based Induction of Answer Set Programs, From Default theories to Combinatorial problems== https://ceur-ws.org/Vol-2206/paper3.pdf
    Heuristic Based Induction of Answer Set Programs:
    From Default Theories to Combinatorial problems

                            Farhad Shakerin and Gopal Gupta

               The University of Texas at Dallas, Richardson TX 75080, USA
                          {fxs130430,gupta}@utdallas.edu



       Abstract. Significant research has been conducted in recent years to extend In-
       ductive Logic Programming (ILP) methods to induce Answer Set Programs (ASP).
       These methods perform an exhaustive search for the correct hypothesis by en-
       coding an ILP problem instance as an ASP program. Exhaustive search, however,
       results in loss of scalability. In addition, the language bias employed in these
       methods is overly restrictive too. In this paper we extend our previous work on
       learning stratified answer set programs that have a single stable model to learn-
       ing arbitrary (i.e., non-stratified) ones with multiple stable models. Our extended
       algorithm is a greedy FOIL-like algorithm, capable of inducing non-monotonic
       logic programs, examples of which includes programs for combinatorial prob-
       lems such as graph coloring and N-queens. To the best of our knowledge, this is
       the first heuristic-based ILP algorithm to induce answer set programs with multi-
       ple stable models.

       Keywords: Inductive Logic Programming , Machine Learning , Negation as
       Failure , Answer Set Programming


1   Introduction
Statistical machine learning methods produce models that are not comprehensible for
humans because they are algebraic solutions to optimization problems such as risk min-
imization or data likelihood maximization. These methods do not produce any intuitive
description of the learned model. Lack of intuitive descriptions makes it hard for users
to understand and verify the underlying rules that govern the model. Also, these meth-
ods cannot produce a justification for a prediction they compute for a new data sample.
Additionally, if prior knowledge (background knowledge) is extended in these methods,
then the entire model needs to be re-learned. Finally, no distinction is made between ex-
ceptions and noisy data in these methods.
    Inductive Logic Programming [11], however, is one technique where the learned
model is in the form of logic programming rules (Horn clauses) that are comprehen-
sible to humans. It allows the background knowledge to be incrementally extended
without requiring the entire model to be re-learned. Meanwhile, the comprehensibility
of symbolic rules makes it easier for users to understand and verify induced models and
even edit them.
    ILP learns theories in the form of Horn clause logic programs. Extending Horn
clauses with negation as failure (NAF) results in more powerful applications becom-
ing possible as inferences can be made even in absence of information. This exten-
sion of Horn clauses with NAF where the meaning is computed using the stable model
semantics [6]—called Answer Set Programming1 —has many powerful applications.
Generalizing ILP to learning answer set programs also makes ILP more powerful. For
a complete discussion on the necessity of NAF in ILP, we refer the reader to [17].
     Once NAF semantics is allowed into ILP systems, they should be able to deal with
multiple stable models which arise due to presence of mutually recursive rules involving
negation (called even cycles) [6] such as:
     p :- not q.
     q :- not p.
     XHAIL [16], ASPAL [2], ILASP [8] are among the recently emerged systems capa-
ble of learning non-monotonic logic programs. However, they all resort to an exhaustive
search for the hypothesis. The exhaustive search is not scalable on practical datasets.
For instance, (all versions of) ILASP training procedure times-out after couple of hours
on “Moral Reasoner” a dataset from the UCI repository. This is a small dataset contain-
ing roughly 200 examples and 50 candidate predicates in language bias.
     In contrast, traditional ILP systems (that only learn Horn clauses), use heuristics to
guide their search. Use of heuristics allows them to avoid an exhaustive search. These
systems usually start with the most general clauses and then specialize them. They are
better suited for large-scale data-sets with noise, since the search can be easily guided
by heuristics. FOIL [15] is a representative of such algorithms. However, handling nega-
tion in FOIL is somewhat problematic as we discuss in [20]. Also, FOIL cannot handle
background knowledge with multiple stable models, nor it can induce answer set pro-
grams.
     Recently we developed an algorithm called FOLD [20] to automate inductive learn-
ing of default theories represented as stratified answer set programs. FOLD (First Order
Learner of Default rules) extends the FOIL algorithm and is able to learn answer set
programs that represent the underlying knowledge very succinctly. However, FOLD is
only limited to dealing with stratified answer set programs, i.e., mutually recursive rules
through negation are not allowed in the background knowledge or the hypothesis. Thus,
FOLD is incapable of handling cases where the background knowledge or the hypothe-
sis admits multiple stable models. In this paper, we extend the FOLD algorithm to allow
both the background knowledge and the hypothesis to have multiple stable models. The
extended FOLD algorithm—called the XFOLD algorithm—is much more general than
previously proposed methods.
     This paper makes the following novel contributions: First, it extends FOLD with
non-observation learning capability (Section 3). Then it presents the XFOLD algorithm,
an extension of our previous FOLD algorithm, that can handle background knowledge
with multiple stable models as well as allow inducing of hypotheses that have multiple
stable models (Section 4). To the best of our knowledge, XFOLD is the first heuristic
based algorithm to induce such hypotheses. The XFOLD algorithm can learn ASP pro-
grams to solve combinatorial problems such as graph-coloring and N-queens. Because
1 We use the term answer set programming in a generic sense to refer to normal logic programs,

  i.e., logic programs extended with NAF, whose semantics is given in terms of stable models
  [5].
the XFOLD algorithm is based on heuristic search, it is also scalable. Lack of scala-
bility is a major problem in previous approaches. We assume that the reader is familiar
with answer set programming and stable model semantics. The book by Gelfond and
Kahl [5] is a good source of background material.


2     Background
The FOLD algorithm [20] which is an extension of FOIL [15], learns a concept in
terms of a default and possibly multiple exceptions (and exceptions to exceptions, and
so on). FOLD tries first to learn the default by specializing a general rule of the form
{target(V1 , ...,Vn ) :- true.} with positive literals. As in FOIL, each specialization
must rule out some already covered negative examples without decreasing the number
of positive examples covered significantly. Unlike FOIL, no negative literal is used at
this stage. Once the heuristic score (i.e., information gain) becomes zero, this process
stops. At this point, if any negative example is still covered, they must be either noisy
data or exceptions to the current hypothesis. Exceptions could be learned by swapping
the current positive and negative examples, then calling the same algorithm recursively.
As a result of this recursive process, we can learn exceptions to exceptions, and so on.
The FOLD ILP problem of learning a target predicate is formally defined as follows:
Given
 1. a background theory B, in the form of an extended logic program, i.e., clauses of
    the form h ← l1 , ..., lm , not lm+1 , ..., not ln , where l1 , ..., ln are positive literals and
    not denotes negation-as-failure (NAF) and B has no even cycle
 2. two disjoint sets of ground target predicates E + , E − known as positive and negative
    examples respectively
 3. a hypothesis language of function free predicates L, and a refinement operator ρ
    under θ − subsumption [14] that would disallow even cycles.
Find a set of clauses H such that:
    – ∀e ∈ E + , B ∪ H |= e
    – ∀e ∈ E − , B ∪ H 6|= e
    – B and H are consistent.
In this paper, we extend the FOLD algorithm to relax all three preconditions stated
above so that: (i) The background knowledge B can potentially have more than one
stable model (section 4); (ii) Examples and target predicate could be different (Section
3); (iii) The induced hypotheses can have cycles through negation. (Section 4)


3     Non-Observation Predicate Learning in FOLD
In usual machine learning setting of “Observation Predicate Learning” (OPL), examples
and hypotheses define the same predicate. In contrast, non-OPL setting allows to have
examples other than the ground target predicate. Non-OPL setting is natural for many
problems [13]. Therefore, a natural extension of FOLD would be to include non-OPL
setting. Intuitively, non-OPL requires to obtain how each non-target example impacts
the correct hypothesis in terms of target ground atoms. The following example shows
how a non-target ground predicate could be expressed in terms of positive and negative
examples of the target predicate.
Example 1 Consider the following Background knowledge. Given the positive example
set E + = {p(a), r(c)}, E − = {p(d)}, we want to learn the target r(X).
 (1) p(X) :- s(X), not r(X).                   (3) q(a,b).
 (2) s(X) :- q(X,Y), r(Y).                     (4) s(d).
Since B ∪ H must imply p(a), from rule (1) we get s(a) must hold and r(a) should not.
For s(a) to hold, from rule (2) we get q(a,Y ), r(Y ) must hold. Such Y indeed exists from
fact (3). Therefore, r(b) must hold too. p(a) requires r(b) and not r(a). Therefore, p(a)
can be replaced by new examples, i.e., r(b) a new positive example, and r(a), a new
negative example. The impact of p(d) as a negative example is to force r(d) not to hold,
because, from (4) we get s(d) holds, therefore, r(d) must not. Hence, r(d) is a new
negative example and replaces p(d).
The computation performed in Example 1 to replace non-target examples with target ex-
amples is realized using abduction in a goal-directed answer set programming system
called s(ASP) [10, 7]. The s(ASP) system takes an answer set program P and a query
goal Q as inputs and enumerates all answer sets that contain the propositions/predicates
in Q. This enumeration employs co-inductive SLD resolution to systematically compute
elements of the greatest fixed point (GFP) of a program via backtracking. The advantage
of s(ASP) over other answer set solvers is that it would lift the restriction that answer
set programs must be finitely groundable. In order to process a query Q, s(ASP) would
produce a set called the “partial answer set” containing the elements that are necessary
to establish Q. The s(ASP) system also allows a query to run abductively, by first defin-
ing a set of predicates as abducible. By doing so, if success of a query Q depends on
assuming a fact that belongs to the set of abducibles, Q abductively succeeds and the
abducibles are added to the set of partial answer set associated with Q.
    Algorithm 1 shows the required steps in order to solve a non-OPL ILP problem us-
ing FOLD. In case of Example 1, p(a) is a non-target example. By running s(ASP) and
defining #abducible r(X), the following partial answer set is produced by s(ASP) on
the query ?- p(a).:
{p(a), q(a,b), r(b), s(a), not r(a)}
r(b) and r(a) are added to the set of positive and negative examples, respectively. It
should be noted that the above set of predicates are relevant to establish the query ?-
p(a). In practice, this is a small subset of the original stable model. The fact that s(ASP)
does not ground the answer set program, makes this approach scalable comparing to
SAT based answer set solvers.

4   Induction of Answer Set Programs with Multiple Stable Models
In this section we extend our FOLD algorithm to learn normal logic programs that
potentially have multiple stable models. The significance of Answer Set Programming
Algorithm 1 Non-OPL Version of FOLD Algorithm
Input: target, B, E + , E −
Output: Hypothesis H
 1: abduced + , abduced − = { }
 2: Let Q be the query: ? − E + , not E −
 3: Run Q on s(ASP) hB, #abducibles = {target}i                     ⊲ Run s(ASP) with B as input
 4: Let P = partial answer set associated with Q
 5: for each p ∈ P s.t pred(p) == target do
 6:     if sign(p) == + then
 7:          abduced + ← abduced + ∪ {p}
 8:     else
 9:          abduced − ← abduced − ∪ {p}
10:     end if
11: end for
12: E + ← E + ∪ abduced +
13: E − ← E − ∪ abduced −
14: Run FOLDhB, E + , E − ,targeti



paradigm is that it provides a declarative semantics under which each stable model
is associated with one (alternative) solution to the problem described by the program.
Typical problems of this kind are combinatorial problems, e.g., graph coloring and N-
queens. In graph coloring, one should find different ways of coloring nodes of a graph
without coloring two nodes connected by an edge with the same color. N-queen is the
problem of placing N queens in a chessboard of size N × N so that no two queens attack
each other.
    In order to inductively learn such programs, the ILP problem definition needs to
be revisited. In the new scenario, positive examples e ∈ E + , may not hold in every
model. Therefore, the ILP problem described in the background section would only
allow learning of predicates that hold in all answer sets. This is too restrictive. Brave
induction [18], in contrast, allows examples to hold only in some stable models of B∪H.
However, as stated in [8], and we will show using examples, this is not enough when
it comes to learning global constraints (i.e, rules with empty head)2 . Learning global
constraints is essential because certain combinations may have to be excluded from all
answer sets.
    When B ∪ H has multiple stable models, there will be some instances of target pred-
icate that would hold in all, none, or some of the stable models. Brave induction is not
able to express situations in which a predicate should hold in all or none of the stable
models. An example is a graph in which node 1 is colored red. In such a case, none
of node 1’s neighbors should be colored red. If node 1 happens to have node 2 as a
neighbor, brave induction is not able to express the fact that if the atom red(1) appears
2 Recall that in answer set programming, a constraint is expressed as a headless rule of the form

     :- B.
     which states that B must be false. A headless rule is really a short-form of rules of the form
  (called odd loops over negation [5]):
     p :- B, not p.
in any stable model of B ∪ H, red(2) should not. In [8], the authors propose a new
paradigm called learning from partial answer sets that overcomes these limitations. We
also adopt this paradigm in this work. Next, we present our XFOLD algorithm.
Definition 1. A partial interpretation E is a pair E = hE inc , E exc i of sets of ground
atoms called inclusions and exclusions, respectively. Let A ∈ AS(B ∪ H) denote a stable
model of B ∪ H. A extends hE inc , E exc i if and only if (E inc ⊆ A) ∧ (E exc ∩ A = 0).
                                                                                     /

Example 2 Consider the following background knowledge about a group of friends
some of whom are in conflict with others. The individuals in conflict will not attend a
party together. Also, they cannot attend a party if they work at the time the party is held.
We want our ILP algorithm to discover the rule(s) that will determine who will go to
the party based on the set of partial interpretations provided.

       B : conflict(X,Y) :- person(X), person(Y), conflict(Y,X).
           works(X) :- person(X), not off(X).
           off(X) :- person(X), not works(X).
           person(p1). person(p2). conflict(p1,p4).
           person(p3). person(p4). person(p5). conflict(p2,p3).
Some of the partial interpretations are as follows:
The predicates g,w,o abbreviate goesToParty, works, off respectively:
E1 = {hg(p1), g(p2), o(p1), o(p2), w(p3), o(p4), w(p5)i, hg(p3), g(p4), g(p5)i}
E2 = {hg(p3), g(p4), g(p5), o(p1), o(p2), o(p3), o(p4), o(p5)i, hg(p1), g(p2)i}
E3 = {hg(p1), g(p3), g(p5), o(p1), o(p2), o(p3), w(p4), o(p5)i, hg(p2), g(p4)i}
E4 = {hg(p2), g(p5), g(p5), w(p1), o(p2), w(p3), w(p4), o(p5)i, hg(p1), g(p3), g(p4)i}
     In the above example, each Ei for i = 1,2,3,4 is a partial interpretation and should be
extended by at least one stable model of B ∪ H for a learned hypothesis H. For instance,
let’s consider the hypothesis H1 = {goesToParty(X) :- off(X)} for learning the
target predicate goesToParty(X). By plugging the background knowledge, the non-
target predicates in E1 , and the hypothesis H1 into an ASP solver (CLASP [4] in our
case), the stable model returned by the solver would contain the following:
     {goesToParty(p1),goesToParty(p2),goesToParty(p4)}.
It does not extend E1 . Although, E1inc ⊆ AS(B ∪ H1 ) but AS(B ∪ H1 ) ∩ E1exc 6= 0.
                                                                                 / It should
be noted that non-target predicates are treated as background knowledge upon calling
ASP solver to compute the stable model of B ∪ H.
Definition 2. An XFOLD problem is defined as a tuple P = hB, L, E + , E − , T i. B is a an-
swer set program with potentially multiple stable models called the background knowl-
edge. L is the language-bias such that L = hMh , Mb i, where Mh (resp. Mb ) are called
the head (resp. body) mode declarations [12].
Each mode declaration mh ∈ Mh (resp. mb ∈ Mb ) is a literal whose abstracted argu-
ments are either variable v or constant c. Type of a variable is a predicate defined in
B. The domain of each constant should be defined separately. Hypothesis h is said to
be compatible with a mode declaration m if each instance of variable in m is replaced
by a variable, and every constant takes a value from the associated domain. The set of
candidate predicates in the greedy search algorithm are selected from Mb ∪ Mh .
    XFOLD is extended with mode declaration to make sure that every clause generated
is safe for the ASP solver CLASP as it needs to ground the program. To obtain a finite
grounded program, CLASP must ensure that every variable is safe. A variable in head is
safe if it occurs in a positive literal of body. XFOLD adds predicates required to ensure
safety, but to keep our examples simple, we omit safety predicates in the paper. E + and
E − are sets of partial interpretations called positive and negative examples, respectively.
T ∈ Mh is the target predicate’s name. Each XFOLD run learns a single target predicate.
A hypothesis h ∈ L is an inductive solution of T if and only if:

 1. ∀e+ ∈ E + ∃A ∈ AS(B ∪ H) such that A extends e+
 2. ∀e− ∈ E − 6 ∃A ∈ AS(B ∪ H) such that A extends e−

     The above definition adopted from [8] subsumes brave and cautious induction se-
mantics [18]. Positive examples should be extended by at least one stable model of
B ∪ H (brave induction). In contrast, no stable model of B ∪ H extends negative exam-
ples (cautious induction). The generate and test problems such as N-queen and graph
coloring could be induced using our XFOLD algorithm. It suffices to use positive ex-
amples for learning the generate part and negative examples for learning the test part.
     Figure 1 represents the input to the XFOLD algorithm for learning an answer set
program for graph coloring. Every positive example states if a node is colored red, then
that node cannot be painted blue or green. Likewise for blue and green. However, this
is not enough to learn the constraint that two nodes connected by an edge cannot have
the same color. To learn this constraint, negative examples are needed. For instance,
E1− , states that if any stable model of B ∪ H contains {red(1)}, in order not to extend
E1− , it should contain {not red(2)} or equivalently, it should not contain {red(2)}.
 Intuitively, XFOLD is similar to FOLD and FOIL: To specialize a clause cl, for every




             Fig. 1: Partial interpretations as examples in graph coloring problem



positive example e ∈ E + , the background knowledge B, all non-target predicates in einc
and cl are passed to the ASP solver as inputs. The resulting answer set is compared with
the target predicates in einc and eexc to compute a partial score. Next, by summing up all
partial scores, total score of that clause is computed. Among all candidate clauses, the
one with highest total score is selected. Once for all e ∈ E + no target predicate in eexc
is covered, the internal loop finishes and the discovered rule(s) are added to the learned
theory. Just like FOLD, if no literal with positive score exists, swapping occurs on
each remaining partial interpretation and the XFOLD algorithm is recursively called.
In this case, instead of introducing abnormality predicates, the negation symbol, ”-”,
Algorithm 2 The XFOLD Algorithm
Input: target, B, {e = (einc , eexc )|e ∈ E + )}
Output: Hypothesis H
  function SPECIALIZE(cl, B, E + )               ⊲ Other functions remain unchanged as in FOLD
     while ∃e ∈ E + such that eexc ! = 0/ do
         for each c ∈ ρ (cl) do                                   ⊲ FOIL inner loop (refinement)
             for each ei ∈ E + do
                 compute partial score[i][c]                        ⊲ partial score for each clause
             end for
             total score[c] = ∑ei ∈E + partial score[i][c]
         end for
         Let c best, max score, be the clause with the highest score and its associated score
         if max score > 0 then
             cl ← c best
             H ← H ∪ {cl}
         else
             E swapped + = Swap(E + )
             XFold(B, E swapped + , −target)
         end if
         update E +
     end while
  end function



is prefixed to the current target predicate to indicate that the algorithm is now trying
to learn the negation of concept being learned. It should also be noted that swapping
examples is performed slightly differently due to the existence of partial interpretations.
For each e ∈ E + the following operations are performed upon swapping:
 1. ∀t ∈ einc , where t is an old target atom already covered and removed, t is restored
 2. ∀t ∈ einc , where t is an old target atom, −t is added to eexc
 3. ∀t ∈ eexc , where t is an old target atom, −t is added to einc
 4. T ← −T . (Target predicate T now becomes its negation, -T)
Figure 2 shows execution of XFOLD on Example 2. At the end of first iteration, the
predicate off(X) gets the highest score. E4 will be removed as it is already covered
by the current hypothesis. In the second iteration, all candidate literals fail to get a
positive score. Therefore, swapping of positive and negative examples occurs and al-
gorithm tries to learn the predicate -goesToParty(X). Since the new target predicate
is -goesToParty(X), all ground atoms of goesToParty in E inc are restored back.
The old target atoms in E exc are transformed to negated version and become members
of E inc . In Figure 2, after one iteration E4 is removed because all target atoms in E inc
are already covered and targets atoms in E exc are already excluded. After swapping,
XFOLD is recursively called to learn -goesToParty. After 2 iterations, all examples
are covered and the algorithm terminates.
    In Example 2, we haven’t introduced any explicit negative example. Nevertheless,
the algorithm was able to successfully find the cases in which the original target pred-
icate does not hold (via learning -goesToParty(X) predicate). In general, it is not
                  Fig. 2: Trace of XFOLD execution on the Party Example



always feasible for the algorithm to figure out prohibited patterns without getting to see
a very large number of positive examples.


5   Application: Combinatorial Problems

A well-known methodology for declarative problem solving is the generate and test
methodology, whereby possible solutions to a problem are generated first, and then
non-solutions are eliminated by testing. In Answer Set Programming, the generate part
is encoded by enumerating the possibilities by introducing even cycles. The test part
is realized by having constraints that would eliminate answer sets that violate the test
conditions. ASP syntax allows rules of the form l{h1 , ..., hk }u such that 0 ≤ l ≤ u ≤ k
and ∀i ∈ [1, k], hi ∈ L, where L is the language bias. This syntactic sugar for combination
of even cycles and constraints is called choice rule in the literature [5].
    ILASP [8] directly searches for choice rules by including them in the search space.
XFOLD, on the other hand, performs the search based on θ -subsumption [14] and hence
disallows search for choice rule hypotheses. Instead, it directly learns even cycles as
well as constraints. This is advantageous as it allows for more sophisticated and flexible
language bias.
    It turns out that inducing the generate part in a combinatorial problem such as graph-
coloring requires an extra step compared to the FOLD algorithm. For instance, red(X)
predicate has the following clause:

                     red(X):- not blue(X), not green(X).
To enable XFOLD to induce such a rule, we adopted the “Mathews Correlation Coeffi-
cient” (MCC) [22] measure to perform the task of feature selection. MCC is calculated
                    T P×T N−FP×FN
as: MCC = √
               (T P+FP)(T P+FN)(T N+FP)(T N+FN)
    This measure takes into account all the four terms TP (true positive), TN (true neg-
ative), FP (false positive) and FN (false negative) in the confusion matrix and is able
to fairly assess the quality of classification even when the ratio of positive tuples to the
negative tuples is not close to 1. The MCC values range from -1 to +1. A coefficient
of +1 represents a perfect classification, 0 represents a classification that is no better
than a random classifier, and -1 indicates total disagreement between the predicted and
the actual labels. MCC cannot replace XFOLD heuristic score, i.e, information gain,
because the latter tries to maximize the coverage of positive examples, while the for-
mer only maximally discriminates between the positives and negatives. Nevertheless,
for the purpose of feature extraction among the negated literals which are disallowed
in XFOLD algorithm, MCC can be applied quite effectively. For that matter, before
running XFOLD algorithm, the MCC score of all candidate literals are computed. If a
predicate scores “close” to +1, the predicate itself is added to the language bias. If it
scores “close” to -1, its negation is added to the language bias. For example, in case
of learning red(X), after running the feature extraction on the graph given in Figure
1, XFOLD computes the scores -0.7, -0.5 for green(X) and blue(X), respectively.
Therefore, {not green(X),not blue(X)} are appended to the list of candidate pred-
icates. Now, after running the XFOLD algorithm, after two iterations of the inner loop,
it would produce the following rule:

                     red(X) :- not green(X), not blue(X).

Corresponding rules for green(X) and blue(X) are learned in a similar manner. This
essentially takes care of the generate part of the combinatorial algorithm. In order to
learn the test part for graph coloring, we need the negative examples shown in Figure
1. It should be noted that in order to learn a constraint, we first learn a new target predi-
cate which is the negation of the original one. Then we shift the negated predicate from
the head to the body inverting its sign in the process. That is, we first learn a clause of
the form {-T :- b1 , b2 . . . bn .} which is then transformed into the following con-
straint: {:- b1 , b2 . . . bn , T.} Thus, the following steps should be taken to learn
constraints from negative examples:

 1. Add rule(s) induced for generate part to B.
 2. ∀e+ ∈ E + , e− ∈ E − , if e−      +
                               inc ⊆ einc :
      – if eexc is of the form (not p(V1 , ...Vm )) then e+
            −                                                    +
                                                          inc ← einc ∪ {−p(V1 , ...Vm )}
      – else e+         +
               exc ← eexc ∪ {−p(V1 , ...Vm )}
 3. compute the contrapositive form of the rule(s) learned in generate part and remove
    the body predicates from the list of candidate predicates
 4. run XFOLD to learn p
 5. shift -p from the head to the body for each rule returned by XFOLD

The contrapositive form of a clause is computed by negating the head and applying the
De Morgan’s law to the body. The resulting disjunctions are resolved by separating them
into new clauses. For instance, the contrapositive of {red(X) :- not green(X),
not blue(X)} is obtained as follows:
{-red(X) :- green(X)},{-red(X) :- blue(X)}. Without step 3, XFOLD would
learn these trivial clauses. However, as soon as those trivial choices are removed from
search space, XFOLD algorithm comes up with the next best hypothesis which is as
follows: {-red(X) :- edge(X,Y), red(Y).} Shifting -red(X) to the body yields




the following constraint: :- red(X),edge(X,Y),red(Y). In graph coloring prob-
lem, Mh = {red(X), green(X), blue(X)}. Once similar examples for green(X)
and blue(X) are provided, XFOLD is able to learn the complete solution as shown
below:
   red(X) :- not green(X), not blue(X).
   green(X) :- not blue(X), not red(X).
   blue(X) :- not green(X), not red(X).
   :- red(X), edge(X,Y), red(Y).
   :- blue(X), edge(X,Y), blue(Y).
   :- green(X), edge(X,Y), green(Y).
Algorithm 3 shows how XFOLD induces a generate and test hypothesis.
Example 3 Learning an answer set program for the 4-queen problem. The following
items are assumed: Background knowledge B including predicates describing a 4 × 4
board, rules describing different ways through which two queens attack each other and
examples of the following form:
B: attack r(R1 ,C1 ,R2 ,C2 ):-q(R1 ,C1 ),q(R2 ,C2 ),C1 ! = C2 , R1 = R2 .
   attack c(R1 ,C1 ,R2 ,C2 ):-q(R1 ,C1 ),q(R2 ,C2 ),R1 ! = R2 , C1 = C2 .
   attack d(R1 ,C1 ,R2 ,C2 ):-q(R1 ,C1 ),q(R2 ,C2 ),R1 ! = R2 ,R1 −C1 = R2 −C2 .
   attack d(R1 ,C1 ,R2 ,C2 ):-q(R1 ,C1 ),q(R2 ,C2 ),R1 ! = R2 ,R1 +C1 = R2 +C2 .
E: E1+ = {hq(2, 1), q(4, 2), q(1, 3), q(3, 4)i, hq(1, 1), q(1, 2), ..., q(4, 4)i}
   ...
   E1− = {hq(2, 1)i, hnot q(2, 2)i}
   E2− = {hq(2, 1)i, hnot q(2, 3)i}
   E3− = {hq(4, 2)i, hnot q(1, 2)i}
   E4− = {hq(4, 2)i, hnot q(2, 3)i}
As far as the generate part is concerned, XFOLD algorithm learns the following pro-
gram:
                             q(X,Y) :- not -q(X,Y).
                             -q(X,Y) :- not q(X,Y).
The predicate -q(X,Y) is introduced by XFOLD algorithm as a result of swapping
the examples and calling itself recursively. After computing the contrapositive form,
q(X,Y), -q(X,Y) are removed from the list of candidate predicates. Then based on
the examples provided in Example 3, XFOLD would learn the following rules:
                      -q(V1 ,V2 ) :- attack r(V1 ,V2 ,V3 ,V4 ).
                      -q(V1 ,V2 ) :- attack c(V1 ,V2 ,V3 ,V4 ).
                      -q(V1 ,V2 ) :- attack d(V1 ,V2 ,V3 ,V4 ).
After shifting the predicate -q(V1 ,V2 ) to the body, we get the following constraint:
                      :- q(V1 ,V2 ), attack r(V1 ,V2 ,V3 ,V4 ).
                      :- q(V1 ,V2 ), attack c(V1 ,V2 ,V3 ,V4 ).
                      :- q(V1 ,V2 ), attack d(V1 ,V2 ,V3 ,V4 ).
It should be noted that, since XFOLD is a sequential covering algorithm like FOIL,
it takes three iterations before it can cover all examples which in turn becomes three
constraints as shown above.


6   Experiments and Results

Table 1 reports the classification accuracy using 10-fold cross-validation and runtime
measurements of Aleph, FOLD and XFOLD on a number of UCI datasets [9] and com-
binatorial problems discussed in this paper. In [20] we compare our FOLD algorithm
with Aleph which is a state-of-the-art ILP system. However, Aleph [21] does not sup-
port multiple stable model ILP. Therefore, we can only compare our results with that of
ILASP. In case of UCI datasets, the “Size” column denotes the number of data samples,
whereas, in graph-coloring (N-queen) it denotes the number of nodes(board size) re-
spectively. We have also examined the application of statistical feature selection on the
performance of our XFOLD algorithm. We report a significant improvement due to the
application of a scalable feature-selection method, i.e., xgboost, prior to invoking the
learning algorithm. Exclusion of low ranked features and the use of negation-as-failure
results in a significant improvement over the accuracy of learned hypotheses.
     “Extreme Gradient Boosting” (xgboost) [1] is a scalable and powerful ensemble
classifier based on decision trees that provides a feature importance score. Since, in
ILP we deal with propositions, it makes sense to discretize numeric features first using
MDL method [3]. In this method, for each numeric feature categories are defined such
that the overall information gain is maximized.
     Next, a dataset that now contains only categorical features is propositionalized. That
is, every value belonging to the domain of a categorical feature turns into a new binary
feature. This is called one hot encoding. One hot encoding makes the feature selection
more fine grained. This is because, in this technique instead of measuring the contri-
bution of a feature as a whole, the importance of every value from the domain of that
feature is measured. Then the data set is fed into xgboost which ranks each binary fea-
ture based on its importance in the classification. From the xgboost’s output, the M
lower ranked features are filtered out of the XFOLD language bias. The optimal M
should be computed via cross-validation.
     In small problems such as graph coloring, ILASP slightly outperforms our XFOLD
algorithm due to embedding the learning algorithm in the ASP solver engine. In a larger
data set such as Moral reasoner with 202 examples and 50 predicates, there are poten-
tially 350 different hypotheses to choose from. This is because, for each predicate it can
either be included positively, included negatively or excluded. In this case, ILASP times
out after couple of hours.
                                        Accuracy (%)      Running Time (s)
             Dataset        Size Aleph Fold XFold ILASP XFold ILASP
             breast-cancer 286 70 82 88               —     4.1 timed-out
             moral           202 96 96 100            —     4.8 timed-out
             diabetes        768 73 86 89             —    27.2 timed-out
             graph-coloring 4       — — 100 100              8      4.5
             graph-coloring 8       — — 100 100             8.9     3.5
             N-queen        4 × 4 — — 100 100               9.5      5
             N-queen        8 × 8 — — 100 100               9.9     6.2
         Table 1: XFold Evaluation on UCI benchmarks and Combinatorial Problems

7   Related Work

A survey of extending Horn clause based ILP to non-monotonic logics can be found
in [17]. “Stable ILP” [19] was the first effort to explore the expressiveness of back-
ground knowledge with multiple stable models. In [17], Sakama introduces algorithms
to induce a categorical logic program3 given the answer set of the background knowl-
edge and either positive or negative examples. Essentially, given a single answer set,
3 A categorical logic program is an answer set program with at most one stable model.
Sakama tries to induce a program that has that answer set as a stable model. In [18],
Sakama and Inoue extend their work to learn from multiple answer sets. They introduce
brave induction, where the learned hypothesis H is such that some of the answer sets
of B ∪ H cover the positive examples. The limitation of this work is that it accepts only
one positive example as a conjunction of atoms. It does not take into account negative
examples at all. Cautious induction, the counterpart of brave induction, is also too re-
stricted as it can only induce atoms in the intersection of all stable models. Thus, neither
brave induction nor cautious induction are able to express situations where something
should hold in all or none of the stable models. An example of this limitation arises in
the graph coloring problem where the following should hold in all answer sets: no two
neighboring nodes in a graph should be painted the same color.
     ASPAL [2] is the first ILP system to learn answer set programs by encoding ILP
problems as ASP programs and having an ASP solver find the hypothesis. Its successor
ILASP [8], is a pioneering ILP system capable of inducing hypotheses expressed as
answer set programs too. ILASP defines a framework that subsumes brave/cautious in-
duction and allows much broader class of problems relating to learning answer set pro-
grams to be handled by ILP. However, the algorithm exhaustively searches the space of
possible clauses to find one that is consistent with all examples and background knowl-
edge. The Exhaustive search is a weaknesses that limits the applicability of ILASP to
many useful situations. Our research presented in this paper does not suffer from this
issue.
     XHAIL [16] is another ILP system capable of learning non-monotonic logic pro-
grams. It heavily incorporates abductive logic programming to search for hypotheses.
It uses a similar language-bias as ILASP does, and thus suffers from the limitations
similar to ILASP. It also does not support the notion of inducing answer set programs
from partial answer sets.

8   Conclusion and Future Work
In this paper we presented the first heuristic-based algorithm to inductively learn normal
logic programs with multiple stable models. The advantage of this work over similar
ILP systems such as ILASP [8] is that unlike these systems, XFOLD does not per-
form an exhaustive search to discover the “best” hypothesis. XFOLD adopts a greedy
approach, guided by heuristics, that is scalable and noise resilient. We also showed
how our algorithm could be applied to induce declarative logic programs that follow
the generate and test paradigm for finding solutions to combinatorial problems such as
graph-coloring and N-queens.
    There are two main avenues for future work: (i) handling large datasets using meth-
ods similar to QuickFoil [22]. In QuickFoil, all the operations of FOIL are performed
in a database engine. Such an implementation, along with pruning and query optimiza-
tion tricks can make the XFOLD training much faster; (ii) XFOLD learns function-free
answer set programs. We plan to investigate extending the language bias towards ac-
commodating functions.

Acknowledgements
Authors are partially supported by NSF Grant IIS 1718945.
                                 Bibliography


 [1] Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: SIGKDD
     22. pp. 785–794. KDD ’16, ACM, New York, NY, USA (2016)
 [2] Corapi, D., Russo, A., Lupu, E.: Ilp in answer set programming. In: ILP 21 - 2011,
     UK. pp. 91–97 (2011)
 [3] Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued at-
     tributes for classification learning. In: IJCAI. pp. 1022–1029 (1993)
 [4] Gebser, M., Kaufmann, B., Schaub, T.: Conflict-driven answer set solving: From
     theory to practice. Artificial Intelligence 187-188, 52–89 (2012)
 [5] Gelfond, M., Kahl, Y.: Knowledge Representation, Reasoning, and the Design of
     Intelligent Agents: The ASP Approach. Cambridge University Press (2014)
 [6] Gelfond, M., Lifschitz, V.: The stable model semantics for logic programming. In:
     Logic Programming,Proceedings ICLP,1988 (2 Volumes). pp. 1070–1080 (1988)
 [7] Gupta, G.: A case for query-driven predicate asp. In: ARCADE’17,1st Int. Work-
     shop on Automated Reasoning,Sweden,2017. pp. 64–68 (2017)
 [8] Law, M., Russo, A., Broda, K.: Inductive learning of answer set programs. In:
     Logics in Artificial Intelligence - 14th European Conference, JELIA (2014)
 [9] Lichman, M.: UCI,ml repository, http://archive.ics.uci.edu/ml (2013)
[10] Marple, K., Salazar, E., Gupta, G.: Computing stable models of normal logic pro-
     grams without grounding (2017), http://arxiv.org/abs/1709.00501
[11] Muggleton, S.: Inductive logic programming. New Generation Comput. 8(4),
     295–318 (1991)
[12] Muggleton, S.: Inverse entailment and progol. New Generation Computing 13(3),
     245–286 (Dec 1995)
[13] Muggleton, S.H., Bryant, C.H.: Theory completion using inverse entailment. In:
     Cussens, J., Frisch, A. (eds.) ILP. pp. 130–146. Springer Berlin Heidelberg (2000)
[14] Plotkin, G.D.: A further note on inductive generalization, in machine intelligence,
     volume 6, pages 101-124 (1971)
[15] Quinlan, J.R.: Learning logical definitions from relations. Machine Learning 5,
     239–266 (1990)
[16] Ray, O.: Nonmonotonic abductive inductive learning. Journal of Applied Logic
     7(3), 329 – 340 (2009), special Issue: Abduction and Induction in AI
[17] Sakama, C.: Induction from answer sets in nonmonotonic logic programs. ACM
     Trans. Comput. Log. 6(2), 203–231 (2005)
[18] Sakama, C., Inoue, K.: Brave induction: a logical framework for learning from
     incomplete information. Machine Learning 76(1), 3–35 (2009)
[19] Seitzer, J.: Stable ilp : Exploring the added expressivity of negation in the back-
     ground knowledge. In: IJCAI-97 Workshop on Frontiers of ILP (1997)
[20] Shakerin, F., Salazar, E., Gupta, G.: A new algorithm to automate inductive learn-
     ing of default theories. TPLP 17(5-6), 1010–1026 (2017)
[21] Srinivasan,           A.:         The         Aleph         Manual          (2001),
     http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/
[22] Zeng, Q., Patel, J.M., Page, D.: Quickfoil: Scalable inductive logic programming.
     Proc. VLDB Endow. 8(3), 197–208 (Nov 2014)