=Paper= {{Paper |id=Vol-2970/aspocppaper1 |storemode=property |title=Gradient-Based Supported Model Computation in Vector Spaces |pdfUrl=https://ceur-ws.org/Vol-2970/aspocppaper1.pdf |volume=Vol-2970 |authors=Akihiro Takemura,Katsumi Inoue |dblpUrl=https://dblp.org/rec/conf/iclp/TakemuraI21 }} ==Gradient-Based Supported Model Computation in Vector Spaces== https://ceur-ws.org/Vol-2970/aspocppaper1.pdf
Gradient-Based Supported Model Computation in
Vector Spaces
Akihiro Takemura1,2 , Katsumi Inoue2,1
1
    The Graduate University for Advanced Studies, SOKENDAI, Japan
2
    National Institute of Informatics, Japan.


                                         Abstract
                                         We propose an alternative method for computing supported models of normal logic programs in vector
                                         spaces using gradient information. To this end, we first translate the program into a definite program
                                         and embed the program into a matrix. We then define a loss function based on the implementation of
                                         the immediate consequence operator 𝑇𝑃 by matrix-vector multiplication with a suitable thresholding
                                         function. We propose an almost everywhere differentiable alternative to the non-linear thresholding
                                         operation, and incorporate penalty terms into the loss function to avoid undesirable results. We report
                                         the results of several experiments where our method shows improvements over an existing method in
                                         terms of success rates of finding correct supported models of normal logic programs.

                                         Keywords
                                         Logic Programming, Supported Model Computation, Differentiable Logic Programming




1. Introduction
Performing logical inference with linear algebraic methods has been studied as an attractive
alternative to traditional symbolic inference methods [1, 2, 3]. Linear algebraic approaches to
logical inference offer promising characteristics compared to the traditional methods. For exam-
ple, efficiency and scalability: since linear algebra is at the heart of many scientific applications,
there are existing highly-efficient optimized algorithms for performing basic operations in linear
algebra, and these new methods can benefit from high computational power offered in modern
systems in the form of multi-core CPUs and GPUs. In this work, we present a gradient-based
method for computing supported models in vector spaces.
   Sakama et al. introduced the matrix representations of definite, disjunctive and normal logic
programs, and a linear algebraic method for computing the least models and stable models [2].
More recently Nguyen et al. proposed an improved version of the algorithm by taking advantage
of sparsity [4]. Sato et al. proposed a method for computing supported models in vector spaces
via 3-valued completion models of a normal logic program [5]. The matrix encoding method
used in this work is influenced by the aforementioned works, and we extend the previous works
by proposing an encoding that is more suitable towards our goal of computing supported models
as fixed points in vector spaces. Another difference is that while our method uses gradient

14th Workshop on Answer Set Programming and Other Computing Paradigms
" atakemura@nii.ac.jp (A. Takemura); inoue@nii.ac.jp (K. Inoue)
 0000-0003-4130-8311 (A. Takemura); 0000-0002-2717-9122 (K. Inoue)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                                           1
Akihiro Takemura et al. CEUR Workshop Proceedings                                               1–15


information to find supported models, previous works use non-differentiable operations to find
models.
   Gradient-based methods have been considered, for example, for abduction [6], SAT and
probabilistic inference [7]. Our design of the loss function follows the direction of previous
works by Sato [8, 7], and we introduce additional terms for computing supported models.
Although one should be able to compute supported models by obtaining a completion of the
program and solving it with a SAT solver e.g. using the matrix-based SAT solver in [7], our
method does not require completion nor SAT solver. Blair et al. introduced an extension of
2-valued semantics in vector spaces using a polynomial representation of normal programs,
and used gradient descent to solve the dynamical system where the solutions correspond to
the supported models [9]. In contrast we employ a matrix representation of the program,
thresholding function and gradient descent for minimizing the loss function.
   The closest to our work is Aspis et al.’s method [10], as both our algorithm and theirs use
gradient-based method to find supported models of normal logic programs in vector spaces.
Their method uses a matrix representation of the program reduct and the Newton’s method
for root finding to find fixed points. Some differences are: (i) While our method uses a normal
to definite program translation and a matrix representation of the program, their method
incorporates the notion of reduct into the program matrix. (ii) Our loss function includes a
thresholding term and additionally regularization terms for dealing with fractional assignments
and facts, while in their method the convergence depends solely on the fixed point criterion. The
use of regularization to penalize fractional-valued interpretations is critical to avoid converging
to unwanted middle points. Our goal in this paper is to improve upon Aspis et al.’s method, by
extending Sato et al.’s earlier work on using loss functions for logical inference, and incorporating
new terms for supported model computation.
   The structure of this paper is as follows: Section 2 covers the necessary background and
definitions. Section 3 presents a method for representing logic programs with matrices. Section
4 introduces the thresholding function and loss function, as well as the gradient-based search
algorithm for finding supported models. Section 5 presents the results of experiments designed
to test the ability of the algorithm and compare against known results in the literature. Finally,
Section 6 presents the conclusion.


2. Background
We consider a language ℒ that contains a finite set of propositional variables defined over a
finite alphabet and the logical connectives ¬, ∧, ∨ and ←. The Herbrand base, 𝐵𝑃 , is the set of
all propositional variables that appear in a logic program 𝑃 .
   A definite program is a set of rules of the form (1) or (2), where ℎ and 𝑏𝑖 are propositional
variables (atoms) in ℒ. We refer to (2) as an OR-rule, which is a shorthand for 𝑚 rules: ℎ ←
𝑏1 , ℎ ← 𝑏2 , . . . , ℎ ← 𝑏𝑚 . For each rule 𝑟 we define ℎ𝑒𝑎𝑑(𝑟) = ℎ and 𝑏𝑜𝑑𝑦(𝑟) = {𝑏1 , . . . , 𝑏𝑚 }.
A rule 𝑟 is referred to as a fact if 𝑏𝑜𝑑𝑦(𝑟) = ∅.

                                 ℎ ← 𝑏1 ∧ 𝑏2 ∧ · · · ∧ 𝑏𝑚 (𝑚 ≥ 0)                                 (1)

                                 ℎ ← 𝑏1 ∨ 𝑏2 ∨ · · · ∨ 𝑏𝑚 (𝑚 ≥ 0)                                 (2)



                                                  2
Akihiro Takemura et al. CEUR Workshop Proceedings                                                      1–15


   A normal program is a set of rules of the form (3) where ℎ and 𝑏𝑖 are propositional variables
in ℒ.
              ℎ ← 𝑏1 ∧ 𝑏2 ∧ · · · ∧ 𝑏𝑙 ∧ ¬𝑏𝑙+1 ∧ ¬𝑏𝑙+2 ∧ · · · ∧ ¬𝑏𝑚 (𝑚 ≥ 𝑙 ≥ 0)             (3)
We refer to the positive and negative occurrences of atoms in the body as 𝑏𝑜𝑑𝑦 + (𝑟) = {𝑏1 , . . . , 𝑏𝑙 }
and 𝑏𝑜𝑑𝑦 − (𝑟) = {𝑏𝑙+1 , . . . , 𝑏𝑚 }, respectively. A normal program is a definite program if
𝑏𝑜𝑑𝑦 − (𝑟) = ∅ for every rule 𝑟 ∈ 𝑃 .
   An Herbrand interpretation 𝐼, of a normal program 𝑃 is a subset of 𝐵𝑃 . A model 𝑀 of 𝑃
is an interpretation of 𝑃 where for every rule 𝑟 ∈ 𝑃 of the form (3), 𝑏𝑜𝑑𝑦 + (𝑟) ⊆ 𝑀 and
𝑏𝑜𝑑𝑦 − (𝑟) ∩ 𝑀 = ∅ imply ℎ ∈ 𝑀 . A program is called consistent if it has a model. A supported
model 𝑀 is a model of 𝑃 where for every 𝑝 ∈ 𝑀 there exists a rule 𝑟 ∈ 𝑃 such that 𝑝 = ℎ,
𝑏𝑜𝑑𝑦 + (𝑟) ⊆ 𝑀 and 𝑏𝑜𝑑𝑦 − (𝑟) ∩ 𝑀 = ∅ [11, 12].
   As we shall show later, in this paper we transform normal logic programs into definite
programs for searching supported models. Thus we use the following definition of the immediate
consequence operator 𝑇𝑃 . 𝑇𝑃 : 2𝐵𝑃 → 2𝐵𝑃 is a function on Herbrand interpretations. For
a definite program 𝑃 , we have: 𝑇𝑃 (𝐼) = {ℎ|ℎ ← 𝑏1 ∧ · · · ∧ 𝑏𝑚 ∈ 𝑃 and {𝑏1 , . . . , 𝑏𝑚 } ⊆
𝐼} ∪ {ℎ|ℎ ← 𝑏1 ∨ · · · ∨ 𝑏𝑚 ∈ 𝑃 and {𝑏1 , . . . , 𝑏𝑚 } ∩ 𝐼 ̸= ∅}. It is known that a supported model
𝑀 of a program 𝑃 is a fixed point of 𝑇𝑃 , i.e., 𝑇𝑃 (𝑀 ) = 𝑀 [11].
Definition 1 (Singly-Defined (SD) program). A normal program P is called an SD program if
ℎ𝑒𝑎𝑑(𝑟1 ) ̸= ℎ𝑒𝑎𝑑(𝑟2 ) for any two rules 𝑟1 and 𝑟2 in P where 𝑟1 ̸= 𝑟2 .
   Any normal program 𝑃 can be converted into an SD-program 𝑃 ′ in the following manner.
If there are more than one rule with the same head (ℎ ← 𝑏𝑜𝑑𝑦(𝑟1 ), . . . , ℎ ← 𝑏𝑜𝑑𝑦(𝑟𝑘 ), where
𝑘 > 1), then replace them with a set of new rules {ℎ ← 𝑏1 ∨ ... ∨ 𝑏𝑘 , 𝑏1 ← 𝑏𝑜𝑑𝑦(𝑟1 ), ..., 𝑏𝑘 ←
𝑏𝑜𝑑𝑦(𝑟𝑘 )} containing new atoms {𝑏1 , ..., 𝑏𝑘 }. This is a stricter condition than the multiple
definitions condition (MD-condition) [2]: for any two rules 𝑟1 and 𝑟2 in the program, ℎ𝑒𝑎𝑑(𝑟1 ) =
ℎ𝑒𝑎𝑑(𝑟2 ) implies |𝑏𝑜𝑑𝑦(𝑟1 )| ≤ 1 and |𝑏𝑜𝑑𝑦(𝑟2 )| ≤ 1. Consequently all SD-programs satisfy the
MD condition. Unless noted otherwise, we assume all programs in this paper are SD-programs.
   Given a normal program 𝑃 , it is transformed into a definite program by replacing the negated
literals in rules of the form (3) and rewriting:

                   ℎ ← 𝑏1 ∧ 𝑏2 ∧ · · · ∧ 𝑏𝑙 ∧ 𝑏𝑙+1 ∧ 𝑏𝑙+2 ∧ · · · ∧ 𝑏𝑚 (𝑚 ≥ 𝑙 ≥ 0)                       (4)

where 𝑏𝑖 are new atoms associated with the negated 𝑏𝑖 . A collection of rules of the form (4) is
referred to as the positive form 𝑃 + where 𝐵𝑃 + = 𝐵𝑃 ∪ {𝑎 | 𝑎 ∈ 𝐵𝑃 }. For transformed rules
of the form (4), we refer to {𝑏1 , . . . , 𝑏𝑙 } as the positive part and {𝑏𝑙+1 , . . . , 𝑏𝑚 } as the negative
part. After transformation, the program should contain rules of the forms (1), (2), or (4).
   By an interpretation 𝐼 + of 𝑃 + , we mean any set of atoms 𝐼 + ⊆ 𝐵𝑃 + that satisfies the
condition for any atom 𝑎 ∈ 𝐵𝑃 + , precisely one of either 𝑎 or 𝑎 belongs to 𝐼 + .


3. Representing Logic Programs with Matrices
In this section, we first describe the relationship between the positive forms of normal logic
programs and its supported models. Then we describe the matrix encoding of the positive forms,
which in turn represents the original normal logic programs in vector spaces.



                                                     3
Akihiro Takemura et al. CEUR Workshop Proceedings                                                1–15


3.1. Relationship between Positive Forms and Supported Models
Consider a (inconsistent) program 𝑝 ← ¬𝑝, and its positive form 𝑝 ← 𝑝. 𝑃 + is a definite
program but it has no models in this case due to the restriction we place on the interpretation:
if 𝑝 ∈ 𝐼 + then 𝑝 ̸∈ 𝐼 + and vice versa. Then in this case, the implication is that there are no
fixed points of 𝑇𝑃 + for 𝑃 + that satisfy the condition 𝑝 ∈ 𝐼 + iff 𝑝 ̸∈ 𝐼 + . On the other hand,
when a model 𝑀 of 𝑃 exists, we can show that the corresponding 𝑀 + is a model of 𝑃 + .

Proposition 1. Let 𝑃 be a normal program, and let 𝑃 + be its positive form. If 𝑀 is a model of
𝑃 , then 𝑀 ′ = 𝑀 ∪ {𝑎 | 𝑎 ∈ 𝐵𝑃 + ∖ 𝑀 } is a model of 𝑃 + . Conversely, if 𝑀 + is a model of 𝑃 + ,
then 𝑀 + ∩ 𝐵𝑃 is a model of P.

Proof. Follows from the definition of 𝑀 ′ and 𝑀 + . Consider 𝑀 ′ . Since 𝑎 ̸∈ 𝑀 ′ if 𝑎 ∈ 𝑀 ′
and vice versa, for each rule 𝑟 ∈ 𝑃 + , 𝑏𝑜𝑑𝑦(𝑟) ⊆ 𝑀 ′ implies ℎ𝑒𝑎𝑑(𝑟) = 𝑎 ∈ 𝑀 ′ . Thus, 𝑀 ′
is a model of 𝑃 +. Now consider 𝑀 + . Let 𝐾 = 𝑀 + ∩ 𝐵𝑃 . Given that 𝑀 + is a model of 𝑃 +
and 𝑎 ∈ 𝐾 if 𝑎 ∈ 𝑀 + , for each rule 𝑟 ∈ 𝑃 , 𝑏𝑜𝑑𝑦 + (𝑟) ⊆ 𝐾 and 𝑏𝑜𝑑𝑦 − (𝑟) ∩ 𝐾 = ∅ implies
ℎ𝑒𝑎𝑑(𝑟) = 𝑎 ∈ 𝐾. Thus, 𝐾 is a model of 𝑃 .

Proposition 2. Let 𝑀 be a supported model of 𝑃 , and put 𝑀 ′ = 𝑀 ∪{𝑎 | 𝑎 ∈ 𝐵𝑃 + ∖ 𝑀 }. Then,
𝑇𝑃 + (𝑀 ′ ) = 𝑀 .

Proof. Suppose 𝑎 ∈ 𝑀 . Since 𝑀 is a supported model, there exists a rule 𝑟 ∈ 𝑃 such that
ℎ𝑒𝑎𝑑(𝑟) = 𝑎, 𝑏𝑜𝑑𝑦 + (𝑟) ⊆ 𝑀 and 𝑏𝑜𝑑𝑦 − (𝑟) ∩ 𝑀 = ∅. Correspondingly, there exists a rule
𝑟′ ∈ 𝑃 + such that ℎ𝑒𝑎𝑑(𝑟′ ) = 𝑎, 𝑏𝑜𝑑𝑦 + (𝑟′ ) ⊆ 𝑀 ′ and 𝑏𝑜𝑑𝑦 − (𝑟′ ) ⊆ 𝑀 ′ . That is, 𝑎 ∈ 𝑇𝑃 + (𝑀 ′ ).
Hence, 𝑀 ⊆ 𝑇𝑃 + (𝑀 ′ ).
   Conversely, suppose 𝑎 ∈ 𝑇𝑃 + (𝑀 ′ ). Then, there exists a rule 𝑟′ ∈ 𝑃 + such that ℎ𝑒𝑎𝑑(𝑟′ ) = 𝑎
and 𝑏𝑜𝑑𝑦(𝑟′ ) ⊆ 𝑀 ′ . Since 𝑀 ′ is a model of 𝑃 + by Proposition 1, 𝑏𝑜𝑑𝑦(𝑟′ ) ⊆ 𝑀 ′ implies
ℎ𝑒𝑎𝑑(𝑟′ ) = 𝑎 ∈ 𝑀 ′ . Because 𝑎 is a positive literal, 𝑎 ∈ 𝑀 holds. Hence, 𝑇𝑃 + (𝑀 ′ ) ⊆ 𝑀 .
Therefore, 𝑀 = 𝑇𝑃 + (𝑀 ′ ).

Proposition 3. Let 𝑀 + be an interpretation of 𝑃 +. If 𝑇𝑃 + (𝑀 + ) = 𝑀 + ∩ 𝐵𝑃 holds, then
𝑀 = 𝑀 + ∩ 𝐵𝑃 is a supported model of 𝑃 .

Proof. Suppose 𝑇𝑃 + (𝑀 + ) = 𝑀 + ∩ 𝐵𝑃 . Because 𝑀 + ∩ 𝐵𝑃 recovers the positive literals
from 𝑀 + , for each 𝑎 ∈ (𝑀 + ∩ 𝐵𝑃 ), there exists a rule 𝑟 ∈ 𝑃 such that ℎ𝑒𝑎𝑑(𝑟) = 𝑎,
𝑏𝑜𝑑𝑦 + (𝑟) ⊆ (𝑀 + ∩ 𝐵𝑃 ) and 𝑏𝑜𝑑𝑦 − (𝑟) ∩ (𝑀 + ∩ 𝐵𝑃 ) = ∅. Thus, 𝑀 = 𝑀 + ∩ 𝐵𝑃 is a
supported model of 𝑃 .

3.2. Matrix Encoding of Logic Programs
We first introduce the matrix and vector notations used in this paper, before showing our
method for encoding normal programs into matrices. Matrices and vectors are represented as
bold upper-case e.g. M and lower-case letters e.g. v, respectively. A 1-vector with length 𝑁 is
represented by 1𝑁 . The indices of the entries of matrices and vectors appear in the subscript,
for example, M𝑖𝑗 refers to the element at 𝑖-th row and 𝑗-th column of a matrix M and v𝑖 refers
to the 𝑖-th element of a column vector v. Let M𝑖: and M:𝑗 denote the 𝑖-th row slice and 𝑗-th
column slice of M, respectively. We denote the horizontal concatenation of matrices M1 and



                                                  4
Akihiro Takemura et al. CEUR Workshop Proceedings                                          1–15


M2 as [M1 M2 ], and denote the vertical concatenation of column vectors v1 and v2 as [v1 ; v2 ].
Some matrices and vectors in this paper have a superscript to distinguish them from others, e.g.
M𝑝 , and it should not be confused with indices appearing in the subscript; for example, M𝑝𝑖𝑗
refers to the 𝑖, 𝑗-th entry of M𝑝 .
   Let 𝑃 be a normal program with size |𝐵𝑃 | = 𝑁 , 𝑃 + its positive form and 𝐵𝑃 + the Herbrand
base of 𝑃 + . Then we have |𝐵𝑃 + | = 2𝑁 since for every 𝑏 ∈ 𝐵𝑃 we add its negated version
𝑏. To encode 𝑃 + into a program matrix Q, we encode atoms appearing in the bodies of the
rules with one-hot encoding. After performing this encoding for all rules in 𝑃 + , we obtain an
(𝑁 × 2𝑁 ) binary program matrix Q, which is equivalent to a vertical concatenation of row
vectors each encoding a rule 𝑟 ∈ 𝑃 + with one-hot encoding.

Definition 2 (Program Matrix). Let 𝑃 be a normal program with |𝐵𝑃 | = 𝑁 and 𝑃 + its positive
form with |𝐵𝑃 + | = 2𝑁 . Then 𝑃 + is represented by a matrix Q ∈ Z𝑁 ×2𝑁 such that for each
element Q𝑖𝑗 (1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 2𝑁 ) in Q,

    • Q𝑖𝑗 = 1 if atom 𝑎𝑗 ∈ 𝐵𝑃 + (1 ≤ 𝑗 ≤ 2𝑁 ) appears in the body of the rule 𝑟𝑖 (1 ≤ 𝑖 ≤ 𝑁 );
    • Q𝑖𝑗 = 0, otherwise.

    The 𝑖-th row of Q corresponds to the atom 𝑎𝑖 appearing in the head of the rule 𝑟𝑖 , and the
𝑗-th column corresponds to the atom 𝑎𝑗 (1 ≤ 𝑗 ≤ 2𝑁 ) appearing in the body of the rules
𝑟𝑖 (1 ≤ 𝑖 ≤ 𝑁 ). Atoms that do not appear in the head of any of the rules in 𝑃 + are encoded as
zero-only row vectors in Q.
    This definition is different from [2] or [10], in that we do not explicitly include ⊤ and ⊥
in the program matrix, and we do not use fractional values to encode long rules. In fact, our
encoding method is similar to that of [5], except that we do not use (2𝑁 × 2𝑁 ) space for the
program matrix since we do not encode rules with 𝑏 ∈ 𝐵𝑃 + in the head.1

Definition 3 (Interpretation Vector). Let 𝑃 be a definite program and 𝐵𝑃 = {𝑎1 , . . . , 𝑎𝑁 }.
Then an interpretation 𝐼 ⊆ 𝐵𝑃 is represented by a vector v = (v1 , . . . , v𝑁 )⊺ ∈ Z𝑁 where each
element v𝑖 (1 ≤ 𝑖 ≤ 𝑁 ) represents the truth value of the proposition 𝑎𝑖 such that v𝑖 = 1 if
𝑎𝑖 ∈ 𝐼, otherwise v𝑖 = 0. We assume propositional variables share the common index such that
v𝑖 corresponds to 𝑎𝑖 , and we write var(v𝑖 ) = 𝑎𝑖 .

   Recall that the positive form 𝑃 + of a normal program is a definite program, and all negated
literals in the body are replaced by new atoms, e.g. in (4) ¬𝑏𝑙+1 is replaced by 𝑏𝑙+1 . We now
extend the definition of interpretation vectors to include relations between the positive and
negative occurrences of atoms, in order to maintain whenever we have 𝑏1 ∈ 𝐼, 𝑏𝑙+1 ̸∈ 𝐼 and
vice versa.

Definition 4 (Companion Vector). Let 𝐵𝑃𝑃 + ⊆ 𝐵𝑃 + denote the positive part of 𝑃 , 𝐵𝑃𝑁+ ⊆ 𝐵𝑃 +
denote the negative part of 𝑃 , with size |𝐵𝑃𝑃 + | = |𝐵𝑃𝑁+ | = 𝑁 . Let v𝑃 ∈ Z𝑁 be a vector
representing truth assignments for 𝑎𝑖 ∈ 𝐵𝑃𝑃 + such that v𝑖𝑃 = 1 if 𝑎𝑖 ∈ 𝐼 and v𝑖𝑃 = 0 otherwise.
Define a companion vector w ∈ Z2𝑁 representing an interpretation 𝐼 + ⊆ 𝐵𝑃 + as follows: w =
[v𝑃 ; 1𝑁 − v𝑃 ].

    1
        In [5], the program matrix encodes the completion of 𝑃 + .



                                                          5
Akihiro Takemura et al. CEUR Workshop Proceedings                                               1–15


4. Gradient Descent For Computing Supported Models
In this section, we first show how the evaluation of the 𝑇𝑃 operator can be carried out in
vector spaces. Then we introduce our loss function and its gradient, and finally we introduce
an algorithm for finding supported models in vector spaces.

4.1. Computing the 𝑇𝑃 operator in Vector Spaces
Sakama et al. [2] showed that the 𝑇𝑃 operator can be computed in vector spaces using a
thresholding function, which they called 𝜃-thresholding. Here we introduce a variant of 𝜃-
thresholding to accommodate the new program encoding method as well as the differentiability
requirement. First we initialize the parameter vector t.

Definition 5 (Parameter Vector t). A set of parameters to the 𝜃-thresholding is represented by a
column vector t ∈ Z𝑁 such that for each element t𝑖 (1 ≤ 𝑖 ≤ 𝑁 ) in t,

    • t𝑖 = |𝑏𝑜𝑑𝑦(𝑟𝑖 )| if the rule 𝑟𝑖 ∈ 𝑃 + is a conjunctive rule, e.g. (1), (4);
    • t𝑖 = 1 if the rule 𝑟𝑖 ∈ 𝑃 + is a disjunctive rule e.g. (2);
    • t𝑖 = 0, otherwise.

   In some previous works [2, 13], the information about the nature of the rules was also stored
in the program matrix Q alongside the atom occurrences; conjunctive rules with |𝑏𝑜𝑑𝑦(𝑟𝑖 )| > 1
had fractional values Q𝑖𝑗 = 1/|𝑏𝑜𝑑𝑦(𝑟𝑖 )| and disjunctive bodies had integer values Q𝑖𝑗 = 1.
Here we follow the approach taken by Sato et al. [5], and only store information about the
occurrences of atoms in 𝑃 + and keep supplementary information in the parameter vector t to
recover the original program.

Definition 6 (Parameterized 𝜃-thresholding). Let w ∈ Z2𝑁 be a companion vector representing
𝐼 + ⊆ 𝐵𝑃 + . Given a parameter vector t ∈ Z𝑁 , a program matrix Q ∈ Z𝑁 ×2𝑁 , and a vector
y = Qw where y ∈ Z𝑁 , we apply the thresholding function element-wise as follows:
                               {︃
                                 min(max(0, y𝑖 − (t𝑖 − 1)), 1) (t𝑖 ≥ 1)
                    𝜃t (y𝑖 ) =                                                           (5)
                                 0                             (t𝑖 < 1)

    This thresholding function resembles ℎ𝑎𝑟𝑑𝑡𝑎𝑛ℎ which is an activation function developed
for use in natural language processing [14]. In the original ℎ𝑎𝑟𝑑𝑡𝑎𝑛ℎ function, the range of the
linear region is [−1, 1] (Figure 1(a)), but here we define the linear region between [t𝑖 − 1, t𝑖 ]
(Figure 1(b)). This function is almost everywhere differentiable except at y𝑖 = t𝑖 − 1 and
y𝑖 = t𝑖 . The special case t𝑖 < 1 in Equation (5) corresponds to the case t𝑖 = 0 where the head
does not appear in the program 𝑃 + and is assumed to be 𝑓 𝑎𝑙𝑠𝑒.
    Intuitively, for the head of a conjunctive rule to be 𝑡𝑟𝑢𝑒 it is sufficient to check whether all
literals in the body hold, otherwise the rule evaluates to 𝑓 𝑎𝑙𝑠𝑒. Similarly for a disjunctive rule
it is sufficient to check whether at least one of the literals in the body holds for the head to hold.
This is formalized in Proposition 4.




                                                   6
Akihiro Takemura et al. CEUR Workshop Proceedings                                                     1–15


                  2                                             2

                  1                                             1


            (y)




                                                          (y)
                  0                                             0

                  1                                             1

                      3   2      1   0       1   2    3             3   2   1   0    1     2     3
                                     y                                          y
                              (a) ℎ𝑎𝑟𝑑𝑡𝑎𝑛ℎ                (b) Parameterized thresholding with t = 1
Figure 1: (a) 𝜃(𝑦) where 𝜃 = ℎ𝑎𝑟𝑑𝑡𝑎𝑛ℎ (b) 𝜃(𝑦) where 𝜃 = parameterized 𝜃-thresholding


Proposition 4 (Thresholded 𝑇𝑃 Operator). Let 𝑃 + be the positive form of a normal program
𝑃 and Q ∈ Z𝑁 ×2𝑁 its matrix representation. Let v ∈ Z𝑁 be a subset of an interpretation vector
representing 𝑎𝑖 ∈ 𝐼 𝑃 ⊆ 𝐵𝑃𝑃 + if 𝑣𝑖 = 1 for {𝑎1 , . . . , 𝑎𝑁 }. Let w ∈ Z2𝑁 be a companion vector
to v. Then z = [u; 1 − u] ∈ Z2𝑁 is a vector representing 𝐽 = 𝑇𝑃 (𝐼) satisfying the condition
(𝑎 ∈ 𝐽 iff 𝑎 ̸∈ 𝐽), iff u = 𝜃t (Qw).
Proof. Consider u = 𝜃t (Qw). For u = (u1 , . . . , u𝑁 )⊺ , by the definition of the thresholding
function, u𝑘 = 1 (1 ≤ 𝑘 ≤ 𝑁 ) iff u′𝑘 ≥ t𝑘 in u′ = Qw. Take a row slice Q𝑘: , then u′𝑘 =
Q𝑘: w = Q𝑘1 w1 + · · · + Q𝑘2𝑁 w2𝑁 , and u𝑘 = 1 iff u′𝑘 ≥ t𝑘 . Both Q𝑘: and w are 0-1 vectors,
then it follows that there are at least t𝑘 elements where Q𝑘𝑗 = w𝑗 = 1 (1 ≤ 𝑗 ≤ 2𝑁 ). The first
𝑁 elements of w represent 𝑎𝑖 ∈ 𝐼 𝑃 ⊆ 𝐵𝑃𝑃 + if w𝑖 = 1, and if 𝑎𝑖 ∈ 𝐼 𝑃 then 𝑎𝑖 ̸∈ 𝐼 𝑁 ⊆ 𝐵𝑃𝑁+
which is maintained through the definition of the companion vector w. 1) For a conjunctive
rule 𝑎𝑘 ← 𝑎1 ∧ · · · ∧ 𝑎𝑚 (1 ≤ 𝑚 ≤ 2𝑁 ), {𝑎1 , . . . , 𝑎2𝑁 } ∈ 𝐼 implies 𝑎𝑘 ∈ 𝑇𝑃 (𝐼). 2) For an
OR-rule 𝑎𝑘 ← 𝑎1 ∨ · · · ∨ 𝑎𝑚 (1 ≤ 𝑚 ≤ 2𝑁 ), {𝑎1 , . . . , 𝑎2𝑁 } ⊆ 𝐼 implies 𝑎𝑘 ∈ 𝑇𝑃 (𝐼). 𝑎𝑚 ∈ 𝐼
is represented by z𝑚 = 1 (1 ≤ 𝑚 ≤ 2𝑁 ). Then by putting 𝐽 = {var(z𝑚 )|z𝑚 = 1}, 𝐽 = 𝑇𝑃 (𝐼)
holds.
   Consider 𝐽 = 𝑇𝑃 (𝐼). For v = (v1 , . . . , v𝑁 )⊺ representing 𝐼 𝑃 ⊆ 𝐵𝑃𝑃 + , w = (v1 , . . . , v𝑁 , 1−
v1 , . . . , 1 − v𝑁 )⊺ is a vector representing 𝐼 ⊆ 𝐵𝑃 + if we set 𝐼 = {var(w𝑖 )|w𝑖 = 1}. u′ = Qw
is a vector such that u′𝑘 ≥ t𝑘 (1 ≤ 𝑘 ≤ 𝑁 ) iff var(u′𝑘 ) ∈ 𝑇𝑃 (𝐼). Define u = (u1 , . . . , u𝑁 )⊺
such that u𝑘 = 1 (1 ≤ 𝑘 ≤ 𝑁 ) iff u′𝑘 ≥ t𝑘 in Qw, and u𝑘 = 0 otherwise. Define an
interpretation 𝐽 ⊆ 𝐵𝑃 + such that it can be partitioned into subsets of positive and negative
occurrences of atoms (𝐽 𝑃 ∪ 𝐽 𝑁 ) = 𝐽 ⊆ 𝐵𝑃 + . Since only positive atoms occur in the head, u
represents a positive subset of interpretation 𝐽 𝑃 ⊆ 𝐽 ⊆ 𝐵𝑃 + by setting 𝐽 𝑃 = { var(u𝑖 )|u𝑖 =
1} (1 ≤ 𝑖 ≤ 𝑁 ). If 𝑎𝑖 ∈ 𝑇𝑃 (𝐼) then u𝑖 = 1, and 𝑎𝑖 ̸∈ 𝑇𝑃 (𝐼) is represented by 1 − u𝑖 = 0.
Conversely, if 𝑎𝑖 ̸∈ 𝑇𝑃 (𝐼) then u𝑖 = 0, and 1 − u𝑖 = 1 represents 𝑎𝑖 ∈ 𝑇𝑃 (𝐼). Thus 1 − u
represents 𝐽 𝑁 ⊆ 𝐽 ⊆ 𝐵𝑃 + . z = [u; 1 − u] is then a vector representing 𝐽 𝑃 ∪ 𝐽 𝑁 = 𝐽 if we
set 𝐽 = {var(z𝑚 )|z𝑚 = 1} (1 ≤ 𝑚 ≤ 2𝑁 ). Thus z = [u; 1 − u] represents 𝐽 = 𝑇𝑃 (𝐼) if
u = 𝜃t (Qw).


Example 1. Consider the following program:

                                                     𝑝←𝑞



                                                          7
Akihiro Takemura et al. CEUR Workshop Proceedings                                           1–15


                                           𝑞 ←𝑝∧𝑟
                                           𝑟 ← ¬𝑝                                             (6)

This program has one supported (stable) model: {𝑟}. We have 𝐵𝑃 = {𝑝, 𝑞, 𝑟} and 𝐵𝑃 + =
{𝑝, 𝑞, 𝑟, 𝑝, 𝑞, 𝑟} and the matrix representation Q and parameter vector t are:

                              𝑝       𝑞   𝑟 𝑝       𝑞   𝑟
                                ⎛                         ⎞     ⎛ ⎞
                            𝑝 0       1   0 0       0   0      𝑝 1
                        Q = 𝑞⎝1       0   1 0       0   0⎠ t = 𝑞⎝2⎠                           (7)
                            𝑟 0       0   0 1       0   0      𝑟 1

Suppose an assignment v{𝑟} = (0 0 1)⊺ is given. Then the companion vector w is:
                                                                   )︀⊺
                      w = [v{𝑟} ; 13 − v{𝑟} ] = 0 0 1 1 1 0                                   (8)
                                                (︀


Compute the matrix multiplication product Qw and apply the parameterized thresholding:
                                              )︀⊺            )︀⊺
                  u = 𝜃t (Qw) = 𝜃t ( 0 1 1 ) = 0 0 1 = v{𝑟}                                   (9)
                                      (︀            (︀


Let z be a companion vector to u, i.e. z = [u; 1 − u], then we have
                                                             )︀⊺
                                                                                             (10)
                                         (︀
                                   z= 0 0 1 1 1 0

Define 𝐽 = {var(z𝑚 )|z𝑚 = 1}, then we have 𝐽 = {𝑟, 𝑝, 𝑞}, and 𝐽 ∩ 𝐵𝑃 = {𝑟}.

Proposition 5 (Supported Model Computation with Thresholded 𝑇𝑃 ). Let v ∈ Z𝑁 be a 0-
1 vector representing a subset of interpretation 𝐼 𝑃 ⊆ 𝐼 ⊆ 𝐵𝑃 + , and z = [v; 1𝑁 − v] be its
companion vector representing 𝐼 ⊆ 𝐵𝑃 + satisfying (𝑎 ∈ 𝐼 iff 𝑎 ̸∈ 𝐼). Given a program matrix
Q representing a program 𝑃 + and a thresholding function 𝜃t parameterized by a vector t, the
fixed points of 𝑃 + are represented by 0-1 binary vectors z𝐹 𝑃 = [v𝐹 𝑃 ; 1𝑁 − v𝐹 𝑃 ] ∈ Z2𝑁 where
v𝐹 𝑃 = 𝜃t (Qz𝐹 𝑃 ). Then z𝐹 𝑃 are vectors representing models 𝑀 + of 𝑃 + satisfying (𝑎 ∈ 𝑀 + iff
𝑎 ̸∈ 𝑀 + ) iff 𝜃t (Qz𝐹 𝑃 ) = v𝐹 𝑃 . When such 0-1 binary vector z𝐹 𝑃 exists, 𝑀 + ∩ 𝐵𝑃 = 𝑀 is a
supported model of 𝑃 .

Proof. Let 𝐼 ⊆ 𝐵𝑃 + be a model of 𝑃 + , represented by z𝐹 𝑃 . Consider two cases (i) 𝑇𝑃 (𝐼) = 𝐼
and (ii) v𝐹 𝑃 = 𝜃t (Qz𝐹 𝑃 ). In both cases, by Propositions 2, 3 and 4, if a supported model of 𝑃
exists, the results hold.


Example 2. Consider Program (6) in Example 1, and interpretation vector z (10) representing
an interpretation 𝐽. We show that 𝐽 = {𝑟, 𝑝, 𝑞} is indeed a model of the positive form, and
corresponds to the supported model of 𝑃 . Construct a vector v{𝑟} and companion vector w as
in (8). By comparing (8) and (10), we see that w = z and this is consistent with the fixed point
definition 𝐽 = 𝑇𝑃 (𝐽). 𝐽 ∩ 𝐵𝑃 results in {𝑟} which is the supported model of 𝑃 .




                                                8
Akihiro Takemura et al. CEUR Workshop Proceedings                                               1–15


4.2. Loss Function for Computing Supported Models
                                                                                                𝑃
By the fixed point definition of supported models, a supported model 𝑀 satisfies v𝑀 =
           𝑃            𝑃
𝜃t (Q[v𝑀 ; 1𝑁 − v𝑀 ]). We now use this definition to design a loss function which can be
minimized using gradient descent. Gradient descent is a method for minimizing an objective
function (loss function), by updating the parameters in the opposite direction of the gradient of
the objective function with respect to the parameters. The size of the update is determined by
the gradient and the step size 𝛼. Intuitively this algorithm follows the direction of the slope of
the loss surface until it reaches the bottom of a valley (local minimum).
   Let u be an (𝑁 × 1) vector representing an assignment to the heads of the rules in 𝑃 + . Define
w = [u; 1𝑁 − u] as the (2𝑁 × 1) interpretation vector that stores assignments to each atom
𝑎 ∈ 𝑀 ⊆ 𝐵𝑃 + . We define an (𝑁 × 1) vector f which stores information about occurrences of
facts in the program 𝑃 + . This vector will be used later during the minimization step to ensure
that facts are not forgotten.
Definition 7 (Fact Vector f ). The set of facts in the program 𝑃 + is represented by a column
vector f ∈ Z𝑁 such that for each element f𝑖 (1 ≤ 𝑖 ≤ 𝑁 ) in f ,
    • f𝑖 = 1 if the rule 𝑟𝑖 is a fact: 𝑎 ←
    • f𝑖 = 0 otherwise.
Definition 8 (Loss Function). Given a program matrix Q, a candidate vector x, thresholding
function 𝜃t , and constants 𝜆1 and 𝜆2 , define the loss function as follows:
              1 (︁                                                                        )︁
   𝐿(x) =          ‖𝜃t (Q[x; 1𝑁 − x]) − x‖2F + 𝜆1 ‖x ⊙ (x − 1𝑁 )‖2F + 𝜆2 ‖f − (x ⊙ f )‖2F    (11)
              2
where ‖x‖F denotes the Frobenius norm and ⊙ element-wise product.
   The first term is derived directly from the fixed point definition of supported models, and
should be 0 if x is a supported model of 𝑃 + . The second term, which resembles a regularization
term often used in the machine learning literature, is added to penalize candidate vectors x that
contain fractional values, and is 0 if and only if x is a 0-1 vector. The third term will be 0 if and
only if the facts are preserved, and will be positive non-zero if any part of the assignment is
lost, i.e. by assigning 0 (𝑓 𝑎𝑙𝑠𝑒) to a fact where f𝑖 = 1.
   We introduce submatrices of Q, Q𝑝 and Q𝑛 that correspond to the positive bodies and
negative bodies of the matrix, respectively, such that Q = [Q𝑝 Q𝑛 ] (horizontal concatenation
of submatrices). Both Q𝑝 and Q𝑛 have the shape (𝑁 × 𝑁 ).
Definition 9 (Gradient of the Loss Function). The gradient of the loss function with respect to x
is given by:
                       (︂                                      )︂
             𝜕𝐿(x)                   𝑇              𝜕𝜃t (Qzx )
                    = (Q𝑝 − Q𝑛 ) · 𝜃t (Qzx ) ⊙                    − 𝜃t (Qzx − x)
              𝜕x                                        𝜕x
                       + 𝜆1 (x ⊙ (1𝑁 − x) ⊙ (1𝑁 − 2x)) + 𝜆2 (x ⊙ f − f )                     (12)
where zx ∈ R2𝑁 = [x; 1𝑁 − x] and
                              {︃
                  𝜕𝜃t (w𝑖 )     1 if (t𝑖 ≥ 1) and (t𝑖 − 1) ≤ w𝑖 ≤ t𝑖
                            =                                                                   (13)
                     𝜕x𝑖        0 otherwise


                                                 9
Akihiro Takemura et al. CEUR Workshop Proceedings                                               1–15


   We can update x iteratively using, for example, gradient descent or quasi-Newton’s method,
to reduce the loss to 0. Here we shown an example of update rule for gradient descent. Let 𝛼 be
the step size, then the gradient descent update rule is given by:
                                                      𝜕𝐿(x)
                                      xnew ← x − 𝛼                                              (14)
                                                       𝜕x
Using this update rule we can design an algorithm to find supported models, as shown in
Algorithm 1.
   The convergence characteristics of the gradient descent algorithm are well-known [15].
Assuming at least one 0-1 vector representing a supported model exists for Q, all we require for
Algorithm 1 to converge to the supported model is that the initial vector xinit to be in the region
surrounding the supported model where the slope points towards the model. When there are
multiple supported models (correspondingly multiple local minima), we expect the algorithm to
converge to different models depending on the choice of initial vector xinit . However, it is often
not known apriori which particular values or regions of xinit lead to which models. Therefore
we implement a uniform sampling strategy, where we sample uniformly from [0, 1], and make
no assumptions with regard to the existence of supported models for Q.
   Depending on the program, an optimal 0-1 vector interpretation may not exist, so we limit the
number of iterations to max_iter before assuming non-convergence. With gradient descent, it
is often time consuming to reduce the loss function completely to zero. We therefore implement
a "peeking at a solution" heuristic, similar to the one presented in [7], where while updating
x we round x a to 0-1 vector to see whether the resulting x𝑟 is a solution (Lines 7-11). The
output is sensitive to the choice of initial vector xinit , and a poor choice of xinit may result
in non-convergence to optimal solutions. We alleviate this dependency on the initial vector
by introducing the max_retry parameter and changing the initial vector on each try. This
algorithm declares failure to find any supported models (returns FALSE) when both max_retry
and max_iter are exhausted.
   An initial implementation of the proposed method is made using Python 3.7. Our experimental
environment is a Linux container limited to 8 virtual CPU cores and 16GB RAM hosted on a
desktop with Ubuntu 18.04, Intel Core i9-9900K 3.6GHz and 64GB RAM.


5. Experiments
In this section, we first compare the performance of our method with the one presented
by Aspis et al. [10] in terms of success rate of finding correct supported models. Then we
qualitatively show the effect of the rounding heuristic by plotting the trajectories of the candidate
interpretation vector.

5.1. Comparing Success Rates with Aspis et al.
Aspis et al. [10] encode the program reduct into a matrix and employ the Netwon’s method for
root finding to find fixed points of the program. The gradient is calculated by a Jacobian matrix,
and the thresholding operation is carried out with a parameterized sigmoid. They present two
types of sampling methods for setting the initial vector; uniform sampling, similarly to our



                                                 10
Akihiro Takemura et al. CEUR Workshop Proceedings                                                      1–15


Algorithm 1 Gradient descent search of supported models
Input: Program matrix Q, Thresholding parameter t, Initial vector xinit , max_retry ≥ 1,
     max_iter ≥ 1, 𝜖 > 0, step size 𝛼 > 0, 𝜆1 > 0, 𝜆2 > 0
Output: supported model x or FALSE
  1: x ← xinit
  2: for n_try ← 1 to max_retry do
  3:     if n_try > 1 then
  4:         x ← random vector
  5:     end if
  6:     for n_iter ← 1 to max_iter do
  7:         x𝑟 ← round(x)                                          ◁ Rounding heuristic
  8:         loss ← 𝐿(x𝑟 )                                   ◁ Loss function, see Def. (8)
  9:         if (loss ≤ 𝜖) then
 10:              x ← x𝑟
 11:              return x
 12:         else
 13:              gradient ← 𝜕𝐿(x)
                               𝜕x                                 ◁ Gradient, see Def. (9)
 14:              x ← x − 𝛼· gradient                                   ◁ Gradient update
 15:         end if
 16:     end for
 17: end for
 18: return FALSE


method, where the values are sampled[︀ uniformly                                     2
                                            ]︀ [︀ from]︀[0, 1], and semantic sampling , where the
values are sampled uniformly from 0, 𝛾 ∪ 𝛾 , 1 . Semantic sampling results in an initial
                                          ⊥       ⊤

vector that is semantically equivalent to an interpretation.
  Firstly we consider the “N -negative loops” task, which involves programs in the following
form: for 1 ≤ 𝑖 ≤ 𝑁 ,
                                          𝑝𝑖 ← not 𝑞𝑖
                                                                                             (15)
                                          𝑞𝑖 ← not 𝑝𝑖
For our algorithm, we disabled automatic retry with max_retry = 1, so that the result depends
solely on a single application of the algorithm. Other parameters were set as follows: max_iter =
103 , 𝜖 = 10−4 , 𝜆1 = 𝜆2 = 1, 𝛼 = 10−1 . For Aspis et al.’s algorithm, we used the following
settings: max_iter = 103 , 𝜖 = 10−4 , 𝛾 = 0.5, 𝜏 = 0.087.3 We generated programs of the form
Program (15) with 𝑁 up to 50, and then applied the algorithms exactly once for 1000 times. We
measured the rate of success of converging to supported models. Figure 2(a) shows our method
with uniform initialization outperforms their uniform sampling method, although improvement
over their semantic sampling method is marginal.
   Secondly we consider the “choose 1 out of 𝑁 ” task, where the task is choose exactly 1 out of

    2 ⊥
     𝛾 is an upper bound on false values that variables can take, and 𝛾 ⊤ is a lower bound on true values.
    3
     We used 𝜏 = 0.087 which was their best performing setting resulting in 89% success and near 100% success
for uniform and semantic sampling, respectively, for 𝑁 = 1 loop.



                                                     11
Akihiro Takemura et al. CEUR Workshop Proceedings                                                                    1–15


                   1.0                                                         1.0                        Ours
                                                                                                          (1-try)
                   0.8                                                         0.8                        Ours
    Success rate




                                                                Success rate
                   0.6                    Ours (1-try)                         0.6
                                                                                                          (n-try)
                                          Aspis+ semantic                                                 Aspis+
                                                                                                          semantic
                   0.4                    Aspis+ uniform                       0.4                        Aspis+
                                                                                                          uniform
                   0.2                                                         0.2
                   0.0                                                         0.0
                         0   10      20       30     40    50                        2    4           6   8
                                          N                                                       N
                             (a) 𝑁 -negative loops                                       (b) 𝑁 -choices
Figure 2: (a) Success rate of converging to a model at the 𝑁 -negative loops task. (b) Success rate of
converging to a model at the 𝑁 -choices task.


𝑁 options. The programs in this class have the following form:

                                              𝑝1 ← not 𝑝2 , not 𝑝3 , , .., not 𝑝𝑁
                                              𝑝2 ← not 𝑝1 , not 𝑝3 , , .., not 𝑝𝑁
                                                              ..                                                     (16)
                                                               .
                                           𝑝𝑁 ← not 𝑝1 , not 𝑝2 , , .., not 𝑝𝑁 −1

We generated programs for 𝑁 between 2 and 9, and applied the algorithms using the same
parameters as the "𝑁 -negative loops" task, and repeated the process for 1000 times for each 𝑁 .4
For our algorithm, we consider the following scenarios: when the algorithm is allowed (i) only
1 try (max_retry = 1, 1-try in Fig. 2(b)), and (ii) retry up to 1000 times (max_retry = 1000,
n-try in Fig. 2(b)).
   From Figure 2(b), we observe that the success rate of our algorithm (1-try) starts to drop
rapidly past 𝑁 = 5, and approaches the same level of success rate (10-20% range) as Aspis et al.
by 𝑁 = 7. We can allow multiple retries to help the algorithm find the supported models, as
indicated by the high success rate of the n-try case, however this means that potentially the run
time will be significantly longer compared to max_retry = 1. In this example, there is a global
constraint that exactly one of {𝑝1 , ..., 𝑝𝑁 } is chosen, but neither our method nor Aspis et al.’s
method can capitalize on this information. Handling of global constraints is left for future work.

5.2. Effect of the Rounding Heuristic on the Trajectories
Here we qualitatively demonstrate the effect of the rounding heuristic (Lines 7-11 in Algorithm
1) by plotting the trajectories of the interpretation vector. We consider a simple negative loop
program, i.e., program (15) with 𝑖 = 1, where we have exactly two supported models {𝑝} and
{𝑞}. The contours in the following figures represent the value of the loss function at each

    4
    For each program whose maximum rule length is 𝑁 − 1, we estimate the upper bound of 𝜏 according to their
method, and subtract 10−3 from this value and use it as 𝜏 . This ensures that 𝜏 is smaller than the upper bound.



                                                                12
Akihiro Takemura et al. CEUR Workshop Proceedings                                                                1–15


(𝑝, 𝑞)-coordinate. The dots and arrows represent the trajectories of the algorithm starting from
various initial vectors. Note that the lowest points on this landscape are at (𝑝, 𝑞) = (1, 0) ({𝑝})
and (𝑝, 𝑞) = (0, 1) ({𝑞}).
   1.0                                              1.0        1.0                                                1.0


   0.8                                              0.8        0.8                                                0.8


   0.6                                              0.6        0.6                                                0.6
q




                                                          q
   0.4                                              0.4        0.4                                                0.4


   0.2                                              0.2        0.2                                                0.2


   0.00.0   0.2      0.4       0.6    0.8     1.0   0.0        0.00.0    0.2     0.4       0.6    0.8      1.0    0.0
                           p                                                           p
 (a) Trajectories of various attempts without rounding.        (b) Trajectories of various attempts with rounding.
Figure 3: (a) Trajectories without rounding (b) Trajectories with rounding heuristic.


   From Figure 3(a) we observe that once the algorithm reaches the bottom of the valley, the
update becomes smaller and the convergence is slow. From Figure 3(b) we see that the rounding
heuristic (peeking at solution) avoids the bottom of the valley entirely and after several iterations
the algorithm jumps straight to the 0-1 interpretation, which makes it converge faster.
   In both cases we see the algorithm converging to the middle point {𝑝 = 0.5, 𝑞 = 0.5} when
starting from initial vectors with (𝑝 = 𝑞) Currently we do not break this symmetry in the
algorithm, for example, by adding a very small noise to the vector during update, and instead
use a retry function which resets the initial vectors to a random value on restart.


6. Conclusion
In this work we presented a new method for computing supported models of a normal logic
program, using a matrix representation of the program and a gradient-based search algorithm.
We defined a loss function based on the implementation of the 𝑇𝑃 -operator by matrix-vector
multiplication with a suitable thresholding function. We incorporated penalty terms into the
loss function to avoid undesirable results. We showed the results of several experiments where
our method showed improvements over an existing method in terms of success rates of finding
correct supported models of normal logic programs.
   Raw computational performance is not the focus of this paper and we leave thorough bench-
marking for future work. Our method does not compute stable models except cases where
supported and stable models coincide. For computing stable models, we may consider using our
method as a preprocessor to the method by [4], where our method can be used to find supported
models as promising candidates, which can then be refined for better performance.



                                                          13
Akihiro Takemura et al. CEUR Workshop Proceedings                                             1–15


References
 [1] T. Sato, Embedding tarskian semantics in vector spaces, in: The Workshops of the the
     Thirty-First AAAI Conference on Artificial Intelligence, Saturday, February 4-9, 2017, San
     Francisco, California, USA, volume WS-17 of AAAI Workshops, AAAI Press, 2017.
 [2] C. Sakama, K. Inoue, T. Sato, Linear Algebraic Characterization of Logic Programs, in:
     G. Li, Y. Ge, Z. Zhang, Z. Jin, M. Blumenstein (Eds.), Knowledge Science, Engineering
     and Management, Lecture Notes in Computer Science, Springer International Publishing,
     Cham, 2017, pp. 520–533. doi:10.1007/978-3-319-63558-3_44.
 [3] Y. Aspis, K. Broda, A. Russo, Tensor-based abduction in horn propositional programs,
     in: ILP 2018 - 28th International Conference on Inductive Logic Programming, CEUR
     Workshop Proceedings, 2018, pp. 68–75.
 [4] T. Q. Nguyen, K. Inoue, C. Sakama, Enhancing linear algebraic computation of logic
     programs using sparse representation, in: F. Ricca, A. Russo, S. Greco, N. Leone, A. Ar-
     tikis, G. Friedrich, P. Fodor, A. Kimmig, F. A. Lisi, M. Maratea, A. Mileo, F. Riguzzi (Eds.),
     Proceedings 36th International Conference on Logic Programming (Technical Communi-
     cations), ICLP Technical Communications 2020, (Technical Communications) UNICAL,
     Rende (CS), Italy, 18-24th September 2020, volume 325 of EPTCS, 2020, pp. 192–205.
     doi:10.4204/EPTCS.325.24. arXiv:2009.10247.
 [5] T. Sato, C. Sakama, K. Inoue, From 3-valued Semantics to Supported Model Computation
     for Logic Programs in Vector Spaces, in: 12th International Conference on Agents and
     Artificial Intelligence, 2020, pp. 758–765.
 [6] T. Sato, K. Inoue, C. Sakama, Abducing Relations in Continuous Spaces, in: Proceedings of
     the Twenty-Seventh International Joint Conference on Artificial Intelligence, International
     Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 2018, pp.
     1956–1962. doi:10.24963/ijcai.2018/270.
 [7] T. Sato, R. Kojima, Logical Inference as Cost Minimization in Vector Spaces, in: A. El Fal-
     lah Seghrouchni, D. Sarne (Eds.), Artificial Intelligence. IJCAI 2019 International Work-
     shops, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020,
     pp. 239–255. doi:10.1007/978-3-030-56150-5_12.
 [8] T. Sato, A linear algebraic approach to Datalog evaluation, Theory and Practice of Logic
     Programming 17 (2017) 244–265. doi:doi:10.1017/S1471068417000023.
 [9] H. A. Blair, F. Dushin, D. W. Jakel, A. J. Rivera, M. Sezgin, Continuous Models of Computa-
     tion for Logic Programs: Importing Continuous Mathematics into Logic Programming’s
     Algorithmic Foundations, in: K. R. Apt, V. W. Marek, M. Truszczynski, D. S. Warren (Eds.),
     The Logic Programming Paradigm: A 25-Year Perspective, Artificial Intelligence, Springer,
     Berlin, Heidelberg, 1999, pp. 231–255. doi:10.1007/978-3-642-60085-2_10.
[10] Y. Aspis, K. Broda, A. Russo, J. Lobo, Stable and Supported Semantics in Continuous Vector
     Spaces, in: Proceedings of the 18th International Conference on Principles of Knowledge
     Representation and Reasoning, 2020, pp. 59–68. doi:10.24963/kr.2020/7.
[11] W. Marek, V. Subrahmanian, The relationship between stable, supported, default and
     autoepistemic semantics for general logic programs, Theoretical Computer Science 103
     (1992) 365–386. doi:10.1016/0304-3975(92)90019-C.
[12] K. R. Apt, H. A. Blair, A. Walker, Towards a theory of declarative knowledge, in: Founda-



                                                14
Akihiro Takemura et al. CEUR Workshop Proceedings                                        1–15


     tions of Deductive Databases and Logic Programming, Elsevier, 1988, pp. 89–148.
[13] H. D. Nguyen, C. Sakama, T. Sato, K. Inoue, An efficient reasoning method on logic
     programming using partial evaluation in vector spaces, Journal of Logic and Computation
     31 (2021) 1298–1316. doi:10.1093/logcom/exab010.
[14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural Language
     Processing (Almost) from Scratch, Journal of Machine Learning Research 12 (2011) 2493–
     2537.
[15] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, USA, 2004.




                                              15