1. Introduction

Gradient-Based Supported Model Computation in Vector Spaces

Akihiro Takemura

0 1

Katsumi Inoue

0 1 0 National Institute of Informatics , Japan 1 The Graduate University for Advanced Studies , SOKENDAI , Japan

We propose an alternative method for computing supported models of normal logic programs in vector spaces using gradient information. To this end, we first translate the program into a definite program and embed the program into a matrix. We then define a loss function based on the implementation of the immediate consequence operator by matrix-vector multiplication with a suitable thresholding function. We propose an almost everywhere diferentiable alternative to the non-linear thresholding operation, and incorporate penalty terms into the loss function to avoid undesirable results. We report the results of several experiments where our method shows improvements over an existing method in terms of success rates of finding correct supported models of normal logic programs.

eol>Logic Programming Supported Model Computation Diferentiable Logic Programming

1. Introduction

Performing logical inference with linear algebraic methods has been studied as an attractive alternative to traditional symbolic inference methods [1, 2, 3]. Linear algebraic approaches to logical inference ofer promising characteristics compared to the traditional methods. For example, eficiency and scalability: since linear algebra is at the heart of many scientific applications, there are existing highly-eficient optimized algorithms for performing basic operations in linear algebra, and these new methods can benefit from high computational power ofered in modern systems in the form of multi-core CPUs and GPUs. In this work, we present a gradient-based method for computing supported models in vector spaces.

Sakama et al. introduced the matrix representations of definite, disjunctive and normal logic programs, and a linear algebraic method for computing the least models and stable models [2]. More recently Nguyen et al. proposed an improved version of the algorithm by taking advantage of sparsity [4]. Sato et al. proposed a method for computing supported models in vector spaces via 3-valued completion models of a normal logic program [5]. The matrix encoding method used in this work is influenced by the aforementioned works, and we extend the previous works by proposing an encoding that is more suitable towards our goal of computing supported models as fixed points in vector spaces. Another diference is that while our method uses gradient 14th Workshop on Answer Set Programming and Other Computing Paradigms " atakemura@nii.ac.jp (A. Takemura); inoue@nii.ac.jp (K. Inoue) 0000-0003-4130-8311 (A. Takemura); 0000-0002-2717-9122 (K. Inoue)

© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CPWrEooUrckResehdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CEUR Workshop Proceedings (CEUR-WS.org) information to find supported models, previous works use non-diferentiable operations to find models.

Gradient-based methods have been considered, for example, for abduction [6], SAT and probabilistic inference [7]. Our design of the loss function follows the direction of previous works by Sato [8, 7], and we introduce additional terms for computing supported models. Although one should be able to compute supported models by obtaining a completion of the program and solving it with a SAT solver e.g. using the matrix-based SAT solver in [7], our method does not require completion nor SAT solver. Blair et al. introduced an extension of 2-valued semantics in vector spaces using a polynomial representation of normal programs, and used gradient descent to solve the dynamical system where the solutions correspond to the supported models [9]. In contrast we employ a matrix representation of the program, thresholding function and gradient descent for minimizing the loss function.

The closest to our work is Aspis et al.’s method [10], as both our algorithm and theirs use gradient-based method to find supported models of normal logic programs in vector spaces. Their method uses a matrix representation of the program reduct and the Newton’s method for root finding to find fixed points. Some diferences are: (i) While our method uses a normal to definite program translation and a matrix representation of the program, their method incorporates the notion of reduct into the program matrix. (ii) Our loss function includes a thresholding term and additionally regularization terms for dealing with fractional assignments and facts, while in their method the convergence depends solely on the fixed point criterion. The use of regularization to penalize fractional-valued interpretations is critical to avoid converging to unwanted middle points. Our goal in this paper is to improve upon Aspis et al.’s method, by extending Sato et al.’s earlier work on using loss functions for logical inference, and incorporating new terms for supported model computation.

The structure of this paper is as follows: Section 2 covers the necessary background and definitions. Section 3 presents a method for representing logic programs with matrices. Section 4 introduces the thresholding function and loss function, as well as the gradient-based search algorithm for finding supported models. Section 5 presents the results of experiments designed to test the ability of the algorithm and compare against known results in the literature. Finally, Section 6 presents the conclusion.

2. Background

We consider a language ℒ that contains a finite set of propositional variables defined over a ifnite alphabet and the logical connectives ¬, ∧, ∨ and ← . The Herbrand base, , is the set of all propositional variables that appear in a logic program .

A definite program is a set of rules of the form ( 1 ) or ( 2 ), where ℎ and are propositional variables (atoms) in ℒ. We refer to ( 2 ) as an OR-rule, which is a shorthand for rules: ℎ ← 1, ℎ ← 2, . . . , ℎ ← . For each rule we define ℎ() = ℎ and () = {1, . . . , }. A rule is referred to as a fact if () = ∅. ℎ ← 1 ∨ 2 ∨ · · · ∨

( ≥ 0)

A normal program is a set of rules of the form ( 3 ) where ℎ and are propositional variables in ℒ.

∧ ¬+1 ∧ ¬+2 ∧ · · · ∧ ¬ ( ≥ ≥ 0) We refer to the positive and negative occurrences of atoms in the body as +() = {1, . . . , } and − () = {+1, . . . , }, respectively. A normal program is a definite program if − () = ∅ for every rule ∈ .

An Herbrand interpretation , of a normal program is a subset of . A model of is an interpretation of where for every rule ∈ of the form ( 3 ), +() ⊆ and − () ∩ = ∅ imply ℎ ∈ . A program is called consistent if it has a model. A supported model is a model of where for every ∈ there exists a rule ∈ such that = ℎ, +() ⊆ and − () ∩ = ∅ [11, 12].

As we shall show later, in this paper we transform normal logic programs into definite programs for searching supported models. Thus we use the following definition of the immediate consequence operator . : 2 → 2 is a function on Herbrand interpretations. For a definite program , we have: () = {ℎ|ℎ ← 1 ∧ · · · ∧ ∈ and {1, . . . , } ⊆ } ∪ {ℎ|ℎ ← 1 ∨ · · · ∨ ∈ and {1, . . . , } ∩ ̸= ∅}. It is known that a supported model of a program is a fixed point of , i.e., ( ) = [11].

Definition 1 (Singly-Defined (SD) program) . A normal program P is called an SD program if ℎ(1) ̸= ℎ(2) for any two rules 1 and 2 in P where 1 ̸= 2.

Any normal program can be converted into an SD-program ′ in the following manner. If there are more than one rule with the same head (ℎ ← (1), . . . , ℎ ← (), where > 1), then replace them with a set of new rules {ℎ ← 1 ∨ ... ∨ , 1 ← (1), ..., ← ()} containing new atoms {1, ..., }. This is a stricter condition than the multiple definitions condition (MD-condition) [2]: for any two rules 1 and 2 in the program, ℎ(1) = ℎ(2) implies |(1)| ≤ 1 and |(2)| ≤ 1. Consequently all SD-programs satisfy the MD condition. Unless noted otherwise, we assume all programs in this paper are SD-programs.

Given a normal program , it is transformed into a definite program by replacing the negated literals in rules of the form ( 3 ) and rewriting: ( 3 ) ∧ +1 ∧ +2 ∧ · · · ∧ ( ≥ ≥ 0) ( 4 ) where are new atoms associated with the negated . A collection of rules of the form ( 4 ) is referred to as the positive form + where + = ∪ { | ∈ }. For transformed rules of the form ( 4 ), we refer to {1, . . . , } as the positive part and {+1, . . . , } as the negative part. After transformation, the program should contain rules of the forms ( 1 ), ( 2 ), or ( 4 ).

By an interpretation + of +, we mean any set of atoms + ⊆ + that satisfies the condition for any atom ∈ + , precisely one of either or belongs to +.

3. Representing Logic Programs with Matrices

In this section, we first describe the relationship between the positive forms of normal logic programs and its supported models. Then we describe the matrix encoding of the positive forms, which in turn represents the original normal logic programs in vector spaces.

3.1. Relationship between Positive Forms and Supported Models

Consider a (inconsistent) program ← ¬ , and its positive form ← . + is a definite program but it has no models in this case due to the restriction we place on the interpretation: if ∈ + then ̸∈ + and vice versa. Then in this case, the implication is that there are no ifxed points of + for + that satisfy the condition ∈ + if ̸∈ +. On the other hand, when a model of exists, we can show that the corresponding + is a model of +. Proposition 1. Let be a normal program, and let + be its positive form. If is a model of , then ′ = ∪ { | ∈ + ∖ } is a model of +. Conversely, if + is a model of +, then + ∩ is a model of P.

Proof. Follows from the definition of ′ and +. Consider ′. Since ̸∈ ′ if ∈ ′ and vice versa, for each rule ∈ +, () ⊆ ′ implies ℎ() = ∈ ′. Thus, ′ is a model of +. Now consider +. Let = + ∩ . Given that + is a model of + and ∈ if ∈ +, for each rule ∈ , +() ⊆ and − () ∩ = ∅ implies ℎ() = ∈ . Thus, is a model of .

Proposition 2. Let be a supported model of , and put ′ = ∪{ | ∈ + ∖ }. Then, + ( ′) = .

Proof. Suppose ∈ . Since is a supported model, there exists a rule ∈ such that ℎ() = , +() ⊆ and − () ∩ = ∅. Correspondingly, there exists a rule ′ ∈ + such that ℎ(′) = , +(′) ⊆ ′ and − (′) ⊆ ′. That is, ∈ + ( ′). Hence, ⊆ + ( ′).

Conversely, suppose ∈ + ( ′). Then, there exists a rule ′ ∈ + such that ℎ(′) = and (′) ⊆ ′. Since ′ is a model of + by Proposition 1, (′) ⊆ ′ implies ℎ(′) = ∈ ′. Because is a positive literal, ∈ holds. Hence, + ( ′) ⊆ . Therefore, = + ( ′).

Proposition 3. Let + be an interpretation of +. If + ( +) = + ∩ holds, then = + ∩ is a supported model of .

Proof. Suppose + ( +) = + ∩ . Because + ∩ recovers the positive literals from +, for each ∈ ( + ∩ ), there exists a rule ∈ such that ℎ() = , +() ⊆ ( + ∩ ) and − () ∩ ( + ∩ ) = ∅. Thus, = + ∩ is a supported model of .

3.2. Matrix Encoding of Logic Programs

We first introduce the matrix and vector notations used in this paper, before showing our method for encoding normal programs into matrices. Matrices and vectors are represented as bold upper-case e.g. M and lower-case letters e.g. v, respectively. A 1-vector with length is represented by 1 . The indices of the entries of matrices and vectors appear in the subscript, for example, M refers to the element at -th row and -th column of a matrix M and v refers to the -th element of a column vector v. Let M: and M: denote the -th row slice and -th column slice of M, respectively. We denote the horizontal concatenation of matrices M1 and M2 as [M1 M2], and denote the vertical concatenation of column vectors v1 and v2 as [v1; v2]. Some matrices and vectors in this paper have a superscript to distinguish them from others, e.g. M, and it should not be confused with indices appearing in the subscript; for example, M refers to the , -th entry of M.

Let be a normal program with size | | = , + its positive form and + the Herbrand base of +. Then we have | + | = 2 since for every ∈ we add its negated version . To encode + into a program matrix Q, we encode atoms appearing in the bodies of the rules with one-hot encoding. After performing this encoding for all rules in +, we obtain an ( × 2 ) binary program matrix Q, which is equivalent to a vertical concatenation of row vectors each encoding a rule ∈ + with one-hot encoding.

Definition 2 (Program Matrix). Let be a normal program with | | = and + its positive form with | + | = 2 . Then + is represented by a matrix Q ∈ Z× 2 such that for each element Q (1 ≤ ≤ , 1 ≤ ≤ 2 ) in Q, • Q = 1 if atom ∈ + (1 ≤ ≤ 2 ) appears in the body of the rule (1 ≤ ≤ ); • Q = 0, otherwise.

The -th row of Q corresponds to the atom appearing in the head of the rule , and the -th column corresponds to the atom (1 ≤ ≤ 2 ) appearing in the body of the rules (1 ≤ ≤ ). Atoms that do not appear in the head of any of the rules in + are encoded as zero-only row vectors in Q.

This definition is diferent from [ 2] or [10], in that we do not explicitly include ⊤ and ⊥ in the program matrix, and we do not use fractional values to encode long rules. In fact, our encoding method is similar to that of [5], except that we do not use (2 × 2 ) space for the program matrix since we do not encode rules with ∈ + in the head.1 Definition 3 (Interpretation Vector). Let be a definite program and = {1, . . . , }. Then an interpretation ⊆ is represented by a vector v = (v1, . . . , v )⊺ ∈ Z where each element v (1 ≤ ≤ ) represents the truth value of the proposition such that v = 1 if ∈ , otherwise v = 0. We assume propositional variables share the common index such that v corresponds to , and we write var(v) = .

Recall that the positive form + of a normal program is a definite program, and all negated literals in the body are replaced by new atoms, e.g. in ( 4 ) ¬+1 is replaced by +1. We now extend the definition of interpretation vectors to include relations between the positive and negative occurrences of atoms, in order to maintain whenever we have 1 ∈ , +1 ̸∈ and vice versa.

Definition 4 (Companion Vector). Let + ⊆ + + ⊆ + denote the positive part of , denote the negative part of , with size |+ | = |+ | = . Let v ∈ Z be a vector representing truth assignments for ∈ + such that v = 1 if ∈ and v = 0 otherwise. Define a companion vector w ∈ Z2 representing an interpretation + ⊆ + as follows: w = [v ; 1 − v ].

1In [5], the program matrix encodes the completion of +.

4. Gradient Descent For Computing Supported Models

In this section, we first show how the evaluation of the operator can be carried out in vector spaces. Then we introduce our loss function and its gradient, and finally we introduce an algorithm for finding supported models in vector spaces. 4.1. Computing the operator in Vector Spaces Sakama et al. [2] showed that the operator can be computed in vector spaces using a thresholding function, which they called -thresholding. Here we introduce a variant of thresholding to accommodate the new program encoding method as well as the diferentiability requirement. First we initialize the parameter vector t.

Definition 5 (Parameter Vector t). A set of parameters to the -thresholding is represented by a column vector t ∈ Z such that for each element t(1 ≤ ≤ ) in t, • t = |()| if the rule ∈ + is a conjunctive rule, e.g. ( 1 ), ( 4 ); • t = 1 if the rule ∈ + is a disjunctive rule e.g. ( 2 ); • t = 0, otherwise.

In some previous works [2, 13], the information about the nature of the rules was also stored in the program matrix Q alongside the atom occurrences; conjunctive rules with |()| > 1 had fractional values Q = 1/|()| and disjunctive bodies had integer values Q = 1. Here we follow the approach taken by Sato et al. [5], and only store information about the occurrences of atoms in + and keep supplementary information in the parameter vector t to recover the original program.

Definition 6 (Parameterized -thresholding). Let w ∈ Z2 be a companion vector representing + ⊆ + . Given a parameter vector t ∈ Z , a program matrix Q ∈ Z× 2 , and a vector y = Qw where y ∈ Z , we apply the thresholding function element-wise as follows: t(y) =

0 {︃min(max(0, y − (t − 1)), 1) (t ≥ 1) (t < 1) ( 5 )

This thresholding function resembles ℎℎ which is an activation function developed for use in natural language processing [14]. In the original ℎℎ function, the range of the linear region is [− 1, 1] (Figure 1(a)), but here we define the linear region between [t − 1, t] (Figure 1(b)). This function is almost everywhere diferentiable except at y = t − 1 and y = t. The special case t < 1 in Equation ( 5 ) corresponds to the case t = 0 where the head does not appear in the program + and is assumed to be .

Intuitively, for the head of a conjunctive rule to be it is suficient to check whether all literals in the body hold, otherwise the rule evaluates to . Similarly for a disjunctive rule it is suficient to check whether at least one of the literals in the body holds for the head to hold. This is formalized in Proposition 4. (a) ℎℎ

(b) Parameterized thresholding with t = 1 Proposition 4 (Thresholded Operator). Let + be the positive form of a normal program and Q ∈ Z× 2 its matrix representation. Let v ∈ Z be a subset of an interpretation vector representing ∈ ⊆ + if = 1 for {1, . . . , }. Let w ∈ Z2 be a companion vector to v. Then z = [u; 1 − u] ∈ Z2 is a vector representing = () satisfying the condition ( ∈ if ̸∈ ), if u = t(Qw).

Proof. Consider u = t(Qw). For u = (u1, . . . , u )⊺, by the definition of the thresholding function, u = 1 (1 ≤ ≤ ) if u′ ≥ t in u′ = Qw. Take a row slice Q:, then u′ = Q:w = Q1w1 + · · · + Q2 w2 , and u = 1 if u′ ≥ t. Both Q: and w are 0-1 vectors, then it follows that there are at least t elements where Q = w = 1 (1 ≤ ≤ 2 ). The first elements of w represent ∈ ⊆ + if w = 1, and if ∈ then ̸∈ ⊆ + which is maintained through the definition of the companion vector w. 1) For a conjunctive rule ← 1 ∧ · · · ∧ (1 ≤ ≤ 2 ), {1, . . . , 2 } ∈ implies ∈ (). 2) For an OR-rule ← 1 ∨ · · · ∨ (1 ≤ ≤ 2 ), {1, . . . , 2 } ⊆ implies ∈ (). ∈ is represented by z = 1 (1 ≤ ≤ 2 ). Then by putting = {var(z)|z = 1}, = () holds.

Consider = (). For v = (v1, . . . , v )⊺ representing ⊆ + , w = (v1, . . . , v , 1− v1, . . . , 1 − v )⊺ is a vector representing ⊆ + if we set = {var(w)|w = 1}. u′ = Qw is a vector such that u′ ≥ t (1 ≤ ≤ ) if var(u′) ∈ (). Define u = (u1, . . . , u )⊺ such that u = 1 (1 ≤ ≤ ) if u′ ≥ t in Qw, and u = 0 otherwise. Define an interpretation ⊆ + such that it can be partitioned into subsets of positive and negative occurrences of atoms ( ∪ ) = ⊆ + . Since only positive atoms occur in the head, u represents a positive subset of interpretation ⊆ ⊆ + by setting = { var(u)|u = 1} (1 ≤ ≤ ). If ∈ () then u = 1, and ̸∈ () is represented by 1 − u = 0. Conversely, if ̸∈ () then u = 0, and 1 − u = 1 represents ∈ (). Thus 1 − u represents ⊆ ⊆ + . z = [u; 1 − u] is then a vector representing ∪ = if we set = {var(z)|z = 1} (1 ≤ ≤ 2 ). Thus z = [u; 1 − u] represents = () if u = t(Qw).

Example 1. Consider the following program: ← This program has one supported (stable) model: {}. We have = {, , } and + = {, , , , , } and the matrix representation Q and parameter vector t are: Suppose an assignment v{} = (0 0 1)⊺ is given. Then the companion vector w is: ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) Compute the matrix multiplication product Qw and apply the parameterized thresholding: Let z be a companion vector to u, i.e. z = [u; 1 − u], then we have u = t(Qw) = t((︀ 0 1 1︀) ⊺) = (︀ 0 0 1︀) ⊺ = v{} Define = {var(z)|z = 1}, then we have = {, , }, and ∩ = {}. Proposition 5 (Supported Model Computation with Thresholded ). Let v ∈ Z be a 01 vector representing a subset of interpretation ⊆ ⊆ + , and z = [v; 1 − v] be its companion vector representing ⊆ + satisfying ( ∈ if ̸∈ ). Given a program matrix Q representing a program + and a thresholding function t parameterized by a vector t, the ifxed points of + are represented by 0-1 binary vectors z = [v ; 1 − v ] ∈ Z2 where v = t(Qz ). Then z are vectors representing models + of + satisfying ( ∈ + if ̸∈ +) if t(Qz ) = v . When such 0-1 binary vector z exists, + ∩ = is a supported model of .

Proof. Let ⊆ + be a model of +, represented by z . Consider two cases (i) () = and (ii) v = t(Qz ). In both cases, by Propositions 2, 3 and 4, if a supported model of exists, the results hold.

Example 2. Consider Program ( 6 ) in Example 1, and interpretation vector z ( 10 ) representing an interpretation . We show that = {, , } is indeed a model of the positive form, and corresponds to the supported model of . Construct a vector v{} and companion vector w as in ( 8 ). By comparing ( 8 ) and ( 10 ), we see that w = z and this is consistent with the fixed point definition = ( ). ∩ results in {} which is the supported model of .

4.2. Loss Function for Computing Supported Models

t(Q[v ; 1 − By the fixed point definition of supported models, a supported model satisfies v = v ]). We now use this definition to design a loss function which can be minimized using gradient descent. Gradient descent is a method for minimizing an objective function (loss function), by updating the parameters in the opposite direction of the gradient of the objective function with respect to the parameters. The size of the update is determined by the gradient and the step size . Intuitively this algorithm follows the direction of the slope of the loss surface until it reaches the bottom of a valley (local minimum). w = [u; 1 − ∈ ⊆

Let u be an ( × 1) vector representing an assignment to the heads of the rules in +. Define u] as the (2 × 1) interpretation vector that stores assignments to each atom facts in the program +. This vector will be used later during the minimization step to ensure + . We define an

( × 1) vector f which stores information about occurrences of that facts are not forgotten.

Definition 7 (Fact Vector f ). The set of facts in the program + is represented by a column vector f ∈ Z such that for each element f(1 ≤ ≤ ) in f , • f = 0 otherwise.

• f = 1 if the rule is a fact: ← Definition 8 (Loss Function). Given a program matrix Q, a candidate vector x, thresholding function t, and constants 1 and 2, define the loss function as follows: (x) = 21 (︁ ‖ t(Q[x; 1 − x]) − x‖2F + 1‖x ⊙ (x − 1 )‖2F + 2‖f − (x ⊙ f )‖2F ︁) ( 11 ) where ‖x‖F denotes the Frobenius norm and ⊙ element-wise product.

The first term is derived directly from the fixed point definition of supported models, and should be 0 if x is a supported model of +. The second term, which resembles a regularization term often used in the machine learning literature, is added to penalize candidate vectors x that contain fractional values, and is 0 if and only if x is a 0-1 vector. The third term will be 0 if and only if the facts are preserved, and will be positive non-zero if any part of the assignment is lost, i.e. by assigning 0 ( ) to a fact where f = 1.

We introduce submatrices of Q, Q and Q that correspond to the positive bodies and negative bodies of the matrix, respectively, such that Q = [Q Q] (horizontal concatenation of submatrices). Both Q and Q have the shape ( × ).

Definition 9 (Gradient of the Loss Function). The gradient of the loss function with respect to x is given by: (x) x = ︂( (Q −

Q) · t(Qzx) ⊙ t(Qzx) )︂ x

− t(Qzx − x) + 1(x ⊙ (1 − x) ⊙ (1 − 2x)) + 2(x ⊙ f − f ) where zx ∈ R2 = [x; 1 − x] and t(w) = x {︃1 if (t ≥ 1) and (t − 1) ≤

w ≤ t 0 otherwise ( 12 ) ( 13 )

We can update x iteratively using, for example, gradient descent or quasi-Newton’s method, to reduce the loss to 0. Here we shown an example of update rule for gradient descent. Let be the step size, then the gradient descent update rule is given by: ( 14 ) Using this update rule we can design an algorithm to find supported models, as shown in Algorithm 1.

The convergence characteristics of the gradient descent algorithm are well-known [15]. Assuming at least one 0-1 vector representing a supported model exists for Q, all we require for Algorithm 1 to converge to the supported model is that the initial vector xinit to be in the region surrounding the supported model where the slope points towards the model. When there are multiple supported models (correspondingly multiple local minima), we expect the algorithm to converge to diferent models depending on the choice of initial vector xinit. However, it is often not known apriori which particular values or regions of xinit lead to which models. Therefore we implement a uniform sampling strategy, where we sample uniformly from [0, 1], and make no assumptions with regard to the existence of supported models for Q.

Depending on the program, an optimal 0-1 vector interpretation may not exist, so we limit the number of iterations to max_iter before assuming non-convergence. With gradient descent, it is often time consuming to reduce the loss function completely to zero. We therefore implement a "peeking at a solution" heuristic, similar to the one presented in [7], where while updating x we round x a to 0-1 vector to see whether the resulting x is a solution (Lines 7-11). The output is sensitive to the choice of initial vector xinit, and a poor choice of xinit may result in non-convergence to optimal solutions. We alleviate this dependency on the initial vector by introducing the max_retry parameter and changing the initial vector on each try. This algorithm declares failure to find any supported models (returns FALSE) when both max_retry and max_iter are exhausted.

An initial implementation of the proposed method is made using Python 3.7. Our experimental environment is a Linux container limited to 8 virtual CPU cores and 16GB RAM hosted on a desktop with Ubuntu 18.04, Intel Core i9-9900K 3.6GHz and 64GB RAM.

5. Experiments

In this section, we first compare the performance of our method with the one presented by Aspis et al. [10] in terms of success rate of finding correct supported models. Then we qualitatively show the efect of the rounding heuristic by plotting the trajectories of the candidate interpretation vector.

5.1. Comparing Success Rates with Aspis et al.

Aspis et al. [10] encode the program reduct into a matrix and employ the Netwon’s method for root finding to find fixed points of the program. The gradient is calculated by a Jacobian matrix, and the thresholding operation is carried out with a parameterized sigmoid. They present two types of sampling methods for setting the initial vector; uniform sampling, similarly to our 1, > 0, step size > if (loss ≤ ) then

x x ← x ← return x gradient ← method, where the values are sampled uniformly from [0, 1], and semantic sampling2, where the values are sampled uniformly from [︀ 0, ⊥]︀ vector that is semantically equivalent to an interpretation.

∪ ⊤, 1]︀ . Semantic sampling results in an initial

︀[

Firstly we consider the “N -negative loops” task, which involves programs in the following form: for 1 ≤ ≤ , ← ← not not ( 15 ) For our algorithm, we disabled automatic retry with max_retry = 1, so that the result depends solely on a single application of the algorithm. Other parameters were set as follows: max_iter = 103, = 10− 4, 1 = 2 = 1,

= 10− 1. For Aspis et al.’s algorithm, we used the following settings: max_iter = 103, = 10− 4, = 0.5, = 0.087.3 We generated programs of the form Program ( 15 ) with up to 50, and then applied the algorithms exactly once for 1000 times. We measured the rate of success of converging to supported models. Figure 2(a) shows our method with uniform initialization outperforms their uniform sampling method, although improvement over their semantic sampling method is marginal.

Secondly we consider the “choose 1 out of ” task, where the task is choose exactly 1 out of 2 ⊥ is an upper bound on false values that variables can take, and ⊤ is a lower bound on true values. 3We used = 0.087 which was their best performing setting resulting in 89% success and near 100% success for uniform and semantic sampling, respectively, for = 1 loop.

Ours (1-try) Ours (n-try) Aspis+ semantic Aspis+ uniform 8 (16) 1.0 options. The programs in this class have the following form: 1 ← 2 ← not 2, not 3, , .., not not 1, not 3, , .., not . .

. ←

not 1, not 2, , .., not − 1 We generated programs for between 2 and 9, and applied the algorithms using the same parameters as the " -negative loops" task, and repeated the process for 1000 times for each .4 For our algorithm, we consider the following scenarios: when the algorithm is allowed (i) only 1 try (max_retry = 1, 1-try in Fig. 2(b)), and (ii) retry up to 1000 times (max_retry = 1000, n-try in Fig. 2(b)).

From Figure 2(b), we observe that the success rate of our algorithm (1-try) starts to drop rapidly past = 5, and approaches the same level of success rate (10-20% range) as Aspis et al. by = 7. We can allow multiple retries to help the algorithm find the supported models, as indicated by the high success rate of the n-try case, however this means that potentially the run time will be significantly longer compared to max_retry = 1. In this example, there is a global constraint that exactly one of {1, ..., } is chosen, but neither our method nor Aspis et al.’s method can capitalize on this information. Handling of global constraints is left for future work.

5.2. Efect of the Rounding Heuristic on the Trajectories

Here we qualitatively demonstrate the efect of the rounding heuristic (Lines 7-11 in Algorithm 1) by plotting the trajectories of the interpretation vector. We consider a simple negative loop program, i.e., program ( 15 ) with = 1, where we have exactly two supported models {} and {}. The contours in the following figures represent the value of the loss function at each 4For each program whose maximum rule length is − 1, we estimate the upper bound of according to their method, and subtract 10− 3 from this value and use it as . This ensures that is smaller than the upper bound. (, )-coordinate. The dots and arrows represent the trajectories of the algorithm starting from various initial vectors. Note that the lowest points on this landscape are at (, ) = ( 1, 0 ) ({}) and (, ) = ( 0, 1 ) ({}). 1.0

From Figure 3(a) we observe that once the algorithm reaches the bottom of the valley, the update becomes smaller and the convergence is slow. From Figure 3(b) we see that the rounding heuristic (peeking at solution) avoids the bottom of the valley entirely and after several iterations the algorithm jumps straight to the 0-1 interpretation, which makes it converge faster.

In both cases we see the algorithm converging to the middle point { = 0.5, = 0.5} when starting from initial vectors with ( = ) Currently we do not break this symmetry in the algorithm, for example, by adding a very small noise to the vector during update, and instead use a retry function which resets the initial vectors to a random value on restart.

6. Conclusion

In this work we presented a new method for computing supported models of a normal logic program, using a matrix representation of the program and a gradient-based search algorithm. We defined a loss function based on the implementation of the -operator by matrix-vector multiplication with a suitable thresholding function. We incorporated penalty terms into the loss function to avoid undesirable results. We showed the results of several experiments where our method showed improvements over an existing method in terms of success rates of finding correct supported models of normal logic programs.

Raw computational performance is not the focus of this paper and we leave thorough benchmarking for future work. Our method does not compute stable models except cases where supported and stable models coincide. For computing stable models, we may consider using our method as a preprocessor to the method by [4], where our method can be used to find supported models as promising candidates, which can then be refined for better performance. 1.0

[1]

Sato , Embedding tarskian semantics in vector spaces, in: The Workshops of the the Thirty- First AAAI Conference on Artificial Intelligence, Saturday, February 4-9 , 2017 , San Francisco, California, USA, volume WS-17 of AAAI Workshops , AAAI Press, 2017 .

[2]

Sakama ,

Inoue ,

Sato , Linear Algebraic Characterization of Logic Programs , in: G. Li,

Ge ,

Zhang ,

Jin , M. Blumenstein (Eds.), Knowledge

Science

, Engineering and Management, Lecture Notes in Computer Science , Springer International Publishing, Cham, 2017 , pp. 520 - 533 . doi: 10 .1007/978-3- 319 -63558-3_ 44 .

[3]

Aspis ,

Broda ,

Russo , Tensor-based abduction in horn propositional programs , in: ILP 2018 - 28th International Conference on Inductive Logic Programming , CEUR Workshop Proceedings , 2018 , pp. 68 - 75 .

[4]

T. Q.

Nguyen ,

Inoue ,

Sakama , Enhancing linear algebraic computation of logic programs using sparse representation , in: F. Ricca , A.

Russo , S.

Greco , N.

Leone , A.

Artikis , G. Friedrich, P.

Fodor , A.

Kimmig , F. A.

Lisi , M.

Maratea , A.

Mileo , F.

Riguzzi (Eds.), Proceedings 36th International Conference on Logic Programming (Technical Communications) , ICLP Technical Communications 2020 , (Technical Communications) UNICAL, Rende (CS), Italy , 18 -24th September 2020 , volume 325 of EPTCS , 2020 , pp. 192 - 205 . doi: 10 .4204/EPTCS.325.24. arXiv: 2009 .10247.

[5]

Sato ,

Sakama ,

Inoue , From 3 -valued Semantics to Supported Model Computation for Logic Programs in Vector Spaces , in: 12th International Conference on Agents and Artificial Intelligence , 2020 , pp. 758 - 765 .

[6]

Sato ,

Inoue ,

Sakama , Abducing Relations in Continuous Spaces, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence , International Joint Conferences on Artificial Intelligence Organization , Stockholm, Sweden, 2018 , pp. 1956 - 1962 . doi: 10 .24963/ijcai. 2018 /270.

[7]

Sato ,

Kojima , Logical Inference as Cost Minimization in Vector Spaces , in: A. El Fallah Seghrouchni , D. Sarne (Eds.), Artificial Intelligence. IJCAI 2019 International Workshops, Lecture Notes in Computer Science , Springer International Publishing, Cham, 2020 , pp. 239 - 255 . doi: 10 .1007/978-3- 030 -56150-5_ 12 .

[8]

Sato , A linear algebraic approach to Datalog evaluation , Theory and Practice of Logic Programming 17 ( 2017 ) 244 - 265 . doi:doi:10.1017/S1471068417000023.

[9]

H. A.

Blair ,

Dushin ,

D. W.

Jakel ,

A. J.

Rivera ,

Sezgin , Continuous Models of Computation for Logic Programs: Importing Continuous Mathematics into Logic Programming's Algorithmic Foundations , in: K. R. Apt , V. W.

Marek , M.

Truszczynski , D. S. Warren (Eds.), The Logic Programming Paradigm: A 25 -Year

Perspective

, Artificial Intelligence , Springer, Berlin, Heidelberg, 1999 , pp. 231 - 255 . doi: 10 .1007/978-3- 642 -60085-2_ 10 .

[10]

Aspis ,

Broda ,

Russo ,

Lobo , Stable and Supported Semantics in Continuous Vector Spaces , in: Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning , 2020 , pp. 59 - 68 . doi: 10 .24963/kr.2020/7.

[11]

Marek ,

Subrahmanian , The relationship between stable, supported, default and autoepistemic semantics for general logic programs , Theoretical Computer Science 103 ( 1992 ) 365 - 386 . doi: 10 .1016/ 0304 - 3975 ( 92 ) 90019 -C.

[12]

K. R.

Apt ,

H. A.

Blair ,

Walker , Towards a theory of declarative knowledge, in: Foundations of Deductive Databases and Logic Programming , Elsevier, 1988 , pp. 89 - 148 .

[13]

H. D.

Nguyen ,

Sakama ,

Sato ,

Inoue , An eficient reasoning method on logic programming using partial evaluation in vector spaces , Journal of Logic and Computation 31 ( 2021 ) 1298 - 1316 . doi: 10 .1093/logcom/exab010.

[14]

Collobert ,

Weston ,

Bottou ,

Karlen ,

Kavukcuoglu ,

Kuksa , Natural Language Processing (Almost) from Scratch , Journal of Machine Learning Research 12 ( 2011 ) 2493 - 2537 .

[15]

Boyd , L. Vandenberghe, Convex Optimization, Cambridge University Press, USA, 2004 .