=Paper=
{{Paper
|id=Vol-2970/aspocppaper1
|storemode=property
|title=Gradient-Based Supported Model Computation in Vector Spaces
|pdfUrl=https://ceur-ws.org/Vol-2970/aspocppaper1.pdf
|volume=Vol-2970
|authors=Akihiro Takemura,Katsumi Inoue
|dblpUrl=https://dblp.org/rec/conf/iclp/TakemuraI21
}}
==Gradient-Based Supported Model Computation in Vector Spaces==
Gradient-Based Supported Model Computation in Vector Spaces Akihiro Takemura1,2 , Katsumi Inoue2,1 1 The Graduate University for Advanced Studies, SOKENDAI, Japan 2 National Institute of Informatics, Japan. Abstract We propose an alternative method for computing supported models of normal logic programs in vector spaces using gradient information. To this end, we first translate the program into a definite program and embed the program into a matrix. We then define a loss function based on the implementation of the immediate consequence operator ππ by matrix-vector multiplication with a suitable thresholding function. We propose an almost everywhere differentiable alternative to the non-linear thresholding operation, and incorporate penalty terms into the loss function to avoid undesirable results. We report the results of several experiments where our method shows improvements over an existing method in terms of success rates of finding correct supported models of normal logic programs. Keywords Logic Programming, Supported Model Computation, Differentiable Logic Programming 1. Introduction Performing logical inference with linear algebraic methods has been studied as an attractive alternative to traditional symbolic inference methods [1, 2, 3]. Linear algebraic approaches to logical inference offer promising characteristics compared to the traditional methods. For exam- ple, efficiency and scalability: since linear algebra is at the heart of many scientific applications, there are existing highly-efficient optimized algorithms for performing basic operations in linear algebra, and these new methods can benefit from high computational power offered in modern systems in the form of multi-core CPUs and GPUs. In this work, we present a gradient-based method for computing supported models in vector spaces. Sakama et al. introduced the matrix representations of definite, disjunctive and normal logic programs, and a linear algebraic method for computing the least models and stable models [2]. More recently Nguyen et al. proposed an improved version of the algorithm by taking advantage of sparsity [4]. Sato et al. proposed a method for computing supported models in vector spaces via 3-valued completion models of a normal logic program [5]. The matrix encoding method used in this work is influenced by the aforementioned works, and we extend the previous works by proposing an encoding that is more suitable towards our goal of computing supported models as fixed points in vector spaces. Another difference is that while our method uses gradient 14th Workshop on Answer Set Programming and Other Computing Paradigms " atakemura@nii.ac.jp (A. Takemura); inoue@nii.ac.jp (K. Inoue) 0000-0003-4130-8311 (A. Takemura); 0000-0002-2717-9122 (K. Inoue) Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 information to find supported models, previous works use non-differentiable operations to find models. Gradient-based methods have been considered, for example, for abduction [6], SAT and probabilistic inference [7]. Our design of the loss function follows the direction of previous works by Sato [8, 7], and we introduce additional terms for computing supported models. Although one should be able to compute supported models by obtaining a completion of the program and solving it with a SAT solver e.g. using the matrix-based SAT solver in [7], our method does not require completion nor SAT solver. Blair et al. introduced an extension of 2-valued semantics in vector spaces using a polynomial representation of normal programs, and used gradient descent to solve the dynamical system where the solutions correspond to the supported models [9]. In contrast we employ a matrix representation of the program, thresholding function and gradient descent for minimizing the loss function. The closest to our work is Aspis et al.βs method [10], as both our algorithm and theirs use gradient-based method to find supported models of normal logic programs in vector spaces. Their method uses a matrix representation of the program reduct and the Newtonβs method for root finding to find fixed points. Some differences are: (i) While our method uses a normal to definite program translation and a matrix representation of the program, their method incorporates the notion of reduct into the program matrix. (ii) Our loss function includes a thresholding term and additionally regularization terms for dealing with fractional assignments and facts, while in their method the convergence depends solely on the fixed point criterion. The use of regularization to penalize fractional-valued interpretations is critical to avoid converging to unwanted middle points. Our goal in this paper is to improve upon Aspis et al.βs method, by extending Sato et al.βs earlier work on using loss functions for logical inference, and incorporating new terms for supported model computation. The structure of this paper is as follows: Section 2 covers the necessary background and definitions. Section 3 presents a method for representing logic programs with matrices. Section 4 introduces the thresholding function and loss function, as well as the gradient-based search algorithm for finding supported models. Section 5 presents the results of experiments designed to test the ability of the algorithm and compare against known results in the literature. Finally, Section 6 presents the conclusion. 2. Background We consider a language β that contains a finite set of propositional variables defined over a finite alphabet and the logical connectives Β¬, β§, β¨ and β. The Herbrand base, π΅π , is the set of all propositional variables that appear in a logic program π . A definite program is a set of rules of the form (1) or (2), where β and ππ are propositional variables (atoms) in β. We refer to (2) as an OR-rule, which is a shorthand for π rules: β β π1 , β β π2 , . . . , β β ππ . For each rule π we define βπππ(π) = β and ππππ¦(π) = {π1 , . . . , ππ }. A rule π is referred to as a fact if ππππ¦(π) = β . β β π1 β§ π2 β§ Β· Β· Β· β§ ππ (π β₯ 0) (1) β β π1 β¨ π2 β¨ Β· Β· Β· β¨ ππ (π β₯ 0) (2) 2 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 A normal program is a set of rules of the form (3) where β and ππ are propositional variables in β. β β π1 β§ π2 β§ Β· Β· Β· β§ ππ β§ Β¬ππ+1 β§ Β¬ππ+2 β§ Β· Β· Β· β§ Β¬ππ (π β₯ π β₯ 0) (3) We refer to the positive and negative occurrences of atoms in the body as ππππ¦ + (π) = {π1 , . . . , ππ } and ππππ¦ β (π) = {ππ+1 , . . . , ππ }, respectively. A normal program is a definite program if ππππ¦ β (π) = β for every rule π β π . An Herbrand interpretation πΌ, of a normal program π is a subset of π΅π . A model π of π is an interpretation of π where for every rule π β π of the form (3), ππππ¦ + (π) β π and ππππ¦ β (π) β© π = β imply β β π . A program is called consistent if it has a model. A supported model π is a model of π where for every π β π there exists a rule π β π such that π = β, ππππ¦ + (π) β π and ππππ¦ β (π) β© π = β [11, 12]. As we shall show later, in this paper we transform normal logic programs into definite programs for searching supported models. Thus we use the following definition of the immediate consequence operator ππ . ππ : 2π΅π β 2π΅π is a function on Herbrand interpretations. For a definite program π , we have: ππ (πΌ) = {β|β β π1 β§ Β· Β· Β· β§ ππ β π and {π1 , . . . , ππ } β πΌ} βͺ {β|β β π1 β¨ Β· Β· Β· β¨ ππ β π and {π1 , . . . , ππ } β© πΌ ΜΈ= β }. It is known that a supported model π of a program π is a fixed point of ππ , i.e., ππ (π ) = π [11]. Definition 1 (Singly-Defined (SD) program). A normal program P is called an SD program if βπππ(π1 ) ΜΈ= βπππ(π2 ) for any two rules π1 and π2 in P where π1 ΜΈ= π2 . Any normal program π can be converted into an SD-program π β² in the following manner. If there are more than one rule with the same head (β β ππππ¦(π1 ), . . . , β β ππππ¦(ππ ), where π > 1), then replace them with a set of new rules {β β π1 β¨ ... β¨ ππ , π1 β ππππ¦(π1 ), ..., ππ β ππππ¦(ππ )} containing new atoms {π1 , ..., ππ }. This is a stricter condition than the multiple definitions condition (MD-condition) [2]: for any two rules π1 and π2 in the program, βπππ(π1 ) = βπππ(π2 ) implies |ππππ¦(π1 )| β€ 1 and |ππππ¦(π2 )| β€ 1. Consequently all SD-programs satisfy the MD condition. Unless noted otherwise, we assume all programs in this paper are SD-programs. Given a normal program π , it is transformed into a definite program by replacing the negated literals in rules of the form (3) and rewriting: β β π1 β§ π2 β§ Β· Β· Β· β§ ππ β§ ππ+1 β§ ππ+2 β§ Β· Β· Β· β§ ππ (π β₯ π β₯ 0) (4) where ππ are new atoms associated with the negated ππ . A collection of rules of the form (4) is referred to as the positive form π + where π΅π + = π΅π βͺ {π | π β π΅π }. For transformed rules of the form (4), we refer to {π1 , . . . , ππ } as the positive part and {ππ+1 , . . . , ππ } as the negative part. After transformation, the program should contain rules of the forms (1), (2), or (4). By an interpretation πΌ + of π + , we mean any set of atoms πΌ + β π΅π + that satisfies the condition for any atom π β π΅π + , precisely one of either π or π belongs to πΌ + . 3. Representing Logic Programs with Matrices In this section, we first describe the relationship between the positive forms of normal logic programs and its supported models. Then we describe the matrix encoding of the positive forms, which in turn represents the original normal logic programs in vector spaces. 3 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 3.1. Relationship between Positive Forms and Supported Models Consider a (inconsistent) program π β Β¬π, and its positive form π β π. π + is a definite program but it has no models in this case due to the restriction we place on the interpretation: if π β πΌ + then π ΜΈβ πΌ + and vice versa. Then in this case, the implication is that there are no fixed points of ππ + for π + that satisfy the condition π β πΌ + iff π ΜΈβ πΌ + . On the other hand, when a model π of π exists, we can show that the corresponding π + is a model of π + . Proposition 1. Let π be a normal program, and let π + be its positive form. If π is a model of π , then π β² = π βͺ {π | π β π΅π + β π } is a model of π + . Conversely, if π + is a model of π + , then π + β© π΅π is a model of P. Proof. Follows from the definition of π β² and π + . Consider π β² . Since π ΜΈβ π β² if π β π β² and vice versa, for each rule π β π + , ππππ¦(π) β π β² implies βπππ(π) = π β π β² . Thus, π β² is a model of π +. Now consider π + . Let πΎ = π + β© π΅π . Given that π + is a model of π + and π β πΎ if π β π + , for each rule π β π , ππππ¦ + (π) β πΎ and ππππ¦ β (π) β© πΎ = β implies βπππ(π) = π β πΎ. Thus, πΎ is a model of π . Proposition 2. Let π be a supported model of π , and put π β² = π βͺ{π | π β π΅π + β π }. Then, ππ + (π β² ) = π . Proof. Suppose π β π . Since π is a supported model, there exists a rule π β π such that βπππ(π) = π, ππππ¦ + (π) β π and ππππ¦ β (π) β© π = β . Correspondingly, there exists a rule πβ² β π + such that βπππ(πβ² ) = π, ππππ¦ + (πβ² ) β π β² and ππππ¦ β (πβ² ) β π β² . That is, π β ππ + (π β² ). Hence, π β ππ + (π β² ). Conversely, suppose π β ππ + (π β² ). Then, there exists a rule πβ² β π + such that βπππ(πβ² ) = π and ππππ¦(πβ² ) β π β² . Since π β² is a model of π + by Proposition 1, ππππ¦(πβ² ) β π β² implies βπππ(πβ² ) = π β π β² . Because π is a positive literal, π β π holds. Hence, ππ + (π β² ) β π . Therefore, π = ππ + (π β² ). Proposition 3. Let π + be an interpretation of π +. If ππ + (π + ) = π + β© π΅π holds, then π = π + β© π΅π is a supported model of π . Proof. Suppose ππ + (π + ) = π + β© π΅π . Because π + β© π΅π recovers the positive literals from π + , for each π β (π + β© π΅π ), there exists a rule π β π such that βπππ(π) = π, ππππ¦ + (π) β (π + β© π΅π ) and ππππ¦ β (π) β© (π + β© π΅π ) = β . Thus, π = π + β© π΅π is a supported model of π . 3.2. Matrix Encoding of Logic Programs We first introduce the matrix and vector notations used in this paper, before showing our method for encoding normal programs into matrices. Matrices and vectors are represented as bold upper-case e.g. M and lower-case letters e.g. v, respectively. A 1-vector with length π is represented by 1π . The indices of the entries of matrices and vectors appear in the subscript, for example, Mππ refers to the element at π-th row and π-th column of a matrix M and vπ refers to the π-th element of a column vector v. Let Mπ: and M:π denote the π-th row slice and π-th column slice of M, respectively. We denote the horizontal concatenation of matrices M1 and 4 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 M2 as [M1 M2 ], and denote the vertical concatenation of column vectors v1 and v2 as [v1 ; v2 ]. Some matrices and vectors in this paper have a superscript to distinguish them from others, e.g. Mπ , and it should not be confused with indices appearing in the subscript; for example, Mπππ refers to the π, π-th entry of Mπ . Let π be a normal program with size |π΅π | = π , π + its positive form and π΅π + the Herbrand base of π + . Then we have |π΅π + | = 2π since for every π β π΅π we add its negated version π. To encode π + into a program matrix Q, we encode atoms appearing in the bodies of the rules with one-hot encoding. After performing this encoding for all rules in π + , we obtain an (π Γ 2π ) binary program matrix Q, which is equivalent to a vertical concatenation of row vectors each encoding a rule π β π + with one-hot encoding. Definition 2 (Program Matrix). Let π be a normal program with |π΅π | = π and π + its positive form with |π΅π + | = 2π . Then π + is represented by a matrix Q β Zπ Γ2π such that for each element Qππ (1 β€ π β€ π, 1 β€ π β€ 2π ) in Q, β’ Qππ = 1 if atom ππ β π΅π + (1 β€ π β€ 2π ) appears in the body of the rule ππ (1 β€ π β€ π ); β’ Qππ = 0, otherwise. The π-th row of Q corresponds to the atom ππ appearing in the head of the rule ππ , and the π-th column corresponds to the atom ππ (1 β€ π β€ 2π ) appearing in the body of the rules ππ (1 β€ π β€ π ). Atoms that do not appear in the head of any of the rules in π + are encoded as zero-only row vectors in Q. This definition is different from [2] or [10], in that we do not explicitly include β€ and β₯ in the program matrix, and we do not use fractional values to encode long rules. In fact, our encoding method is similar to that of [5], except that we do not use (2π Γ 2π ) space for the program matrix since we do not encode rules with π β π΅π + in the head.1 Definition 3 (Interpretation Vector). Let π be a definite program and π΅π = {π1 , . . . , ππ }. Then an interpretation πΌ β π΅π is represented by a vector v = (v1 , . . . , vπ )βΊ β Zπ where each element vπ (1 β€ π β€ π ) represents the truth value of the proposition ππ such that vπ = 1 if ππ β πΌ, otherwise vπ = 0. We assume propositional variables share the common index such that vπ corresponds to ππ , and we write var(vπ ) = ππ . Recall that the positive form π + of a normal program is a definite program, and all negated literals in the body are replaced by new atoms, e.g. in (4) Β¬ππ+1 is replaced by ππ+1 . We now extend the definition of interpretation vectors to include relations between the positive and negative occurrences of atoms, in order to maintain whenever we have π1 β πΌ, ππ+1 ΜΈβ πΌ and vice versa. Definition 4 (Companion Vector). Let π΅ππ + β π΅π + denote the positive part of π , π΅ππ+ β π΅π + denote the negative part of π , with size |π΅ππ + | = |π΅ππ+ | = π . Let vπ β Zπ be a vector representing truth assignments for ππ β π΅ππ + such that vππ = 1 if ππ β πΌ and vππ = 0 otherwise. Define a companion vector w β Z2π representing an interpretation πΌ + β π΅π + as follows: w = [vπ ; 1π β vπ ]. 1 In [5], the program matrix encodes the completion of π + . 5 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 4. Gradient Descent For Computing Supported Models In this section, we first show how the evaluation of the ππ operator can be carried out in vector spaces. Then we introduce our loss function and its gradient, and finally we introduce an algorithm for finding supported models in vector spaces. 4.1. Computing the ππ operator in Vector Spaces Sakama et al. [2] showed that the ππ operator can be computed in vector spaces using a thresholding function, which they called π-thresholding. Here we introduce a variant of π- thresholding to accommodate the new program encoding method as well as the differentiability requirement. First we initialize the parameter vector t. Definition 5 (Parameter Vector t). A set of parameters to the π-thresholding is represented by a column vector t β Zπ such that for each element tπ (1 β€ π β€ π ) in t, β’ tπ = |ππππ¦(ππ )| if the rule ππ β π + is a conjunctive rule, e.g. (1), (4); β’ tπ = 1 if the rule ππ β π + is a disjunctive rule e.g. (2); β’ tπ = 0, otherwise. In some previous works [2, 13], the information about the nature of the rules was also stored in the program matrix Q alongside the atom occurrences; conjunctive rules with |ππππ¦(ππ )| > 1 had fractional values Qππ = 1/|ππππ¦(ππ )| and disjunctive bodies had integer values Qππ = 1. Here we follow the approach taken by Sato et al. [5], and only store information about the occurrences of atoms in π + and keep supplementary information in the parameter vector t to recover the original program. Definition 6 (Parameterized π-thresholding). Let w β Z2π be a companion vector representing πΌ + β π΅π + . Given a parameter vector t β Zπ , a program matrix Q β Zπ Γ2π , and a vector y = Qw where y β Zπ , we apply the thresholding function element-wise as follows: {οΈ min(max(0, yπ β (tπ β 1)), 1) (tπ β₯ 1) πt (yπ ) = (5) 0 (tπ < 1) This thresholding function resembles βππππ‘ππβ which is an activation function developed for use in natural language processing [14]. In the original βππππ‘ππβ function, the range of the linear region is [β1, 1] (Figure 1(a)), but here we define the linear region between [tπ β 1, tπ ] (Figure 1(b)). This function is almost everywhere differentiable except at yπ = tπ β 1 and yπ = tπ . The special case tπ < 1 in Equation (5) corresponds to the case tπ = 0 where the head does not appear in the program π + and is assumed to be π πππ π. Intuitively, for the head of a conjunctive rule to be π‘ππ’π it is sufficient to check whether all literals in the body hold, otherwise the rule evaluates to π πππ π. Similarly for a disjunctive rule it is sufficient to check whether at least one of the literals in the body holds for the head to hold. This is formalized in Proposition 4. 6 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 2 2 1 1 (y) (y) 0 0 1 1 3 2 1 0 1 2 3 3 2 1 0 1 2 3 y y (a) βππππ‘ππβ (b) Parameterized thresholding with t = 1 Figure 1: (a) π(π¦) where π = βππππ‘ππβ (b) π(π¦) where π = parameterized π-thresholding Proposition 4 (Thresholded ππ Operator). Let π + be the positive form of a normal program π and Q β Zπ Γ2π its matrix representation. Let v β Zπ be a subset of an interpretation vector representing ππ β πΌ π β π΅ππ + if π£π = 1 for {π1 , . . . , ππ }. Let w β Z2π be a companion vector to v. Then z = [u; 1 β u] β Z2π is a vector representing π½ = ππ (πΌ) satisfying the condition (π β π½ iff π ΜΈβ π½), iff u = πt (Qw). Proof. Consider u = πt (Qw). For u = (u1 , . . . , uπ )βΊ , by the definition of the thresholding function, uπ = 1 (1 β€ π β€ π ) iff uβ²π β₯ tπ in uβ² = Qw. Take a row slice Qπ: , then uβ²π = Qπ: w = Qπ1 w1 + Β· Β· Β· + Qπ2π w2π , and uπ = 1 iff uβ²π β₯ tπ . Both Qπ: and w are 0-1 vectors, then it follows that there are at least tπ elements where Qππ = wπ = 1 (1 β€ π β€ 2π ). The first π elements of w represent ππ β πΌ π β π΅ππ + if wπ = 1, and if ππ β πΌ π then ππ ΜΈβ πΌ π β π΅ππ+ which is maintained through the definition of the companion vector w. 1) For a conjunctive rule ππ β π1 β§ Β· Β· Β· β§ ππ (1 β€ π β€ 2π ), {π1 , . . . , π2π } β πΌ implies ππ β ππ (πΌ). 2) For an OR-rule ππ β π1 β¨ Β· Β· Β· β¨ ππ (1 β€ π β€ 2π ), {π1 , . . . , π2π } β πΌ implies ππ β ππ (πΌ). ππ β πΌ is represented by zπ = 1 (1 β€ π β€ 2π ). Then by putting π½ = {var(zπ )|zπ = 1}, π½ = ππ (πΌ) holds. Consider π½ = ππ (πΌ). For v = (v1 , . . . , vπ )βΊ representing πΌ π β π΅ππ + , w = (v1 , . . . , vπ , 1β v1 , . . . , 1 β vπ )βΊ is a vector representing πΌ β π΅π + if we set πΌ = {var(wπ )|wπ = 1}. uβ² = Qw is a vector such that uβ²π β₯ tπ (1 β€ π β€ π ) iff var(uβ²π ) β ππ (πΌ). Define u = (u1 , . . . , uπ )βΊ such that uπ = 1 (1 β€ π β€ π ) iff uβ²π β₯ tπ in Qw, and uπ = 0 otherwise. Define an interpretation π½ β π΅π + such that it can be partitioned into subsets of positive and negative occurrences of atoms (π½ π βͺ π½ π ) = π½ β π΅π + . Since only positive atoms occur in the head, u represents a positive subset of interpretation π½ π β π½ β π΅π + by setting π½ π = { var(uπ )|uπ = 1} (1 β€ π β€ π ). If ππ β ππ (πΌ) then uπ = 1, and ππ ΜΈβ ππ (πΌ) is represented by 1 β uπ = 0. Conversely, if ππ ΜΈβ ππ (πΌ) then uπ = 0, and 1 β uπ = 1 represents ππ β ππ (πΌ). Thus 1 β u represents π½ π β π½ β π΅π + . z = [u; 1 β u] is then a vector representing π½ π βͺ π½ π = π½ if we set π½ = {var(zπ )|zπ = 1} (1 β€ π β€ 2π ). Thus z = [u; 1 β u] represents π½ = ππ (πΌ) if u = πt (Qw). Example 1. Consider the following program: πβπ 7 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 π βπβ§π π β Β¬π (6) This program has one supported (stable) model: {π}. We have π΅π = {π, π, π} and π΅π + = {π, π, π, π, π, π} and the matrix representation Q and parameter vector t are: π π π π π π β β β β π 0 1 0 0 0 0 π 1 Q = πβ1 0 1 0 0 0β t = πβ2β (7) π 0 0 0 1 0 0 π 1 Suppose an assignment v{π} = (0 0 1)βΊ is given. Then the companion vector w is: )οΈβΊ w = [v{π} ; 13 β v{π} ] = 0 0 1 1 1 0 (8) (οΈ Compute the matrix multiplication product Qw and apply the parameterized thresholding: )οΈβΊ )οΈβΊ u = πt (Qw) = πt ( 0 1 1 ) = 0 0 1 = v{π} (9) (οΈ (οΈ Let z be a companion vector to u, i.e. z = [u; 1 β u], then we have )οΈβΊ (10) (οΈ z= 0 0 1 1 1 0 Define π½ = {var(zπ )|zπ = 1}, then we have π½ = {π, π, π}, and π½ β© π΅π = {π}. Proposition 5 (Supported Model Computation with Thresholded ππ ). Let v β Zπ be a 0- 1 vector representing a subset of interpretation πΌ π β πΌ β π΅π + , and z = [v; 1π β v] be its companion vector representing πΌ β π΅π + satisfying (π β πΌ iff π ΜΈβ πΌ). Given a program matrix Q representing a program π + and a thresholding function πt parameterized by a vector t, the fixed points of π + are represented by 0-1 binary vectors zπΉ π = [vπΉ π ; 1π β vπΉ π ] β Z2π where vπΉ π = πt (QzπΉ π ). Then zπΉ π are vectors representing models π + of π + satisfying (π β π + iff π ΜΈβ π + ) iff πt (QzπΉ π ) = vπΉ π . When such 0-1 binary vector zπΉ π exists, π + β© π΅π = π is a supported model of π . Proof. Let πΌ β π΅π + be a model of π + , represented by zπΉ π . Consider two cases (i) ππ (πΌ) = πΌ and (ii) vπΉ π = πt (QzπΉ π ). In both cases, by Propositions 2, 3 and 4, if a supported model of π exists, the results hold. Example 2. Consider Program (6) in Example 1, and interpretation vector z (10) representing an interpretation π½. We show that π½ = {π, π, π} is indeed a model of the positive form, and corresponds to the supported model of π . Construct a vector v{π} and companion vector w as in (8). By comparing (8) and (10), we see that w = z and this is consistent with the fixed point definition π½ = ππ (π½). π½ β© π΅π results in {π} which is the supported model of π . 8 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 4.2. Loss Function for Computing Supported Models π By the fixed point definition of supported models, a supported model π satisfies vπ = π π πt (Q[vπ ; 1π β vπ ]). We now use this definition to design a loss function which can be minimized using gradient descent. Gradient descent is a method for minimizing an objective function (loss function), by updating the parameters in the opposite direction of the gradient of the objective function with respect to the parameters. The size of the update is determined by the gradient and the step size πΌ. Intuitively this algorithm follows the direction of the slope of the loss surface until it reaches the bottom of a valley (local minimum). Let u be an (π Γ 1) vector representing an assignment to the heads of the rules in π + . Define w = [u; 1π β u] as the (2π Γ 1) interpretation vector that stores assignments to each atom π β π β π΅π + . We define an (π Γ 1) vector f which stores information about occurrences of facts in the program π + . This vector will be used later during the minimization step to ensure that facts are not forgotten. Definition 7 (Fact Vector f ). The set of facts in the program π + is represented by a column vector f β Zπ such that for each element fπ (1 β€ π β€ π ) in f , β’ fπ = 1 if the rule ππ is a fact: π β β’ fπ = 0 otherwise. Definition 8 (Loss Function). Given a program matrix Q, a candidate vector x, thresholding function πt , and constants π1 and π2 , define the loss function as follows: 1 (οΈ )οΈ πΏ(x) = βπt (Q[x; 1π β x]) β xβ2F + π1 βx β (x β 1π )β2F + π2 βf β (x β f )β2F (11) 2 where βxβF denotes the Frobenius norm and β element-wise product. The first term is derived directly from the fixed point definition of supported models, and should be 0 if x is a supported model of π + . The second term, which resembles a regularization term often used in the machine learning literature, is added to penalize candidate vectors x that contain fractional values, and is 0 if and only if x is a 0-1 vector. The third term will be 0 if and only if the facts are preserved, and will be positive non-zero if any part of the assignment is lost, i.e. by assigning 0 (π πππ π) to a fact where fπ = 1. We introduce submatrices of Q, Qπ and Qπ that correspond to the positive bodies and negative bodies of the matrix, respectively, such that Q = [Qπ Qπ ] (horizontal concatenation of submatrices). Both Qπ and Qπ have the shape (π Γ π ). Definition 9 (Gradient of the Loss Function). The gradient of the loss function with respect to x is given by: (οΈ )οΈ ππΏ(x) π ππt (Qzx ) = (Qπ β Qπ ) Β· πt (Qzx ) β β πt (Qzx β x) πx πx + π1 (x β (1π β x) β (1π β 2x)) + π2 (x β f β f ) (12) where zx β R2π = [x; 1π β x] and {οΈ ππt (wπ ) 1 if (tπ β₯ 1) and (tπ β 1) β€ wπ β€ tπ = (13) πxπ 0 otherwise 9 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 We can update x iteratively using, for example, gradient descent or quasi-Newtonβs method, to reduce the loss to 0. Here we shown an example of update rule for gradient descent. Let πΌ be the step size, then the gradient descent update rule is given by: ππΏ(x) xnew β x β πΌ (14) πx Using this update rule we can design an algorithm to find supported models, as shown in Algorithm 1. The convergence characteristics of the gradient descent algorithm are well-known [15]. Assuming at least one 0-1 vector representing a supported model exists for Q, all we require for Algorithm 1 to converge to the supported model is that the initial vector xinit to be in the region surrounding the supported model where the slope points towards the model. When there are multiple supported models (correspondingly multiple local minima), we expect the algorithm to converge to different models depending on the choice of initial vector xinit . However, it is often not known apriori which particular values or regions of xinit lead to which models. Therefore we implement a uniform sampling strategy, where we sample uniformly from [0, 1], and make no assumptions with regard to the existence of supported models for Q. Depending on the program, an optimal 0-1 vector interpretation may not exist, so we limit the number of iterations to max_iter before assuming non-convergence. With gradient descent, it is often time consuming to reduce the loss function completely to zero. We therefore implement a "peeking at a solution" heuristic, similar to the one presented in [7], where while updating x we round x a to 0-1 vector to see whether the resulting xπ is a solution (Lines 7-11). The output is sensitive to the choice of initial vector xinit , and a poor choice of xinit may result in non-convergence to optimal solutions. We alleviate this dependency on the initial vector by introducing the max_retry parameter and changing the initial vector on each try. This algorithm declares failure to find any supported models (returns FALSE) when both max_retry and max_iter are exhausted. An initial implementation of the proposed method is made using Python 3.7. Our experimental environment is a Linux container limited to 8 virtual CPU cores and 16GB RAM hosted on a desktop with Ubuntu 18.04, Intel Core i9-9900K 3.6GHz and 64GB RAM. 5. Experiments In this section, we first compare the performance of our method with the one presented by Aspis et al. [10] in terms of success rate of finding correct supported models. Then we qualitatively show the effect of the rounding heuristic by plotting the trajectories of the candidate interpretation vector. 5.1. Comparing Success Rates with Aspis et al. Aspis et al. [10] encode the program reduct into a matrix and employ the Netwonβs method for root finding to find fixed points of the program. The gradient is calculated by a Jacobian matrix, and the thresholding operation is carried out with a parameterized sigmoid. They present two types of sampling methods for setting the initial vector; uniform sampling, similarly to our 10 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 Algorithm 1 Gradient descent search of supported models Input: Program matrix Q, Thresholding parameter t, Initial vector xinit , max_retry β₯ 1, max_iter β₯ 1, π > 0, step size πΌ > 0, π1 > 0, π2 > 0 Output: supported model x or FALSE 1: x β xinit 2: for n_try β 1 to max_retry do 3: if n_try > 1 then 4: x β random vector 5: end if 6: for n_iter β 1 to max_iter do 7: xπ β round(x) β Rounding heuristic 8: loss β πΏ(xπ ) β Loss function, see Def. (8) 9: if (loss β€ π) then 10: x β xπ 11: return x 12: else 13: gradient β ππΏ(x) πx β Gradient, see Def. (9) 14: x β x β πΌΒ· gradient β Gradient update 15: end if 16: end for 17: end for 18: return FALSE method, where the values are sampled[οΈ uniformly 2 ]οΈ [οΈ from]οΈ[0, 1], and semantic sampling , where the values are sampled uniformly from 0, πΎ βͺ πΎ , 1 . Semantic sampling results in an initial β₯ β€ vector that is semantically equivalent to an interpretation. Firstly we consider the βN -negative loopsβ task, which involves programs in the following form: for 1 β€ π β€ π , ππ β not ππ (15) ππ β not ππ For our algorithm, we disabled automatic retry with max_retry = 1, so that the result depends solely on a single application of the algorithm. Other parameters were set as follows: max_iter = 103 , π = 10β4 , π1 = π2 = 1, πΌ = 10β1 . For Aspis et al.βs algorithm, we used the following settings: max_iter = 103 , π = 10β4 , πΎ = 0.5, π = 0.087.3 We generated programs of the form Program (15) with π up to 50, and then applied the algorithms exactly once for 1000 times. We measured the rate of success of converging to supported models. Figure 2(a) shows our method with uniform initialization outperforms their uniform sampling method, although improvement over their semantic sampling method is marginal. Secondly we consider the βchoose 1 out of π β task, where the task is choose exactly 1 out of 2 β₯ πΎ is an upper bound on false values that variables can take, and πΎ β€ is a lower bound on true values. 3 We used π = 0.087 which was their best performing setting resulting in 89% success and near 100% success for uniform and semantic sampling, respectively, for π = 1 loop. 11 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 1.0 1.0 Ours (1-try) 0.8 0.8 Ours Success rate Success rate 0.6 Ours (1-try) 0.6 (n-try) Aspis+ semantic Aspis+ semantic 0.4 Aspis+ uniform 0.4 Aspis+ uniform 0.2 0.2 0.0 0.0 0 10 20 30 40 50 2 4 6 8 N N (a) π -negative loops (b) π -choices Figure 2: (a) Success rate of converging to a model at the π -negative loops task. (b) Success rate of converging to a model at the π -choices task. π options. The programs in this class have the following form: π1 β not π2 , not π3 , , .., not ππ π2 β not π1 , not π3 , , .., not ππ .. (16) . ππ β not π1 , not π2 , , .., not ππ β1 We generated programs for π between 2 and 9, and applied the algorithms using the same parameters as the "π -negative loops" task, and repeated the process for 1000 times for each π .4 For our algorithm, we consider the following scenarios: when the algorithm is allowed (i) only 1 try (max_retry = 1, 1-try in Fig. 2(b)), and (ii) retry up to 1000 times (max_retry = 1000, n-try in Fig. 2(b)). From Figure 2(b), we observe that the success rate of our algorithm (1-try) starts to drop rapidly past π = 5, and approaches the same level of success rate (10-20% range) as Aspis et al. by π = 7. We can allow multiple retries to help the algorithm find the supported models, as indicated by the high success rate of the n-try case, however this means that potentially the run time will be significantly longer compared to max_retry = 1. In this example, there is a global constraint that exactly one of {π1 , ..., ππ } is chosen, but neither our method nor Aspis et al.βs method can capitalize on this information. Handling of global constraints is left for future work. 5.2. Effect of the Rounding Heuristic on the Trajectories Here we qualitatively demonstrate the effect of the rounding heuristic (Lines 7-11 in Algorithm 1) by plotting the trajectories of the interpretation vector. We consider a simple negative loop program, i.e., program (15) with π = 1, where we have exactly two supported models {π} and {π}. The contours in the following figures represent the value of the loss function at each 4 For each program whose maximum rule length is π β 1, we estimate the upper bound of π according to their method, and subtract 10β3 from this value and use it as π . This ensures that π is smaller than the upper bound. 12 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 (π, π)-coordinate. The dots and arrows represent the trajectories of the algorithm starting from various initial vectors. Note that the lowest points on this landscape are at (π, π) = (1, 0) ({π}) and (π, π) = (0, 1) ({π}). 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 q q 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.00.0 0.2 0.4 0.6 0.8 1.0 0.0 0.00.0 0.2 0.4 0.6 0.8 1.0 0.0 p p (a) Trajectories of various attempts without rounding. (b) Trajectories of various attempts with rounding. Figure 3: (a) Trajectories without rounding (b) Trajectories with rounding heuristic. From Figure 3(a) we observe that once the algorithm reaches the bottom of the valley, the update becomes smaller and the convergence is slow. From Figure 3(b) we see that the rounding heuristic (peeking at solution) avoids the bottom of the valley entirely and after several iterations the algorithm jumps straight to the 0-1 interpretation, which makes it converge faster. In both cases we see the algorithm converging to the middle point {π = 0.5, π = 0.5} when starting from initial vectors with (π = π) Currently we do not break this symmetry in the algorithm, for example, by adding a very small noise to the vector during update, and instead use a retry function which resets the initial vectors to a random value on restart. 6. Conclusion In this work we presented a new method for computing supported models of a normal logic program, using a matrix representation of the program and a gradient-based search algorithm. We defined a loss function based on the implementation of the ππ -operator by matrix-vector multiplication with a suitable thresholding function. We incorporated penalty terms into the loss function to avoid undesirable results. We showed the results of several experiments where our method showed improvements over an existing method in terms of success rates of finding correct supported models of normal logic programs. Raw computational performance is not the focus of this paper and we leave thorough bench- marking for future work. Our method does not compute stable models except cases where supported and stable models coincide. For computing stable models, we may consider using our method as a preprocessor to the method by [4], where our method can be used to find supported models as promising candidates, which can then be refined for better performance. 13 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 References [1] T. Sato, Embedding tarskian semantics in vector spaces, in: The Workshops of the the Thirty-First AAAI Conference on Artificial Intelligence, Saturday, February 4-9, 2017, San Francisco, California, USA, volume WS-17 of AAAI Workshops, AAAI Press, 2017. [2] C. Sakama, K. Inoue, T. Sato, Linear Algebraic Characterization of Logic Programs, in: G. Li, Y. Ge, Z. Zhang, Z. Jin, M. Blumenstein (Eds.), Knowledge Science, Engineering and Management, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2017, pp. 520β533. doi:10.1007/978-3-319-63558-3_44. [3] Y. Aspis, K. Broda, A. Russo, Tensor-based abduction in horn propositional programs, in: ILP 2018 - 28th International Conference on Inductive Logic Programming, CEUR Workshop Proceedings, 2018, pp. 68β75. [4] T. Q. Nguyen, K. Inoue, C. Sakama, Enhancing linear algebraic computation of logic programs using sparse representation, in: F. Ricca, A. Russo, S. Greco, N. Leone, A. Ar- tikis, G. Friedrich, P. Fodor, A. Kimmig, F. A. Lisi, M. Maratea, A. Mileo, F. Riguzzi (Eds.), Proceedings 36th International Conference on Logic Programming (Technical Communi- cations), ICLP Technical Communications 2020, (Technical Communications) UNICAL, Rende (CS), Italy, 18-24th September 2020, volume 325 of EPTCS, 2020, pp. 192β205. doi:10.4204/EPTCS.325.24. arXiv:2009.10247. [5] T. Sato, C. Sakama, K. Inoue, From 3-valued Semantics to Supported Model Computation for Logic Programs in Vector Spaces, in: 12th International Conference on Agents and Artificial Intelligence, 2020, pp. 758β765. [6] T. Sato, K. Inoue, C. Sakama, Abducing Relations in Continuous Spaces, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 2018, pp. 1956β1962. doi:10.24963/ijcai.2018/270. [7] T. Sato, R. Kojima, Logical Inference as Cost Minimization in Vector Spaces, in: A. El Fal- lah Seghrouchni, D. Sarne (Eds.), Artificial Intelligence. IJCAI 2019 International Work- shops, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 239β255. doi:10.1007/978-3-030-56150-5_12. [8] T. Sato, A linear algebraic approach to Datalog evaluation, Theory and Practice of Logic Programming 17 (2017) 244β265. doi:doi:10.1017/S1471068417000023. [9] H. A. Blair, F. Dushin, D. W. Jakel, A. J. Rivera, M. Sezgin, Continuous Models of Computa- tion for Logic Programs: Importing Continuous Mathematics into Logic Programmingβs Algorithmic Foundations, in: K. R. Apt, V. W. Marek, M. Truszczynski, D. S. Warren (Eds.), The Logic Programming Paradigm: A 25-Year Perspective, Artificial Intelligence, Springer, Berlin, Heidelberg, 1999, pp. 231β255. doi:10.1007/978-3-642-60085-2_10. [10] Y. Aspis, K. Broda, A. Russo, J. Lobo, Stable and Supported Semantics in Continuous Vector Spaces, in: Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning, 2020, pp. 59β68. doi:10.24963/kr.2020/7. [11] W. Marek, V. Subrahmanian, The relationship between stable, supported, default and autoepistemic semantics for general logic programs, Theoretical Computer Science 103 (1992) 365β386. doi:10.1016/0304-3975(92)90019-C. [12] K. R. Apt, H. A. Blair, A. Walker, Towards a theory of declarative knowledge, in: Founda- 14 Akihiro Takemura et al. CEUR Workshop Proceedings 1β15 tions of Deductive Databases and Logic Programming, Elsevier, 1988, pp. 89β148. [13] H. D. Nguyen, C. Sakama, T. Sato, K. Inoue, An efficient reasoning method on logic programming using partial evaluation in vector spaces, Journal of Logic and Computation 31 (2021) 1298β1316. doi:10.1093/logcom/exab010. [14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research 12 (2011) 2493β 2537. [15] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, USA, 2004. 15