=Paper=
{{Paper
|id=None
|storemode=property
|title=CUD@ASP: Experimenting with GPGPUs in ASP solving
|pdfUrl=https://ceur-ws.org/Vol-1068/paper-l11.pdf
|volume=Vol-1068
|dblpUrl=https://dblp.org/rec/conf/cilc/VellaPDFP13
}}
==CUD@ASP: Experimenting with GPGPUs in ASP solving==
<pdf width="1500px">https://ceur-ws.org/Vol-1068/paper-l11.pdf</pdf>
<pre>
CUD@ASP: Experimenting with GPUs in ASP solving?

                  Flavio Vella1 , Alessandro Dal Palù2 , Agostino Dovier3 ,
                         Andrea Formisano1 , and Enrico Pontelli4
                      1
                         Dip. di Matematica e Informatica, Univ. di Perugia
                              2
                                 Dip. di Matematica, Univ. di Parma
                       3
                         Dip. di Matematica e Informatica, Univ. di Udine
                              4
                                 Dept. of Computer Science, NMSU


        Abstract. This paper illustrates the design and implementation of a prototype
        ASP solver that is capable of exploiting the parallelism offered by general pur-
        pose graphical processing units (GPGPUs). The solver is based on a basic conflict-
        driven search algorithm. The core of the solving process develops on the CPU,
        while most of the activities, such as literal selection, unit propagation, and conflict-
        analysis, are delegated to the GPU. Moreover, a deep non-deterministic search,
        involving a very large number of threads, is also delegated to the GPU. The initial
        results confirm the feasibility of the approach and the potential offered by GPUs
        in the context of ASP computations.


1     Introduction

Answer Set Programming (ASP) [22, 20] has gained momentum in the logic program-
ming and artificial intelligence communities as a paradigm of choice for a variety of ap-
plications. In comparison to other non-monotonic logics and knowledge representation
frameworks, ASP is syntactically simpler and, at the same time, very expressive. The
mathematical foundations of ASP have been extensively studied; in addition, there exist
a large number of building block results about specifying and programming using ASP.
ASP has offered novel and highly declarative solutions in several application areas, in-
cluding intelligent agents, planning, software verification, complex systems diagnosis,
semantic web services composition and monitoring, and phylogenetic inference.
    An important push towards the popularity of ASP has come from the development
of very efficient ASP solvers, such as CLASP and DLV. In particular, systems like CLASP
and its variants have been shown to be competitive with the state of the art in several
domains, including competitive performance in SAT solving competitions. In spite of
the efforts in developing fast execution models for ASP, execution of large programs re-
mains a challenging task, limiting the scope of applicability of ASP in certain domains
(e.g., planning). In this work, we offer parallelism as a viable approach to enhance
performance of ASP inference engines. In particular, we are interested in devising tech-
niques that can take advantage of recent architectural developments in the field of Gen-
eral Purpose Graphical Processing Units (GPGPUs). Modern GPUs are multi-core
platforms, offering massive levels of parallelism; vendors like NVIDIA have started
?
    Research partially supported by GNCS-13 project.
164    Flavio Vella, A. Dal Palù, Agostino Dovier, Andrea Formisano and Enrico Pontelli


 supporting the use of GPUs for applications different from graphical operations, provid-
 ing dedicated APIs and development environments. Languages and language extensions
 like OpenCL [16] and CUDA [29] support the development of general purpose applica-
 tions on GPUs, beyond the limitations of graphical APIs. To the best of our knowledge,
 the use of GPUs for ASP computations has not been explored and, as demonstrated in
 this paper, it opens an interesting set of possibilities and issues to be resolved.
     The work proposed in this paper builds on two existing lines of research. The ex-
 ploitation of parallelism from ASP computations has been explored in several research
 works, starting with seminal papers by Pontelli et al. and Finkel et al. [25, 9], and
 later continued in several other projects (e.g., [26, 12, 24]). Most of the existing pro-
 posals have primarily focused on parallelization of the search process underlying the
 construction of answer sets, by distributing parts of the search tree among different
 processors/cores; furthermore, the literature focused on parallelization on traditional
 multi-core or Beowulf architectures. These approaches are not applicable in the con-
 text of GPGPUs—the models of parallelization used on GPGPUs are deeply different
 (e.g., GPGPUs are designed to operate with large number of threads, operating in a syn-
 chronous way; GPGPUs have significantly more complex memory organizations, that
 have great impact on parallel performance) and existing parallel ASP models are not
 scalable on GPGPUs. Furthermore, our focus on this work is not primarily on search
 parallelism, but on parallelization of the various operations associated to unit propaga-
 tion and management of nogoods.
     The second line of research that supports the effort proposed in this paper is the
 recent developments in the area of GPGPUs for SAT solving and constraint program-
 ming. The work in [6] illustrates how to parallelize the search process employed by
 the DPLL procedure in solving a SAT problem on GPGPUs; the outcomes demonstrate
 the potential benefit of delegating to GPGPUs the tails of the branches of the search
 tree—an idea that we have also applied in the work presented in this paper. Several
 other proposals have appeared in the literature suggesting the use of GPGPUs to par-
 allelize parts of the SAT solving process—e.g., the computation of variable heuristics
 [18]. The work presented in [4] provides a preliminary investigation of parallelization
 of constraint solving (applied to the specific domain of protein structure prediction) on
 GPGPUs. The work we performed in [4] provided inspiration for the ideas used in this
 paper to parallelize unit propagation and other procedures.
     The main contribution of the research presented in this paper is the analysis of a
 state of the art algorithm for answer set computation (i.e., the algorithm underlying
 CLASP ) to identify potential sources of parallelism that are suitable to the peculiar par-
 allel architecture provided by CUDA.

 2     Background
 2.1   Answer Set Programming
 Syntax. In this section we will briefly review the foundations of ASP, starting with its
 syntax. Let us consider a language composed of a set of propositional symbols (atoms)
 P. An ASP rule has the form
                         p0 ← p1 , . . . , pm , not pm+1 , . . . , not pn               (1)
CUD@ASP: Experimenting with GPGPUs in ASP solving                                                  165


 where pi ∈ P.5 Given a rule r of type (1), p0 is referred to as the head of the rule
 (head(r)), while the set of atoms {p1 , . . . , pm , not pm+1 , . . . , not pn } is referred to as
 the body of the rule (body(r)). In particular, body + (r) = {p1 , . . . , pm } and body − (r) =
 {pm+1 , . . . , pn }. We identify particular types of rules: a constraint is a rule of the form

                               ← p1 , . . . , pm , not pm+1 , . . . , not pn                       (2)

 while a fact is a rule of the form p0 ←. A program Π is a collection of ASP rules.
 We will use the following notation: atom(Π) denotes the set of all atoms present in Π,
 while bodyΠ (p) denotes the set {body(r) | r ∈ Π, head(r) = p}.
                                                          +
     Let Π be a program; its positive dependence graph DΠ   = (V, E) is a directed graph
 satisfying the following properties:
   - The set of nodes V = atom(Π);
   - E = {(p, q) | r ∈ Π, head(r) = p, q ∈ body + (r)}.
                                                            +
 In particular, we are interested in recognizing cycles in DΠ ; the number of non-self
             +
 loops in DΠ is denoted by loop(Π). A program Π is tight (non-tight) if loop(Π) = 0
                                                              +
 (loop(Π) > 0). A strongly connected component (scc) of DΠ       is a maximal subgraph
            +
 of X of DΠ such that there exists a path between each pair of nodes in X.

 Semantics. The semantics of ASP programs is provided in terms of answer sets. Intu-
 itively, an answer set is a minimal model of the program which supports each atom in
 the model—i.e., for each atom there is a rule in the program that has such atom in the
 head and whose body is satisfied by the model. Formally, a set of atoms M is an answer
 set of a program Π if M is the minimal model of the reduct program Π M , where the
 reduct is obtained from Π as follows:
    - remove from Π all rules r such that M ∩ body − (r) 6= ∅;
    - remove all negated atoms from the remaining rules.
 Π M is a definite program, i.e., a set of rules that does not contain any occurrence of
 not. Definite programs are characterized by the fact that they admit a unique minimal
 model. Each answer set of a program Π is, in particular, a minimal model of Π.
 Example 1. The following
                        program Π has two answer sets: {a, c} 
                                                               e {a, d}.
                         a←        c ← a, not d         e←b
                 Π=
                         b ← ¬a    d ← not c, not e     e←e

 Answer Set Computation. In the rest of this section, we provide a brief overview
 of techniques used in the computation of the answer sets of a program; the mate-
 rial presented is predominantly drawn from the implementation techniques used in
 CLASP [11, 10].
     Several ASP solvers rely directly or indirectly on techniques drawn from the domain
 of SAT solving, properly extended to include procedures to determine minimality and
 stability of the models (these two procedures can be quickly performed in time linear
 in the number of occurrences of atoms in the program, namely |Π|)). Several ASP
  5
      A rule that includes first-order atoms with variables is simply seen as a syntactic sugar for all
      its ground instances.
166   Flavio Vella, A. Dal Palù, Agostino Dovier, Andrea Formisano and Enrico Pontelli


 solvers (e.g., CMODELS [13]) rely on a translation of Π into a SAT problem and on
 the use of SAT solvers to determine putative answer sets. Other systems (e.g., CLASP)
 implement native ASP solvers, that combine search techniques with backjumping along
 with techniques drawn from the field of constraint programming [27].
     The CLASP system relies on a search in the space of all truth value assignments to
 the atoms in Π, organized as a binary tree. The successful construction of a branch in
 the tree corresponds to the identification of an answer set of the program. If a, possibly
 partial, assignment fails to satisfy the rules in the program, then backjumping proce-
 dures are used to backtrack to the node in the tree that caused the failure. The design
 of the tree construction and the backjumping procedure in CLASP is implemented in
 such a way to guarantee that if a branch is successfully constructed, then the outcome
 is indeed an answer set of the program. CLASP’s search is also guided by special as-
 signments of truth values to subsets of atoms that are known not to be extendable into
 an answer set—these are referred to as nogoods [7, 27]. Assignments and nogoods are
 sets of assigned atoms—i.e., entities of the form T p (F p) denoting that p has been
 assigned true (false). For assignments it is also required that for each atom p at
 most one between T p and F p is contained. Given an assignment A, we denote with
 AT = {p | T p ∈ A} and AF = {p | F p ∈ A}. A is total if it assigns a truth value to ev-
 ery atom, otherwise it is partial. Given a (possibly partial) assignment A and a nogood
 δ, we say that δ is violated if δ ⊆ A. In turn, a partial assignment A is a solution for a
 set of nogoods ∆ if no δ ∈ ∆ is violated by A.
     The concept of nogood can be also used during deterministic propagation phases
 (a.k.a. unit propagation) to determine additional assignments. Given a nogood δ and a
 partial assignment A such that δ \A = {F p} (δ \A = {T p}), then we can infer the need
 to add T p (F p) to A in order to avoid violation of δ. In the context of ASP computation,
 we distinguish two types of nogoods: completion nogoods [8], which are derived from
 Clark’s completion of a logic program (we will denote with ∆Πcc the set of completion
 nogoods for the program Π), and loop nogoods [17], which are derived from the loop
 formula of Π (denoted by ΛΠ ). Before proceeding with the formal definitions of these
 two classes of nogoods, let us review the two fundamental results associated to them
 (see [10]). Let Π be a program and A an assignment:

  – If Π is a tight program then: atom(Π) ∩ AT is an answer set of Π iff A satisfies
    all the nogoods in ∆Πcc .
  – If Π is a non-tight program, then: atom(Π) ∩ AT is an answer set of Π iff A
    satisfies all the nogoods in ∆Πcc ∪ ΛΠ .

 Let us now proceed in the formal definitions of nogoods. Let us start by recalling the
 notion of Clark completion of Π (Π):
                    n        V                V                      o
              Πcc = βr ↔ a∈body+ (r) a ∧ b∈body− (r) ¬b | r ∈ Π ∪                   (3)
                    n       W                             o
                      p ↔ r∈bodyΠ (p) βr | p ∈ atom(Π)

 Where βr is a new variable, introduced for each rule r ∈ Π, logically equivalent to the
 body of r. Assignments need to deal with βr variables, as well. The completion nogoods
 reflect the structure of the implications present in the definition of Πcc . In particular:
CUD@ASP: Experimenting with GPGPUs in ASP solving                                               167


   • the implication present in the original rule p ← body(r) implies the nogood {F βr }∪
       {T a | a ∈ body + (r)} ∪ {F b | b ∈ body − (r)}.
   • the implication in each rule also implies that the body should be false if any of its
       element is falsified, leading to the set of nogoods of the form: {T βr , F a} for each
       a ∈ body + (r) and {T βr , T b} for each b ∈ body − (r).
   • the closure of an atom definition (as disjunction of the rule bodies supporting
       it) leads to a nogood expressing that the atom is true if any of its rule is true:
       {F p, T βr } for each r ∈ bodyΠ (p).
   • similarly, the atom cannot be true if all its rules have a false body. This yields the
       nogood {T p} ∪ {F βr | r ∈ bodyΠ (p)}.
 ∆Πcc is the set of all the nogoods defined as above.
      The loop nogoods derive instead from the need to capture loop formulae, thus
 avoiding cyclic support of truth. Let us provide some preliminary definitions. Given
 a set of atoms U , we define the external bodies of U (denoted by EBΠ (U )) as the set
 {βr | r ∈ Π, body + (r) ∩ U = ∅}. Furthermore, let us define U to be an unfounded set
 with respect to an assignment A if, for each rule r ∈ Π, we have (i) head(r) 6∈ U , or
 (ii) body(r) is falsified by A, or (iii) body + (r) ∩ U 6= ∅. The loop nogoods capture the
 fact that, for each unfounded set U , its elements have to be false. This is encoded by the
 following nogoods: for each set of atoms U and for each p ∈ U , we create the nogood
 {T p} ∪ {F βr | βr ∈ EBΠ (U )}. We denote with ΛΠ the set of all loop nogoods, and
 with ∆Π the whole set of nogoods: ∆Π = ∆Πcc ∪ ΛΠ .


 2.2   CUDA

     Our proposal focuses on exploring the use GPGPU parallelism in ASP solving.
 GPGPU is a general term indicating the use of the multicores available within modern
 graphical processing units (GPUs) for gen-               GRID
 eral purpose parallel computing. NVIDIA is                Block                Block
 one of the pioneering manufacturers in pro-                      Shared               Shared
 moting GPGPU computing, especially thanks                        memory               memory

 to its Computing Unified Device Architec-
 ture (CUDA) [29]. The underlying conceptual                regs           regs  regs           regs


 model of parallelism supported by CUDA is
 Single-Instruction Multiple-Thread (SIMT), a                Thread    Thread     Thread    Thread


 variant of the SIMD model, where, in general,
 the same instruction is executed by differ-                           GLOBAL MEMORY
 ent threads that run on identical cores, while   HOST
                                                                      CONSTANT MEMORY
 data and operands may differ from thread to
 thread. CUDA’s architectural model is repre-
 sented in Figure 1.                               Fig. 1: CUDA Logical Architecture
 Different NVIDIA GPUs are distinguished by the number of cores, their organization,
 and the amount of memory available. The GPU is composed of a series of Streaming
 MultiProcessors (SMs); the number of SMs depends on the specific characteristics of
 each family of GPU—e.g., the Fermi architecture provides 16 SMs. In turn, each SM
 contains a collection of computing cores; the number of cores per SM may range from
168   Flavio Vella, A. Dal Palù, Agostino Dovier, Andrea Formisano and Enrico Pontelli


 8 (in the older G80 platforms) to 32 (e.g., in the Fermi platforms). Each GPU provides
 access to both on-chip memory (used for thread registers and shared memory—defined
 later) and on-chip memory (used for L2 cache, global memory and constant memory).
 Notice that the architecture of the GPU also determines both the GPU Clock and the
 Memory Clock rates. A logical view of computations is introduced by CUDA, in order
 to define abstract parallel work and to schedule it among different hardware configura-
 tions (see Figure 1). A typical CUDA program is a C/C++ program that includes parts
 meant for execution on the CPU (referred to as the host) and parts meant for parallel
 execution on the GPU (referred as the device). A parallel computation is described by a
 collection of kernels—each kernel is a function to be executed by several threads.

 The host program contains all instructions to initialize the data in GPUs, to define the
 threads number and to manage the kernel. Instead, a kernel is a set of instruction per-
 formed in GPUs across a set of concurrent threads. The programmer or compiler or-
                                              ganizes these threads in thread blocks and
                                              grids of thread blocks. A grid is an array
                                              of thread blocks that execute the same ker-
                                              nel, read data input from global memory,
                                              write results to global memory. Each thread
                                              within a thread block executes an instance
                                              of the kernel, and has a thread ID within
                                              its thread block. When a CUDA program
                                              on the host CPU invokes a kernel grid, the
                                              blocks of the grid are enumerated and dis-
                                              tributed to multiprocessors with available
                                              execution capacity; the kernel is executed
                                              in N blocks, each consisting of M threads.
                                              The threads in the same block can share
                                              data, using shared high-throughput on-chip
      Fig. 2: Generic workflow in CUDA
                                              memory; on the other hand, the threads be-
 longing to different blocks can only share data through global memory. Thus, the block
 size allows the programmer to define the granularity of threads cooperation. Figure 1
 shows the CUDA threads hierarchy [23].
 CUDA provides an API to interact with GPU and C for CUDA, an extension of C
 language to define kernels. Referring to Figure 2, a typical CUDA application can be
 summarized as follow:
 Memory data allocation and transfer: The data before being processed by kernels, must
 be allocated and transferred to Global memory. The CUDA API supports this operations
 through the functions cudaMalloc() and cudaMemcpy(). The call cudaMalloc()
 allows the programmer to allocate the space needed to store the data while the call
 cudaMemcpy() transfers the data from the memory of the host to the space previously
 allocated in Global Memory, or vice versa. The transfer rate is dependent on the bus
 bandwidth where the Graphics Card is physically connected.
 Kernels definition: Kernels are defined as standard C functions; the annotation used to
 communicate to the CUDA compiler that a function should be treated as kernel has the
CUD@ASP: Experimenting with GPGPUs in ASP solving                                        169


 form: global void kernelName (Formal Arguments) where global is
 the qualifier that shows to the compiler that the next statement is a kernel code.
 Kernels execution: A kernel can be launched from the host program using a new:
     kernelName <<< GridDim, ThreadsPerBlock >>> (Actual Arguments)
 execution configuration syntax where kernelName is the specified name in kernel
 function prototype, GridDim is the number of blocks of the grid and ThreadsPerBlock
 specifies the number of threads in each block. Finally, the Actual Arguments are typ-
 ically pointer variables, referring to the previously allocated data in Global Memory.
 Data retrieval: After the execution of the kernel, the host needs to retrieve the data—
 representing results of the kernel. This is performed with another transfer operation
 from Global Memory to Host Memory, using the function cudaMemcpy().


 3     Design of an conflict-based CUDA ASP Solver

 In this section, we will present the CUD@ASP procedure. This procedure is based on the
 CDNL-ASP procedure adopted in the CLASP system [11, 10]. The procedure assumes
 that the input is a ground ASP program. The novelty of CUD@ASP is the off-loading
 of several time consuming operations to the GPU—with particular focus on conflict
 analysis, exploration of the set of possible assignments and execution of the phases of
 unit-propagation. The rest of this section is organized as follows: we will start with
 an overview of the serial structure of the CUD@ASP procedure (Subsection 3.1). In
 the successive subsections, we will illustrate the parallel versions of the key procedures
 used in CUD@ASP: literal selection (Subsection 3.2), nogoods analysis (Subsection 3.3),
 unit propagation (Subsection 3.4), conflict analysis (Subsection 3.5), and analysis of
 stability (Subsection 3.7). In addition, we illustrate a method to use the GPU to handle
 the search process in the tail part of the search tree (Subsection 3.6).


 3.1    The General CUD@ASP Procedure

 The overall CUD@ASP procedure is summarized in Algorithm 3.2. The procedures that
 appear underlined in the algorithm are those that are delegated to the GPU for paral-
 lel execution. The algorithm makes use of the following notation. The input (ground)
 program is denoted by Π; Πcc denotes the completion of Π (eq. 3). The overall set
 of nogoods is denoted by ∆Π , composed of the completion nogoods and the loop no-
 goods. For each program atom p, the notation p represents the atom with a truth value
 assigned; ¬p denotes, instead, the complement truth value with respect to p.
     Lines 1–5 of Algorithm 3.2 represent the initialization phase of the ASP compu-
 tation. In particular, the Parsing procedure (Line 5) is in charge of computing the
 completion of Π and extracting the nogoods. The set A will keep track of the atoms
 that have already been assigned a truth value. It is initialized to the empty set in Line 1
 and updated by the Selection procedure at Line 22. Two variables (current dl
 and k) are introduced to support the rest of the computation. In particular, the vari-
 able current dl represents the decision level; this variable acts as a counter that keeps
 track of the number of “choices” that have been made in the computation of an an-
 swer set. Line 6 invokes the procedure StronglyConnectedComponent, which
170   Flavio Vella, A. Dal Palù, Agostino Dovier, Andrea Formisano and Enrico Pontelli


 Algorithm 3.2 CUD@ASP
 Input Π ground ASP program
 Output An answer set, or null
  1: A := ∅                                                                    . Atoms assignment
  2: ∆Π := ∅                                                                             . Nogoods
  3: current dl := 0                                                      . Current Decision Level
  4: k := 32                                                  . Threshold for Exhaustive Procedure
  5: (∆Π , k, Πcc ) := Parsing(Π)                                          . Initialize ∆Π as ∆Πcc
  6: scc := StronglyConnectedComponent(Π)
  7: loop
  8:    Violation := NoGoodCheck(A, ∆Π )
  9:    if (Violation is true) ∧ (current dl = 0) then return no answer set
 10:     end if
 11:     if Violation is true then
 12:         (current dl, δ) = ConflictAnalysis(∆Π , A)
 13:         ∆Π = ∆Π ∪ {δ}
 14:         A := A \ {p ∈ A | current dl < dl(p)}
 15:     else
 16:         if ∃ δ ∈ ∆Π such that δ \ A = {p} and p ∈     / A then
 17:             A := UnitPropagation(A, ∆Π )
 18:         end if
 19:     end if
 20:     if There are atoms not assigned then
 21:         if Number of atoms to assign > k then
 22:             p := Selection(Πcc , A)
 23:             current dl := current dl + 1
 24:             dl(p) := current dl
 25:             A := A ∪ {p}
 26:         else              . At most k unassigned atoms: Non-deterministic GPU computation
 27:             if There is a successful thread for Exhaustive(A) then
 28:                 for each successful thread returning A := Exhaustive(A) do
 29:                      if StableTest(A, Πcc ) is true then return AT ∩ atom(Π)
 30:                      end if
 31:                 end for
 32:             end if
 33:         end if
 34:     else return AT ∩ atom(Π)
 35:     end if
 36: end loop


 determines the positive dependence graph and its strongly connected components; in
 absence of loops, the program Π is tight, thus not requiring the use of loop nogoods
 (ΛΠ ). We have implemented the classical Tarjan’s algorithm, running in O(n+e), on
 CPU (where n and e are the numbers of nodes and edges, respectively). The loop in
 Lines 7–36 represents the core of the computation. It alternates the process of testing
 consistency and propagating assignments (through the nogoods), and of guessing a pos-
 sible assignment to atoms that are still undefined. Each cycle starts with a call to the
CUD@ASP: Experimenting with GPGPUs in ASP solving                                        171


 procedure NoGoodCheck (Line 8)—which, given a partial assignment A, validates
 whether all the nogoods in ∆Π are still satisfied. If a violation is detected, then the
 procedure ConflictAnalysis is used to determine the decision level causing the
 nogood violation, backtrack to such point in the search tree, and generate an additional
 nogood to prune that branch of the search space (Lines 11–14). If p is the assignment at
 the decision level determined by ConflictAnalysis, then the nogood will prompt
 the unit propagation process to explore the branch starting with the truth assignment ¬p
 (thus ensuring completeness of the computation [10]).
      If the ConflictAnalysis procedure does not detect nogood violations, then the
 procedure might be in one of the following situations:
    - If there is a nogood that is completely covered by A except for one element p,
       then the UnitPropagation procedure is called to determine assignments that
       are implied by the nogoods (starting with the assignment ¬p) (Lines 16–17). Note
       that this procedure does not modify the decision level. In the case of non-tight
       programs, the UnitPropagation procedure will also execute a subroutine in
       charge of validating the loop nogoods.
    - If there are atoms left to assign (Line 20), then additional selections will need to
       be performed. We distinguish two possibilities. If the number of unassigned atoms
       is larger than a threshold k, then one of them, say p, is selected and the current
       decision level is recorded (by setting the value of the variable dl(p)—see Line 24).
       The Selection procedure is in charge for selecting a literal. The assignment
       is extended accordingly and the current decision level is increased (Lines 23–25).
       If the number of unassigned atoms is small, then a specialized parallel procedure
       (Exhaustive) systematically explores all the possible missing assignments. For
       each possible assignment of the remaining atoms, the procedure StableTest
       validates that all nogoods are satisfied and that the overall assignment A is stable
       (necessary test in the case of non-tight programs). This is described in Lines 27–32.

 3.2   Selection Procedure
 The purpose of this procedure is to determine an unassigned atom in the program and a
 truth value for it. A number of heuristic strategies have been studied to determine atom
 and assignment, often derived from analogous strategies developed in the context of
 SAT solving or constraint solving [27, 2]. As soon as an atom has been selected, it is
 necessary to assign a truth value to it. A traditional strategy [10] consists of assigning
 at the beginning the value true to bodies of rules, while atoms are initially assigned
 false—aiming at maximizing the number of resulting implications.
     There is no an optimal strategy for all problems, of course. In the current imple-
 mentation, we provide three selection strategies: the most frequently occurring literal
 strategy which selects the atom that appears in the largest number of nogoods (that aims
 at determining violations as soon as possible or to lead to early propagations through
 the nogoods), the leftmost-first strategy (which selects the first unassigned atom found),
 and the Jeroslow-Wang strategy (also based on the frequency of occurrence of an atom,
 but placing a greater value on smaller nogoods). All the three strategies are implemented
 by allowing kernels on the GPU to concurrently compute the rank of each atom; these
 rankings are re-evaluated at each backjump.
172    Flavio Vella, A. Dal Palù, Agostino Dovier, Andrea Formisano and Enrico Pontelli


 Algorithm 3.3 NoGoodCheck                                    . Kernel executed by thread i
 Input A, ∆Π = {δ1 , . . . , δm }                    . An assignment A and a set of nogoods ∆Π
 Output True or False
  1: if i ≤ m then
  2:      state := 0
  3:      covered := 0
  4:      Atom to propagate := N U LL
  5:      for all p ∈ δi do
  6:          if ¬p ∈ A then state := 1
  7:          else if p ∈ A then covered := covered + 1
  8:          else Atom to propagate := p
  9:          end if
 10:      end for
 11:      if covered = |δi | then return V iolation := T rue
 12:      else if covered = |δi | − 1 and state = 0 then
 13:          Make Atom to propagate global
 14:      end if
 15:      return V iolation := F alse
 16: end if


 3.3   NoGoodCheck Procedure

 The NoGoodCheck procedure (see Algorithm 3.3) is primarily used to verify whether
 the current partial assignment A violates any of the nogoods in a given set ∆Π . The pro-
 cedure plays also the additional rôle of identifying opportunities for unit propagation—
 i.e., recognizing nogoods δ such that δ \ A = {p} and ¬p 6∈ A. In this case, the element
 p will be the target of a successive unit propagation phase.
      The pseudocode in Algorithm 3.3 describes a CUDA kernel (i.e., running on GPU)
 implementing the NoGoodCheck. Each thread handles one of the nogoods in ∆Π and
 performs a linear scan of its assigned atoms (Lines 5–10). The local flag state keeps
 track of whether the nogood is satisfied by the assignment (state equal to 1). The
 counter covered keeps track of how many elements of δi have already been found
 in A. The condition of state equal to zero and the covered counter equal to the
 size of the nogood implies that the nogood is violated by A. The first thread to detect a
 violation will communicate it to the host by setting a variable (Violation—Line 11)
 in global memory (used in Lines 9 and 11 of the general CUD@ASP procedure).
      Lines 12–13 implement the second functionality of the NoGoodCheck procedure—
 by identifying and making global the single element of the nogood that is not covered by
 the A assignment. Note that the identification of the element Atom to Propagate
 can be conveniently performed in NoGoodCheck since the procedure is already per-
 forming the scanning of the nogood to check its validity.


 3.4   UnitPropagation Procedure

 The UnitPropagation procedure is performed only if the NoGoodCheck has de-
 tected no violations and has exposed at least one atom for propagation (as in Lines
CUD@ASP: Experimenting with GPGPUs in ASP solving                                       173


 12–13 of Algorithm 3.3). UnitPropagation is implemented as a CUDA kernel—
 which allows us to distribute the different nogoods among threads, each in charge of
 extending the partial assignment A with one additional assignment. The procedure is
 iterated until a fixpoint is reached. The extension of A is an immediate consequence
 of the work done in NoGoodCheck: if the check of a nogood δi identifies p as the
 only element in δi not covered by A (i.e., {p, ¬p} ∩ A = ∅), then A is extended as
 A := A ∪ {¬p}.
     If the program Π is non-tight, then the UnitPropagation procedure includes
 an additional phase aimed at performing the computation of the unfounded sets deter-
 mined by the partial assignment A and the corresponding loop nogoods ΛΠ . This pro-
 cess is implemented by the procedure UnfoundedSetCheck and follows the gen-
 eral structure of the analogous procedure used in the implementation of CLASP [10].
 This procedure performs an analysis of the strongly connected components of the pos-
                              +
 itive dependence graph DΠ       (already computed at the beginning of the computation of
 CUD@ASP—Line 6). For each p ∈ atoms(Π), scc(p) denotes the set of atoms that
 belong to the same strongly connected component as p. An atom p is said to be cyclic if
 there exists a rule r ∈ Π such that: head(r) ∈ scc(p) and body + (r) ∩ scc(p) 6= ∅, oth-
 erwise p is acyclic. Cyclic atoms are the core of the search for unfounded sets—since
 they are the only ones that can appear in the unfounded loops. Cyclic atoms along with
 the knowledge of elements assigned by A allow the computation of unfounded sets, as
 discussed in [17, 10]. In the current implementation UnfoundedSetCheck runs on
 the host. Some parts are inherently parallelizable (e.g., the computation of the external-
 support, or a splitting to different threads of the analysis of each scc component)—their
 execution on the device is work in progress.


 3.5   ConflictAnalysis Procedure

 The ConflictAnalysis procedure is used to resolve a conflict detected by the
 NoGoodCheck by identifying a level dl and assignment p the computation should
 backtrack to, in order to remove the nogood violation. This process allows classical
 backjumping in the search tree generated by the Algorithm 3.2 [28, 27]. In addition
 to this, the procedure produces a new nogood to be added to the nogoods set, in or-
 der to prevent the same assignments in future. This procedure is implemented by a
 sequence of kernels, and it is executed after some nogood violations have been detected
 by NoGoodCheck. This procedure works as follows:
   • Each thread is assigned to a unique nogood (δ).
   • The thread determines the last two assigned literals in δ, say `M (δ) and `m (δ).
      The two (not necessarily distinct) decision levels of these assignments are stored in
      dlM (δ) = dl(`M (δ)) and dlm (δ) = dl(`m (δ)), respectively.
   • The thread verifies whether δ is violated.
   • Then, the violated nogood δ with lowest value of dlM is determined.
 At this point, a nogood learning procedure is activated. A kernel function (again, one
 thread for each existing nogood) determines each nogood ε, such that: (a) ¬`M (δ) ∈ ε
 and (b) ε \ {¬`M (δ)} ⊆ A. Heuristic functions (see, e.g., [1]) can be applied to select
 one of these ε. Currently, the smallest one is selected in order to generate small new
174    Flavio Vella, A. Dal Palù, Agostino Dovier, Andrea Formisano and Enrico Pontelli


 nogoods—as future work, we will consider all the set of these nogoods. The next step
 performs a sequence of steps, by repeatedly setting δ := (ε\{¬`M (δ)})∪(δ\{`M (δ)})
 and coherently updating the values of dlM (δ) and dlm (δ), until dlM (δ) 6= dlm (δ). This
 procedure ends with the identification of a unique implication point (UIP [21]) that de-
 termines the lower decision level/literal among those causing the detected conflicts. We
 use such value for backjumping (Line 14 of Algorithm 3.2). The last nogood obtained
 in this manner is also added to the set of nogoods.


 3.6   Exhaustive Procedure

 GPU are typically employed for data parallelism. However, as shown in [6], when the
 size of the problem is manageable, it is possible to use them for massive search paral-
 lelism. We have developed the Exhaustive procedure for this task. It is called when
 at most k atoms remains undecided—where k is a parameter that can be set by the user
 (by default, k = 32). The nogood set is simplified using the current assignment (this is
 done in parallel by a kernel that assigns each nogood to a thread). This simplified sets
 will be then processed by a second kernel with 2k threads, that non-deterministically
 explores all of the possible assignments. Each thread verifies that the assignments do
 not violate the nogoods set. If this happens, in case of a tight program, we have found
 an answer set. Otherwise the StableTest procedure (Sect. 3.7) is launched (Lines
 27–28 of Algorithm 3.2). The efficiency of this procedure is obtained by a careful use of
 low-level data-structures. For example, the Boolean assignment of 32 atoms is stored in
 a single integer variable. Similarly, the nogood representation is stored using bit-strings,
 and violation control is managed by low-level bit operations.


 3.7   StableTest Procedure

 In order to verify whether an assignment found by the Exhaustive procedure is a
 stable model, we have implemented a GPU kernel that behaves as follows:
   • It computes the reduct of the program: each thread takes care of an individual rule;
     as result, some threads may become inactive due to rule elimination, threads dealing
     with rules with all negative literals not in the model simply ignore them, while all
     other threads are idle.
   • A computation of the minimum fixpoint is performed. Each thread handles one rule
     (internally modified by the first step above) and, if the body is satisfied, updates the
     sets of derived atoms. Once a rule is triggered, it becomes inactive, speeding-up the
     consecutive computations.
   • When a fixpoint is reached, the computed and the guessed models are compared.


 4     Concluding discussion

 We have reported on our working project of developing an ASP solver running (par-
 tially) on GPGPUs. We implemented a working prototypical solver. The first results
 in experimenting with different GPU architectures are encouraging. Table 1 shows an
CUD@ASP: Experimenting with GPGPUs in ASP solving                                         175


 excerpt of the results obtained on some instances (taken from the Second ASP Compe-
 tition). The differences between the performance obtained by exploiting different GPUs
 are evident and indicates the strong potential for enhanced performance and the scala-
 bility of the approach.
     Table 2 reports on the performance of different serial ASP solvers, on the same col-
 lection of instances. Far from being a deep and fair comparison of these solvers against
 the GPU-based prototype, these results show that even at this stage of its development,
 the parallel prototype can compete, in some cases, with the existing and highly op-
 timized serial solvers. Notice that the GPU-based prototype does not benefit from a
 number of refined heuristics and search/decision strategies exploited, for instance, by
 the state of the art solver CLASP.
     It should be noticed that, in order to profitably exploit in full the computational
 power of the GPUs, one has to carefully tune its parallel application w.r.t. the charac-
 teristics of the specific device at hand. The architectural features and characteristics of
 the specific GPU family has to be carefully taken into account. Moreover, even consid-
 ering a given GPU, different options can be adopted both in partitioning tasks among
 threads/warps and in allocating/transferring data on the device’s memory. Clearly, such
 choices sensibly affect the performance of the whole application. This can be better ex-
 plained by considering Table 3. It shows the performance obtained by three versions of
 the GPU-based solver, differing in the way the device’s global memory is used. Apart
 from the default allocation mentioned in Sect. 2.2, CUDA provides two other basic
 kind of memory allocation. A first possibility uses page-locking to speed up address
 resolution. Mapped allocation allows one to map a portion of host memory into the
 device global memory. In this way the data transfer between host and device is im-
 plicitly ensured by the system and explicit memory transfers (by means of the function
 cudaMemcpy()) can be avoided. The first column of Table 3 shows the performance of
 a version of the prototype that allocates all data by using mapped memory. The behavior
 of a faster version of the solver which exploits page-locking to deal with the main data
 structures (essentially those representing the set of nogoods), is shown in the second
 column. Clearly, this approach requires additional programming effort (in optimizing
 and keeping track of memory transfers). Even better performance has been achieved by
 a third version of the solver that adopts page-locking to allocate all data structures, only
 on the device. This solution may appear, in some sense, unappealing, because it im-
 poses to implement on the device also some intrinsically-serial functionalities. Even if
 these functions cannot fully exploit the parallelism of the cores, considerable advantage
 is achieved by avoiding most of the memory transfer between host and device.
     In this work we made initial steps towards the creation of a GPU-based ASP-
 solver; however, further effort is needed to improve the solver. In particular, some pro-
 cedures need to be optimized in order to take greater advantage from the high data-
 /task-parallelism offered by GPGPUs and the different types of available memories.
 Moreover, some parts of the solver currently running on the host, should be replaced by
 suitable parallel counterparts (examples are the computation of the strongly connected
 components of the dependence graph and the computation of the unfounded sets). We
 plan to develop the stability test that avoids analyzing the whole program and the im-
 plementation of the NoGoodCheck that makes use of watched literals.
176   Flavio Vella, A. Dal Palù, Agostino Dovier, Andrea Formisano and Enrico Pontelli


                       Instance            GT520       GT640     GTX580
                       channelRoute 3           5.44      1.73         0.37
                       knights 11 11            0.70      0.23         0.06
                       knights 13 13            1.70      0.51         0.12
                       knights 15 15            1.71      0.51         0.12
                       knights 17 17            2.40      0.69         0.16
                       knights 20 20            8.57      2.34         0.46
                       labyrinth.0.5            0.08      0.08         0.05
                       schur 4 41               0.24      0.16         0.07
                       schur 4 42               0.31      0.20         0.07
 Table 1. Results obtained with three different Nvidia GeForce GPUs: GT520 (48 cores, capability
 2.0, GPU clock 1.62 GHz, memory clock rate 0.50 GHz, global memory 1GB), GT640 (384
 cores, capability 3.0, GPU clock 0.90 GHz, memory clock rate 0.89 GHz, global memory 2GB),
 GTX580 (512 cores, capability 2.0, GPU clock 1.50 GHz, memory clock rate 2.00 GHz, global
 memory 1.5GB). The timing is in seconds.


 References
  [1] C. Anger, M. Gebser, and T. Schaub. Approaching the Core of Unfounded Sets. Proceedings
      of the International Workshop on Nonmonotonic Reasoning. 2006.
  [2] A. Biere. Handbook of Satisfiability, IOS Press, 2009.
  [3] H. Blair and A. Walker. Towards a theory of declarative knowledge. IBM Watson Research
      Center, 1986.
  [4] F. Campeotto, A. Dovier, and E. Pontelli. Protein Structure Prediction on GPU: an experi-
      mental report. Proc. of RCRA, Rome, June 2013.
  [5] K. Clark. Negation as Failure. Logic and Databases, Morgan Kaufmann, 1978.
  [6] A. Dal Palù, A. Dovier, A. Formisano, E. Pontelli. Exploiting Unexploited Computing
      Resources for Computational Logics. Proc. of CILC, CEUR, vol 857, 2012.
  [7] R. Dechter. Constraint Processing. Morgan Kaufmann, 2003.
  [8] F. Fages. Consistency of Clark’s Completion and Existence of Stable Models. Journal of
      Methods of Logic in Computer Science, 1(1):51–60, 1994.
  [9] R. Finkel et al. Computing Stable Models in Parallel. Answer Set Programming, AAAI
      Spring Symposium, 2001.
 [10] M. Gebser, B. Kaufmann, and T. Schaub. Conflict-driven Answer Set Solving: From Theory
      to Practice. Artificial Intelligence 187: 52–89, 2012.
 [11] M. Gebser et al. Answer Set Solving in Practice. Morgan & Claypool, 2012.
 [12] M. Gebser et al. Multi-Threaded ASP Solving with CLASP. TPLP, 12(4-5), 2012.
 [13] E. Giunchiglia et al. Answer Set Programming Based on Propositional Satisfiability. Jour-
      nal of Automated Reasoning, 36(4):345–377, 2006.
 [14] J. Herbrand. Recherches sur la théorie de la démonstration. Doctoral Dissertation, Univ. of
      Paris, 1930.
 [15] R. Jeroslow and J. Wang. Solving Propositional Satisfiability Problems. Annals of Math
      and AI, 1:167-187, 1990.
 [16] Khronos Group Inc. OpenCL Reference Pages. http://www.khronos.org, 2011.
 [17] F. Lin and Y. Zhao. Assat: Computing Answer Sets of a Logic Program by SAT Solvers.
      Artificial Intelligence, 157(1):115–137, 2004.
 [18] P. Manolios and Y. Zhang. Implementing Survey Propagation on Graphics Processing Units.
      SAT, Springer Verlag, 2006.
CUD@ASP: Experimenting with GPGPUs in ASP solving                                                177


           Instance         SMODELS CMODELS CLASP-None CLASP                     GTX580
           channelRoute 3         2.08          1.42         69.27        0.24         0.37
           knights 11 11          0.34          0.11           0.03       0.03         0.06
           knights 13 13          1.12          0.21           0.06       0.06         0.12
           knights 15 15          1.12          0.24           0.05       0.07         0.12
           knights 17 17          0.91          1.99           0.05       0.06         0.16
           knights 20 20          9.61          3.85           0.22       0.20         0.46
           labyrinth.0.5          0.02          0.01           0.01       0.01         0.05
           schur 4 41             0.05          0.70           0.02       0.02         0.07
           schur 4 42             0.07          0.60           0.02       0.05         0.07
 Table 2. Results obtained with different solvers. All experiments where run on the same machine
 (host: QuadCore Intel i7 CPU, 2.93GHz, 4GB RAM; device GTX580). Serial solvers: SMODELS
 v. 2.34; CMODELS v. 3.85 exploiting minisat; CLASP v. 2.1.0. The column ‘CLASP-None’ shows
 results obtained with CLASP by inhibiting its decision heuristics. This makes its selection strategy
 analogous to the one used in our implementation. The timing is in seconds.

       Instance          All data mapped    ∆Π page-locked         All data page-locked
       knights 11 11                   0.87                0.16                      0.06
       knights 13 13                   2.50                0.34                      0.12
       knights 15 15                   2.49                0.35                      0.12
       knights 17 17                   3.60                0.50                      0.16
       knights 20 20                  14.14                1.60                      0.46
       labyrinth.0.5                   0.82                0.03                      0.05
       schur 4 41                     19.71                0.09                      0.07
       schur 4 42                     24.75                0.12                      0.07
         Table 3. Results obtained with GTX580 with different use of memory resources.


 [19] W. Marek and M. Truszczyński. Autoepistemic logic. Journal of the ACM (JACM),
      38(3):587–618, 1991.
 [20] W. Marek and M. Truszczyński. Stable Models as an Alternative Programming Paradigm.
      The Logic Programming Paradigm, Springer Verlag, 1999.
 [21] J. Marques-Silva and K. Sakallah. GRASP: A Search Algorithm for Propositional Satisfia-
      bility. IEEE Transactions on Computers, 48:506-521, 1999.
 [22] I. Niemela. Logic Programming with Stable Model Semantics as a Constraint Programming
      Paradigm. Annals of Math and AI, 25:241-273, 1999.
 [23] J. Nickolls and W.J. Dally. The GPU Computing Era. In IEEE Micro, 30(2):56-59, 2010.
 [24] S. Perri, F. Ricca, and M. Sirianni. Parallel Instantiation of ASP Programs: Techniques and
      Experiments. TPLP, 13(2), 2013.
 [25] E. Pontelli and O. El-Khatib. Exploiting Vertical Parallelism from Answer Set Programs.
      Answer Set Programming, AAAI Spring Symposium, 2001.
 [26] E. Pontelli, H. Le and T. Son. An Investigation in Parallel Execution of ASP on Distribute
      Memory Platforms. Computer Languages, Systems & Structures, 36(2):158-202, 2010.
 [27] F. Rossi, P. Van Beek, and T. Walsh. Handbook of Constraint Programming, Elsevier, 2006.
 [28] S. Russell et al. Artificial Intelligence: A Modern Approach. Prentice Hall, 2010.
 [29] J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU
      Programming. Addison Wesley, 2011.
 [30] A. Van Gelder et al. The Well-founded Semantics for General Logic Programs. Journal of
      the ACM, 38(3):619–649, 1991.

</pre>