Technical Communications of ICLP 2015. Copyright with the Authors.                         1


Parallel Bottom-Up Evaluation of Logic Programs:
 DeALS on Shared-Memory Multicore Machines
        MOHAN YANG, ALEXANDER SHKAPSKY, CARLO ZANIOLO
                            University of California, Los Angeles

                       submitted 29 April 2015; accepted 5 June 2015


                                        Abstract
Delivering superior expressive power over RDBMS, while maintaining competitive per-
formance, has represented the main goal and technical challenge for deductive database
research since its inception forty years ago. Significant progress toward this ambitious goal
is being achieved by the DeALS system through the parallel bottom-up evaluation of logic
programs, including recursive programs with monotonic aggregates, on a shared-memory
multicore machine.
   In DeALS, a program is represented as an AND/OR tree, where the parallel evalu-
ation instantiates multiple copies of the same AND/OR tree that access the tables in
the database concurrently. Synchronization methods such as locks are used to ensure the
correctness of the evaluation. We describe a technique which finds an efficient hash par-
titioning strategy of the tables that minimizes the use of locks during the evaluation.
Experimental results demonstrate the effectiveness of the proposed technique — DeALS
achieves competitive performance on non-recursive programs compared with commercial
RDBMSs and superior performance on recursive programs compared with other existing
systems.
KEYWORDS: parallel, bottom-up evaluation, Datalog, multicore, AND/OR tree


                                    1 Introduction
There has been much research on improving the performance of Datalog systems
through parallel bottom-up evaluation. Previous studies focused on the message
passing model where processors communicate with each other by exchanging mes-
sages. This includes both strategies for programs that can be evaluated without any
communication (Wolfson and Silberschatz 1988; Wolfson 1988; Cohen and Wolfson
1989; Seib and Lausen 1991) and strategies to minimize the amount of communi-
cation required (Ganguly et al. 1992; Zhang et al. 1995; Ganguly et al. 1995). In
this paper we instead assume the shared-memory model, where the data is stored
in shared-memory that can be directly accessed by all processors, as is support-
ed by most modern multicore machines, rather than explicit exchange of messages
through a shared segment of memory as studied in (Wolfson and Silberschatz 1988;
Ganguly et al. 1992). The shared data may be modified by multiple processors
concurrently, and synchronization methods such as locks are used to ensure the
correctness of the evaluation.
2                          M. Yang, A. Shkapsky and C. Zaniolo

   The key to achieving a good performance in the shared-memory model is to
use as little synchronization as possible in the program evaluation. In this paper,
we present the technique used by the Deductive Application Language System
(DeALS)1 under development at UCLA, which extends the LDL++ technology
(Arni et al. 2003), to support the parallel bottom-up evaluation of Datalog pro-
grams on shared-memory multicore machines. The proposed technique produces
efficient evaluation plans for both non-recursive programs and recursive programs.
DeALS delivers competitive performance on the non-recursive queries of the TPC-
H benchmark2 , compared with the state of the art RDBMSs such as Vectorwise3
and SQL Server4 , and superior performance on recursive programs compared with
other existing systems.
   The rest of this paper is organized as follows. We introduce the concept of lock-
free programs in Section 2. We describe how DeALS tries to find a lock-free evalu-
ation plan in Section 3. An overview of DeALS is presented in Section 4. We report
experimental results in Section 5. Related work is discussed in Section 6. The paper
concludes in Section 7.

                                    2 Lock-free Programs
Let arc be a base relation that represents the edges of a directed graph. The
transitive closure tc of arc is a derived relation that contains all the pairs (X, Y)
where there is a path from X to Y in the graph. The following program is used to
compute tc.
                             r1.1 : tc(X, Y) <- arc(X, Y).
                                                                                     (1)
                          r1.2 : tc(X, Y) <- tc(X, Z), arc(Z, Y).
The bottom-up evaluation of this program works as follows. The exit rule r1.1 is
evaluated first. A tuple (X, Y) is added to tc for each tuple (X, Y) in arc. Then the
left-linear recursive rule r1.2 is evaluated. For each tuple (X, Z) in tc, a tuple (X, Y)
is added to tc for each tuple of the form (Z, Y) in arc.5 r1.2 is repeatedly evaluated
until tc does not change between two successive evaluations of r1.2.
   Now we describe how to parallelize the bottom-up evaluation on a shared-memory
machine with n processors. The parallel evaluation strategy presented in this pa-
per uses the same workflow as the sequential bottom-up evaluation to ensure the
correctness of evaluation, while the parallelism is achieved through the parallel e-
valuation of each single rule (including the parallel pipelined evaluation of all the
rules which support its goal as described in Section 3). As shown in our experi-
ments, our strategy is able to achieve a reasonable speedup for a data intensive
application to the point that we do not need to explore rules level and components
level parallelism (Perri et al. 2013).

1 Deductive Application Language System, http://wis.cs.ucla.edu/deals/.
2 TPC-H, http://www.tpc.org/tpch/.
3 Vectorwise, http://www.actian.com/.
4 SQL Server 2014, http://www.microsoft.com/en-us/server-cloud/products/sql-server/.
5 Naive evaluation is used here for simplicity. The technique described in this paper also applies
    to the semi-naive evaluation.
     Parallel Evaluation of Logic Programs: DeALS on Multicore Machines              3

   We divide each relation into n partitions and we use the relation name with a
superscript i to denote the i-th partition of the relation. For each partition, we use
a lock to ensure the atomicity of each write operation if multiple write operations
can occur concurrently. Let h be a hash function that maps a vertex to an integer
between 1 to n. Both arc and tc are partitioned by the first column, i.e., h(X) = i
for each (X, Y) in arci and h(X) = i for each (X, Y) in tci . Assuming that there are
one coordinator and n workers, the parallel evaluation proceeds as follows.
(1) The i-th worker evaluates r1.1 by adding a tuple (X, Y) to tc for each tuple
    (X, Y) in arci .
(2) Once all workers finish Step (1), the coordinator notifies each worker to start
    Step (3).
(3) For each tuple (X, Z) in tci , the i-th worker looks for tuples in the form (Z, Y)
    in arc and adds a tuple (X, Y) to tc.
(4) Once all workers finish Step (3), the coordinator checks if the evaluation for
    tc is completed. If so, the computation terminates; otherwise the computation
    starts from Step (3).
In Step (1) and Step (3), each worker performs its task on one processor while the
coordinator waits. Step (2) and Step (4) serve as synchronization barriers.
   In Step (1), the i-th worker only writes to tci since every tuple (X, Y) it adds
to tc is taken from arci where h(X) = i. In Step (3), the i-th worker reads from
tci and only writes to tci since every tuple (X, Y) it adds to tc is derived from a
tuple (X, Z) in tci and a tuple (Z, Y) in arc where h(X) = i. There is no need for
locks since tci is only accessed by the i-th worker during evaluation. The above
evaluation plan is called a lock-free plan where no lock is needed, and a program
which has a lock-free plan is called a lock-free program.
   A key factor that enables a lock-free plan is the partitioning strategy. If we keep
the current partitioning for arc but instead partition tc by its second column, then
every worker could write to tci in Step (3), and an exclusive lock (x-lock) is needed
to ensure every write operation to tci is atomic. For some programs, it is possible
that a worker reads from a partition of a relation while the same partition is being
modified by another worker. In this case, a readers-writer lock (rw-lock) is needed
to ensure that every write operation is atomic, read operations will not get a tuple
that is partially updated, and concurrent read operations are allowed when the
partition is not being modified by other workers.
   Another key factor is the selection of hash functions. It is possible to get a lock-
free plan even if we use different hash functions for arc and tc. However, in this
paper we focus on the case where every relation is partitioned using the same hash
function h defined as
                                                 t
                                                 X
                         h(x1 , · · · , xt ) =         g(xi )   mod n,             (2)
                                                 i=1

where the input to h is a tuple of any arity t, g is a hash function with a range
                    P
no less than n, and    can be replaced with any commutative function. Using a
commutative function to combine the hash values of h allows for more parallelism
4                        M. Yang, A. Shkapsky and C. Zaniolo

than an arbitrary function, such as concatenation (||) where h(x1 , · · · , xt ) becomes
g(x1 )|| · · · ||g(xt ) (Yang et al. 2015). Next, we present a systematic approach to find
a partitioning strategy that generates a lock-free plan for an arbitrary program
when such a plan exists.
Discriminating Sets and Parallel Program Evaluation. Consider the evalua-
tion of a stratifiable program P that consists of l rules r1 , · · · , rl , where each rule ri
is in the form pi,0 <- pi,1 , · · · , pi,mi . Here pi,0 is a predicate that serves as the head
of ri , mi is the number of predicates in the body of ri , and each pi,j is a predicate in
the body for j = 1, · · · , mi . Assume that a predicate p is associated with a relation
of the same name and arity as the predicate, and we use R(p) to denote the relation
that stores all tuples corresponding to facts about p in memory. A discriminating
set of a (non-nullary) relation R is a non-empty subset of columns in R. Given a
discriminating set of a relation, we divide the relation into n partitions by the hash
values of the columns that belong to the discriminating set.
   Let Pj be the program to be executed by the j-th worker. For each rule ri , we
add the following rule in Pj :
                       rij : pi,0 <- pi,1 , · · · , pi,mi , h(pi,ti [Xi ]) = j,           (3)
where pi,ti is a predicate selected from the mi predicates in the body for a ti between
1 to mi , Xi is a selected discriminating set of R(pi,ti ), pi,ti [Xi ] denotes a tuple of
arity |Xi | by retrieving the arguments in pi,ti whose positions belong to Xi , and
the last predicate h(pi,ti [Xi ]) = j means a tuple from the j-th partition of R(pi,ti )
is accessed in every successful derivation of rij . We select the same pi,ti and Xi for
j = 1, · · · , n. We also select a discriminating set, denoted by X(R), for each derived
relation R. These discriminating sets can be arbitrarily selected as long as there
is a unique discriminating set for each derived relation. We might have selected
several different discriminating sets of the same base relation which correspond to
different ways of partitioning the relation. This relation is preprocessed before the
evaluation so that it can be efficiently accessed for every partitioning.
   In general, the parallel evaluation proceeds as follows.
(1) The coordinator determines the first rule to be evaluated, say ri , and instructs
    the workers to start Step (2).
(2) All n workers become active with the j-th worker evaluating rij .
(3) After all workers finish, the coordinator checks if the evaluation for P is com-
    pleted. If not, it determines the next rule to be evaluated and Step (2) repeats.
   The parallel evaluation scheme described above is a special case of the substitution
partitioned scheme (Ganguly et al. 1992). Our scheme improves it by removing the
sending, receiving and final pooling steps since each worker has full access to all the
relations in the shared-memory model.
Example 1
The corresponding Pj in the lock-free plan for the program in (1) is shown as
follows.
                 r4.1 : tc(X, Y) <- arc(X, Y), h(X) = j.
                                                                          (4)
                 r4.2 : tc(X, Y) <- tc(X, Z), arc(Z, Y), h(X) = j.
     Parallel Evaluation of Logic Programs: DeALS on Multicore Machines            5

  For each worker, there are still many different ways to evaluate r4.2 since the
three predicates in the body can be evaluated in different orders. For a specific
order, a bound/free adornment (Ullman 1985) of a predicate p is a string of b’s
and f’s whose length equals the arity of p, where a b (an f) in the i-th position
means the i-th argument in p is bound (free) when p is evaluated. We force the
same bottom-up evaluation plan upon every worker by using the same bound/free
adornments for every Pj .
Example 2
The program in (5) shows an adorned version of the program in Example 1. The
predicates in r5.2 are reordered to reflect the order in which they are evaluated.
In the evaluation of r5.2, the j-th worker evaluates tcff (X, Z), h(X) = j by reading
from the j-th partition of tc. Then it evaluates arcbf (Z, Y) where Z is bound. So it
finds a tuple (Z, Y) from arc with the given Z, and adds (X, Y) to tc if it succeeds.
                  r5.1 : tcff (X, Y) <- arcff (X, Y), h(X) = j.
                                                                                 (5)
                  r5.2 : tcff (X, Y) <- tcff (X, Z), h(X) = j, arcbf (Z, Y).

Example 3
The attend program below finds all the people who will attend the party. A person
will attend the party if he/she is an organizer, or he/she has at least three friends
who will attend the party. Here, mcount is the monotonic and continuous count
aggregate (Shkapsky et al. 2015).
              cntfriends(Y, mcount<X>) <- friend(X, Y), attend(X).
              attend(X) <- organizer(X).                                         (6)
              attend(Y) <- cntfriends(Y, N), N ≥ 3.
The program in (7) shows a lock-free evaluation plan. Each relation is partitioned
by its first column except friend is partitioned by its second column. In each
iteration, the j-th worker scans through the j-th partition of friend, and checks
if there is a specific X in attend which can be efficiently supported by maintaining
an index on attend during program evaluation. Thus, the j-th worker only writes
to the j-th partition of cntfriends and attend, and no lock is needed.
       cntfriendsff (Y, mcount<X>) <- friendff (X, Y), h(Y) = j, attendb (X).
       attendf (X) <- organizerf (X), h(X) = j.                                  (7)
              f                       ff
       attend (Y) <- cntfriends (Y, N), h(Y) = j, N ≥ 3.


                   3 Parallel Evaluation of AND/OR Trees
In DeALS, a program is represented as an AND/OR tree (Arni et al. 2003). An
OR node represents a predicate and an AND node represents the head of a rule.
The root is an OR node. The children of an OR node and an AND node are AND
nodes and OR nodes, respectively. There are specialized OR nodes, such as the
R-node that reads from a base relation or a derived relation and the W-node that
writes to a derived relation. We use a pipelined evaluation which only materializes
6                         M. Yang, A. Shkapsky and C. Zaniolo

relations when it is necessary — if 1) it is the root; or 2) it is an aggregation; or
3) the optimizer determines that the cost of computing the tuples when they are
needed is much higher than the cost of materializing the relation in memory.6
   A program is transformed into an AND/OR tree such that the root represents the
query and the children of each AND node follow the same order as the corresponding
predicates in the corresponding rule. The AND/OR tree is then adorned for bottom-
up evaluation. An OR node is an entry node if 1) it is a leaf, and 2) it is the first
R-node among its siblings, and 3) each of its ancestor OR node does not have a
left-sibling (i.e., a sibling that appears before the current node) that has an R-node
descendant or a W-node descendant.
Example 4
A non-linear formulation of the transitive closure program is shown as follows.
                                 tc(X, Y) <- arc(X, Y).
                                                                                            (8)
                                 tc(X, Y) <- tc(X, Z), tc(Z, Y).
The corresponding AND/OR tree is shown in Fig. 1. The text description inside a
node indicates the type and ID of the node, e.g., “OR-1” indicates the root is an
OR node with ID 1. OR-4, OR-5 and OR-6 are R-nodes and OR-1 is a W-node.
OR-4 and OR-5 are entry nodes in this program.

                                             OR-1   tc(X, Y)


                          tc(X, Y) AND-2               AND-3 tc(X, Y)


                         arc(X, Y) OR-4 tc(X, Z) OR-5          OR-6 tc(Z, Y)


                Fig. 1. AND/OR tree of tc expressed by the program in (8).


   In DeALS, the j-th worker evaluates an entry node by instantiating variables
with constants from the j-th partition of the corresponding relation. For the re-
maining R-nodes, each worker has full access to all partitions of the corresponding
relations. This strategy ensures that the evaluation is divided into n disjoint parts;
otherwise if each worker evaluates an entry node by instantiating variables with
constants from all partitions of the corresponding relation, redundant work will be
performed.                          1


Determining the Discriminating Sets. Now we describe the read/write analysis
on an adorned AND/OR tree that determines the type of lock needed for each
derived relation. The analysis performs a depth-first traversal on the AND/OR
tree that simulates the actual evaluation to check each read or write operation
performed by the j-th worker. For each node encountered during the traversal:

6 Currently this is done manually where a user can force the materialization of a relation by an
    annotation in the program.
      Parallel Evaluation of Logic Programs: DeALS on Multicore Machines                     7

(1) if it is an entry node, set it as the current entry node; if it reads from a derived
    relation, for each W-node that is an ancestor of the current node,7 determine
    if the j-th worker only writes to the j-th partition of R(pw ) by checking if
    pe [Xj ] = pw [Xk ]8 , where pe and pw are the predicates associated with the
    entry node and W-node, respectively, and Xj and Xk are the corresponding
    discriminating sets;
(2) if it is an R-node that reads from a derived relation, determine if the j-th
    worker only reads from the j-th partition of R(pr ) by checking if Xk ⊆ B
    and pe [Xj ] = pr [Xk ], where pe and pr are the predicates associated with the
    current entry node and R-node, respectively, Xj and Xk are the corresponding
    discriminating sets, and B is the set of positions for bound arguments in the
    R-node.
  We use the following discriminating set equations (DSE) obtained through a
read/write analysis to find the best possible discriminating sets for a given program.
A DSE consists of three types of equations:
(1) pe [Xj ] = pw [Xk ] for each entry node in Case (1) of the read/write analysis;
(2) Xk ⊆ B, pe [Xj ] = pr [Xk ] for each R-node in Case (2) of the read/write analysis;
(3) ∅ ( X ⊆ {1, · · · , arity(R)} for each X appearing in the above two types of
    equations, where arity(R) is the arity of the relation R associated with X.
The target is to find an optimal solution to the DSE which is an assignment to the
variables that satisfies all the equations of Type (3) and maximizes the number of
satisfied equations of Type (1) and Type (2). DeALS finds an optimal solution by
enumerating all possible assignments, which is feasible for most recursive programs
studied in the literature since the number of variables is very small. DSE is very
similar to the idea of generalized pivoting where a system of equations is obtained
from the rules and an exact solution is required (Seib and Lausen 1991). But it
is different from generalized pivoting in two aspects: 1) DSE is obtained through
the read/write analysis on the AND/OR tree since the pipelined evaluation on the
AND/OR tree might evaluate multiple rules at the same time where the equations
obtained from each single rule cannot capture all the constraints; 2) an exact so-
lution is not required since we want to obtain the best possible evaluation plan
even when the program cannot be evaluated without any communication under the
message passing model.
   Finally, the discriminating sets are determined as follows.
(1) Obtain the DSE by performing a read/write analysis on the AND/OR tree.
(2) Find an optimal solution to the DSE.
(3) Determine the type of lock for each derived relation by a read/write analysis
    on the AND/OR tree with the selected discriminating sets.


7 The set of ancestor W-nodes of an entry node can be efficiently obtained using a stack during
  the traversal (Yang et al. 2015). The details are omitted due to space constraints.
8 The tuple denoted by p[X] is treated as a multiset of arguments when involved in equality
  checking.
8                     M. Yang, A. Shkapsky and C. Zaniolo

Example 5
The DSE for the AND/OR tree in Fig. 1 is shown is shown below.
                            arc(X, Y)[X1 ] = tc(X, Y)[X2 ]
                            tc(X, Z)[X2 ] = tc(X, Y)[X2 ]
                            X2 ⊆ {1}
                                                                                 (9)
                            tc(X, Z)[X2 ] = tc(Z, Y)[X2 ]
                            ∅ ( X1 ⊆ {1, 2}
                            ∅ ( X2 ⊆ {1, 2}

The optimal solution is X1 = X2 = {1} that only violates the fourth equation. The
result of a read/write analysis determines that an rw-lock is needed for tc in the
evaluation of the non-linear recursive rule.

  We assume the program does not contain aggregates and arithmetic expressions
in the above procedure. If the program does contain these constructs, the same
procedure is applicable if we ignore all the arguments which are either aggregates
or arithmetic expressions. The only exception is that the evaluation of a W-node
always requires locks if it contains an aggregate with no group by arguments.


                                     4 DeALS
DeALS is a Datalog system under development at UCLA. It has a Java-based
compiler and a sequential Java-based interpreter that allows users to develop and
debug their applications. The new parallel evaluation module targeted for shared-
memory multicore machines is implemented as a separate module in the system
— the compiler compiles a program into an AND/OR tree and then the parallel
module determines the parallel evaluation plan using the technique presented in this
paper and generates a corresponding C++ program. We implemented the database
objects (index and storage), base classes for each kind of node in the AND/OR tree
and common functions. The generated program contains the definition of tuples
and relations, and the actual implementation of the AND/OR tree based on the
base classes. It is compiled into the final executable by invoking the Visual C++
Compiler that comes with Visual Studio 2013 (v120) on a Windows machine or GCC
4.9.2 on a Linux machine. The thread implementation provided by the Microsoft
Windows runtime library is used on a Windows machine to evaluate the query in
parallel, while Pthreads is used on a Linux machine.


                            5 Experimental Results
In this section, we report some experimental results on evaluating both non-recursive
and recursive programs. The test machine has four AMD Opteron 6376 CPUs (16
cores per CPU) and 256GB memory (configured into eight NUMA regions). The
operating system is Ubuntu Linux 12.04 LTS. All execution times are calculated by
taking the average of five runs of the same experiment.
       Parallel Evaluation of Logic Programs: DeALS on Multicore Machines                       9

                           1000                                        831.711*
                                          DeALS
                                          VectorWise 2.0.1
                                                              83.118    315.429*
                                100       SQL Server 2014


                     Time (s)    10
                                                                   15.091*
                                                 7.755
                                      1.143
                                  1
                                      1           10            100          1000
                                                  Database Size (GB)


Fig. 2. Total query evaluation time for 22 queries in the TPC-H benchmark. A predicted
       query evaluation time on our test machine is marked with a * in the figure.


Exp-I: Non-recursive programs — TPC-H benchmark. The benchmark con-
tains 22 (non-recursive) SQL queries over a database of eight tables. The data types
involved in the queries are integer, decimal, string and date. We implemented all the
22 queries following the query plans described in (Dees and Sanders 2013).9 DeALS
is able to correctly evaluate all the 22 queries on databases of size 1GB, 10GB and
100GB on the test machine, with a speedup of 12.43, 19.48, 23.63, respectively
(evaluation time for all the 22 queries using one processor divides the time using
60 processors). Fig. 2 shows the total query evaluation time for all the 22 queries.
The last point 831.711s shows the predicted time on a database of size 1TB if the
evaluation time scales linearly w.r.t. the size of the database. We compare DeALS
with the current single machine world record for the benchmark on databases of size
100GB and 1TB. VectorWise 2.0.1 evaluates all the queries in 22.8s on a database
of size 100GB,10 while it takes Microsoft SQL Server 2014 Enterprise Edition 138s
on a database of size 1TB.11 Note that the SPECint_rate2006 of our test machine
is 1050 (the larger the more powerful), while the values are 695 and 2400 for the
other two machines. Fig. 2 shows the predicted time of both commercial RDBMSs
on our test machine if the evaluation time is inversely proportional to the SPECin-
t_rate2006 of the machine. DeALS is within six times slower comparing with these
two highly optimized commercial systems (both system settings are also optimized
for the benchmark).
Exp-II: Recursive programs. We test the performance of DeALS on three clas-
sical recursive queries — tc (program in (1)), attend (program in Example 3) and
sg (same generation) as shown below:
                                  sg(X, Y) <- anc(P, X), anc(P, Y), X 6= Y.
                                                                                             (10)
                                  sg(X, Y) <- anc(A, X), sg(A, B), anc(B, Y).
The test datasets contain synthetic graphs and real world graphs.

9 COUNT(DISTINCT) is replaced with COUNT in query16. ORDER BY and LIMIT are ignored in
   our program. The evaluation time will not change significantly if we add these constructs since
   most queries return very few results except query3 and query10.
10 TPC-H Result on Dell PowerEdge R720, http://www.tpc.org/3282.
11 TPC-H Result on Cisco UCS C460 M4 Server, http://www.tpc.org/3311.
10                                        M. Yang, A. Shkapsky and C. Zaniolo
                                                                                                                   LogicBlox                 DeALS-1


                                                                              394.9


                                                                                                           123.7
            109                                                                                                    DLV                       DeALS-64


                                       4496


                                                       763.1
                                      2417


                                                                                                        8.662

                                                                                                        7.966
                                                                          9.072
                                                    75.14


                                                                                             307.5
                                                   26.44
            107


                                 27.39


                                                                                            99.02
Time (ms)


                                                                                                                                                 14.01
                                                                                                                            22.32
                                                                                                                            18.65
                                                               Out of Memory


                                                                                                                    Not Supported


                                                                                                                                        Not Supported
                                                                                                                                                4.050
                      270.2
                     172.0


                                                                                         8.083
            105
                  6.226


            103
             10
                   tree-11        grid-150         gnp-10K        tree-11                grid-150       gnp-10K      Patents                 Wiki
                              transitive closure                                      same generation                               attend


              Fig. 3. Query evaluation time. The numbers above the bars for LogicBlox (DLV,
                DeALS-1) show the speedup of DeALS-64 over LogicBlox (DLV, DeALS-1).

Synthetic Graphs
1) tree-11 is a randomly generated tree of depth 11 where the out degree of a
   non-leaf vertex is a random number between 2 to 6.
2) grid-150 is a 151 × 151 square grid.
3) gnp-10K is a random graph (Erdős-Rényi model) of 10,000 vertices generated by
   connecting vertices randomly such that the average out-degree of a vertex is 10.
Real World Graphs
1) patent is the US patent citation graph12 . Each vertex represents a patent, and
   each edge represents a citation between two patents. It has 3,774,769 vertices
   and 16,518,948 edges.
2) wiki is the Wikipedia knowledge graph. Each vertex represents an entity in the
   Wikipedia, and each edge represents an appearance of an entity in another
   entity’s infobox. It has 3,165,181 vertices and 23,190,820 edges.
The experiments on tc (sg) evaluation use each of these synthetic graphs as arc
(anc). The experiments on attend evaluation use each real world graphs as friend,
while organizer contains all the vertices in the graph whose in-degrees are zero.
   We compare DeALS with LogicBlox 4.1.9 (Aref et al. 2015) and DLV (Leone et al.
2006), which are two systems that represent the state of the art in evaluating logic
programs in the areas of deductive database and disjunctive logic programming,
respectively. Both DeALS and LogicBlox support all three queries and evaluate
them correctly on the test datasets. DLV runs out of memory on our test machine
on the evaluation of sg on tree-11. The version of DLV13 that supports aggregates
in recursion is a 32-bit executable which fails on the evaluation of attend on both
patent and wiki as it does not support more than 4GB memory required by eval-
uation. Fig. 3 compares the evaluation time of three systems on these recursive
queries. Bars for LogicBlox show the evaluation time of LogicBlox using 64 pro-
cessors.14 Bars for DLV show the evaluation time of DLV using one processor.15

12 Patent citation network, https://snap.stanford.edu/data/cit-Patents.html.
13 DLV with Recursive Aggregates, downloaded from http://www.dbai.tuwien.ac.at/proj/dlv/
            dlvRecAggr/dl-recagg-snapshot-2007-04-14.zip.
14 In our experiments, LogicBlox 4.1.9 does not utilize all the processors all the time.
15 The single-processor version of DLV is downloaded from http://www.dlvsystem.com/files/
            dlv.x86-64-linux-elf-static.bin. Although a parallel version (http://www.mat.unical.it/
            ricca/downloads/parallelground10.zip) is available, it is either much slower than the single-
            processor version or it fails since it is a 32-bit executable that does not support more than 4GB
            memory required by evaluation.
     Parallel Evaluation of Logic Programs: DeALS on Multicore Machines           11

Bars for DeALS-1 and DeALS-64 show the evaluation time of DeALS using one
processor and 64 processors, respectively. When DeALS uses only one processor,
it always outperforms DLV on these queries and datasets, and it outperforms or
equally performs LogicBlox. DeALS always outperforms LogicBlox when it uses 64
processors. It achieves a greater speedup (the speedup of DeALS-64 over DeALS-1)
for tc and attend than sg since no lock is used in tc and attend, while sg suffers
from lock contention. It achieves limited speedup for tc on tree-11 since evaluation
time is dominated by barrier synchronization overhead (Yang et al. 2015).


                                6 Related Work

The parallel evaluation strategy proposed in this paper uses a simple hash-based
data partitioning strategy. Various data partitioning strategies for parallel bottom-
up evaluation have been studied in (Wolfson and Silberschatz 1988; Wolfson 1988;
Cohen and Wolfson 1989; Seib and Lausen 1991; Ganguly et al. 1992; Zhang et al.
1995; Ganguly et al. 1995). These studies assume a message passing model and
focus on minimizing the amount of message exchange, whereas our study considers
a shared-memory model where no message exchange is needed during the evalua-
tion; we demonstrate the effectiveness of our technique with a real Datalog system
implementation while previous studies focus on the theoretical aspect. The settings
of (Raschid and Su 1986; Hulin 1989; Bell et al. 1991) are more similar to ours,
where strategies for top-down evaluation are proposed. These strategies are com-
plementary to ours since we focus on the bottom-up evaluation. Yet another related
work is our ongoing research which aims to provide a distributed evaluation engine
for DeALS. However, the study presented in this paper focus on the case where all
the base relations and derived relations fit into the memory of a single machine.


                                  7 Conclusion

In this paper, we presented the technique used in DeALS for parallel bottom-
up evaluation on shared-memory multicore machines. The technique is simple and
applicable to a wide range of non-recursive and recursive programs. DeALS is
able to achieve competitive performance on non-recursive programs compared with
RDBMSs and superior performance on recursive programs compared with other
existing systems, by adding a parallel evaluation module based on this technique.
However, there is still a clear performance gap between DeALS and the hand
written optimal programs, such as the SSC12 algorithm for transitive closure (Yang
and Zaniolo 2014). We are working on reducing the gap by further optimizing
the performance of the generated program. Another ongoing work is to provide
a parallel evaluation module targeted for distributed environment that is able to
solve problems when the dataset does not fit into the memory of a single machine.
We are also working on optimizing the support for and the performance of new
applications requiring data mining and graph analysis.
12                     M. Yang, A. Shkapsky and C. Zaniolo

                                 Acknowledgment
This work was supported by NSF grants IIS 1218471 and IIS 1118107. We would
like to thank the reviewers and Matteo Interlandi for their comments. We thank
LogicBlox especially Martin Bravenboer, Dung Nguyen, and Yannis Smaragdakis,
for their assistance with the LogicBlox comparison.


                                     References
Aref, M., ten Cate, B., Green, T. J., Kimelfeld, B., et al. 2015. Design and
  implementation of the logicblox system. In SIGMOD 2015. ACM, 1371–1382.
Arni, F., Ong, K., Tsur, S., Wang, H., and Zaniolo, C. 2003. The deductive database
  system LDL++. TPLP 3, 1, 61–94.
Bell, D. A., Shao, J., and Hull, M. E. C. 1991. A pipelined strategy for processing
  recursive queries in parallel. Data & Knowledge Engineering 6, 5, 367–391.
Cohen, S. and Wolfson, O. 1989. Why a single parallelization strategy is not enough
  in knowledge bases. In PODS 1989. ACM, 200–216.
Dees, J. and Sanders, P. 2013. Efficient many-core query execution in main memory
  column-stores. In ICDE 2013. IEEE, 350–361.
Ganguly, S., Silberschatz, A., and Tsur, S. 1992. Parallel bottom-up processing of
  datalog queries. The Journal of Logic Programming 14, 1, 101–126.
Ganguly, S., Silberschatz, A., and Tsur, S. 1995. Mapping datalog program execu-
  tion to networks of processors. TKDE 7, 3, 351–361.
Hulin, G. 1989. Parallel processing of recursive queries in distributed architectures. In
  VLDB 1989. Morgan Kaufmann Publishers Inc., 87–96.
Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., et al. 2006. The dlv
  system for knowledge representation and reasoning. TOCL 7, 3, 499–562.
Perri, S., Ricca, F., and Sirianni, M. 2013. Parallel instantiation of asp programs:
  techniques and experiments. TPLP 13, 02, 253–278.
Raschid, L. and Su, S. Y. W. 1986. A parallel processing strategy for evaluating
  recursive queries. In VLDB 1986. Morgan Kaufmann Publishers Inc., 412–419.
Seib, J. and Lausen, G. 1991. Parallelizing datalog programs by generalized pivoting.
  In PODS 1991. ACM, 241–251.
Shkapsky, A., Yang, M., and Zaniolo, C. 2015. Optimizing recursive queries with
  monotonic aggregates in deals. In ICDE 2015. IEEE, 867–878.
Ullman, J. D. 1985. Implementation of logical query languages for databases. TOD-
  S 10, 3, 289–321.
Wolfson, O. 1988. Sharing the load of logic-program evaluation. In DPDS 1988. IEEE
  Computer Society Press, 46–55.
Wolfson, O. and Silberschatz, A. 1988. Distributed processing of logic programs. In
  SIGMOD 1988. ACM, 329–336.
Yang, M., Shkapsky, A., and Zaniolo, C. 2015. Parallel bottom-up evaluation of logic
  programs: DeALS on shared-memory multicore machines. Tech. Rep. 150003, UCLA
  CSD.
Yang, M. and Zaniolo, C. 2014. Main memory evaluation of recursive queries on
  multicore machines. In IEEE BigData 2014. IEEE, 251–260.
Zhang, W., Wang, K., and Chau, S.-C. 1995. Data partition and parallel evaluation
  of datalog programs. TKDE 7, 1, 163–176.