=Paper=
{{Paper
|id=Vol-1433/tc_50
|storemode=property
|title=Thread-Aware Logic Programming for
Data-Driven Parallel Programs 
|pdfUrl=https://ceur-ws.org/Vol-1433/tc_50.pdf
|volume=Vol-1433
|dblpUrl=https://dblp.org/rec/conf/iclp/CruzRG15
}}
==Thread-Aware Logic Programming for
Data-Driven Parallel Programs ==
<pdf width="1500px">https://ceur-ws.org/Vol-1433/tc_50.pdf</pdf>
<pre>
Technical Communications of ICLP 2015. Copyright with the Authors.                            1


             Thread-Aware Logic Programming for
                Data-Driven Parallel Programs
                 Flavio Cruz†‡ , Ricardo Rocha‡ , Seth Copen Goldstein†
                           † Carnegie Mellon University, Pittsburgh, PA 15213

                                (e-mail: {fmfernan,seth}@cs.cmu.edu)
                   ‡ CRACS & INESC TEC and Faculty of Sciences, University Of Porto

                       Rua do Campo Alegre, 1021/1055, 4169-007 Porto, Portugal
                             (e-mail: {flavioc,ricroc}@dcc.fc.up.pt)

                              submitted 29 April 2015; accepted 5 June 2015


                                              Abstract
Declarative programming in the style of functional and logic programming has been hailed
as an alternative parallel programming style where computer programs are automatically
parallelized without programmer control. Although this approach removes many pitfalls of
explicit parallel programming, it hides important information about the underlying parallel
architecture that could be used to improve the scalability and efficiency of programs. In this
paper, we present a novel programming model that allows the programmer to reason about
thread state in data-driven declarative programs. This abstraction has been implemented on
top of Linear Meld, a linear logic programming language that is designed for writing graph-
based programs. We present several programs that show the flavor of our new programming
model, including graph algorithms and a machine learning algorithm. Our goal is to show
that it is possible to take advantage of architectural details without losing the key advantages
of logic programming.
KEYWORDS: Parallel Programming, Declarative Programming, Coordination


                                          1 Introduction
Parallelism in functional (Blelloch 1996; Chakravarty et al. 2007) and logic (Gupta
et al. 2001) programming languages has been exploited through the use of implicit
parallelism, meaning that parallelism is not explicitely controlled by the program-
mer but by the underlying runtime system. Although these approaches remove
many pitfalls of explicit parallel programming, they hide important information
about the underlying parallel architecture that could be used to improve the scal-
ability and efficiency of programs, thus leaving very little opportunity for the
programmer to optimize the parallel aspects of the program. This is unfortunate
since some programs are computed more efficiently if information about the par-
allel architecture (e.g., the number of cores) is exposed to the code being executed.
It would be far better if the programmer could reason declaratively not only about
the problem at hand but also about the underlying parallel architecture.
   Linear Meld (LM) is a linear logic programming language especially suited for
2                Flavio Cruz, Ricardo Rocha and Seth Copen Goldstein

the parallel implementation of graph-based algorithms (Cruz et al. 2014). LM of-
fers a concise and expressive framework that has been applied to a wide range of
graph-based problems and machine learning algorithms, including: belief propa-
gation with and without residual splash (Gonzalez et al. 2009), PageRank, graph
coloring, N-Queens, shortest path, diameter estimation, map reduce, quick-sort,
neural network training, minimax, and many others.
  In this paper, we present an extension to LM that allows the programmer to
reason about the underlying parallel architecture. The extension introduces the
graph of threads, that map to the concrete threads executing on the system, as
computable entities of the language. It is then possible to derive logical facts about
the threads and to write inference rules that reason about the state of the threads
along with the state of the program. This novel programming model allows the
writing of declarative code that is both data-driven and architecture-driven. As
we will see throughout the paper, this allows us to easily optimize programs that
could not have been improved otherwise.


                                      2 Language
   Linear Meld (LM) is a logic programming
language in the style of Datalog (Ramakrish-          1     type left(node, node).
nan and Ullman 1993) that offers a declarative        2     type right(node, node).
                                                      3     type found(node, int, string).
and structured way to manage mutable state.           4     type not-found(node, int).
LM is based on linear logic (Girard 1995), a          5     type linear value(node, int, string).
                                                      6     type linear lookup(node, int).
logic system where truth is ephemeral and             7
                                                      8     // We found the key we want
can be consumed when used to prove propo-             9     lookup(A, K),
sitions. Like Datalog, LM is a forward-chaining      10     value(A, K, Value)
                                                     11        -o value(A, K, Value),
logic programming language since computa-            12           !found(A, K, Value).
tion is driven by a set of inference rules that      13
                                                     14     // The key must be to the left
are used to update a database of logical facts       15     lookup(A, K),
                                                     16     value(A, NKey, Value),
that represent the state of the program. Un-         17     K < NKey
like Datalog, logical facts can be asserted and      18     !left(A, B)
                                                     19        -o value(A, NKey, Value),
retracted freely, therefore inference rules can      20           lookup(B, K).
                                                     21
also retract facts. Moreover, each LM rule has       22     // The key must be to the right
a pre-defined priority that is inferred from         23     lookup(A, K),
                                                     24     value(A, NKey, Value),
its position in the source code which forces         25     K > NKey,
higher priority rules to be applied first. The       26     !right(A, B)
                                                     27        -o value(A, NKey, Value),
program stops when quiescence is achieved,           28           lookup(B, K).
                                                     29
i.e., when inference rules no longer apply.          30     // The key cannot be found
   An inference rule of the form a(M, N),            31     lookup(A, K)
                                                     32        -o !not-found(root, K).
b(M) -o c(N) has the following meaning: if           33
                                                     34     // Initial axioms (...)
the database has facts a(M, N) and b(M) then         35     lookup(root, 6).
an instance of both is retracted and a copy of
c(N) is asserted. A(M, N), b(M) form the body             Fig. 1. LM program for performing
of the rule, c(N) the head while -o expresses                 lookups in a BST dictionary.
a linear implication. In addition to linear facts,
                                     Theory and Practice of Logic Programming                                                                 3

rules in LM can also use persistent facts which are never retracted. Such facts are
preceded by a !, e.g., !left in line 18 of Fig. 1. Facts are instantiations of predicates
and they map predicate arguments (which are typed) to concrete values. The type
system of LM includes scalar types such as node, int, float, string, bool. Recursive
types are also included, e.g., list X, pair X; Y and array X.
   In Fig. 1, we present an LM program that implements the lookup operation
in a binary search tree (BST) dictionary represented as key/value pairs. Lines 1-6
declare the predicates that are going to be used in the program, which includes
four persistent predicates (left, right, found and not-found) and two linear
predicates (value and lookup). The predicate value assigns a key/value pair to a
tree node and lookup represents an lookup operation for a given key. Each node of
the tree contains a key and a value. The nodes are connected using left and right
branches, which take the search property of the tree into account.
   The algorithm uses four rules for the three cases when looking up a given key K
at a given node A. The first rule finds the key, meaning that the current node has the
key we want (lines 8-12). The second rule detects that the key must be located in
the left branch of tree (lines 14-20), while the third rule detects that the key must be
in the right branch of the tree (lines 22-28). Finally, the fourth rule, which is inferred
if nothing else matches, represents the case where the node has no branches or one
of the branches available cannot possible contain the required key (lines 30-32). At
the end, we have the initial axioms which includes the left and right axioms (not
shown) and a lookup(root, 6) axiom for looking up the value with key 6.
   Figure 2 illustrates the initial and final states of the example program. Note that
we have partitioned the database by the first argument of each fact. In Fig. 2(a),
we present the initial database filled with the program’s axioms. Execution follows
through the right branch twice using rule 3 since 6 > 3 and 6 > 5. We finally reach
node 7 and apply rule 1, which derives fact !found(7, 6, g). The final state of
the program is show in Fig. 2(b).


                              !left(1, 2)
                                                                                                    !left(1, 2)
                             !right(1, 3)
                            value(1, 3, a)
                                                                                                   !right(1, 3)
                            lookup(1, 6)                                                          value(1, 3, a)


           !left(2, 4)                         !left(3, 6)                        !left(2, 4)                        !left(3, 6)
          !right(2, 5)                        !right(3, 7)                       !right(2, 5)                       !right(3, 7)
         value(2, 1, b)                      value(3, 5, c)                     value(2, 1, b)                     value(3, 5, c)


                                                                                                                             value(7, 6, g)
 value(4, 0, d)   value(5, 2, e)     value(6, 4, f)    value(7, 6, g)   value(4, 0, d)   value(5, 2, e)    value(6, 4, f)
                                                                                                                            !found(7, 6, g)


                          (a) Initial state                                     (b) Final state, after reaching node 7

                      Fig. 2. Initial and final states of the BST dictionary program.
4                Flavio Cruz, Ricardo Rocha and Seth Copen Goldstein

                                     3 Semantics
An LM program consists of rules which manipulate facts. The facts can be viewed
as a graph where the first argument, of type node, specifies the node of the graph
where the fact belongs to. The remaining arguments of a fact, if any, describe the
properties of the node including, if the argument is of type node, the edges in the
graph. Rule inference is done concurrently by independently deriving rules on
different nodes of the graph. Further, rules may only manipulate facts belonging
to the same node. However, head expressions may refer to other nodes, as long as
they are instantiated in the body.
   In LM’s implementation, nodes are processed in parallel by threads which are
realized as computing cores in a shared memory setting. The graph of nodes is
initially partitioned among threads but nodes can be stolen from other nodes for
work balancing. Nodes are either active or inactive. Active nodes contain new logical
facts that have not been processed yet (i.e., may have applicable inference rules)
and inactive nodes have been processed (i.e., no inference rules are applicable at
the moment). In the case where all the nodes assigned to a given thread are inactive
then the thread is allowed to steal active nodes from another thread. When every
node becomes inactive, all the threads will go idle and the program terminates.
   LM already provides special predicates, called coordination predicates, that allow
the programmer to manage scheduling and data partitioning in order to improve
performance. Coordination predicates fall into two groups: scheduling predicates and
partitioning predicates. Scheduling and partitioning predicates are also classified into
action predicates and sensing predicates. The sensing predicates are used to obtain
information from the runtime system in the body of the rule. Action facts in the
head of the rule manipulate the runtime system.
   Scheduling predicates change how nodes are selected from the work queue.
Every node can have a priority value which is used by the threads to choose the nodes
from the work queue. We have action predicates to change the priority of a node,
such as set-priority, remove-priority, and schedule-next, that puts a node on
the head of the work queue, and sensing predicates, such as priority(node A,
float P), which indicates that node A has priority P. Note that sensing facts do
not need to follow the body constraints mentioned earlier.
   Partitioning predicates change how nodes are placed in threads. In terms of ac-
tion predicates, we have a set-thread predicate, that changes the current thread of
a given node, and set-static, which disallows work stealing. The sensing predi-
cates allow the programmer to reason about the placement of the node, including
knowing the thread where the node is currently placed. We also have the predicate
just-moved, which is derived after the use of set-thread.


                                4 Thread Local Facts
As we have seen, the first argument of every fact must be typed as node. This
enforces locality of facts to a particular node. While this restriction is the foundation
for implicit parallelism in LM, it restricts how much information is known between
                          Theory and Practice of Logic Programming                                  5

nodes. The existence of coordination predicates brings some awareness about the
underlying parallel system, including node scheduling and placement, however it
is limited in the sense that the programmer is not able to reason directly about the
state of the thread but only about the state of the node.
   In order to remove this limitation, we introduce the concept of thread facts, which
are logical facts stored at the thread level, meaning that, each thread is now an
entity with its own logical facts. We extend the type system to include the type
thread, which is the type of the first argument of thread predicates, indicating that
the predicate is related and is to be stored in a specific thread. We can view the
thread facts as forming a separate graph from the data graph, a graph of the
processing elements which are operating on the data graph.
   The introduction of thread facts increases the expressiveness of the system in the
sense that it is now possible to write inference rules that reason about the state of the
threads. This creates optimization opportunies since we can now write algorithms
with global information stored in the thread, while keeping the LM language fully
declarative. Moreover, threads are now allowed to explicitly communicate with
each other, and in conjunction with coordination predicates, enable the writing of
complex scheduling policies.
   We discriminate between two new types
of inference rules. The first type is the                                      b(3)
                                                                              e(3, 2)
thread rule and has the form a(T), b(T)                                       e(3, 5)
-o c(T), and can be read as: if thread T has
fact a(T) and b(T) then derive fact c(T).                        a(2)
                                                                e(2, 1)                        e(5, 2)
                                                                                               e(5, 3)
The second type is the mixed rule and has                       e(2, 3)
                                                                e(2, 5)                        e(5, 4)
                                                                                               e(5, 6)
the form a(T), d(N) -o e(N) and can be
read as: if thread T is executing node N and                            a(1)
has the fact a(T) and node N has the fact                             e(1, 2)            b(4)
                                                    Thread 1          e(1, 4)           e(4, 1)
d(N) then derive e(N) at node N. Thread                                                 e(4, 5)
                                                         f(T1)
rules reason solely at the thread level,                 g(T1)                      Thread 2
                                                     running(T1, 1)
while mixed rules allow reasoning about                                                    f(T2)
                                                                                      running(T2, 4)
both thread and node facts. Logically, the
mixed rule uses an extra fact running(T,
N), which indicates that thread T is cur- Fig. 3. A program being executed with
rently executing node N. The running fact two threads. Note that each thread has a
is implicitly retracted and asserted every           running fact that stores the node
time the thread selects a different node for              currently being executed.
execution. This makes our implementation
efficient since a thread does not need to look for nodes that match mixed rules and
it is then the scheduling of the program that drives the matching of such rules.
Figure 3 represents a schematic view of the graph data structure of a program with
two threads: thread T1 is executing node 1 and T2 is executing node 4. Both threads
have access to all the facts in the thread itself and to the corresponding node facts.
   To show how the new extension works, we are going to take the BST example
shown in Fig. 1 and improve it by using thread facts. We assume that there is a BST
and a sequence of n lookup operations for different keys in the BST (which may or
6                Flavio Cruz, Ricardo Rocha and Seth Copen Goldstein

may not be repeated). A single lookup has worst-case time complexity O(h) where
h is the height of the BST, therefore n lookups can take O(h × n) time. In order to
improve the execution time of the program, we can cache the search operations
so that repeated lookup operations become faster. Instead of traversing the entire
height of the BST, we look in the cache and send the lookup operation immediately
to the node where the key is located. Without thread facts, we might have cached
the results at the root node, however this is not a scalable approach as it would
introduce a bottleneck.
   Figure 4 shows the updated BST code with
a thread cache. We just added two more pred- 1 type linear cache(thread, node, int).
icates, cache and cache-size, that are facts 2 type linear cache-size(thread, int).
                                                    3
placed in the thread and represent cached 4 // Key exists and is also in cache
keys and the total size of the cache, respec- 5 lookup(A, K),
                                                    6   value(A, K, Value),
tively. We also added three new rules that han- 7 cache(T, A, K)
                                                            -o value(A, K, Value),
dle the following cases: (i) a key is found and 89             !found(A, K, Value),
is also in the cache; (ii) a key is found but is 10            cache(T, A, K).
                                                   11
not in the cache; and (iii) a key is in the cache, 12 // Key exists and is not in cache
therefore a lookup fact is derived in the target 13     lookup(A, K),
                                                   14   value(A, K, Value),
node. Note that it is easy to extend the cache 15 cache-size(T, Total)
                                                   16       -o value(A, K, Value),
mechanism to use an LRU type approach in 17                    !found(A, K, Value),
order to limit the size of the cache.              18          cache-size(T, Total + 1),
                                                   19          cache(T, A, K).
   In order to understand how well the new 20
                                                        // Cached by the thread
program performs, we have experimented 21          22   lookup(A, K),
with a binary tree with 17 levels and 100000 23 cache(T, TargetNode, K)
                                                   24       -o lookup(TargetNode, K),
lookup operations. In our experimental setup, 25               cache(T, TargetNode, K).
we used a machine with 4 AMD Six-Core 26           27   // Remaining rules (...)
Opteron TM 8425 HE (2100 MHz) chips (24
cores) and 64 GB of DDR-2 667MHz (16x4GB)             Fig. 4. LM program for performing
RAM, running GNU/Linux (kernel 3.15.10- lookups in a BST with a thread cache.
201 64 bits). We compiled our code using GCC
4.8.3 (g++) with the flags -O3 -std=c++11 -fno-rtti -march=x86-64 1 .
   The scalability results shown in Fig. 5 are presented by comparing the run time
of different versions against the run time of the regular version (without thread
facts) using 1 thread. The results show that caching brings improved scalability
and reduced run time due to pruned paths that would need to be searched without
a cache. For example, when using a single thread, the cached version achieves a
2-fold speedup over the regular version and when using 16 threads, it achieves a
16-fold speedup over the regular version using 1 thread.
   This program shows that it is possible to use architectural details in a declarative
style to improve the scalability of programs. With a few extra facts stored at the
thread level and a few extra logical rules, we were able to significantly improve
the linear logic program, while remaining fully declarative.


1 Implementation and programs available in http://github.com/flavioc/meld
                      Theory and Practice of Logic Programming                        7

                               5 Further Applications

                               5.1 Graph Reachability

Consider the problem of checking if a set of
nodes S in a graph G is reachable from an
arbitrary node N. An obvious solution to this
problem is to start at N, gather all the neighbor
nodes into a list and then recursively visit all
those reachable nodes, until S is covered. This
reduces to a problem of performing a breadth
or depth-first search on graph G. However,
this solution is sequential and does not have
much concurrency. An alternative solution to
the problem is to recursively propagate the
                                                   Fig. 5. Scalability results comparing
search to all neighbors and aggregate the re-
                                                  the regular and cached BST programs.
sults in the node where the search started.          The dataset is a tree with 262000
  Unfortunately, there are still some issues nodes and 100000 lookup operations
with this solution. First, as the search process    distributed among 12000 nodes.
goes on, there is no sharing of the nodes that
were already found, since only the start node actually stores which nodes have been
found. It would be prohibitly expensive to share such information between nodes.
Second, once the search has reached all the required nodes, the search process will
not stop, exploring unneeded nodes. Fortunately, we can use thread facts to solve
both these issues. The search process is still done concurrently as before, but the
search state is now stored in each thread, allowing the thread to store partial results
and coordinate with other threads. The code for this coordinated version is shown in
Fig. 6. The axioms (not shown) represent the search facts, each containing the Id
of the search and the list of nodes to reach.
   Lines 1-4 start the search process by assigning a thread Owner to search Id using
the persistent fact !thread-list which contains the list of all available threads
in the system (the total number of threads is given by @threads). In line 3, a
fact thread-search is created for all threads using a comprehension, which is a
construct made of 3 parts that, for the variables in the first part (T2), iterates over
the facts in the second part (!thread(T, T2)) and derives the facts in the third
part (thread-search(T2, Id, ToReach, Owner)). We use predicate do-search to
propagate the search through the graph and a predicate visited to mark nodes
already processed for a specific search. The two rules in lines 14-27 propagate the
search process to the neighbor nodes and check if the current node is part of the
list of nodes we want to reach.
   An interesting property of this version is that each owner thread responsible for
a search keeps track of the remaining nodes that need to be reached. In line 18,
we derive remove-thread-search in order to inform owner threads about new
reachable nodes. Once an owner thread detects that all nodes have been reached
(lines 32-34), all the other threads will know that and update their search state
 8                     Flavio Cruz, Ricardo Rocha and Seth Copen Goldstein
1       search(A, Id, ToReach),
2       !thread-list(T, L), Owner = nth(L, Id % @threads)               // Allocate search to a thread
3          -o {T2 | !thread(T, T2) | thread-search(T2, Id, ToReach, Owner)},
4             do-search(A, Id).
5
6       thread-search(T, Id, [], Owner),                                       // Nothing left to find
7       do-search(A, Id)
8          -o thread-search(T, Id, [], Owner).
9
10      do-search(A, Id),
11      visited(A, Id)                                                              // Already visited
12         -o visited(A, Id).
13
14      do-search(A, Id),
15      thread-search(T, Id, ToReach, Owner),
16      !value(A, Val), Val in ToReach                                               // New node found
17         -o thread-search(T, Id, remove(ToReach, Val), Owner),
18            remove-thread-search(Owner, Id, Val),                     // Tell owner thread about it
19            {B | !edge(A, B) | do-search(B, Id)},
20            visited(A, Id).
21
22      do-search(A, Id),
23      thread-search(T, Id, ToReach, Owner),
24      !value(A, Val), ˜ Val in ToReach                                    // Node is not on the list
25         -o thread-search(T, Id, ToReach, Owner),
26            visited(A, Id),
27            {B | !edge(A, B) | do-search(B, Id)}.
28
29      remove-thread-search(T, Id, Val), thread-search(T, Id, ToReach, Owner)
30         -o thread-search(T, Id, remove(ToReach, Val), Owner),
31            check-results(T, Id).
32      check-results(T, Id), thread-search(T, Id, [], Owner)
33         -o thread-search(A, Id, [], Owner),
34            {B | !other-thread(T, B) | signal-thread(B, Id)}.
35       check-results(T, Id), thread-search(T, Id, ToReach, Owner), ToReach <> []
36         -o thread-search(T, Id, ToReach, Owner).
37      signal-thread(T, Id), thread-search(T, Id, ToReach, Owner)     // Thread knows search is done
38         -o thread-search(T, Id, [], Owner).


                 Fig. 6. Coordinated version of the reachability checking program.


 accordingly (lines 37-38). When every thread knows that all nodes were reached,
 they will consume do-search facts (lines 6-8), effectively pruning the search space.
    Experimental results for the graph reachability program are shown in Fig. 7.
 The Random dataset is a dense graph, which makes it one of the best performing
 datasets. In this dataset, the coordinated version reaches a 16-fold speedup with
 20 threads versus a 12-fold speedup for the regular version. Twitter and Pokec
 are datasets generated from real data2 and the coordinated version is able to scale
 well up to 20 threads, while the regular version shows some issues once it uses
 more than 16 threads. This is because the number of searches is relatively small
 and the nodes where the searches started have some contention since those nodes
 accumulate the search results. This behavior is not seen in the coordinated version
 because searches are equally distributed among the threads.
    The Pokec dataset is noteworthy because the coordinated version is almost twice
 as fast as the regular version when using only 1 thread. We think that such run
 time improvement happens because the number of searches is small, which makes
 it easier to perform the joins in lines 14-15.


     2 Twitter and Pokec datasets retrieved from http://snap.stanford.edu/data/
                         Theory and Practice of Logic Programming                                9


(a) Random: 50,000 nodes, (b) Twitter:            81,306   nodes, (c) Pokec: 1,632,803 nodes,
1,000,000 edges and 20 searches. 1,768,149 edges and 50 searches. 30,622,564 edges and 5 searches.


Fig. 7. Scalability results for the graph reachability program using different datasets and
number of searches. All searches in the datasets target around 2-5% of all nodes in the graph.


  This graph reachability program shows how to introduce complex coordination
policies between threads by reasoning about the state of each thread. In addition,
the use of linear logic programming makes it easier to prove properties of the
program since computation is done by applying controlled changes to the state.


                                  5.2 PowerGrid Problem
Consider a powergrid with C consumers and G generators. We are interested in
connecting each consumer to a single generator, but each generator has a limited
capacity and the consumer draws a certain amount of power from the generator. A
valid powergrid is built in such a way that all consumers are served by a generator
and that no generator is being overdrawn by too many consumers. Although
consumers and generators may be connected through a complex network, in this
section we analyze the case where any consumer can be served by any generator.
   A straightforward distributed implementation for the powergrid problem re-
quires that each consumer is able to connect to a any generator. Once a generator
receives a connection request, it may or may not accept it. If the generator has no
power available for the new consumer, it will disconnect from it and the consumer
must select another generator. If a generator initiates too many disconnections, then
it disconnects all its consumers in order to restart the process. This randomized
algorithm works but can take a long time to converge, depending on the amount
of power available in the generators.
   The issue with the straightforward distributed implementation is that it lacks
a global view of the problem or requires a more complicated synchronization
algorithm between consumers and generators. As we have seen before, thread
local facts are an excellent mechanism to introduce a global view of the problem
without complicating the original algorithm written in a declarative style. In our
solution, we partition the set of generators G among the threads in the system. With
this partitioning, each thread assumes the ownership of its generators and is able
to process consumers with a global view over its set of generators. The thread can
then immediately assign the consumers to its generators when possible, otherwise
it uses the regular distributed algorithm.
10               Flavio Cruz, Ricardo Rocha and Seth Copen Goldstein

   Figure 8 shows the scal-
ability results for the pow-
ergrid program when us-
ing 2000 generators and
50000 consumers. We var-
ied the total capacity of
the generators in relation
to the power required by
                                (a) Normal power (101.6%).      (b) Reduced power (100.8%).
all consumers. In Fig. 8(a)
the power of the genera-
tors is 101.6% of the re-        Fig. 8. Scalability results for the powergrid program.
quired by the consumers.
And in Fig. 8(b), we reduced the gap to only 100.8% in order to see how the two
programs would behave in a reduced power situation. As shown in the figures,
the optimized version performs the best when the power is reduced and is almost
twice as fast than then regular version when using just 1 thread.


                            5.3 Splash Belief Propagation

Approximation algorithms can obtain significant benefits from using optimal eval-
uation strategies due to their inherent non-determinism. A good example is the
Loopy Belief Propagation (LBP) program (Murphy et al. 1999). LBP is an approx-
imate inference algorithm used in graphical models with cycles. LBP is a sum-
product message passing algorithm where nodes exchange messages with their
immediate neighbors and apply computations to the messages received.
   The most basic evaluation strategy for LBP is to update the belief values of
nodes in synchronous iterations. First, the beliefs of all nodes are computed and
then sent to the neighbor nodes in rounds, requiring synchronization after each
round. Another computation strategy is to compute the beliefs asynchronously,
by using partial information at the node level. The asynchronous approach is an
improvement over the synchronous version because it leads to faster convergence
time. An improved evaluation strategy is the Splash Belief Propagation (SBP)
program (Gonzalez et al. 2009), where belief values are computed by first building
a tree and then updating the beliefs of each node twice, first from the leaves to the
root and then the reverse. These splash trees are built by starting at a node whose
belief changed the most in the last update. The trees must be built iteratively until
convergence is achieved.
   In an environment with T threads, it is then possible to build T splash trees con-
currently. First, we partition the nodes into T regions and then assign each region
to a thread. Each thread is then responsible for iteratively building splash trees on
that region until convergence is reached. Figure 9 shows the LM implementation
for the SBP program.
   The programs starts in lines 3-7 by partitioning the nodes into regions using
set-thread and by creating the first splash tree (line 7) using start-tree(T). The
                        Theory and Practice of Logic Programming                               11
 1   partitioning(T, @world / @threads).                            // Move @world/@threads nodes
 2
 3   !coord(A, X, Y), start(A)                                                // Moving this node
 4      -o set-static(A), set-thread(A, grid(X, Y)).
 5   just-moved(A), partitioning(T, Left)                         // Thread received another node
 6      -o partitioning(T, Left - 1).
 7   partitioning(T, 0) -o start-tree(T).
 8
 9   start-tree(T), priority(A, P), P > 0.0                                       // Tree building
10      -o priority(A, P), expand-tree(T, [A], []).
11   expand-tree(T, [A | All], Next)
12      -o thread-id(A, Id),
13         [collect => L | !edge(A, L), ˜ L in All, ˜ L in Next, priority(L, P), P > 0.0,
14            thread-id(L, Id2), Id1 = Id2 | priority(L, P), thread-id(L, Id2) |
15            new-tree(T, [A | All],
16               if len(All) + 1 >= maxnodes then [] else Next ++ L end)].
17
18   new-tree(T, [A | All], [])
19      -o schedule-next(A), first-phase(T, reverse([A | All]), [A | All]).
20   new-tree(T, All, [B | Next])
21      -o schedule-next(B), expand-tree(T, [B | All], Next).
22
23   first-phase(T, [A], [A]), running(T, A)                                        // First phase
24      -o running(T, A), update(A), remove-priority(A), start-tree(T).
25   first-phase(T, [A, B | Next], [A]), running(T, A)
26      -o running(T, A), update(A), schedule-next(B), second-phase(T, [B | Next]).
27   first-phase(T, All, [A, B | Next]), running(T, A)
28      -o running(T, A), update(A), schedule-next(B), first-phase(T, All, [B | Next]).
29
30   second-phase(T, [A]), running(T, A)                                           // Second phase
31      -o running(T, A), update(A), remove-priority(A), start-tree(T).
32   second-phase(T, [A, B | Next]), running(T, A)
33      -o running(T, A), update(A), schedule-next(B), second-phase(T, [B | Next]).


          Fig. 9. LM implementation for the Splash Belief Propagation program.


algorithm is then divided in three main phases, named tree building, first phase and
second phase, as described next:

Tree building Tree building starts after the rule in lines 9-10 is fired. Since the
  thread always picks the higher priority node, we start by adding that node to the
  list that represents the tree. In lines 13-16, we use an aggregate (Cruz et al. 2014)
  to gather all the neighbor nodes that have a positive priority (due to a new belief
  update) and are in the same thread. Nodes are collected into list L and appended
  to list Next (line 16).
First phase When the number of nodes in the tree reaches a certain limit, a
  first-phase is generated to update the beliefs of all nodes in the tree (line 19).
  As the nodes are updated, starting from the leaves and ending at the root, an
  update fact is derived to update the belief values (lines 26 and 28).
Second phase The computation of beliefs is performed from the root to the leaves
  and the belief values are updated a second time (lines 31 and 33).

  GraphLab (Low et al. 2010), a C++ framework for writing machine learning
algorithms, provides the splash scheduler as part of its framework. To put LM’s
implementation in perspective, we thus measured the behavior of LBP and SBP
for both LM and GraphLab. Figure 10 shows the results. We can observe that both
systems have very similar behavior when using a variable number of threads, but
for higher number of threads and, in particular, for more than 15 threads, LM shows
12              Flavio Cruz, Ricardo Rocha and Seth Copen Goldstein

better speedups than GraphLab. In terms of running times, LM is, on average, 1.4
times slower than GraphLab, although LM program code is more concise.


                              6 Concluding Remarks
Logic programming, as a
declarative style of pro-
gramming, has been a
fertile research field in
terms of parallel mod-
els (Gupta et al. 2001;
Rocha et al. 2005). These
approaches have been rel-
atively successful in par-
                                           (a) LBP                       (b) SBP
allelizing regular Prolog
programs, however the
programmer has little con-          Fig. 10. Experimental results for the Loopy Belief
                                 Propagation (LBP) and Splash Belief Propagation (SBP)
trol over the scheduling
                                     programs using the LM and GraphLab systems.
strategy and policies. An
exception is the proposal
by Casas et al. which exposes execution primitives for and-parallelism (Casas et al.
2007), allowing for different scheduling policies. Compared to LM, this approach
offers a more fine grained control to parallelism but has limited support for rea-
soning about thread state.
   The new LM extension shares some similarities with the Linda coordination lan-
guage (Ahuja et al. 1986). Linda implements a data-driven coordination model and
features a tuple space that can be manipulated by a set of processes. Like LM, those
processes communicate through the tuple space, by deriving tuples, which are akin
to logical facts. The programmer is able to create arbitrary communication patterns
by reading and writing to the tuple space. Unfortunately, Linda is implemented
on top of other languages and, by itself, its not a declarative language.
   Galois (Pingali et al. 2011; Nguyen and Pingali 2011) and Elixir (Prountzos et al.
2012) are parallel language models that allow the specification of different schedul-
ing strategies in order to optimize programs. The programmer first writes the basic
program and then a scheduler specification changes how threads prioritize compu-
tation. However, these specifications only reason about the data being computed
and not about the parallel architecture.
   In this work, we have extended the LM language with a declarative mechanism
for reasoning about the underlying parallel architecture. LM programs can be first
written in a data-driven fashion and then optimized by reasoning about the state of
threads, thus enabling the implementation of more efficient evaluation strategies.
This novel mechanism enables the programmer to take advantage of both implicit
and explicit parallelism by using the principles of logic programming without
introduction of extra language constructs.
                         Theory and Practice of Logic Programming                             13

                                    Acknowledgments
This work is partially funded by the North Portugal Regional Operational Pro-
gramme (ON.2 O Novo Norte) and by the National Strategic Reference Frame-
work (NSRF), through the European Regional Development Fund (ERDF) and the
Portuguese Foundation for Science and Technology (FCT), within projects NORTE-
07-0124-FEDER-000059 and UID/EEA/50014/2013.


                                         References
Ahuja, S., Carriero, N., and Gelernter, D. 1986. Linda and friends. Computer 19, 8, 26–34.
Blelloch, G. E. 1996. Programming parallel algorithms. Communications of the ACM 39,
  85–97.
Casas, A., Carro, M., and Hermenegildo, M. V. 2007. Towards high-level execution primi-
  tives for and-parallelism: Preliminary results.
Chakravarty, M. M. T., Leshchinskiy, R., Jones, S. P., Keller, G., and Marlow, S. 2007. Data
  parallel haskell: a status report. In Workshop on Declarative Aspects of Multicore Programming.
  New York, NY, USA, 10–18.
Cruz, F., Rocha, R., Goldstein, S., and Pfenning, F. 2014. A Linear Logic Programming
  Language for Concurrent Programming over Graph Structures. Journal of Theory and
  Practice of Logic Programming, International Conference on Logic Programming, Special Is-
  sue abs/1405.3556, 493–507.
Girard, J.-Y. 1995. Linear logic: Its syntax and semantics. In Advances in Linear Logic. New
  York, NY, USA, 1–42.
Gonzalez, J., Low, Y., and Guestrin, C. 2009. Residual splash for optimally parallelizing
  belief propagation. In Artificial Intelligence and Statistics. Clearwater Beach, Florida.
Gupta, G., Pontelli, E., Ali, K. A. M., Carlsson, M., and Hermenegildo, M. V. 2001. Parallel
  execution of prolog programs: A survey. ACM Transactions on Programming Languages and
  Systems 23, 4, 472–602.
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. 2010.
  Graphlab: A new framework for parallel machine learning. In Conference on Uncertainty
  in Artificial Intelligence. Catalina Island, California, 340–349.
Murphy, K. P., Weiss, Y., and Jordan, M. I. 1999. Loopy belief propagation for approximate
  inference: An empirical study. In Conference on Uncertainty in Artificial Intelligence. San
  Francisco, CA, USA, 467–475.
Nguyen, D. and Pingali, K. 2011. Synthesizing concurrent schedulers for irregular algo-
  rithms. In International Conference on Architectural Support for Programming Languages and
  Operating Systems. New York, NY, USA, 333–344.
Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M. A., Kaleem, R., Lee,
  T.-H., Lenharth, A., Manevich, R., Méndez-Lojo, M., Prountzos, D., and Sui, X. 2011.
  The tao of parallelism in algorithms. SIGPLAN Not. 46, 6 (June), 12–25.
Prountzos, D., Manevich, R., and Pingali, K. 2012. Elixir: A system for synthesizing con-
  current graph programs. In ACM International Conference on Object Oriented Programming
  Systems Languages and Applications. Tucson, Arizona, USA, 375–394.
Ramakrishnan, R. and Ullman, J. D. 1993. A survey of research on deductive database
  systems. Journal of Logic Programming 23, 125–149.
Rocha, R., Silva, F., and Costa, V. S. 2005. On applying or-parallelism and tabling to logic
  programs. Journal of Theory and Practice of Logic Programming 5, 1 & 2, 161–205.

</pre>