<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Magic of Pushing Extrema into Recursion: Simple and Powerful Datalog Programs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carlo Zaniolo</string-name>
          <email>zaniolo@cs.ucla.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohan Yang</string-name>
          <email>yang@cs.ucla.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ariyam Das</string-name>
          <email>ariyam@cs.ucla.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Interlandi</string-name>
          <email>minterlandi@cs.ucla.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California</institution>
          ,
          <addr-line>Los Angeles</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We introduce a novel query optimization method based on pushing extrema and other integrity constraints into recursion. This optimization produces an efficient operational semantics for the goals declaratively specified by the original program that is stratified with respect to negation and aggregates. Complex algorithms can now be specified via simpler programs, and then implemented via optimized rules that use min and max aggregates in the seminaive fixpoint computation to achieve safe and efficient execution. This optimization dovetails with recent advances that entail the use of monotonic aggregates in recursion and leads to efficient implementation, as demonstrated by the scalable Datalog systems we have recently developed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        A growing body of research on scalable data analytics has brought a renaissance
of interest in Datalog due to its ability to declaratively specify advanced
applications that execute efficiently over different architectures, including massively
parallel ones [
        <xref ref-type="bibr" rid="ref10 ref11 ref6 ref7 ref8 ref9">6–11</xref>
        ]. As part of this renaissance, a solution [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] was recently
proposed for the difficult problem of using aggregates in recursion which had
remained open for more than 20 years; the new solution enables the concise
expression and efficient support of graph algorithms and data mining methods. In
this short paper, we discuss an even more recent discovery that makes possible
a further simplification of these recursive programs, which is described next.
      </p>
      <p>Rather than writing programs with monotonic min and max in recursion, it
is often easier to write stratified programs where the recursive rules use no min
or max, which are instead used in the next stratum to define the correct logical
result. Indeed, in this paper we show that it is often simple for the compiler to
optimize the computation of those programs by moving the extrema into the
actual rules used in the fixpoint iteration. This optimization by constraint-pushing
approach produces a simple declarative semantics, combined with a very
efficient operational semantics—a combination that makes it easy to express concisely
and implement efficiently complex algorithms, as illustrated by Example 1 below.</p>
      <p>The syntax minhDyi in our example illustrates the head notation for
aggregates that is used in many Datalog systems, and follows SQL-2 approach of
allowing zero, one or more group-by variables for each aggregate.
(r1) path(Y; Dy)
(r2) path(Y; Dy)
(r3) spath(Y; minhDyi)
Thus in our example, Dy is the aggregate variable and Y is the group-by
variable. We will often refer to the min and max variables as cost variables. Now,
the semantics of extrema (i.e., min and max) can be easily reduced to that of
negation, whereby the last rule in Example 1 can be replaced by:</p>
      <sec id="sec-1-1">
        <title>Example 2. Min via negation in stratified programs.</title>
        <p>
          spath(Y; Dy)
betterpath(Y; Dy)
path(Y; Dy); :betterpath(Y; Dy):
path(Y; Dyy); Dyy &lt; Dy:
Re-expressing our min via negation also makes manifest the non-monotonic
nature of extrema aggregates, whereby their usage in recursive predicates is
incompatible with the declarative least-fixpoint semantics of the programs—a topic
that dovetails with the solution proposed in this paper and is discussed in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Stratification can be used to avoid the semantic problems caused by using
aggregates in recursion. For instance, in Example 1, spath belongs to a stratum
that is above that of path, whereby our program is assured to have a
perfectmodel semantics [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The perfect model of a stratified program is unique and can
be computed using an iterated fixpoint computation, whereby the least fixpoint is
computed starting at the bottom stratum and moving up to higher strata. In our
example, therefore, all the possible paths will be computed using rules r1 and r2,
before selecting values that are minimal using r3. This is the approach used by
current Datalog compilers, and it can be very inefficient or even non-terminating
when the original graph contains cycles. In this paper, we will show that, for
large classes of programs, the computation can be significantly optimized by
simply pushing constraints into the fixpoint computation. In Example 1 above,
the constraint is imposed by the last rule that, for each node produced, selects
the minimal value of its distance from a. This non-monotonic constraint can be
pushed into the recursive rules whereby our program becomes:
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Example 3. Optimized shortest path from node a.</title>
        <p>
          path(Y; minhDyi) arc(a; Y; Dy):
path(Y; minhDyi) path(X; Dx); arc(X; Y; Dxy); Dy = Dx + Dxy:
spath(Y; Dy) path(Y; Dy):
Because of the use of the non-monotonic constructs in its recursive rules, the
program so obtained does not have a declarative semantics. However, the
Immediate Consequence Operator (ICO) can still be defined for these rules, and so is
fixpoint iteration TP"!(;), that defines the operational semantics of these rules.
It can be shown that this operational semantics produces the same spath values
as those in the perfect model of the original program [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Thus this optimized
computation will be used for spath instead of the iterated-fixpoint computations
of perfect models normally used for stratified programs.
        </p>
        <p>
          Moreover this optimization can be applied to large classes of programs where
the rules have the simple properties described in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Two simple examples of
such programs are described next. In Example 4, we compute the connected
components of an undirected graph via label propagation. The final label of a
node Z is the minimum node id among all the nodes that can reach Z. All nodes
with the same label belong to the same component.
        </p>
        <p>Example 4. Connected components of an undirected graph.</p>
        <p>reach(X; X)
reach(X; Z)
cc(Z; minhXi)
arc(X; _):
reach(X; Y); arc(Y; Z):
reach(X; Z):
The pushing of the min constraint produces the following optimized program.</p>
      </sec>
      <sec id="sec-1-3">
        <title>Example 5. HCC algorithm [2] in Datalog.</title>
        <p>reach(minhXi; X) arc(X; _):
reach(minhXi; Z) reach(X; Y); arc(Y; Z):
cc(Z; X) reach(X; Z):
Conjunction of Inequalities and Extrema Constraints. Let us now
consider a constraint involving a conjunction of a min with an inequality constraint
requiring the minimized value to not exceed a given K value.</p>
        <p>Example 6. Shortest path with min and upperbound constraints.
apath(Y; Dy) arc(a; Y; Dy):
apath(Y; Dy) apath(X; Dx); arc(X; Y; Dxy); Dy = Dx + Dxy; Dy &gt; Dx:
minboundpath(Y; minhDyi) apath(Y; Dy); aconstant(K); Dy &lt; K:
Then we have the following sound and complete rewriting w.r.t. minboundpath:
Example 7. Example 6 optimized by both inequality and min.</p>
        <p>apath(Y; minhDyi)
apath(Y; minhDyi)
minboundpath(Y; Dy)
arc(a; Y; Dy); aconstant(K); Dy &lt; K:
apath(X; Dx); arc(X; Y; Dxy); Dy = Dx + Dxy; Dy &gt; Dx;
aconstant(K); Dy &lt; K:
apath(Y; Dy):</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Correctness of Constraint Pushing into Recursion (CPR)</title>
      <p>A CPR-structured segment of a stratified program P , consists of (i) rules defining
one or more recursive predicates, and (ii) a constraint rule specifying one or more
constraints on the recursive predicates defined by (i). If we let denote the
constraints defined in (ii), then the CPR optimization consists in removing the
constraints from (ii) to add them to the rules in (i) defining the corresponding
recursive predicates. Then is applied to the rules or facts defining the recursive
predicates, and the old constraint rule becomes a copy rule that just renames
the atoms produced by the fixpoint iteration.</p>
      <p>If T is the ICO for the rules and facts defining the recursive predicates, then
the ICO for those rules and facts after they undergo the CPR optimization will
be denoted by T : T (I) = (T (I)).</p>
      <p>Observe that the results produced by the fixpoint iteration on T are sound
since they are a subset of those produced by the iterated fixpoint on the original
program. However, completeness, is also needed to assure the correctness of the
CPR optimization. Our strict requirement for completeness is that the following
property must hold for every positive integer n: T "n(;) = T "n(;): When this
property holds for each n, then T "!(;) = T "!(;) also holds.</p>
      <p>
        Simple sufficient correctness conditions that assure correctness for most
programs of interest are given in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. For instance, the recursive rule in Example
6 is such that when both apath(X; Dx) and apath(X; Dx0) are true and Dx0 &lt; Dx,
then if the rule holds for the former, it must also hold for the latter making the
former expendable in the computation of min. However if we change Dy &lt; K into
Dy &gt; K the completeness of CPR might no longer hold. By using such criteria the
DeALS compiler [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is able to determine that the CPR optimization is correct
for all examples in this paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>
        The ability to specify complex algorithms by adding postconditions to simpler
recursive predicates is of interest in the specification of powerful algorithms and
their efficient implementation. In fact, as discussed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the CPR approach
tested on the sequential version of DeALS produces executions that are
incomparably faster than the iterated fixpoint and two or three times faster than their
more complex specification as XY-stratified programs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We are now working
on applying these ideas to achieve efficient and scalable support of graph and
data mining applications over diverse parallel platforms, as needed to
demonstrate the major benefits that our Big Datalog project is achieving through a
close integration of theoretical and system advances [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>We thank all the reviewers for their comments on improving the paper. This
work was supported in part by NSF grants IIS-1218471 and IIS-1302698.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Deductive</given-names>
            <surname>Application Language System</surname>
          </string-name>
          . http://wis.cs.ucla.edu/deals/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>U.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Tsourakakis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          .
          <article-title>Pegasus: A peta-scale graph mining system implementation and observations</article-title>
          . In ICDM, pages
          <fpage>229</fpage>
          -
          <lpage>238</lpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazuran</surname>
          </string-name>
          , E. Serra, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          .
          <article-title>A declarative extension of horn clauses, and its significance for datalog and its applications</article-title>
          .
          <source>TPLP</source>
          ,
          <volume>13</volume>
          (
          <issue>4-5</issue>
          ):
          <fpage>609</fpage>
          -
          <lpage>623</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazuran</surname>
          </string-name>
          , E. Serra, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          .
          <article-title>Extending the power of datalog recursion</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>22</volume>
          (
          <issue>4</issue>
          ):
          <fpage>471</fpage>
          -
          <lpage>493</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Przymusinski</surname>
          </string-name>
          .
          <article-title>Perfect model semantics</article-title>
          .
          <source>In ICLP/SLP</source>
          , pages
          <fpage>1081</fpage>
          -
          <lpage>1096</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Lam</surname>
          </string-name>
          .
          <article-title>Distributed socialite: a datalog-based language for large-scale graph analysis</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>6</volume>
          (
          <issue>14</issue>
          ):
          <fpage>1906</fpage>
          -
          <lpage>1917</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Shkapsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Interlandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Condie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          .
          <article-title>Big Data Analytics with Datalog Queries on Spark</article-title>
          .
          <source>In SIGMOD. ACM</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Shkapsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zeng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          .
          <article-title>Graph queries in a next-generation datalog system</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>6</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1258</fpage>
          -
          <lpage>1261</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Halperin</surname>
          </string-name>
          .
          <article-title>Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>8</volume>
          (
          <issue>12</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M. Yang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Shkapsky</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zaniolo</surname>
          </string-name>
          .
          <article-title>Parallel bottom-up evaluation of logic programs: DeALS on shared-memory multicore machines</article-title>
          .
          <source>In Technical Communications of ICLP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          .
          <article-title>Main memory evaluation of recursive queries on multicore machines</article-title>
          .
          <source>In IEEE Big Data</source>
          , pages
          <fpage>251</fpage>
          -
          <lpage>260</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>C. Zaniolo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ceri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Faloutsos</surname>
            ,
            <given-names>R. T.</given-names>
          </string-name>
          <string-name>
            <surname>Snodgrass</surname>
            ,
            <given-names>V. S.</given-names>
          </string-name>
          <string-name>
            <surname>Subrahmanian</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zicari</surname>
          </string-name>
          .
          <article-title>Advanced database systems</article-title>
          . Morgan Kaufmann Publishers Inc.,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>C. Zaniolo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Interlandi</surname>
          </string-name>
          .
          <article-title>Improving the power, performance and usability of datalog by pushing constraints into recursion</article-title>
          .
          <source>Submitted for publication</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>