<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Genova - Nervi, Italy
$ bembenek@g.harvard.edu (A. Bembenek); michael@greenberg.science (M. Greenberg);
chong@seas.harvard.edu (S. Chong)
 https://people.seas.harvard.edu/~bembenek/ (A. Bembenek); https://mgree.github.io/ (M. Greenberg);
https://people.seas.harvard.edu/~chong/ (S. Chong)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Formulog: Datalog + SMT + FP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aaron Bembenek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Greenberg</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephen Chong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Harvard University</institution>
          ,
          <addr-line>Cambridge, MA 02138</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Stevens Institute of Technology</institution>
          ,
          <addr-line>Hoboken, NJ 07030</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Formulog extends Datalog with mechanisms for constructing logical terms and reasoning about them with satisfiability modulo theories (SMT) solving; a first-order functional language aids inspecting and manipulating complex terms. This combination of features makes it possible to write complex SMT-based program analyses in a way close to their formal specification, while being satisfactorily performant thanks to powerful Datalog optimizations and eficient evaluation techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Datalog</kwd>
        <kwd>satisfiability modulo theories (SMT)</kwd>
        <kwd>program analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>satisfiability of a term that is not well sorted under SMT, or passing a bit-vector-valued SMT
term to a function expecting a concrete bit vector. At the same time, it needs to be possible to
construct expressive SMT formulas. We strike this balance through a bimodal type system that
treats terms occurring in SMT formulas more liberally than terms outside of formulas.</p>
      <p>We first present Formulog by example (Section 2); we then discuss our prototype and several
case studies (Section 3).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Formulog in a Nutshell</title>
      <p>We present Formulog by example. The canonical “hello world” program for Datalog computes
graph transitive closure. An analogous problem for Formulog is computing non-empty paths
in a proposition graph, i.e., a directed graph where edges are labeled with SMT propositions,
where we consider a path to exist only if the conjunction of the propositions along it (the path
condition) is satisfiable. Consider the example graph in Figure 1, which has edges labeled with
signed comparisons between symbolic 32-bit vectors (like the proposition  &lt; , where  and
 are SMT variables of the 32-bit vector sort). There is no path from node 2 to node 1, as no
sequence of edges leads from node 2 to node 1. There is also no satisfiable path from node 1
to node 4. While a path exists structurally, the conjunction of the edge conditions—the path
condition  &lt;  ∧  &lt;  ∧  &lt; —is not satisafible. However, there is a (non-empty) path
from node 2 to itself, since the path condition  &lt;  ∧  &lt;  ∧  &gt;  +  is satisfiable under
the theory of bit vectors (the addition of  and  can wrap, unlike the theory of mathematical
integers). We break down the Formulog program computing non-empty paths on the example
proposition graph piece-by-piece (Figure 2):</p>
      <p>Lines 1-2. We declare edge as a ternary input (EDB) relation. Formulog is strongly typed:
edge relates two terms of type node (an alias for i32, a 32-bit vector) and a term of type
bool smt, i.e., an SMT proposition.</p>
      <p>Lines 4-7. We enumerate the edge relation for our example graph. The third attribute is
the SMT proposition. SMT terms are complex terms. To build such a term, users can take
advantage of a library of constructors like bv_slt, which represents signed bit vector less-than;
1 type node = i32
2 input edge(node, node, bool smt)
3
4 edge(1, 2, `bv_slt(#x[i32], #y[i32])`).
5 edge(2, 3, `bv_slt(#y[i32], #z[i32])`).
6 edge(3, 4, `bv_slt(#z[i32], #x[i32])`).
7 edge(4, 2, `bv_sgt(#y[i32], bv_add(#z[i32], #x[i32]))`).
8
9 fun mem(x: ’a, xs: ’a list) : bool =
10 match xs with [] =&gt; false | h :: t =&gt; x = h || mem(x, t) end
11
12 output path(node, node, node list, bool smt)
13 path(X, Y, [Y], Phi)
:14 edge(X, Y, Phi),
15 is_sat(Phi) = true.
16 path(X, Z, Y :: Path, `Phi /\ Constraint`)
:17 edge(X, Y, Phi),
18 path(Y, Z, Path, Constraint),
19 mem(Y, Path) = false,
20 is_sat(`Phi /\ Constraint`) = true.
21
22 output path_conditions(node, node, bool smt list)
23 path_conditions(X, Y, Conditions)
:24 path(X, Y, _, _),
25 Conditions = path(X, Y, _, ??).
it constructs a term of type bool smt out of two terms of type bv[] smt (where  is a
positive integer; i32 is shorthand for bv[32]). Quasiquoting (with backticks) enables a type
checking mode that flexibly mixes concrete and SMT terms. Additionally, thanks to the SMT
theory of algebraic data types, arbitrary user-defined data types can be used within formulas.
The notation #[ ] denotes an SMT variable of sort  named .</p>
      <p>
        Lines 9-10. Formulog supports first-order, polymorphic, recursive ML-style functions. Here
the mem function checks list membership. Such functions are syntactic sugar, and could be
translated into (likely less eficient) Datalog, following a translation similar in spirit to that
of Pacak and Erdweg [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]; functions are convenient for inspecting and manipulating complex
terms. We use ML shorthand for list constructors ( ::  and [, . . ., ] for terms ).
      </p>
      <p>Line 12. We declare the path output (IDB) relation, relating two nodes, a path (a list of
nodes), and a path condition (an SMT proposition).</p>
      <sec id="sec-2-1">
        <title>Lines 13-15. The first rule defining</title>
        <p>path is the non-recursive case: there is a path between
two nodes if there is an edge between them labeled with a satisfiable formula. Datalog variables
have initial capitals. The function is_sat takes a term of type bool smt and returns a bool
indicating the satisfiability of its argument.</p>
        <p>Lines 16-21. The second rule defining path is the recursive case; it makes use of both the
user-defined function mem and the built-in function is_sat.</p>
        <p>
          Lines 22-26. We declare a relation path_conditions relating two nodes and a list of
the path conditions for paths between them. This relation’s sole rule takes advantage of
Formulog’s flexible approach to stratified aggregation: the notation path(X, Y, _, ??)
treats the relation path as a function that takes two arguments (grouping variables X and Y)
and returns a list of all the elements in the fourth column—i.e., the path conditions (the term ??
is a wildcard). In addition to stratified aggregation, Formulog supports stratified negation.
This combination of features—and Formulog’s take on them—distinguishes Formulog from other
systems combining logic programming and constraint solving, like Calypso [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and constraint
logic programming [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. We have formalized Formulog and proven that its type system is sound
with respect to a small-step operational semantics: evaluation never “goes wrong,” whether in
Datalog, ML, or SMT.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Prototype and Case Studies</title>
      <p>
        We implemented a prototype of Formulog in ∼ 24K lines of Java.1 The prototype is designed as
a relatively standard Datalog engine augmented with an ML-style interpreter that evaluates the
functional code and discharges SMT queries to external SMT solvers; it uses caching to take
advantage of incremental SMT solving [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It supports optimizations like the magic set
transformation (which can be applied on a relation-by-relation basis) and automatic parallelization.
We developed three substantial case studies with Formulog to demonstrate (a) that complex
static analysis specifications encode naturally into Formulog, and (b) that Datalog optimizations
can help the resulting programs achieve satisfactory performance (at times besting previously
published reference implementations).2
      </p>
      <p>
        Refinement Type Checker (1.2K LOC). This type checker uses SMT solving to prove
subtyping between refinement types, which requires checking logical validity. Our implementation
translates the original formalism [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] almost line-by-line; our encoding exposed an important
bug in the formalism. Thanks to automatic parallelization, it scales better than the reference
implementation published with the original formalism.
      </p>
      <p>
        Java Pointer Analysis (1.5K LOC). SMT solving helps statically resolve the memory
locations Java variables might point to; this is a direct translation of the formal specification of Feng
et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (it uncovered bugs in the specification). It is slower than their implementation in Java,
but much smaller (10% of the size), and thus might be appropriate as a prototype, as well as a
playground for testing new features (like parallelizing the analysis or making it goal-directed).
      </p>
      <sec id="sec-3-1">
        <title>1Available at https://github.com/HarvardPL/formulog. 2Available at https://zenodo.org/record/4039122.</title>
        <p>
          LLVM Symbolic Evaluator (1K LOC). This tool evaluates a fragment of LLVM bitcode
on symbolic inputs, up to a bounded depth. When it reaches a branch conditioned on a symbolic
value, it explores each branch (assuming both are possible given the path so far). By writing
the symbolic executor in Formulog, we can automatically parallelize exploration; by using the
magic set transformation, we can guide the symbolic executor to avoid uninteresting paths.
These optimizations help it compete with industrial-strength CBMC [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and KLEE [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] on
sample problems (with speedups of up to 12× over KLEE).
        </p>
        <p>
          The case studies make heavy use of the ML fragment of Formulog: the ratio of the number of
functions to the number of Horn clauses is 1:4 for the points-to analysis and 3:4 for the other
case studies. The ability to easily integrate functional code with Horn clauses is a key feature
of the Formulog programming experience, and the presence of large amounts of functional
code diferentiates Formulog-based analyses from analyses written in a more standard dialect
of Datalog, like Souflé [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Outlook</title>
      <p>
        We have shown that Formulog’s combination of Datalog, SMT solving, and first-order functional
programming makes it an appropriate language for coding up complex SMT-based program
analyses. Even though our prototype Formulog runtime is not heavily optimized, it achieves
satisfactory performance on our case studies, even in comparison to heavily optimized systems.
In the future, we hope to improve its performance via compilation, à la Souflé [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>We also intend to expand Formulog’s applications beyond being a language for writing
program analyses. One direction is to use Formulog as an intermediate verification language
(i.e., an analysis produces Formulog code, instead of being written in Formulog). In particular,
we are currently exploring the possibility of symbolically executing Datalog programs by
translating them to Formulog programs. Furthermore, while we have focused on program
analysis applications because we are most familiar with them, we are hopeful that Formulog
will prove useful in other domains, such as knowledge representation and reasoning.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This material is based upon work supported by the Defense Advanced Research Projects Agency
(DARPA) under Contract No. FA8750-19-C-0004. Any opinions, findings and conclusions or
recommendations expressed in this material are those of the author(s) and do not necessarily
reflect the views of the Defense Advanced Research Projects Agency (DARPA).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bembenek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Greenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chong</surname>
          </string-name>
          ,
          <article-title>Formulog: Datalog for SMT-based static analysis</article-title>
          ,
          <source>Proc. ACM Program. Languages</source>
          <volume>4</volume>
          (
          <year>2020</year>
          )
          <volume>141</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>141</lpage>
          :
          <fpage>31</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Whaley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <article-title>Cloning-based context-sensitive pointer alias analysis using binary decision diagrams</article-title>
          ,
          <source>in: Proc. ACM SIGPLAN 2004 Conf. Program. Lang. Des. and Implementation</source>
          ,
          <year>2004</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Grech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jurisevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Brent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Scholz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Smaragdakis</surname>
          </string-name>
          , Madmax:
          <article-title>Surviving out-of-gas conditions in Ethereum smart contracts</article-title>
          ,
          <source>Proc. ACM Program. Languages</source>
          <volume>2</volume>
          (
          <year>2018</year>
          )
          <volume>116</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>116</lpage>
          :
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Flores-Montoya</surname>
          </string-name>
          , E. Schulte,
          <article-title>Datalog disassembly</article-title>
          ,
          <source>in: 29th USENIX Secur. Symp.</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1075</fpage>
          -
          <lpage>1092</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pacak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Erdweg</surname>
          </string-name>
          ,
          <article-title>Functional programming with Datalog</article-title>
          ,
          <source>in: Proc. 36th Eur. Conf. Object-Oriented Program.</source>
          ,
          <year>2022</year>
          , pp.
          <volume>7</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          :
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aiken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bugrara</surname>
          </string-name>
          , I. Dillig,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dillig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hackett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hawkins</surname>
          </string-name>
          ,
          <article-title>An overview of the Saturn project</article-title>
          ,
          <source>in: Proc. 7th ACM SIGPLAN-SIGSOFT Workshop Program Anal. Softw. Tools and Eng</source>
          .,
          <year>2007</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jafar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Lassez</surname>
          </string-name>
          ,
          <article-title>Constraint logic programming</article-title>
          ,
          <source>in: Proc. 14th ACM SIGACT-SIGPLAN Symp. Princ. Program. Languages</source>
          ,
          <year>1987</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bembenek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ballantyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Greenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Amin</surname>
          </string-name>
          ,
          <article-title>Datalog-based systems can use incremental SMT solving</article-title>
          ,
          <source>in: Proc. 36th Int. Conf. Log. Program.</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Bierman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hriţcu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Langworthy</surname>
          </string-name>
          ,
          <article-title>Semantic subtyping with an SMT solver</article-title>
          ,
          <source>J. Functional Program</source>
          .
          <volume>22</volume>
          (
          <year>2012</year>
          )
          <fpage>31</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , I. Dillig, T. Dillig,
          <article-title>Bottom-up context-sensitive pointer analysis for java</article-title>
          ,
          <source>in: Proc. 13th Asian Symp. Program. Languages and Syst.</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>465</fpage>
          -
          <lpage>484</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kroening</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lerda</surname>
          </string-name>
          ,
          <article-title>A tool for checking ANSI-C programs</article-title>
          ,
          <source>in: Proc. 10th Int. Conf. Tools and Algorithms Construction and Anal. Syst</source>
          .,
          <year>2004</year>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cadar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dunbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Engler</surname>
          </string-name>
          ,
          <article-title>KLEE: Unassisted and automatic generation of highcoverage tests for complex systems programs</article-title>
          ,
          <source>in: Proc. 8th USENIX Conf. Operating Syst. Des. and Implementation</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Scholz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Subotić</surname>
          </string-name>
          , Souflé:
          <article-title>On synthesis of program analyzers</article-title>
          ,
          <source>in: Proc. 28th Int. Conf. Comput. Aided Verification</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>422</fpage>
          -
          <lpage>430</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>