=Paper=
{{Paper
|id=Vol-3203/short2
|storemode=property
|title=Formulog: Datalog + SMT + FP
|pdfUrl=https://ceur-ws.org/Vol-3203/short2.pdf
|volume=Vol-3203
|authors=Aaron Bembenek,Michael Greenberg,Stephen Chong
|dblpUrl=https://dblp.org/rec/conf/datalog/Bembenek0C22
}}
==Formulog: Datalog + SMT + FP==
Formulog: Datalog + SMT + FP Aaron Bembenek1 , Michael Greenberg2 and Stephen Chong1 1 Harvard University, Cambridge, MA 02138, USA 2 Stevens Institute of Technology, Hoboken, NJ 07030, USA Abstract Formulog extends Datalog with mechanisms for constructing logical terms and reasoning about them with satisfiability modulo theories (SMT) solving; a first-order functional language aids inspecting and manipulating complex terms. This combination of features makes it possible to write complex SMT-based program analyses in a way close to their formal specification, while being satisfactorily performant thanks to powerful Datalog optimizations and efficient evaluation techniques. Keywords Datalog, satisfiability modulo theories (SMT), program analysis 1. Introduction Formulog [1] is a domain-specific language for writing program analyses that use satisfiability modulo theories (SMT) solving. Formulog extends Datalog with a first-order functional language, ergonomic syntax for writing SMT terms, and a robust interface to an SMT solver. Datalog has already been shown to be a useful language for writing certain static analyses [2, 3, 4]; however, many static analyses cannot be easily written in existing Datalog variants, as they require SMT solvers to reason about logical formulas containing constructs like arrays and bit vectors. Formulog fills this gap and meets several key design objectives. First, Formulog enables executable specifications. Its combination of features enables program- ming SMT-based static analyses close to their mathematical specifications, which typically consist of inference rules and pure functions, either of which might need to check the satisfia- bility or validity of logical terms. Second, Formulog lets Datalog be Datalog. Despite the additional language features, Formulog programs benefit from Datalog optimizations and evaluation techniques (i.e., semi-naive evalu- ation). This makes it possible to write analyses at the level of a mathematical specification but still get solid performance. Third, Formulog’s parts interact in a safe and expressive way. The interaction between Datalog, functions, and SMT solving needs to be safe, never raising type errors by, e.g., querying the Datalog 2.0 2022: 4th International Workshop on the Resurgence of Datalog in Academia and Industry, September 05, 2022, Genova - Nervi, Italy $ bembenek@g.harvard.edu (A. Bembenek); michael@greenberg.science (M. Greenberg); chong@seas.harvard.edu (S. Chong) https://people.seas.harvard.edu/~bembenek/ (A. Bembenek); https://mgree.github.io/ (M. Greenberg); https://people.seas.harvard.edu/~chong/ (S. Chong) 0000-0002-3677-701X (A. Bembenek); 0000-0003-0014-7670 (M. Greenberg); 0000-0002-6734-5383 (S. Chong) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 48 Aaron Bembenek et al. CEUR Workshop Proceedings 48–53 1 4 y>z+x xare signed comparisons between 32-bit vectors. satisfiability of a term that is not well sorted under SMT, or passing a bit-vector-valued SMT term to a function expecting a concrete bit vector. At the same time, it needs to be possible to construct expressive SMT formulas. We strike this balance through a bimodal type system that treats terms occurring in SMT formulas more liberally than terms outside of formulas. We first present Formulog by example (Section 2); we then discuss our prototype and several case studies (Section 3). 2. Formulog in a Nutshell We present Formulog by example. The canonical “hello world” program for Datalog computes graph transitive closure. An analogous problem for Formulog is computing non-empty paths in a proposition graph, i.e., a directed graph where edges are labeled with SMT propositions, where we consider a path to exist only if the conjunction of the propositions along it (the path condition) is satisfiable. Consider the example graph in Figure 1, which has edges labeled with signed comparisons between symbolic 32-bit vectors (like the proposition 𝑥 < 𝑦, where 𝑥 and 𝑦 are SMT variables of the 32-bit vector sort). There is no path from node 2 to node 1, as no sequence of edges leads from node 2 to node 1. There is also no satisfiable path from node 1 to node 4. While a path exists structurally, the conjunction of the edge conditions—the path condition 𝑥 < 𝑦 ∧ 𝑦 < 𝑧 ∧ 𝑧 < 𝑥—is not satisfiable. However, there is a (non-empty) path from node 2 to itself, since the path condition 𝑦 < 𝑧 ∧ 𝑧 < 𝑥 ∧ 𝑦 > 𝑧 + 𝑥 is satisfiable under the theory of bit vectors (the addition of 𝑥 and 𝑦 can wrap, unlike the theory of mathematical integers). We break down the Formulog program computing non-empty paths on the example proposition graph piece-by-piece (Figure 2): Lines 1-2. We declare edge as a ternary input (EDB) relation. Formulog is strongly typed: edge relates two terms of type node (an alias for i32, a 32-bit vector) and a term of type bool smt, i.e., an SMT proposition. Lines 4-7. We enumerate the edge relation for our example graph. The third attribute is the SMT proposition. SMT terms are complex terms. To build such a term, users can take advantage of a library of constructors like bv_slt, which represents signed bit vector less-than; 49 Aaron Bembenek et al. CEUR Workshop Proceedings 48–53 1 type node = i32 2 input edge(node, node, bool smt) 3 4 edge(1, 2, `bv_slt(#x[i32], #y[i32])`). 5 edge(2, 3, `bv_slt(#y[i32], #z[i32])`). 6 edge(3, 4, `bv_slt(#z[i32], #x[i32])`). 7 edge(4, 2, `bv_sgt(#y[i32], bv_add(#z[i32], #x[i32]))`). 8 9 fun mem(x: ’a, xs: ’a list) : bool = 10 match xs with [] => false | h :: t => x = h || mem(x, t) end 11 12 output path(node, node, node list, bool smt) 13 path(X, Y, [Y], Phi) :- 14 edge(X, Y, Phi), 15 is_sat(Phi) = true. 16 path(X, Z, Y :: Path, `Phi /\ Constraint`) :- 17 edge(X, Y, Phi), 18 path(Y, Z, Path, Constraint), 19 mem(Y, Path) = false, 20 is_sat(`Phi /\ Constraint`) = true. 21 22 output path_conditions(node, node, bool smt list) 23 path_conditions(X, Y, Conditions) :- 24 path(X, Y, _, _), 25 Conditions = path(X, Y, _, ??). Figure 2: This Formulog program computes non-empty paths (the path relation) through our example proposition graph (Figure 1), ignoring paths where the accumulated condition along the path is not satisfiable. It also aggregates all the path conditions for each pair of nodes that have a path between them (the path_conditions relation). it constructs a term of type bool smt out of two terms of type bv[𝑘 ] smt (where 𝑘 is a positive integer; i32 is shorthand for bv[32]). Quasiquoting (with backticks) enables a type checking mode that flexibly mixes concrete and SMT terms. Additionally, thanks to the SMT theory of algebraic data types, arbitrary user-defined data types can be used within formulas. The notation #𝑥[𝜏 ] denotes an SMT variable of sort 𝜏 named 𝑥. Lines 9-10. Formulog supports first-order, polymorphic, recursive ML-style functions. Here the mem function checks list membership. Such functions are syntactic sugar, and could be translated into (likely less efficient) Datalog, following a translation similar in spirit to that of Pacak and Erdweg [5]; functions are convenient for inspecting and manipulating complex terms. We use ML shorthand for list constructors (𝑡 :: 𝑡 and [𝑡, . . ., 𝑡] for terms 𝑡). Line 12. We declare the path output (IDB) relation, relating two nodes, a path (a list of nodes), and a path condition (an SMT proposition). Lines 13-15. The first rule defining path is the non-recursive case: there is a path between 50 Aaron Bembenek et al. CEUR Workshop Proceedings 48–53 two nodes if there is an edge between them labeled with a satisfiable formula. Datalog variables have initial capitals. The function is_sat takes a term of type bool smt and returns a bool indicating the satisfiability of its argument. Lines 16-21. The second rule defining path is the recursive case; it makes use of both the user-defined function mem and the built-in function is_sat. Lines 22-26. We declare a relation path_conditions relating two nodes and a list of the path conditions for paths between them. This relation’s sole rule takes advantage of Formulog’s flexible approach to stratified aggregation: the notation path(X, Y, _, ??) treats the relation path as a function that takes two arguments (grouping variables X and Y) and returns a list of all the elements in the fourth column—i.e., the path conditions (the term ?? is a wildcard). In addition to stratified aggregation, Formulog supports stratified negation. This combination of features—and Formulog’s take on them—distinguishes Formulog from other systems combining logic programming and constraint solving, like Calypso [6] and constraint logic programming [7]. We have formalized Formulog and proven that its type system is sound with respect to a small-step operational semantics: evaluation never “goes wrong,” whether in Datalog, ML, or SMT. 3. Prototype and Case Studies We implemented a prototype of Formulog in ∼24K lines of Java.1 The prototype is designed as a relatively standard Datalog engine augmented with an ML-style interpreter that evaluates the functional code and discharges SMT queries to external SMT solvers; it uses caching to take advantage of incremental SMT solving [8]. It supports optimizations like the magic set trans- formation (which can be applied on a relation-by-relation basis) and automatic parallelization. We developed three substantial case studies with Formulog to demonstrate (a) that complex static analysis specifications encode naturally into Formulog, and (b) that Datalog optimizations can help the resulting programs achieve satisfactory performance (at times besting previously published reference implementations).2 Refinement Type Checker (1.2K LOC). This type checker uses SMT solving to prove sub- typing between refinement types, which requires checking logical validity. Our implementation translates the original formalism [9] almost line-by-line; our encoding exposed an important bug in the formalism. Thanks to automatic parallelization, it scales better than the reference implementation published with the original formalism. Java Pointer Analysis (1.5K LOC). SMT solving helps statically resolve the memory loca- tions Java variables might point to; this is a direct translation of the formal specification of Feng et al. [10] (it uncovered bugs in the specification). It is slower than their implementation in Java, but much smaller (10% of the size), and thus might be appropriate as a prototype, as well as a playground for testing new features (like parallelizing the analysis or making it goal-directed). 1 Available at https://github.com/HarvardPL/formulog. 2 Available at https://zenodo.org/record/4039122. 51 Aaron Bembenek et al. CEUR Workshop Proceedings 48–53 LLVM Symbolic Evaluator (1K LOC). This tool evaluates a fragment of LLVM bitcode on symbolic inputs, up to a bounded depth. When it reaches a branch conditioned on a symbolic value, it explores each branch (assuming both are possible given the path so far). By writing the symbolic executor in Formulog, we can automatically parallelize exploration; by using the magic set transformation, we can guide the symbolic executor to avoid uninteresting paths. These optimizations help it compete with industrial-strength CBMC [11] and KLEE [12] on sample problems (with speedups of up to 12× over KLEE). The case studies make heavy use of the ML fragment of Formulog: the ratio of the number of functions to the number of Horn clauses is 1:4 for the points-to analysis and 3:4 for the other case studies. The ability to easily integrate functional code with Horn clauses is a key feature of the Formulog programming experience, and the presence of large amounts of functional code differentiates Formulog-based analyses from analyses written in a more standard dialect of Datalog, like Soufflé [13]. 4. Outlook We have shown that Formulog’s combination of Datalog, SMT solving, and first-order functional programming makes it an appropriate language for coding up complex SMT-based program analyses. Even though our prototype Formulog runtime is not heavily optimized, it achieves satisfactory performance on our case studies, even in comparison to heavily optimized systems. In the future, we hope to improve its performance via compilation, à la Soufflé [13]. We also intend to expand Formulog’s applications beyond being a language for writing program analyses. One direction is to use Formulog as an intermediate verification language (i.e., an analysis produces Formulog code, instead of being written in Formulog). In particular, we are currently exploring the possibility of symbolically executing Datalog programs by translating them to Formulog programs. Furthermore, while we have focused on program analysis applications because we are most familiar with them, we are hopeful that Formulog will prove useful in other domains, such as knowledge representation and reasoning. Acknowledgments This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-19-C-0004. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). References [1] A. Bembenek, M. Greenberg, S. Chong, Formulog: Datalog for SMT-based static analysis, Proc. ACM Program. Languages 4 (2020) 141:1–141:31. 52 Aaron Bembenek et al. CEUR Workshop Proceedings 48–53 [2] J. Whaley, M. S. Lam, Cloning-based context-sensitive pointer alias analysis using bi- nary decision diagrams, in: Proc. ACM SIGPLAN 2004 Conf. Program. Lang. Des. and Implementation, 2004, pp. 131–144. [3] N. Grech, M. Kong, A. Jurisevic, L. Brent, B. Scholz, Y. Smaragdakis, Madmax: Surviving out-of-gas conditions in Ethereum smart contracts, Proc. ACM Program. Languages 2 (2018) 116:1–116:27. [4] A. Flores-Montoya, E. Schulte, Datalog disassembly, in: 29th USENIX Secur. Symp., 2020, pp. 1075–1092. [5] A. Pacak, S. Erdweg, Functional programming with Datalog, in: Proc. 36th Eur. Conf. Object-Oriented Program., 2022, pp. 7:1–7:28. [6] A. Aiken, S. Bugrara, I. Dillig, T. Dillig, B. Hackett, P. Hawkins, An overview of the Saturn project, in: Proc. 7th ACM SIGPLAN-SIGSOFT Workshop Program Anal. Softw. Tools and Eng., 2007, pp. 43–48. [7] J. Jaffar, J.-L. Lassez, Constraint logic programming, in: Proc. 14th ACM SIGACT-SIGPLAN Symp. Princ. Program. Languages, 1987, pp. 111–119. [8] A. Bembenek, M. Ballantyne, M. Greenberg, N. Amin, Datalog-based systems can use incremental SMT solving, in: Proc. 36th Int. Conf. Log. Program., 2020. [9] G. M. Bierman, A. D. Gordon, C. Hriţcu, D. Langworthy, Semantic subtyping with an SMT solver, J. Functional Program. 22 (2012) 31–105. [10] Y. Feng, X. Wang, I. Dillig, T. Dillig, Bottom-up context-sensitive pointer analysis for java, in: Proc. 13th Asian Symp. Program. Languages and Syst., 2015, pp. 465–484. [11] E. Clarke, D. Kroening, F. Lerda, A tool for checking ANSI-C programs, in: Proc. 10th Int. Conf. Tools and Algorithms Construction and Anal. Syst., 2004, pp. 168–176. [12] C. Cadar, D. Dunbar, D. Engler, KLEE: Unassisted and automatic generation of high- coverage tests for complex systems programs, in: Proc. 8th USENIX Conf. Operating Syst. Des. and Implementation, 2008, pp. 209–224. [13] H. Jordan, B. Scholz, P. Subotić, Soufflé: On synthesis of program analyzers, in: Proc. 28th Int. Conf. Comput. Aided Verification, 2016, pp. 422–430. 53