A compiler for stratified Datalog programs: preliminary results? Bernardo Cuteri and Francesco Ricca DeMaCS, Università della Calabria, Rende (CS), Italy {cuteri,ricca}@mat.unical.it Abstract. Deductive databases originated from the confluence between logic programming and databases. The core language of deductive data- bases is Datalog, which has recently found a renewed interest and new applications in several real-world problems. The evaluation of Datalog is traditionally implemented in monolithic systems that are general- purpose in the sense that they are able to process an entire class of programs. In this paper, we follow a different approach; we present a tool that is able to compile a given (non-ground) Datalog program, possibly with stratified negation, into a problem-specific executable implemen- tation. Preliminary results show the performance benefits that can be obtained by a compilation-based approach. Keywords: Compilation, Stratified programs, Deductive databases 1 Introduction Deductive databases allow to manipulate data declaratively by means of logic programming. The target language of this work is Datalog [4, 12]: a simple, yet flexible logic programming language which is at the core of most deductive database systems. Datalog is sufficiently expressive to model many practical problems, and it recently found new applications in a variety of emerging do- mains such as data integration, information extraction, networking, program analysis, security, and cloud computing [6]. The evaluation of Datalog programs is traditionally performed by systems that are general-purpose in the sense that they are able to process an entire class of programs. In this paper, we follow a dif- ferent approach; we consider a compilation strategy for the evaluation of Datalog programs. Moreover, we consider an extension of Datalog that includes stratified negation and allows the representation of a striclty larger class of problems. This language is also called stratified Datalog. A Datalog program with stratified negation [4] P is a set of rules of the form Head :−Body, where Body is a conjunction of possibly negated literals, and Head is an atom. A literal is either a positive or a negative atom, while ? Partially supported by the Italian Ministry for Economic Development (MISE) un- der project “PIUCultura - Paradigmi Innovativi per l’Utilizzo della Cultura” (n. F/020016/01-02/X27)”. an atom is either a propositional variable or a n-ary predicate p with a list of terms t1 , ..., tn . A term is either a constant or a variable. If a variable appears in the head of a rule or in a negative literal, it must then appear also in some positive literal of the same rule (safety), and recursion trough negation is not allowed (stratification). Programs are virtually split into two distinct parts: the intensional part (i.e. the set of possibly non-ground rules) that describes the problem, and the extensional part (i.e. the set of ground facts) that represent an instance of the problem. As an example, the following program models the well-known Reachability problem: reaches(X,Y) :- edge(X, Y). reaches(X,Y) :- edge(X,Z), reaches(Z,Y). Facts of the form edge(i,j) for each arc (i,j) model the input graph. It is custom in the logic programming community to refer to the instensional part as the encoding, and to the extensional part as the instance. In applications, it is common to use the same uniform encoding several times with different instances. A general-purpose system (often needlessly) processes the same encoding ev- ery time a new instance is processed. By compiling the encoding in a specialized procedure, one can wire it inside the evaluation procedure so that one does not have to process it every time. Moreover, specialized evaluation strategies can be adopted on a per-rule basis possibly increasing evaluation performance. Datalog represents the kernel sub-language of Answer Set Programming (ASP) [2]. Notably, the grounder modules of ASP systems are based on algo- rithms for evaluating stratified Datalog programs [7], and, basically, any mono- lithic ASP system is also an efficient engine for this class of programs. Actually, the idea of compiling Datalog programs is not new [1, 4, 8]; however, a new complexity-wise optimal compilation-based approach has been recently proposed in [11]. There, Liu and Stoller describe a method for transforming Datalog programs (with rules having at most two literals in the body) into efficient specialized implementations. We extend the approach in [11] in order to handle rules with bodies of any size, stratified negation and inequalities. Moreover, we implemented a concrete system entirely developed in C++. In our prototype implementation, both the compiler and its output (the compiled logic programs) are written in C++. To assess the k of our tool, we performed an experimental analysis where we compare our implementation against existing general-purpose systems capable of handling stratified logic programs. 2 Compilation of stratified logic programs In this section, we first overview the evaluation strategy adopted by our system, and then we present the compilation strategy by means of an example. Evaluation strategy. Stratified logic programs are evaluated in our approach by following the classical bottom-up schema [12]. Basically, rules are applied Algorithm 1 Reachability compiled program 1: . . . {Data structures initializations and facts reading} 2: while W edge 6= ø do {Evaluation of rule (1)} 3: edge = W edge.pop() 4: Redge.insert(edge) 5: EdgeZM ap.insertKeyV alue({edge[1]}, {edge[0]}) 6: W reaches.insert({edge[0], edge[1]}) 7: end while 8: while W reaches 6= ø do {Evaluation of rule (2)} 9: reaches = W reaches.pop() 10: Rreaches.insert(reaches) 11: for X : edgeZM ap.at(reaches[0]) do 12: W reaches.insert({X, reaches[1]}) 13: end for 14: end while 15: while W vertex 6= ø do {Evaluation of rule (3)} 16: vertex = W vertex.pop() 17: Rvertex.insert(vertex) 18: if not Rreaches.contains({2, vertex[0]}) then 19: Rnoreaches.insert({vertex[0]}) 20: end if 21: end while 22: M odel = Redge ∪ Rreaches ∪ Rvertex ∪ Rnoreaches starting from the known facts to deduce new facts until no new information can be derived. The notion of dependency graph of the input program is used both to improve efficiency of the evaluation and to determine a correct order of evaluation of rules in presence of negation. More in details, given program P the dependency graph DG =< V, E > of P has a vertex p ∈ V for each intensional predicate p, and a (direct) edge of the form (b, h) ∈ E whenever b occurs in the body and h in the head of a rule of P . Edges are labeled as negative if the body literal is negated. The dependency graph is, then, partitioned into strongly connected components (SCC)s. A SCC is a maximal subset of the vertices, such that every vertex is reachable from every other vertex. We say that a rule r ∈ P defines a predicate p if p appears in the head of r. For each strongly connected component (SCC) of DG, the set of rules defining all the predicates in C is called the module of C. The dependency graph yields a topological order [C1 ], ..., [Cn ] over the SCCs: for each pair (Ci , Cj ) with i < j, there is no path in the de- pendency graph from Ci to Cj . Program modules corresponding to components are evaluated by following a topological order. If two predicates do not belong to the same SCC it means that they do not depend on each other, thus their defining rules can be evaluated separately (possibly increasing evaluation perfor- mance). Since, by definition, stratified programs admit no negative edge inside any SCC (i.e. no loop can contain a negative edge), an evaluation performed according to a topological order ensures a sound computation of the semantics of programs with negation. Since the rules of a program module (possibly) need to be processed several times, the evaluation of modules is optimized employing the semi-naı̈ve evaluation technique [12]. Basically, at each iteration n, only the significant information derived during iteration n − 1 is used. Compilation by example. The strategy described above is quite standard, and it is employed also by general-purpose implementations. In our approach the same strategy is used to produce a specialized implementation. In the following, we exemplify how the compilation of a program module is done in our system. Consider once more the program P GR that models the Reachability problem with an extra rule to derive all vertices that do not reach vertex 2: (1) reaches(X,Y) :- edge(X, Y). (2) reaches(X,Y) :- edge(X,Z), reaches(Z,Y). (3) noReach(Y) :- vertex(Y), not reaches(2,Y). The dependency graph analysis basically ensures that the rule (3) is evaluated after rule (1) and (2). Then , the core part of the compilation process applied to P GR , in pseudo-code is presented in Algorithm 1. There, W edge, W reaches and W vertex denote working sets while Redge, Rreaches, Rvertex and Rnoreaches denote result sets (cfr.[11]). Working sets and result sets are implemented in efficient indexed data structures that provide associative access and efficient insertions and deletions of ground atoms. In the example, we first loop on the working set of predicate edge which was previously loaded with facts. In every iteration, we retrieve and remove an element from the working set (pop) and we insert it in the result set. At line 5 we insert an edge atom into an auxiliary map (note that, attributes in atom variables, such as edge, are accessed in the example by position, where the first attribute has index 0). Auxiliary maps are used to index predicate ground atoms that appear in the body of some rule and might be involved in join operations with other predicates of the same body. In our example, rule (2) induces an indexing of predicate edge on attribute Z. At line 6, we derive an instance of reaches because of rule (1) and we insert it in its working set. After the edge loop completes, we can loop on W reaches and once again we retrieve and remove an atom from the working set and we insert it into the result set. At this point (line 11) we loop over joining edge atoms retrieved from the auxiliary map EdgeZMap. Inside the loop, we generate the head ground atom and we insert it in Wreaches. In the last loop (line 15), we iterate over vertex ground atoms and we add into Rnoreaches all vertices that do not reach vertex 2 because of rule (3). Note that we do not need an extra auxiliary map for reaches because we can directly use Rreaches instead. Finally, the model of the program is given by the union of all result sets. We remark that, there are tasks involved in the evaluation of programs that are carried out by the compiler and do not appear in the execution code. For instance, the computation of SCCs and the individuation of which data struc- tures are needed in the evaluation. Thus, the product of the compilation does not have to perform such tasks because they are done only once and for all in the compilation process, while a general-purpose solver has to deal with such tasks once for every instance. At low level, another advantage, comes from the possi- Encoding (problem description) Compiler Problem solutions Problem instances Model 1 Facts 1 ... ... Compiled program Model n Facts n Fig. 1. Datalog compiler architecture. bility to declare data structures as variables and exploit direct access (reducing pointers dereferencing), which results in faster C++ code. Datalog Compiler. The compiler is written in C++ and the output is first written in C++ and then compiled into an executable program (using standard C++ compilers). Figure 1 shows a high-level architecture of Datalog compilation and evaluation as designed in our system. The resulting executable program can be run on any instance of the compiled logic program. 3 Experimental analysis To assess the potential of our tool, we performed an experimental analysis where we compared our system against existing general-purpose systems that can eval- uate stratified logic programs by using bottom-up strategies, namely: Clingo [5], DLV [9] and I-DLV [3]. Clingo and DLV are two well-known ASP solvers, while I-DLV is a recently-introduced grounder. Even though the target language of Table 1. Average execution times in seconds and number of solved instances grouped by solvers and domains, best performance outlined in bold face. compiled Clingo DLV I-DLV Benchmark Inst. Time Sol. Time Sol. Time Sol. Time Sol. LargeJoins 23 272 22 418 21 263 21 258 21 Recursion 21 139 21 331 20 444 21 332 21 Stratified negation 5 109 5 154 5 255 5 88 5 Totals 49 197 48 352 46 343 47 273 47 3500 Compiled Clingo 3000 DLV I-DLV 2500 2000 Time (s) 1500 1000 500 0 0 5 10 15 20 25 30 35 40 45 50 Solved Instances Fig. 2. Overall performance: Cactus plot of the execution times. such systems is ASP they can also be considered as rather efficient implemen- tations for stratified logic programs. The experimental analysis has been carried out on benchmarks from OpenRuleBench [10], which is a well-known suite for rule engines. The execution times reported for our system include compilation times to provide a fair comparison with general-purpose tools. Results are summarized in Table 1 and in Figure 2. By looking at the table, we observe that our tool solves more instances than any alternative on the over- all, and is the fastest on average in Large Joins and Recursion sets. In Stratified negation, our tool is on par with the others in terms of solved instances, but is slower on average than I-DLV . This might be explained by the fact that there are still some optimizations, such as join reordering, that general-purpose sys- tems adopt, but we did not consider yet in our prototype implementation. An aggregate view on the results is reported in Figure 2 showing that our tool per- forms well on the overall. For completeness, we report that compilation required 2.6s on average (over all benchmarks). This performance is acceptable given that compilation is intended as a one-time process in our approach. 4 Conclusions and future works In this paper, we presented a new compiler for stratified Datalog programs.1 The prototype takes as input a stratified Datalog program and generates a specialized implementation for a given program. Experimental results are very encouraging. Ongoing work concerns the improvement of data structures generated by our 1 The tool can be downloaded from http://goo.gl/XhZXWh. compiler and the inclusion of other known optimization techniques used by ASP grounders [7]. Since, notably, the grounder modules of ASP systems are based on algorithms for evaluating stratified Datalog programs and we are obtaining promising results for Datalog, we then expect to be able to obtain similar im- provements by applying compilation to the grounding of ASP programs. As for future works, we aim at applying compilation-based techniques to the instanti- ation of ASP programs. References 1. Arni, F., Ong, K., Tsur, S., Wang, H., Zaniolo, C.: The deductive database system LDL++. TPLP 3(1), 61–94 (2003) 2. Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011) 3. Calimeri, F., Fuscà, D., Perri, S., Zangari, J.: I -dlv: The new intelligent grounder of dlv. In: AI*IA. LNCS, vol. 10037, pp. 192–207. Springer (2016) 4. Ceri, S., Gottlob, G., Tanca, L.: Logic Programming and Databases. Springer (1990) 5. Gebser, M., Kaminski, R., Kaufmann, B., Ostrowski, M., Schaub, T., Wanko, P.: Theory solving made easy with clingo 5. In: ICLP 2016 TCs. pp. 2:1–2:15 (2016) 6. Huang, S.S., Green, T.J., Loo, B.T.: Datalog and emerging applications: an inter- active tutorial. In: Proceedings of SIGMOD 2011. pp. 1213–1216. ACM (2011) 7. Kaufmann, B., Leone, N., Perri, S., Schaub, T.: Grounding and solving in answer set programming. AI Magazine 37(3), 25–32 (2016) 8. Kifer, M., Lozinskii, E.L.: On compile-time query optimization in deductive databases by means of static filtering. ACM TDS 15(3), 385–426 (1990) 9. Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The DLV system for knowledge representation and reasoning. ACM TOCL 7(3), 499–562 (2006) 10. Liang, S., Fodor, P., Wan, H., Kifer, M.: Openrulebench: an analysis of the perfor- mance of rule engines. In: Proceedings of WWW 2009. pp. 601–610. ACM (2009) 11. Liu, Y.A., Stoller, S.D.: From datalog rules to efficient programs with time and space guarantees. ACM Trans. Program. Lang. Syst. 31(6), 21:1–21:38 (2009) 12. Ullman, J.D.: Principles of Database and Knowledge-Base Systems. Computer Sci- ence Press (1988)