Effective Datalog-like representation of procedural programs⋆ David Bednárek Department of Software Engineering Faculty of Mathematics and Physics, Charles University Prague bednarek@ksi.mff.cuni.cz Abstract. Database systems are constantly extending embedded-SQL their application area towards more general computing. procedural However, applications that combine database access and program general computing still suffer from the conceptual and tech- nical gap between relational algebra and procedural pro- parser gramming. In this paper, we show that procedural programs may be effectively represented in a Datalog-like language with functions and aggregates. Such a language may then relational procedural IR be used as a common representation for both relational and algebra procedural part of an application. code database query rewriting optimization statistics 1 Introduction bytecode physical operator plan generator As database systems were extending their application generator metadata area, their mainstay language, SQL, became no longer library sufficient to support all required applications of data- bytecode compile time physical plan bases. The database community reacted in two ways: run time Implementing domain specific languages like XQuery, physical operator SPARQL etc. and improving the interaction of data- JIT compiler code library base engines with general procedural programming languages. The second approach is certainly more gen- procedural plan call interface eral but it is also more difficult due to deep differences machine interpreter between the procedural and relational paradigms. Many database systems offer a procedural lan- data guage with embedded SQL statements. The most com- base mon processing scenario is depicted at Fig. 1. The procedural and relational parts are separated already in the parser stage. The procedural part is converted to a procedural intermediate representation while the Fig. 1. Typical processing path of embedded-SQL pro- relational part is expressed using extended relational grams algebra. The two parts are processed almost indepen- dently in the following stages. When entering run time, the procedural part is usually expressed by a procedu- operators selected from a library and the selection is ral bytecode which is designed for easy interpretation; guided by database statistics and physical operator the relational part is represented by a physical plan, metadata (cost etc.). i.e. an expression over a physical algebra. The procedural part meets with its relational ele- Database-related optimizations are performed only ments only at run time – the procedural machine ex- on the relational side; individual SQL statements ecutes the procedural code containing calls to a data- extracted from the source code are usually optimized base interface which in turn invokes the plan inter- independently. The most important optimization step preter. The interpreter schedules and dispatches calls is the strategy of physical plan generation, tradition- to procedures implementing individual physical oper- ally called cost-based optimization. Here, the logical ators of a plan. Although the individual SQL state- relational algebra operators are converted to physical ments extracted from the source program often access ⋆ The work was supported by the GACR project the same data, their plans usually interact only at the 202/10/0761 and by the GAUK grant SVV-2011- lowest levels of physical data access; thus, they often 263312. repeat the same operations on the same data. 10 David Bednárek The most important reason for the separation of 1.2 Architecture procedural and relational parts is probably the his- tory – the relational (SQL) query engines were usu- The architecture of the proposed system is shown at ally implemented well before the procedural add-on Fig. 2. Instead of separating the procedural and re- was considered and the software-engineering cost of lational part, the parser converts the source code into reimplementing the query engine was considered too a Datalog-like intermediate representation. This inter- high. mediate language and the conversion of procedural Nevertheless, there is another reason for the sep- code into it form the main subject of this paper. On aration – the nature of procedural code is so distant the other hand, conversion of relational queries to from the algebraic nature of SQL that it is very dif- Datalog is a well-known subject; thus, it is not neces- ficult to create a meaningful common representation sary to describe the handling of embedded SQL state- for the two parts. ments. The processing continues with rewriting step which 1.1 Goals optimizes the Datalog-like representation – this phase corresponds to query rewriting and it may also be in- In this paper, we suggest using a Datalog-like language fluenced by database statistics. Moreover, several op- as the common intermediate representation for proce- timization algorithms known from compiler construc- dural and relational code. This idea is natural, consid- tion like loop unrolling or variable renaming [17] have ering the close relationship between Datalog and rela- their equivalents in the transformation of logic pro- tional algebra on one side and the computing power grams, so this phase covers also the optimization of of logic programming and recursion on the other side. procedural code. However, there are still many obstacles in the inte- The most important feature of this architecture is gration effort. In particular, there is a risk of loss of the ability to apply rewriting optimization step across efficiency since the procedural code is not evaluated the boundary between procedural and relational code. directly but transformed to logic-programming repre- For instance, repeated invocations of a SQL statement sentation and then evaluated by a procedural hard- may be glued together, offering, for instance, the pos- ware. sibility to cache their partial results. In our approach, we strive to improve the efficiency The plan generator phase tries to cover the Data- of logic-programming representation by minimizing log-like representation using a predefined set of pat- the size of the models – we show that a procedural code terns. Each pattern corresponds to a component which can be encoded in a logic program in such a manner has several inputs and several outputs, each corre- that the size of the (minimal) model is proportional sponding to a predicate. A simplest component cover to the execution time of the original program on the one Datalog rule – in this case, the head of the rule same data. Thanks to this proportionality, the logic corresponds to the output and each atom of the body program may be evaluated completely in bottom-up corresponds to an input of the component. More so- manner. phisticated components correspond to a pattern cov- Due to the focus on bottom-up evaluation, we pre- ering more than one rule, including recursively depen- fer calling our approach Datalog-like over the term dent rules. logic programming, although we certainly must use Some of the components correspond to physical a language stronger than Datalog to achieve gener- operators of a relational engine; for instance, a Data- ality. log rule with two body atoms may be implemented Besides function symbols which are necessary to by a hash-join operator. Other components are imple- simulate general procedural programming, our langua- mented with simple procedural code snippets – these ge needs negation and aggregation for the emulation components ensure that the procedural parts of the of both procedural constructs and the embedded re- source code are reverted back to procedural code. Of lational language. Both negation and aggregation re- course, parts of the source code may be changed during quire special handling to ensure well-defined semantics the rewriting phase; consequently, procedural source and there are several approaches to the semantics of code may be eventually covered by relational operators negation in Datalog and logic programming in general. and, conversely, portions of the embedded relational Thus, defining a particular approach to semantics statements may be converted to procedural snippets. is a part of our effort – we reuse the concepts of lo- The implementation of physical relational opera- cal stratification [13] and rule progressivity [10] that tors is boxed in procedural packages which are con- together well reflect the nature of the original proce- nected together similarly to classical query plans. On dural code. the other hand, the procedural snippets are so small that individual packaging would be ineffective. There- Effective Datalog-like representation . . . 11 fore, the snippets are combined together to larger pro- 2 Background and related work cedural code fragments by the bytecode generator. This paper deals with the design of the intermedi- Interaction of procedural and relational code was re- ate representation and the translation from procedu- cognized as an important topic a long time ago; never- ral code to it. The subsequent steps – Datalog-based theless, practical applications of the results are very rewriting and plan generator – are subject of our cur- scarce. In [7], a successful optimization of the interac- rent research. The run-time portion of the system at tion cost was shown in the case of calling procedurally- Fig. 2 was already implemented for a SPARQL com- implemented functions from relational queries. Our piler [4] whose compile time used a relational interme- approach also applies to this case; nevertheless, we fo- diate code and an algebra-based plan generator. cused on the opposite problem – repeated calling of relational queries from procedural code. Nested relational algebra [14] may well reflect the embedded-SQL use of structured and relation-valued variables in procedural a procedural program; however, the overall comput- program ing strength of nested relational algebra is insufficient to express while loops. While loops may be added as parser an additional second-order construct atop of relational algebra, or represented by transitive closure in power- Datalog-like set algebra [9]. In Sec. 3.3, we will show a logical- intermediate representation programming equivalent of nested relational algebra together with the drawbacks of such an approach. Flattening nested relations is an important step Datalog rewriting database towards effective evaluation of nested algebra and it is physical statistics also present in the core of our approach. The original component snippet flattening principle described in [16] was designed to library physical flatten an isolated nested-relational algebra expression plan generator component metadata and it was based on the finite height of the expression library bytecode generator tree. Since a while loop may generate a calculation of unlimited length, this flattening technique may not compile time be applied for procedural programs. Instead, we had component bytecode run time tree to use a numbering technique (see Sec. 3.4) and to solve some unwanted consequences of the numbering procedural component approach. JIT compiler machine scheduler Datalog and its extensions, besides their natural physical applications in many areas of database theory, was component access data already successfully used in areas related to procedural code interface base library programming. A language derived from logical programming was designed for programming in distributed environ- Fig. 2. Proposed processing path of embedded-SQL pro- ment [5]. The opposite problem, generating effective grams. procedural implementations from Datalog programs, was studied in [11]. These recent publications suggest that the potential of Datalog was not exhausted in The rest of the paper is organized as follows: In the the first decades of its life and it may experience a re- Section 2, we briefly review the related work on the vival fueled by the renewed interest in non-traditional interaction between procedural and relational code as database architectures. well as Datalog-related definitions important for our There were attempts to improve the expressive po- approach. The following section uses a sample pro- wer of Datalog towards procedural programming by cedural code to illustrate several approaches to the non-traditional extensions of its semantics [8]. Extend- modelling of procedural code in Datalog style. After ing Datalog towards complex data structures known reviewing the required extensions to Datalog in Sec- from procedural programming was described in [6]. tion 3.5, a particular strategy to the minimization of These powerful extensions relied on significant intru- model size is presented in Section 3.6. sions to the traditional Datalog semantics; consequent- ly, their use in an intermediate language for a rela- tional platform would be doubtful. 12 David Bednárek Datalog with temporal features was used to model relational algebra can be simulated with plain rela- sequences of data-manipulation statements [10] or in tional algebra [16], this is not the case with transitive the analysis of procedural programs [15]. Our number- closure. Thus, representing the example code in the ing technique shown in Sec. 3.4 is similar to temporal algebraic world requires transitive closure over nested techniques although used for a different purpose. relational algebra, which includes expensive atomic Among the sheer number of approaches to Datalog operations like equality test over sets and it is diffi- extensions, semantics and evaluation strategies, local cult to reduce it to implementable physical operators. and, in particular, temporal stratification [12] has mo- A natural response to the problems of algebraic tivated our approach. In addition, we also make use of representation is switching to Datalog where the while the progressivity [10] of rules in the scheduling strategy loop may be handled easily using recursion. However, used in plan execution. the Algorithm 1 demonstrates several obstacles in the Datalog approach. 3 Modeling procedural code in Datalog style 3.1 Naı̈ve approach The following rule forms a naı̈ve Datalog implementa- tion of the loop body: Algorithm 1 Example: A procedural algorithm with state(M ′ , S ′ ) ← state(M, S), stmt3(X, M ), embedded SQL statements stmt4(S ′ , S, X), stmt5(M ′ , M, X). Require: Relation M (A, B) 1: S := z The predicates stmt3, stmt4, and stmt5 imple- 2: while exists(M ) do ment the behavior of the statements 3, 4, and 5 of 3: X := select min(A) from M Algorithm 1. This approach creates extremely ineffi- where A not in (select B from M ) cient representation because any model of this pro- 4: S := f (S, X) gram contains ground atoms for stmt3, stmt4, and 5: delete from M where A = X stmt5 representing all satisfiable variable assignments 6: end while for the statements regardless of its reachability dur- 7: return S ing an execution. In addition, statement clauses (not shown here) violate safety rules as some of the vari- ables are bound only by functional symbols. Conse- Algorithm 1 is an example of code written in a pro- quently, this approach is not suitable for bottom-up cedural language with SQL embedded. It consumes evaluation in Datalog style. a relation M with schema (A, B) representing edges of a directed graph and traverses it in a topological ordering. In the loop, a node X is selected that has 3.2 Using function symbols for relational no incoming edge in M . The min aggregate is neces- algebra sary to select from multiple candidates. Later in the loop, all outgoing edges of X are removed from M . The following, improved representation may be de- The output of the loop is the variable S which aggre- rived from the previous one using the Magic-sets trans- gates the selected values of X using the constant z formation [1]: and the function f (S, X) (e.g. concatenation). If the state1(M ) ← m0(M ). graph is acyclic, the algorithm terminates after at state2(M, z) ← state1(M ). most |M | iterations; otherwise, it eventually fails to state3(M, S) ← state2(M, S), M 6= ∅. find any A in the statement 3 and, depending on the state3(M, S) ← state6(M, S), M 6= ∅. subtle details of the statement semantics, either causes state4(M, S, X) ← state3(M, S), an exception or loops indefinitely. X = πmin(A) (πA M \ πB M ). Modeling of a while-loop requires an extension to state5(M, S ′ , X) ← state4(M, S, X), S ′ = f (S, X). relational algebra and transitive closure is the most state6(M ′ , S) ← state5(M, S, X), obvious candidate. Each iteration of the loop changes M ′ = M \ σA=X M. the state of the program – the variables M and S state7(S) ← state2(M, S), M = ∅. – thus, there is a relation B which models the loop- state7(S) ← state6(M, S), M = ∅. body behavior using tuples hM, S, M ′ , S ′ i. Together The predicate statei indicate the reachability of with the while-head condition H, the transitive clo- a particular variable assignment in the beginning of sure L = σM ′ =∅ (σM6=∅ B)∗ models the loop. Unfortu- statement i of Algorithm 1. The relational statements nately, it requires nested relational algebra to repre- are represented as equality statements containing sent the relation-valued attribute M ; although nested function symbols from relational algebra. The rules Effective Datalog-like representation . . . 13 implement a state machine simulating the execution of s2(T + 1, f (S, X)) ← branch23(T ), the original program; the model expansion and safety s2(T, S), x4(T, X). problems mentioned above disappeared. m2(T + 1, A, B) ← branch23(T ), The relational statements are implemented using m2(T, A, B), x4(T, X), A 6= X. a non-Datalog mechanism; thus, this approach is far branch27(T ) ← state2(T ), ¬cond2(T ). from being a suitable intermediate representation for return(S) ← branch27(T ), s2(T, S). mixed code. Variable values are represented independently: xi (T, V ) determines the value V of the variable X be- 3.3 Nesting and unnesting relational fore entering statement i at time T . In the case of variables relational-valued variable M , the relation is unnested and its tuples are represented by individual instances In this section, the relational algebra was hoisted to of the atom mi (T, A, B). Datalog level using the unnest predicate and the nest Relational (and in general, complex-valued) Data- aggregate. This approach is equivalent to nested re- log variables and terms are no longer needed – every lational algebra. The predicate unnest(M, A, B) per- term is an atomic value. forms the membership test hA, Bi ∈ M ; the general- ized aggregate nest(A, B) collects all hA, Bi pairs and Note that the dissolution of statei predicates combine them into a set (see [3] for exact definiton of created an opportunity for optimization: Every atomic semantics of aggregates in Datalog): statement modifies only one variable; therefore, only one clause is required for each statement, specifying state1(M ) ← m0(M ). the new variable value. For a variable, it is sufficient state2(M, z) ← state1(M ). to have the value specified only at reference points cond2(M ) ← state2(M, S), unnest(M, A, B). (for S and M , it is the beginning of the statement 2), state3(M, S) ← state2(M, S), cond2(M ). provided that at least one reference point lies on any state3(M, S) ← state6(M, S), cond2(M ). path from any definition to any usage of this variable. cond3(M, B) ← state3(M, S), unnest(M, A, B). In our case, the value for the reference point 2 is spec- state4(M, S, min(A)) ← state3(M, S), ified twice because there are two control paths leading unnest(M, A, B), ¬cond3(M, A). to this point. state5(M, S ′ , X) ← state4(M, S, X), S ′ = f (S, X). state6(nest(A, B), S) ← state5(M, S, X), The execution of rules implementing statements is unnest(M, A, B), A 6= X. guarded by trigger predicates branchi,j which signal- state7(S) ← state2(M, S), ¬cond2(M ). ize passing from the statement i to the statement j. state7(S) ← state6(M, S), ¬cond2(M ). These predicates are controlled by conditions and their negations; in our case, the predicate cond2. This approach unifies the means used for proce- dural and relational fragments. Nevertheless, it suffers from the stratification required by the nest aggregate and the need to incorporate all live variables into single 3.5 Requirements on the logic language statei atom. In our example Algorithm 1, there is a loop-carried 3.4 Numbering iterations dependence from the variable M through X to the next M value which involves negation and aggrega- The following code illustrates the approach we finally tion. This is reflected in the presence of negation and used. The argument T representing time (more exact- aggregation in the mutual recursion between the pred- ly, the number of iteration) was introduced to almost icates x4 and s2. Our representation is therefore un- all rules. It allowed dissolution of the original statei stratifiable; the unstratifiability is inherent to Algo- predicates: statei (T ) indicates reachability of the sta- rithm 1 since the length of the chain of negations tement i at time T . generated by the loop is unlimited. Consequently, no m2(1, A, B) ← m0(A, B). stratification may exist for any Datalog-like represen- s2(1, z). tation of this example. state2(1). This forces us to use the concept of local stratifica- state2(T + 1) ← branch23(T ). tion. Note that in pure Datalog without function sym- cond2(T ) ← state2(T ), m2(T, A, B). bols, local stratification is almost equivalent to strat- branch23(T ) ← state2(T ), cond2(T ). ification [2]. However, our system does use function cond3(T, B) ← branch23(T ), m2(T, A, B). symbols to generate the time values T and to imple- x4(T, min(A)) ← branch23(T ), ment built-in operators and functions of the procedu- m2(T, A, B), ¬cond3(T, A). ral language and of SQL. 14 David Bednárek 3.6 Long-range variable passing In our example Algorithm 1, each iteration of the loop modifies each variable. However, a loop body may con- tain conditional statements; thus, a variable may be- come unmodified in some iterations. In this case, there must be a mechanism to pass the unmmodified vari- able value through the loop body. A simple implemen- tation of such mechanism is the following clause: valueV (T + 1, X) ← valueV (T, X), cond(T ). For every scalar variable V and for every itera- tion T , a ground atom valueV (T, X) is present in the model indicating the value X of the variable. Con- Fig. 3. Long-range variable passing sequently, the model is of the size Ω(τ υ) where τ is the execution time of the procedural program and υ is the number of variables in the program. For non- evaluation of the intermediate language – our results scalar variables the cost is even larger, because the show only O(log υ) degradation with respect to the argument X (or more arguments) encodes an individ- native procedural evaluation of a code with υ vari- ual element of a relation or an array, therefore there ables. are as many ground atoms as the size of the variable. In this paper, we presented the principles of the Evaluating the complete model is thus unacceptably proposed language and its use for the representation costly. of procedural code. For the sake of clarity, we omitted Fortunately, the cost of unmodified variable pass- many technical details and tricks that were necessary ing may be lowered to O(τ log(υ)). The principle is de- to achieve the reported effectiveness of the represen- picted in Fig. 3 – instead of copying the variable value tation. on every iteration, there are several layers of preferred Whether our approach is viable, it can be shown points in time and copying is performed at these pre- only by successful implementation of the whole pro- ferred points. Preferred points of layer k occur at the cessing chain from Fig. 2. The design of the interme- distance of 2k iterations and copying is allowed either diate representation was only a necessary prerequisite to a point of higher preference or to a point of access before attempting the implementation. to the variable. The thick lines in Fig. 3 show how Our approach was motivated by the lessons learned a variable is passed when it is accessed in the three from the implementation of a parallel SPARQL en- points marked at the time axis. gine [4] for the Semantic Web. When using a relational The number of copies done between two accesses repository, many Semantic Web algorithms are most is O(log∆) where ∆ is the time distance between the easily expressed as simple procedural algorithms over accesses, i.e. the length of the passing range. Since an relational queries. atomic statement may access only a limited number Thus, using a combined relational/procedural in- of variables, the number n of passing ranges is propor- termediate language may save the tedious and error- tional to the execution time, i.e. n = c ∗ τ . Since the prone work associated with reformulating such algo- ranges may not intersect for the rithms in either purely relational (if ever possible) or Pnsame variable, their purely procedural way. In addition, the mixed repre- sum across of all variables is i=1 ∆i ≤ τ υ. Conse- quently, the total cost of copying is proportional to sentation offers new opportunities for optimization. If successful, the new architecture may become im- Xn Pn ∆i τυ υ portant for areas where database access is tightly cou- log(∆i ) ≤ n log i=1 ≤ n log( ) = τ c log pled with non-trivial computing, including the Seman- i=1 n n c tic Web, computational linguistics or some areas of e-science. 4 Conclusion References We have proposed a promising approach to mixed pro- 1. C. Beeri, R. Ramakrishnan: On the power of magic. cedural and relational code whose key element is The Journal of Logic Programming, 10(3–4), 1991, a novel intermediate representation based on logic pro- 255–299. gramming. With respect to the assumed bottom-up 2. H. A. Blair, V. W. Marek, J. S. Schlipf: The expressive- evaluation strategy, the intermediate language falls to ness of locally stratified programs. Annals of Mathema- the Datalog family. Special care was taken for effective tics and Artificial Intelligence, 15(2), 1995, 209–229. Effective Datalog-like representation . . . 15 3. M. P. Consens, A. O. Mendelzon: Low-complexity ag- gregation in graphlog and datalog. Theoretical Com- puter Science 116(1), 1993, 95–116. 4. Z. Falt, D. Bednarek, M. Cermak, F. Zavoral: On par- allel evaluation of sparql queries. In DBKDA 2012, The Fourth International Conference on Advances in Databases, Knowledge, and Data Applications, 2012, 97–102. 5. S. C. Goldstein, F. Cruz: Meld: A logical approach to distributed and parallel programming. Technical re- port, DTIC Document, 2012. 6. S. Greco, L. Palopoli, E. Spadafora: Extending datalog with arrays. Data & Knowledge Engineering 17(1), 1995, 31–57. 7. R. Guravannavar, S. Sudarshan: Rewriting procedures for batched bindings. Proceedings of the VLDB En- dowment 1(1), 2008, 1107–1123. 8. A. Guzzo, D. Sacca: Semi-inflationary datalog: A declarative database language with procedural fea- tures. Artificial Intelligence Communications 18(2), 2005, 79–92. 9. M. Gyssens, D. Van Gucht: The powerset algebra as a result of adding programming constructs to the nested relational algebra. ACM 17, 1988. 10. G. Lausen, B. Ludäscher, W. May: On active deductive databases: The Statelog approach. Transactions and Change in Logic Databases, 1998, 69–106. 11. Y. A. Liu, S. D. Stoller: From datalog rules to effi- cient programs with time and space guarantees. In Proceedings of the 5th ACM SIGPLAN International Conference on Principles and Practice of Declaritive Programming. ACM, 2003, 172–183. 12. C. Nomikos, P. Rondogiannis, M. Gergatsoulis: Tem- poral stratification tests for linear and branching-time deductive databases. Theoretical Computer Science 342(2), 2005, 382–415. 13. L. Palopoli: Testing logic programs for local stratifi- cation. Theoretical Computer Science 103(2), 1992, 205–234. 14. H. J. Schek, M. H. Scholl: The relational model with relation-valued attributes. Information Systems 11(2), 1986, 137–147. 15. Y. Smaragdakis, M. Bravenboer: Using datalog for fast and easy program analysis. In Datalog Reloaded: First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers, vol. 6702, pp. 245. Springer-Verlag New York Inc, 2011. 16. J. Van den Bussche: Simulation of the nested relational algebra by the flat relational algebra, with an applica- tion to the complexity of evaluating powerset algebra expressions. Theoretical Computer Science 254(1-2), 2001, 363–377. 17. E. Visser: Stratego: A language for program transfor- mation based on rewriting strategies system description of stratego 0.5. In Rewriting Techniques and Applica- tions, Springer, 2001, 357–361.