Have Your Cake and Eat it, Too: Data Provenance for Turing-Complete SQL Queries Tobias Müller (Advisor: Torsten Grust) University of Tübingen Tübingen, Germany to.mueller@uni-tuebingen.de ABSTRACT computing the data provenance for any non-updating SQL We report on our work about the computation of data prove- query. nance for feature-rich SQL. Among further constructs, our prototype supports correlated subqueries, aggregations, re- 1.1 Provenance Model cursive queries and window functions. Our analysis ap- We adopt a basic distinction of Where- and Why-prove- proach completely sidesteps relational algebra and instead nance as originally introduced by Buneman and Tan [1]: requires a translation of the input query into an imperative- • Where-provenance : where has a certain data piece style program. Provided that the target language is Turing- originated? Exactly which table cells were copied or complete, any SQL query can be covered. We employ a transformed to yield an output cell? new variant of program analysis which consists of a dy- • Why-provenance : why is a certain data piece in the namic and a static part. This two-step approach enables result? Which input table cells were inspected to de- us to dodge limitations that a Turing-complete computation cide about the existence or contents of an output cell? model entails for program analyses otherwise. The derived data provenance directly reflects the data provenance of the 1.2 Basic Example original SQL query. Figure 1 shows an intentionally simple SQL query and corresponding example tables. Mouse pointer 1 represents an inquiry for the data provenance of table cell t4 : C6 H12 O6 . 1. INTRODUCTION According to the SQL query, two input columns are ac- Data provenance [3,4] is metadata — primarily about the cessed: compound is used to decide if a tuple gets filtered origin of a certain data piece. Everyday examples for de- or not. If a tuple qualifies, its value sitting in formula is sirable provenance information are the From: header field copied over into the result table. Our provenance analy- in an email or citations in academic papers. In these two sis accordingly finds the result being why-dependent on tu- cases, the provenance is trivial and does not need any clever ple t2 : glucose and being where-dependent on t2 : C6 H12 O6 . algorithms for its computation (at least: should not). In the following sections we revisit this example and il- However, in the context of real-world relational database lustrate how this outcome actually is computed using our systems there is a deficiency regarding the provenance com- program analysis. putation for contemporary implementations of SQL. SQL, being the standard of relational query languages, has sup- 1.3 Advanced Example port for advanced language constructs like recursive queries The provenance analysis of the query found in Figure 2(b) or window functions. Further, nesting of queries is possi- is a unique feature of our approach: to the best knowledge ble, for example, through (correlated) subqueries. These of the author, only our analysis approach can deal with re- features make writing queries convenient but also make the cursive SQL queries. data provenance of query results non-trivial in the general The query syntax-checks molecular formulae. Technically, case. Concrete scenarios in which data provenance for SQL the finite state machine depicted in Figure 3 is executed. has proved being relevant are the view update/maintenance The encoded FSM and input formulae can be found in Fig- problem [4], data warehouses [4] and debugging purposes [5]. ure 2(a). The analysis approach we are going to describe is capable of 3- We inspect result cell O7 (mouse pointer 2 ). The according highlights within the compounds table ( citrate SELECT formula compounds FROM compounds compound formula WHERE compound = ’glucose’ 3- t1 citrate C6 H5 O7 output t2 glucose C6 H12 O6 formula Proceedings of the VLDB 2016 PhD Workshop, September 9, 2016. New + t4 C6 H12 O6 t3 hydronium H3 O Delhi, India. 1 Copyright (c) 2016 for this paper by its authors. Copying permitted for (a) Database instance. (b) Query and result. private and academic purposes. Figure 1: Basic query example and provenance markers. 3- and C6 H5 O7 ) show the same pattern as in the basic ex- SQL Query ample of Section 1.2. [A..Za..z 0.. 9] More interesting markers Translate SQL can be found within ta- [A..Za..z] s0 s1 ble fsm. The highlighted Provenance Analysis cells inside t8 and t9 in- 0 9 [ .. ] Imperative Code dicate which state changes +- [ ] [A..Za..z] were triggered while parsing Dynamic Analysis Logs Static Analysis the first letters of the for- 0 9 s2 s3 mula. [ .. ] +- [ ] Input Tables Output Table Data Provenance 1.4 Analysis Overview Figure 3: FSM. Figure 4: Overview of the two-step analysis. Figure 4 provides a graphical overview of our analysis ap- proach. The actual provenance analysis happens within the You find the table compounds of Figure 1(a) represented as dotted box. It requires the SQL query to be translated into a data structure (list of dictionaries). The algorithm iterates imperative program code. For our prototype, we use a hand- over the input table (line 3) and if a tuple has qualified crafted SQL compiler. Contemporary database systems like (line 5), its formula is appended to the result (line 7). HyPer [9] perform such translation internally. Please note that we combined input data (i.e. database The provenance analysis itself consists of two steps. At instance) and the computation algorithm into one program. first, a dynamic analysis takes place which includes code In the regular case, both of them are kept separate (refer to instrumentation and execution. This step actually computes Figure 4). the same query result as a regular query processor would 1 do. As a side effect, two light-weight execution logs are written. They describe the execution flow during runtime 3. PROVENANCE ANALYSIS and are a key element of this approach. Before we get to the details of our approach, we shed some In our second step, a static analysis is carried out exploit- light on the theoretical limits of program analysis and the ing the runtime knowledge encoded within the logs. Our arising dilemma. The theorem of Rice is a result of compu- static analysis does absolutely no data processing. The data tational theory. Cast informally, the theorem states that in provenance is derived from program code and logs only. It the Turing-complete computation model only trivial ques- is inspired by Program Slicing [2, 10]. tions about the behavior of a program can be answered. In Section 3, all elements of our provenance analysis will A sample trivial question would be: how many lines has be explained in deeper detail. the program? However, non-trivial properties of a program (such as data provenance) can only be adressed if the pro- gram actually is executed. 2. SQL COMPILATION This gives rise to the following dilemma: to embrace a Figure 5 shows a simplified yet executable translation of rich SQL dialect, we want to be Turing-complete (i.e., com- the basic SQL query in Figure 1(b). Ignore the logging state- pute anything). Regarding program analysis, however, we ments until the subsequent section. want to avoid Turing completeness and its implications for- The target language is kept minimal to just fit our needs: mulated in the theorem of Rice. The approach illustrated it can compute query results but has no support for I/O op- next allows us to have the cake and eat it, too. It allows us erations, for example. Due to space limitations and as the to stay in the Turing-complete computation model during presented code fragment consists of well-known language el- runtime and to switch into a weaker computation model for ements we do not give a formal definition. provenance analysis. 1 As part of our future work, we seek to modify an existing database system and let it run the dynamic analysis simul- taneously with query execution. compounds compound formula WITH RECURSIVE 3- run(compound, step, state, formula) AS ( t5 citrate C6 H5 O7 SELECT compound, 0, 0, formula t6 glucose C6 H12 O6 FROM compounds output + t7 hydronium H3 O UNION ALL step state formula SELECT this.compound, this.step + 1 AS step, 3- 0 0 C6 H5 O7 fsm edge.target AS state, right(this.formula, -1) AS formula 1 1 3- 6 H5 O7 source labels target final FROM run AS this, fsm AS edge 3- 2 1 H5 O7 t8 0 A..Za..z 1 false WHERE length(this.formula) > 0 3- AND this.state = edge.source 3 1 O 5 7 t9 1 A..Za..z 0.. 9 1 true 3- 0 9 AND strpos(edge.labels, left(this.formula, 1)) > 0 4 1 O7 t10 1 .. 2 true 3- 2 t11 1 +- 3 true ) 5 1 7 3- t12 2 0 9 .. 2 false SELECT r.step, r.state, r.formula 6 1 FROM run AS r 7 2 - t13 2 +- 3 false WHERE r.compound = ’citrate’ t14 3 A..Za..z 1 true 8 3 (a) Database instance. (b) Recursive SQL query driving the FSM. (c) Parsing trace. Figure 2: Advanced query example and provenance markers. 1 data = decisions using Boolean values. The first true found in the 3- [{"compound": "citrate", "formula": "C6 H5 O7 "}, log indicates that the body has been executed. The last {"compound": "glucose", "formula": "C6 H12 O6"}, false indicates that the foreach loop has exited. Similarly, + {"compound": "hydronium", "formula": "H3 O "}]; an if-statement can decide between then (yields true) or 2 res = []; else (yields false). 3 foreach row in data do logcf logix put(logcf ,true) List/dictionary element ac- cesses get logged in logix . Note ⟨ true, ⟨ 0, put(logix ,idxOf(row, data)) false, "compound", 4 c = row["compound"]; that foreach and append im- put(logix ,"compound") true, 1, plicitly use numeric indices to true, "compound", 5 if c == "glucose" then put(logcf ,true) read/write from/into lists and true, "formula", 6 t = {"formula": row["formula"]}; need to be included. The false, 0, put(logix ,"formula") idxOf() function retrieves the false ⟩ 2, 7 append(res, t) ordinal position of a list ele- "compound" ⟩ put(logix ,idxOf(t, output)) 8 else ment. Figure 6: Log con- put(logcf ,false) tents. 9 skip 3.3 Static Analysis 10 fi 11 od put(logcf ,false) Our static analysis does an abstract (value-less) interpre- tation of the uninstrumented source code. Instead of com- 12 // res: [{"formula", "C6 H12 O6"}] puting values, all input values are replaced by unique nu- Figure 5: The translated and instrumented SQL query. meric identifiers. These pids are propagated during pro- gram interpretation and successively create a variable envi- 3.1 Two-Step Program Analysis ronment containing the data provenance information. Based To make this switch possible we run consecutive dynamic on the basic query example of Figure 1 we present a simpli- and static analyses (compare Figure 4). fied subset of our provenance derivation algorithm. During dynamic analysis, the behavior (not: result) of cer- Figure 8 shows provenance inference rules denoted in op- tain program statements is recorded in logs. For example, an erational semantics. The top Statements rule is the entry if-statement can branch into the then- or the else-block. point for the interpretation. It takes the first statement s out We record this (binary) decision. During static analysis, this of all statements ss to be interpreted. In general, interpre- makes the behavior of an if-statement predetermined. The tation of statements is triggered by the ⤇ − symbol and leads − if does no longer actively contribute to the computation to an update of the current variable environment Γ. The and can be replaced by the according then- or else-branch. CF symbol represents the current data provenance for the When applying this record & replace discipline for a rele- control flow. The idea behind this is that reaching a certain vant subset of a program’s statements, we get an equivalent code section depends on a number of branching decisions form of the original program computing the same result. carried out by if/else statements. The dependencies for But now, the computation model has been simplified and these decisions are collected in CF and propagated during is open for running an exhaustive program analysis. In the program interpretation. remainder of this section we explain the two analysis steps The numeric ids which represent a data provenance rela- in detail. tionship are defined in Figure 7. There are the two kinds pid e and pid y which stand for Where- and Why-provenance, 3.2 Dynamic Analysis respectively. During analysis, these ids are created by the As motivated above, we aim to record the behavior of new ()-function (for an example, see the Lit-Str rule). Ini- program statements during runtime. The following two logs tially, all pids are of the Where-type because any pid e rep- are appended to: resents a certain value and a location of origin. During in- • logcf (control flow): which/how often does a certain terpretation, they may be converted into Why-type using code branch get executed by if and foreach? function Υ(). • logix (indices): at which locations are elements inside The main data structure is P . It can represent any value lists/dictionaries accessed? of any type of our programming language. Its second compo- During runtime, these properties are available and can nent e is used for container types (i.e., lists/dictionaries) to easily be recorded. We use the technique of code instrumen- store contained elements. The first component c is used for tation to create the two logs. both, containers as well as atomic values (e.g., strings). It For an instrumented example, see Figure 5. The instru- represents the provenance for that value itself. The logs logcf mentation instructions are placed on the righthand side of and logix are read by the inference rules. See rules If-True the listing. The first argument of the put()-function is the and If-False, for example. The popf ()-function reads and type of log we want to append to. Its second argument is removes the first element of the according log. the actual value being logged. Figure 6 lists the according The inference rules presented in Figure 8 are suitable to logs. These are written (and read) sequentially and do not compute the data provenance of the basic query and fi- need any further meta-data, keeping the logs small. nally yield the environment shown in Figure 9. As the The logged data items are to be interpreted in the context P ∶= ⟨c, e⟩ l ∶= any identifier of the (uninstrumented) source code. For example, the first c ∶= {pid1 , ..., pidn } γ(P ) ∶= c entry of logcf corresponds to the first control flow decision in the program at line 3. The foreach loop opened there pid ∈ {1e , 1y , 2e , 2y , 3e , 3y , ...} (P ) ∶= e can either execute its body (another time) or terminate and e ∶= {l1 ↦ P1 , ..., ln ↦ Pn } Υ(pids) ∶= {pidy ∶ pide∣y ∈ pids} continue at the statement after line 11. We encode these Figure 7: Data structures used in provenance computation. Statements data ∶ ⟨{ 10e }, {0 ↦ ⟨{3e }, "compound" ↦ ⟨{1e }, ∅⟩, CF ; Γ ⊢ s ⤇ − Γ1 − CF ; Γ1 ⊢ ss ⤇ − Γ2 − "formula" ↦ ⟨{2e }, ∅⟩⟩, CF ; Γ ⊢ s ; ss ⤇ − Γ2 − 1 ↦ ⟨{ 6e }, "compound" ↦ ⟨{ 4e }, ∅⟩, PutVar Skip "formula" ↦ ⟨{ 5e }, ∅⟩⟩, CF ; Γ ⊢ e ⤇ P Γres = Γ + {v ↦ P } 2 ↦ ⟨{9e }, "compound" ↦ ⟨{7e }, ∅⟩, CF ; Γ ⊢ v = e ⤇ − Γres − CF ; Γ ⊢ skip ⤇ − Γ − "formula" ↦ ⟨{8e }, ∅⟩⟩}⟩ If-True row ∶ ⟨{9e , 10e }, {"compound" ↦ ⟨{7e }, ∅⟩, popf (logcf ) CF ; Γ ⊢ e ⤇ Pe "formula" ↦ ⟨{8e }, ∅⟩}⟩ CF if = Υ(CF ∪ γ(Pe )) CF if ; Γ ⊢ ss1 ⤇ − Γres − CF ; Γ ⊢ if e then ss 1 else ss 2 fi ⤇ − Γres − c ∶ ⟨{7e , 9e , 10y }, ∅⟩ t ∶ ⟨{4y , 6y , 10y }, {"formula" ↦ ⟨{5e , 4y , 6y , 10y }, ∅⟩}⟩ If-False Foreach-False 2 ¬popf (logcf ) ... ¬popf (logcf ) res ∶ ⟨∅, 0 ↦ ⟨{4y , 6y , 10y }, CF ; Γ ⊢ if ... fi ⤇ − Γres − CF ; Γ ⊢ foreach ... od ⤇ − Γ − {"formula" ↦ ⟨{ 5e , 4y , 6y , 10y }, ∅⟩}⟩⟩ Foreach-True Figure 9: Resulting environment Γ after static analysis. popf (logcf ) CF ; Γ ⊢ e ⤇ Pe Pel = (Pe )[popf (logix )] Γfor = Γ + {v ↦ ⟨γ(Pe ) ∪ γ(Pel ), (Pel )⟩} Pids of non-input-values (> 10e∣y ) got dropped. CF ; Γfor ⊢ ss ; foreach v in e do ss od ⤇ − Γres − CF ; Γ ⊢ foreach v in e do ss od ⤇ − Γres − The aforementioned algebraic approaches are all limited in Append their expressiveness and extending the number of supported CF ; Γ ⊢ e ⤇ Pe Pv = Γ[v] algebraic operators is non-trivial. P = ⟨γ(Pv ), (Pv ) + {popf (logix ) ↦ Pe }⟩ Γres = Γ + {v ↦ P } CF ; Γ ⊢ append(v, e) ⤇ − Γres − 4. CONCLUSIONS GetVar Lit-Str The approach presented in this article pushes the bound- P = Γ[v] Pres = ⟨γ(P ) ∪ CF , (P )⟩ Pres = ⟨{new ()} ∪ CF, ∅⟩ aries of the provenance analysis for SQL queries. Our proto- CF ; Γ ⊢ v ⤇ Pres CF ; Γ ⊢ c ⤇ Pres type can analyse queries with advanced but timely SQL lan- guage features. Due to Turing-completeness, this approach GetVar-Idx can deal with any (non-updating) query translated into im- P = (Γ[v])[popf (logix )] CF ; Γ ⊢ e ⤇ Pe perative code. Pres = ⟨γ(P ) ∪ CF ∪ γ(Γ[v]) ∪ Υ(γ(Pe )), (P )⟩ It is part of our future work to run this approach in CF ; Γ ⊢ v[e] ⤇ Pres the environment of a decent DBMS. In parallel, we pur- Lit-Dict sue the derivation of How -provenance [3], i.e. get each one ∣CF ; Γ ⊢ ei ⤇ Pi ∣i=0...n Pres = ⟨{new ()} ∪ CF , {∣`i ↦ Pi ∣i=0...n }⟩ of the computed provenance relations associated to the SQL CF ; Γ ⊢ {`0 :e0 , . . . , `n :en } ⤇ Pres clauses accountable for its existence. Lit-List ∣CF ; Γ ⊢ ei ⤇ Pi ∣i=0...n Pres = ⟨{new ()} ∪ CF , {∣i ↦ Pi ∣i=0...n }⟩ 5. REFERENCES CF ; Γ ⊢ [e0 , . . . , en ] ⤇ Pres [1] P. Buneman, S. Khanna, and W.-C. Tan. Why and Where: BinOp A Characterization of Data Provenance. In Proc. ICDT, CF ; Γ ⊢ e1 ⤇ P1 CF ; Γ ⊢ e2 ⤇ P2 Pres = ⟨γ(P1 ) ∪ γ(P2 ), ∅⟩ 2001. CF ; Γ ⊢ e1 ⊛ e2 ⤇ Pres [2] J. Cheney. Program Slicing and Data Provenance. IEEE Data Engineering Bulletin, 30(4):22–28, 2007. Figure 8: Inference rules for data provenance. [3] J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Foundations and main result, we find four provenance relationships located Trends in Databases, 1(4), 2007. in res[0]["formula"]. The highlighted pids 5e (relates [4] Y. Cui, J. Widom, and J. Wiener. Tracing the Lineage of to t2 : C6 H12 O6 ) and 4y (relates to t2 : glucose ) consti- View Data in a Warehousing Environment. ACM TODS, 25(2), 2000. tute the data provenance visualized in Figure 1. 6y and [5] B. Dietrich, T. Müller, and T. Grust. The best bang for 10y do not correspond to table cells and may be ignored. your bu(ck)g. In Proc. EDBT, 2016. We already presented a visualization prototype in a recent [6] B. Glavic and G. Alonso. Perm: Processing Provenance and demo paper [8]. Data on the Same Data Model Through Query Rewriting. In Proc. ICDE, 2009. 3.4 Related Work [7] T. Green, G. Karvounarakis, and V. Tannen. Provenance Semirings. In Proc. PODS, 2007. The strongest group of related work builds upon prove- [8] T. Müller and T. Grust. Provenance for SQL Based on nance propagation through query transformation on the al- Abstract Interpretation: Value-less, but Worthwhile. In gebraic layer. For example, there is the Provenance Semir- Proc. VLDB, Hawaii, USA, 2015. ings approach [7] as well as the PERM system [6]. In more [9] T. Neumann. Efficiently Compiling Efficient Query Plans recent work, both of them were extended to support aggre- for Modern Hardware. In Proc. VLDB, 2011. gations and subqueries, respectively. [10] M. Weiser. Program Slicing. IEEE Transactions on Software Engineering, SE-10(4), 1984. 2 Analogous to If-True: ss2 is interpreted.