Have Your Cake and Eat it, Too: Data Provenance for Turing-Complete SQL Queries

Have Your Cake and Eat it, Too: Data Provenance for Turing-Complete SQL Queries TobiasMüller to.mueller@uni-tuebingen.de Torsten Grust) University of Tübingen Tübingen

Germany

Have Your Cake and Eat it, Too: Data Provenance for Turing-Complete SQL Queries F5A954F5AD26B6CADC27A26FEC83E0CC GROBID - A machine learning software for extracting information from scholarly documents

We report on our work about the computation of data provenance for feature-rich SQL. Among further constructs, our prototype supports correlated subqueries, aggregations, recursive queries and window functions. Our analysis approach completely sidesteps relational algebra and instead requires a translation of the input query into an imperativestyle program. Provided that the target language is Turingcomplete, any SQL query can be covered. We employ a new variant of program analysis which consists of a dynamic and a static part. This two-step approach enables us to dodge limitations that a Turing-complete computation model entails for program analyses otherwise. The derived data provenance directly reflects the data provenance of the original SQL query.

INTRODUCTION

Data provenance [3,4] is metadata -primarily about the origin of a certain data piece. Everyday examples for desirable provenance information are the From: header field in an email or citations in academic papers. In these two cases, the provenance is trivial and does not need any clever algorithms for its computation (at least: should not).

However, in the context of real-world relational database systems there is a deficiency regarding the provenance computation for contemporary implementations of SQL. SQL, being the standard of relational query languages, has support for advanced language constructs like recursive queries or window functions. Further, nesting of queries is possible, for example, through (correlated) subqueries. These features make writing queries convenient but also make the data provenance of query results non-trivial in the general case. Concrete scenarios in which data provenance for SQL has proved being relevant are the view update/maintenance problem [4], data warehouses [4] and debugging purposes [5]. The analysis approach we are going to describe is capable of computing the data provenance for any non-updating SQL query.

Provenance Model

We adopt a basic distinction of Where-and Why-provenance as originally introduced by Buneman and Tan [1]:

• Where-provenance Where-provenance : where has a certain data piece originated? Exactly which table cells were copied or transformed to yield an output cell? • Why-provenance Why-provenance : why is a certain data piece in the result? Which input table cells were inspected to decide about the existence or contents of an output cell?

Basic Example

Figure 1 shows an intentionally simple SQL query and corresponding example tables. Mouse pointer 1 represents an inquiry for the data provenance of C 6 H 12 O 6 . In the following sections we revisit this example and illustrate how this outcome actually is computed using our program analysis.

Advanced Example

The provenance analysis of the query found in Figure 2(b) is a unique feature of our approach: to the best knowledge of the author, only our analysis approach can deal with recursive SQL queries.

The query syntax-checks molecular formulae. Technically, the finite state machine depicted in Figure 3 More interesting markers can be found within table fsm. The highlighted cells inside t 8 and t 9 indicate which state changes were triggered while parsing the first letters of the formula.

Analysis Overview

Figure 4 provides a graphical overview of our analysis approach. The actual provenance analysis happens within the dotted box. It requires the SQL query to be translated into imperative program code. For our prototype, we use a handcrafted SQL compiler. Contemporary database systems like HyPer [9] perform such translation internally.

The provenance analysis itself consists of two steps. At first, a dynamic analysis takes place which includes code instrumentation and execution. This step actually computes the same query result as a regular query processor would do.

1 As a side effect, two light-weight execution logs are written. They describe the execution flow during runtime and are a key element of this approach.

In our second step, a static analysis is carried out exploiting the runtime knowledge encoded within the logs. Our static analysis does absolutely no data processing. The data provenance is derived from program code and logs only. It is inspired by Program Slicing [2,10].

In Section 3, all elements of our provenance analysis will be explained in deeper detail.

SQL COMPILATION

Figure 5 shows a simplified yet executable translation of the basic SQL query in Figure 1(b). Ignore the logging statements until the subsequent section.

The target language is kept minimal to just fit our needs: it can compute query results but has no support for I/O operations, for example. Due to space limitations and as the presented code fragment consists of well-known language elements we do not give a formal definition. You find the table compounds of Figure 1(a) represented as a data structure (list of dictionaries). The algorithm iterates over the input table (line 3) and if a tuple has qualified (line 5), its formula is appended to the result (line 7).

Please note that we combined input data (i.e. database instance) and the computation algorithm into one program. In the regular case, both of them are kept separate (refer to Figure 4).

PROVENANCE ANALYSIS

Before we get to the details of our approach, we shed some light on the theoretical limits of program analysis and the arising dilemma. The theorem of Rice is a result of computational theory. Cast informally, the theorem states that in the Turing-complete computation model only trivial questions about the behavior of a program can be answered. A sample trivial question would be: how many lines has the program? However, non-trivial properties of a program (such as data provenance) can only be adressed if the program actually is executed.

This gives rise to the following dilemma: to embrace a rich SQL dialect, we want to be Turing-complete (i.e., compute anything). Regarding program analysis, however, we want to avoid Turing completeness and its implications formulated in the theorem of Rice. The approach illustrated next allows us to have the cake and eat it, too. It allows us to stay in the Turing-complete computation model during runtime and to switch into a weaker computation model for provenance analysis. [{"compound": "citrate", "formula": "C 6 H 5 O 7 3-"}, {"compound": "glucose", "formula": "C 6 H 12 O 6 "}, {"compound": "hydronium", "formula": "H 3 O + "}];

Two-Step Program Analysis

To make this switch possible we run consecutive dynamic and static analyses (compare Figure 4).

During dynamic analysis, the behavior (not: result) of certain program statements is recorded in logs. For example, an if-statement can branch into the then-or the else-block. We record this (binary) decision. During static analysis, this makes the behavior of an if-statement predetermined. The if does no longer actively contribute to the computation and can be replaced by the according then-or else-branch.

When applying this record & replace discipline for a relevant subset of a program's statements, we get an equivalent form of the original program computing the same result. But now, the computation model has been simplified and is open for running an exhaustive program analysis. In the remainder of this section we explain the two analysis steps in detail.

Dynamic Analysis

As motivated above, we aim to record the behavior of program statements during runtime. The following two logs are appended to:

• log cf (control flow): which/how often does a certain code branch get executed by if and foreach? • log ix (indices): at which locations are elements inside lists/dictionaries accessed? During runtime, these properties are available and can easily be recorded. We use the technique of code instrumentation to create the two logs.

For an instrumented example, see Figure 5. The instrumentation instructions are placed on the righthand side of the listing. The first argument of the put()-function is the type of log we want to append to. Its second argument is the actual value being logged. Figure 6 lists the according logs. These are written (and read) sequentially and do not need any further meta-data, keeping the logs small.

The logged data items are to be interpreted in the context of the (uninstrumented) source code. For example, the first entry of log cf corresponds to the first control flow decision in the program at line 3. The foreach loop opened there can either execute its body (another time) or terminate and continue at the statement after line 11. We encode these decisions using Boolean values. The first true found in the log indicates that the body has been executed. The List/dictionary element accesses get logged in log ix . Note that foreach and append implicitly use numeric indices to read/write from/into lists and need to be included. The idxOf() function retrieves the ordinal position of a list element.

Static Analysis

Our static analysis does an abstract (value-less) interpretation of the uninstrumented source code. Instead of computing values, all input values are replaced by unique numeric identifiers. These pids are propagated during program interpretation and successively create a variable environment containing the data provenance information. Based on the basic query example of Figure 1 we present a simplified subset of our provenance derivation algorithm.

Figure 8 shows provenance inference rules denoted in operational semantics. The top Statements rule is the entry point for the interpretation. It takes the first statement s out of all statements ss to be interpreted. In general, interpretation of statements is triggered by the − − ⤇ symbol and leads to an update of the current variable environment Γ. The CF symbol represents the current data provenance for the control flow. The idea behind this is that reaching a certain code section depends on a number of branching decisions carried out by if/else statements. The dependencies for these decisions are collected in CF and propagated during program interpretation.

The numeric ids which represent a data provenance relationship are defined in Figure 7. There are the two kinds pid e and pid y which stand for Where-and Why-provenance, respectively. During analysis, these ids are created by the new ()-function (for an example, see the Lit-Str rule). Initially, all pids are of the Where-type because any pid e represents a certain value and a location of origin. During interpretation, they may be converted into Why-type using function Υ().

The main data structure is P . It can represent any value of any type of our programming language. Its second component e is used for container types (i.e., lists/dictionaries) to store contained elements. The first component c is used for both, containers as well as atomic values (e.g., strings). It represents the provenance for that value itself. The logs log cf and log ix are read by the inference rules. See rules If-True and If-False, for example. The popf ()-function reads and removes the first element of the according log.

The inference rules presented in Figure 8 are suitable to compute the data provenance of the basic query and finally yield the environment shown in Figure 9. As the

Statements CF ; Γ ⊢ s − − ⤇ Γ 1 CF ; Γ 1 ⊢ ss − − ⤇ Γ 2 CF ; Γ ⊢ s ; ss − − ⤇ Γ 2 PutVar CF ; Γ ⊢ e ⤇ P Γ res = Γ + {v ↦ P } CF ; Γ ⊢ v = e − − ⤇ Γ res Skip CF ; Γ ⊢ skip − − ⤇ Γ If-True popf (log cf ) CF ; Γ ⊢ e ⤇ P e CF if = Υ(CF ∪ γ(P e )) CF if ; Γ ⊢ ss 1 − − ⤇ Γ res CF ; Γ ⊢ if e then ss 1 else ss 2 fi − − ⤇ Γ res If-False ¬popf (log cf ) ... 2 CF ; Γ ⊢ if ... fi − − ⤇ Γ res

Related Work

The strongest group of related work builds upon provenance propagation through query transformation on the algebraic layer. For example, there is the Provenance Semirings approach [7] as well as the PERM system [6]. In more recent work, both of them were extended to support aggregations and subqueries, respectively. The aforementioned algebraic approaches are all limited in their expressiveness and extending the number of supported algebraic operators is non-trivial.

CONCLUSIONS

The approach presented in this article pushes the boundaries of the provenance analysis for SQL queries. Our prototype can analyse queries with advanced but timely SQL language features. Due to Turing-completeness, this approach can deal with any (non-updating) query translated into imperative code.

It is part of our future work to run this approach in the environment of a decent DBMS. In parallel, we pursue the derivation of How -provenance [3], i.e. get each one of the computed provenance relations associated to the SQL clauses accountable for its existence.

9 ]Figure 3 :93Figure 3: FSM.

Figure 4 :4Figure 4: Overview of the two-step analysis.

2 (2c) Parsing trace.

Figure 2 :2Advanced query example and provenance markers.

Figure 5 :5Figure 5: The translated and instrumented SQL query.

Figure 6 :6Figure 6: Log contents.

Figure 7 :7Figure 7: Data structures used in provenance computation.

Foreach-False ¬popf (log cf ) CF ; Γ ⊢ foreach ... od − − ⤇ Γ Foreach-True popf (log cf ) CF ; Γ ⊢ e ⤇ P e P el = (P e )[popf (log ix )] Γ for = Γ + {v ↦ ⟨γ(P e ) ∪ γ(P el ), (P el )⟩} CF ; Γ for ⊢ ss ; foreach v in e do ss od − − ⤇ Γ res CF ; Γ ⊢ foreach v in e do ss od − − ⤇ Γ res Append CF ; Γ ⊢ e ⤇ P e P v = Γ[v] P = ⟨γ(P v ), (P v ) + {popf (log ix ) ↦ P e }⟩ Γ res = Γ + {v ↦ P } CF ; Γ ⊢ append(v, e) − − ⤇ Γ res GetVar P = Γ[v] P res = ⟨γ(P ) ∪ CF , (P )⟩ CF ; Γ ⊢ v ⤇ P res Lit-Str P res = ⟨{new ()} ∪ CF, ∅⟩ CF ; Γ ⊢ c ⤇ P res GetVar-Idx P = (Γ[v])[popf (log ix )] CF ; Γ ⊢ e ⤇ P e P res = ⟨γ(P ) ∪ CF ∪ γ(Γ[v]) ∪ Υ(γ(P e )), (P )⟩ CF ; Γ ⊢ v[e] ⤇ P res Lit-Dict |CF ; Γ ⊢ e i ⤇ P i | i=0...n P res = ⟨{new ()} ∪ CF , {| i ↦ P i | i=0...n }⟩ CF ; Γ ⊢ { 0 :e 0 , . . . , n :e n } ⤇ P res Lit-List |CF ; Γ ⊢ e i ⤇ P i | i=0...n P res = ⟨{new ()} ∪ CF , {|i ↦ P i | i=0...n }⟩ CF ; Γ ⊢ [e 0 , . . . , e n ] ⤇ P res BinOp CF ; Γ ⊢ e 1 ⤇ P 1 CF ; Γ ⊢ e 2 ⤇ P 2 P res = ⟨γ(P 1 ) ∪ γ(P 2 ), ∅⟩ CF ; Γ ⊢ e 1 ⊛ e 2 ⤇ P res

Figure 8 :8Figure 8: Inference rules for data provenance. main result, we find four provenance relationships located in res[0]["formula"]. The highlighted pids 5 e 5 e (relates to t 2 : C 6 H 12 O 6C 6 H 12 O 6 ) and 4 y 4 y (relates to t 2 : glucose glucose ) constitute the data provenance visualized in Figure1. 6 y 6 y and 10 y 10 y do not correspond to table cells and may be ignored. We already presented a visualization prototype in a recent demo paper[8].

Figure 9 :9Figure 9: Resulting environment Γ after static analysis. Pids of non-input-values (> 10 e|y ) got dropped.

table cell t 4 : C 6 H 12 O 6 C 6 H 12 O 6 . According to the SQL query, two input columns are accessed: compound is used to decide if a tuple gets filtered or not. If a tuple qualifies, its value sitting in formula is copied over into the result table. Our provenance analysis accordingly finds the result being why-dependent on tuple t 2 : glucose glucose and being where-dependent on t 2 : C 6 H 12 O 6

last false indicates that the foreach loop has exited. Similarly, an if-statement can decide between then (yields true) or else (yields false).

log cflog ix⟨ true,⟨ 0,false,"compound",true,1,true,"compound",true,"formula",false,0,false ⟩2,"compound" ⟩

P∶= ⟨c, e⟩ c ∶= {pid 1 , ..., pid n } pid ∈ {1 e , 1 y , 2 e , 2 y , 3 e , 3 y , ...} e ∶= {l 1 ↦ P 1 , ..., l n ↦ P n }l ∶= any identifierγ(P ) ∶= c(P ) ∶= e

Υ(pids) ∶= {pid y ∶ pid e|y ∈ pids}

The highlighted pids 5 e 5 e (relates to t 2 : C 6 H 12 O 6 C 6 H 12 O 6 ) and 4 y 4 y (relates to t 2 : glucose glucose ) constitute the data provenance visualized in Figure 1. 6 y 6 y and 10 y 10 y do not correspond to table cells and may be ignored.We already presented a visualization prototype in a recentdemo paper [8].

2Analogous to If-True: ss 2 is interpreted. data ∶ ⟨{ 10 e 10 e }, {0 ↦ ⟨{3 e }, "compound" ↦ ⟨{1 e }, ∅⟩, "formula" ↦ ⟨{2 e }, ∅⟩⟩, 1 ↦ ⟨{ 6 e 6 e }, "compound" ↦ ⟨{ 4 e 4 e }, ∅⟩, "formula" ↦ ⟨{ 5 e 5 e }, ∅⟩⟩, 2 ↦ ⟨{9 e }, "compound" ↦ ⟨{7 e }, ∅⟩, "formula" ↦ ⟨{8 e }, ∅⟩⟩}⟩ row ∶ ⟨{9 e , 10 e }, {"compound" ↦ ⟨{7 e }, ∅⟩, "formula" ↦ ⟨{8 e }, ∅⟩}⟩ c ∶ ⟨{7 e , 9 e , 10 y }, ∅⟩ t ∶ ⟨{4 y , 6 y , 10 y }, {"formula" ↦ ⟨{5 e , 4 y , 6 y , 10 y }, ∅⟩}⟩ res ∶ ⟨∅, 0 ↦ ⟨{4 y , 6 y , 10 y }, {"formula" ↦ ⟨{ 5 e 5 e , 4 y 4 y , 6 y 6 y , 10 y 10 y }, ∅⟩}⟩⟩As part of our future work, we seek to modify an existing database system and let it run the dynamic analysis simultaneously with query execution.

<author> <persName><surname>References</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b1"> <analytic> <title level="a" type="main">Why and Where: A Characterization of Data Provenance PBuneman SKhanna W.-CTan Proc. ICDT ICDT 2001 Program Slicing and Data Provenance JCheney IEEE Data Engineering Bulletin 30 4 2007 Provenance in Databases: Why, How, and Where JCheney LChiticariu W.-CTan Foundations and Trends in Databases 1 4 2007 Tracing the Lineage of View Data in a Warehousing Environment YCui JWidom JWiener ACM TODS 25 2 2000 The best bang for your bu(ck)g BDietrich TMüller TGrust Proc. EDBT EDBT 2016 Perm: Processing Provenance and Data on the Same Data Model Through Query Rewriting BGlavic GAlonso Proc. ICDE ICDE 2009 Provenance Semirings TGreen GKarvounarakis VTannen Proc. PODS PODS 2007 Provenance for SQL Based on Abstract Interpretation: Value-less, but Worthwhile TMüller TGrust Proc. VLDB VLDB

Hawaii, USA

2015 Efficiently Compiling Efficient Query Plans for Modern Hardware TNeumann Proc. VLDB VLDB 2011 Program Slicing MWeiser IEEE Transactions on Software Engineering 10 4 1984