1. INTRODUCTION

PhD Workshop, September

Have Your Cake and Eat it, Too: Data Provenance for Turing-Complete SQL Queries

Tobias Müller

0 0 (Advisor: Torsten Grust) University of Tübingen Tübingen , Germany

2016

9 2016

We report on our work about the computation of data provenance for feature-rich SQL. Among further constructs, our prototype supports correlated subqueries, aggregations, recursive queries and window functions. Our analysis approach completely sidesteps relational algebra and instead requires a translation of the input query into an imperativestyle program. Provided that the target language is Turingcomplete, any SQL query can be covered. We employ a new variant of program analysis which consists of a dynamic and a static part. This two-step approach enables us to dodge limitations that a Turing-complete computation model entails for program analyses otherwise. The derived data provenance directly reflects the data provenance of the original SQL query.

1. INTRODUCTION

Data provenance [ 3,4 ] is metadata — primarily about the origin of a certain data piece. Everyday examples for desirable provenance information are the From: header field in an email or citations in academic papers. In these two cases, the provenance is trivial and does not need any clever algorithms for its computation (at least: should not).

However, in the context of real-world relational database systems there is a deficiency regarding the provenance computation for contemporary implementations of SQL. SQL, being the standard of relational query languages, has support for advanced language constructs like recursive queries or window functions. Further, nesting of queries is possible, for example, through (correlated) subqueries. These features make writing queries convenient but also make the data provenance of query results non-trivial in the general case. Concrete scenarios in which data provenance for SQL has proved being relevant are the view update/maintenance problem [ 4 ], data warehouses [ 4 ] and debugging purposes [ 5 ]. The analysis approach we are going to describe is capable of computing the data provenance for any non-updating SQL query. 1.1

Provenance Model

We adopt a basic distinction of Where- and Why-provenance as originally introduced by Buneman and Tan [ 1 ]: a Where-provenance : where has a certain data piece originated? Exactly which table cells were copied or transformed to yield an output cell? a Why-provenance : why is a certain data piece in the result? Which input table cells were inspected to decide about the existence or contents of an output cell? 1.2

Basic Example

Figure 1 shows an intentionally simple SQL query and corresponding example tables. Mouse pointer 1 represents an inquiry for the data provenance of table cell t4: C6H12O6 .

According to the SQL query, two input columns are accessed: compound is used to decide if a tuple gets filtered or not. If a tuple qualifies, its value sitting in formula is copied over into the result table. Our provenance analysis accordingly finds the result being why-dependent on tuple t2: glucose and being where-dependent on t2: C6H12O6 .

In the following sections we revisit this example and illustrate how this outcome actually is computed using our program analysis. 1.3

Advanced Example

The provenance analysis of the query found in Figure 2(b) is a unique feature of our approach: to the best knowledge of the author, only our analysis approach can deal with recursive SQL queries.

The query syntax-checks molecular formulae. Technically, the finite state machine depicted in Figure 3 is executed. The encoded FSM and input formulae can be found in Figure 2(a).

We inspect result cell O73- (mouse pointer 2 ). The according highlights within the compounds table ( citrate compounds

compound t1 citrate t2 glucose t3 hydronium formula C6H5O73

CH63HO1+2O6 t4 C6H12O6

1 (b) Query and result. and C6H5O73- ) show the same pattern as in the basic example of Section 1.2. [A::Za::z0::9]

More interesting markers can be found within ta- [A::Za::z] ble fsm. The highlighted s0 s1 cells inside t8 and t9 in- [0::9] dicate which state changes [+-] were triggered while parsing the first letters of the formula. [0::9] s2 s3

2. SQL COMPILATION

1As part of our future work, we seek to modify an existing database system and let it run the dynamic analysis simultaneously with query execution. Translate SQL

Imperative Code Provenance Analysis

You find the table compounds of Figure 1(a) represented as a data structure (list of dictionaries). The algorithm iterates over the input table (line 3) and if a tuple has qualified (line 5), its formula is appended to the result (line 7).

Please note that we combined input data (i.e. database instance) and the computation algorithm into one program. In the regular case, both of them are kept separate (refer to Figure 4). 3. PROVENANCE ANALYSIS

Before we get to the details of our approach, we shed some light on the theoretical limits of program analysis and the arising dilemma. The theorem of Rice is a result of computational theory. Cast informally, the theorem states that in the Turing-complete computation model only trivial questions about the behavior of a program can be answered. A sample trivial question would be: how many lines has the program? However, non-trivial properties of a program (such as data provenance) can only be adressed if the program actually is executed.

This gives rise to the following dilemma: to embrace a rich SQL dialect, we want to be Turing-complete (i.e., compute anything). Regarding program analysis, however, we want to avoid Turing completeness and its implications formulated in the theorem of Rice. The approach illustrated next allows us to have the cake and eat it, too. It allows us to stay in the Turing-complete computation model during runtime and to switch into a weaker computation model for provenance analysis. compounds

compound t5 citrate t6 glucose t7 hydronium

Two-Step Program Analysis

To make this switch possible we run consecutive dynamic and static analyses (compare Figure 4).

During dynamic analysis, the behavior (not: result) of certain program statements is recorded in logs. For example, an if-statement can branch into the then- or the else-block. We record this (binary) decision. During static analysis, this makes the behavior of an if-statement predetermined. The if does no longer actively contribute to the computation and can be replaced by the according then- or else-branch.

When applying this record & replace discipline for a relevant subset of a program’s statements, we get an equivalent form of the original program computing the same result. But now, the computation model has been simplified and is open for running an exhaustive program analysis. In the remainder of this section we explain the two analysis steps in detail. 3.2

Dynamic Analysis

As motivated above, we aim to record the behavior of program statements during runtime. The following two logs are appended to: a logcf (control flow): which/how often does a certain code branch get executed by if and foreach? a logix (indices): at which locations are elements inside lists/dictionaries accessed?

During runtime, these properties are available and can easily be recorded. We use the technique of code instrumentation to create the two logs.

For an instrumented example, see Figure 5. The instrumentation instructions are placed on the righthand side of the listing. The first argument of the put -function is the type of log we want to append to. Its second argument is the actual value being logged. Figure 6 lists the according logs. These are written (and read) sequentially and do not need any further meta-data, keeping the logs small.

The logged data items are to be interpreted in the context of the (uninstrumented) source code. For example, the first entry of logcf corresponds to the first control flow decision in the program at line 3. The foreach loop opened there can either execute its body (another time) or terminate and continue at the statement after line 11. We encode these decisions using Boolean values. The first true found in the log indicates that the body has been executed. The last false indicates that the foreach loop has exited. Similarly, an if-statement can decide between then (yields true) or else (yields false).

List/dictionary element ac- logcf logix cesses get logged in logix . Note true; that foreach and append im- false; plicitly use numeric indices to ttrruuee;; read/write from/into lists and true; need to be included. The false; idxOf() function retrieves the false ordinal position of a list element. Log con 0; "compound"; 1; "compound"; "formula"; 0; 2; "compound"

Our static analysis does an abstract (value-less) interpretation of the uninstrumented source code. Instead of computing values, all input values are replaced by unique numeric identifiers. These pids are propagated during program interpretation and successively create a variable environment containing the data provenance information. Based on the basic query example of Figure 1 we present a simplified subset of our provenance derivation algorithm.

Figure 8 shows provenance inference rules denoted in operational semantics. The top Statements rule is the entry point for the interpretation. It takes the first statement s out of all statements ss to be interpreted. In general, interpretation of statements is triggered by the , symbol and leads to an update of the current variable environment . The CF symbol represents the current data provenance for the control flow. The idea behind this is that reaching a certain code section depends on a number of branching decisions carried out by if/else statements. The dependencies for these decisions are collected in CF and propagated during program interpretation.

The numeric ids which represent a data provenance relationship are defined in Figure 7. There are the two kinds pid e and pid y which stand for Where- and Why-provenance, respectively. During analysis, these ids are created by the new -function (for an example, see the Lit-Str rule). Initially, all pids are of the Where-type because any pid e represents a certain value and a location of origin. During interpretation, they may be converted into Why-type using function ).

The main data structure is P . It can represent any value of any type of our programming language. Its second component e is used for container types (i.e., lists/dictionaries) to store contained elements. The first component c is used for both, containers as well as atomic values (e.g., strings). It represents the provenance for that value itself. The logs logcf and logix are read by the inference rules. See rules If-True and If-False, for example. The popf -function reads and removes the first element of the according log.

The inference rules presented in Figure 8 are suitable to compute the data provenance of the basic query and finally yield the environment shown in Figure 9. As the P c e c; e rpid1; :::; pidnx pid " r1e; 1y; 2e; 2y; 3e; 3y; :::x rl1 ( P1; :::; ln ( Pnx l

any identifier P

P pids c e rpidy pide¶y " pidsx

CF ; à s ; ss , 2 PutVar CF ; à e , P

res CF ; à v = e , res rv ( P x

CF ; à e , Pe Pel Pe popf logix for rv ( Pe < Pel ; Pel x CF ; for à ss ; foreach v in e do ss od , res

CF ; à foreach v in e do ss od , res Append

CF ; à e , Pe Pv v Pv ; Pv rpopf logix ( Pex res

CF ; à append(v; e) , res Pres P < CF ; P CF ; à v , Pres Lit-Dict ·CF ; à ei , Pi·i 0:::n Pres rnew x < CF ; r·`i ( Pi·i 0:::nx

CF ; à {`0:e0; : : : ; `n:en} , Pres Lit-List ·CF ; à ei , Pi·i 0:::n Pres rnew x < CF ; r·i ( Pi·i 0:::nx

CF ; à [e0; : : : ; en] , Pres BinOp CF ; à e1 , P1 CF ; à e2 , P2 Pres P1 < P2 ; o

CF ; à e1 l e2 , Pres main result, we find four provenance relationships located in res[0]["formula"]. The highlighted pids 5e (relates to t2: C6H12O6 ) and 4y (relates to t2: glucose ) constitute the data provenance visualized in Figure 1. 6y and 10y do not correspond to table cells and may be ignored. We already presented a visualization prototype in a recent demo paper [ 8 ].

3.4 Related Work

The strongest group of related work builds upon provenance propagation through query transformation on the algebraic layer. For example, there is the Provenance Semirings approach [ 7 ] as well as the PERM system [ 6 ]. In more recent work, both of them were extended to support aggregations and subqueries, respectively. 1 ( r 6e x; "compound" ( r 4e x; o;

"formula" ( r 5e x; o; 2 ( r9ex; "compound" ( r7ex; o;

"formula" ( r8ex; ox row r9e; 10ex; r"compound" ( r7ex; o;

"formula" ( r8ex; ox

The aforementioned algebraic approaches are all limited in their expressiveness and extending the number of supported algebraic operators is non-trivial.

4. CONCLUSIONS

The approach presented in this article pushes the boundaries of the provenance analysis for SQL queries. Our prototype can analyse queries with advanced but timely SQL language features. Due to Turing-completeness, this approach can deal with any (non-updating) query translated into imperative code.

It is part of our future work to run this approach in the environment of a decent DBMS. In parallel, we pursue the derivation of How -provenance [ 3 ], i.e. get each one of the computed provenance relations associated to the SQL clauses accountable for its existence.

[1]

Buneman ,

Khanna , and

W.-C.

Tan . Why and Where: A Characterization of Data Provenance . In Proc. ICDT , 2001 .

[2]

Cheney . Program Slicing and Data Provenance. IEEE Data Engineering Bulletin , 30 ( 4 ): 22 - 28 , 2007 .

[3]

Cheney ,

Chiticariu , and

W.-C.

Tan . Provenance in Databases: Why, How, and Where . Foundations and Trends in Databases, 1 ( 4 ), 2007 .

[4]

Cui ,

Widom , and

Wiener . Tracing the Lineage of View Data in a Warehousing Environment . ACM TODS , 25 ( 2 ), 2000 .

[5]

Dietrich ,

Müller , and

Grust . The best bang for your bu(ck)g . In Proc. EDBT , 2016 .

[6]

Glavic and

Alonso . Perm: Processing Provenance and Data on the Same Data Model Through Query Rewriting . In Proc. ICDE , 2009 .

[7]

Green ,

Karvounarakis , and

Tannen . Provenance Semirings. In Proc. PODS , 2007 .

[8]

Müller and

Grust . Provenance for SQL Based on Abstract Interpretation: Value-less, but Worthwhile . In Proc. VLDB , Hawaii, USA, 2015 .

[9]

Neumann . Efficiently Compiling Efficient Query Plans for Modern Hardware . In Proc. VLDB , 2011 .

[10]

Weiser . Program Slicing. IEEE Transactions on Software Engineering , SE- 10 ( 4 ), 1984 .