<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>PhD Workshop, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Have Your Cake and Eat it, Too: Data Provenance for Turing-Complete SQL Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tobias Müller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(Advisor: Torsten Grust) University of Tübingen Tübingen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>9</volume>
      <issue>2016</issue>
      <abstract>
        <p>We report on our work about the computation of data provenance for feature-rich SQL. Among further constructs, our prototype supports correlated subqueries, aggregations, recursive queries and window functions. Our analysis approach completely sidesteps relational algebra and instead requires a translation of the input query into an imperativestyle program. Provided that the target language is Turingcomplete, any SQL query can be covered. We employ a new variant of program analysis which consists of a dynamic and a static part. This two-step approach enables us to dodge limitations that a Turing-complete computation model entails for program analyses otherwise. The derived data provenance directly reflects the data provenance of the original SQL query.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Data provenance [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ] is metadata — primarily about the
origin of a certain data piece. Everyday examples for
desirable provenance information are the From: header field
in an email or citations in academic papers. In these two
cases, the provenance is trivial and does not need any clever
algorithms for its computation (at least: should not).
      </p>
      <p>
        However, in the context of real-world relational database
systems there is a deficiency regarding the provenance
computation for contemporary implementations of SQL. SQL,
being the standard of relational query languages, has
support for advanced language constructs like recursive queries
or window functions. Further, nesting of queries is
possible, for example, through (correlated) subqueries. These
features make writing queries convenient but also make the
data provenance of query results non-trivial in the general
case. Concrete scenarios in which data provenance for SQL
has proved being relevant are the view update/maintenance
problem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], data warehouses [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and debugging purposes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
The analysis approach we are going to describe is capable of
computing the data provenance for any non-updating SQL
query.
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Provenance Model</title>
      <p>
        We adopt a basic distinction of Where- and
Why-provenance as originally introduced by Buneman and Tan [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
a Where-provenance : where has a certain data piece
originated? Exactly which table cells were copied or
transformed to yield an output cell?
a Why-provenance : why is a certain data piece in the
result? Which input table cells were inspected to
decide about the existence or contents of an output cell?
1.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Basic Example</title>
      <p>Figure 1 shows an intentionally simple SQL query and
corresponding example tables. Mouse pointer 1 represents
an inquiry for the data provenance of table cell t4: C6H12O6 .</p>
      <p>According to the SQL query, two input columns are
accessed: compound is used to decide if a tuple gets filtered
or not. If a tuple qualifies, its value sitting in formula is
copied over into the result table. Our provenance
analysis accordingly finds the result being why-dependent on
tuple t2: glucose and being where-dependent on t2: C6H12O6 .</p>
      <p>In the following sections we revisit this example and
illustrate how this outcome actually is computed using our
program analysis.
1.3</p>
    </sec>
    <sec id="sec-4">
      <title>Advanced Example</title>
      <p>The provenance analysis of the query found in Figure 2(b)
is a unique feature of our approach: to the best knowledge
of the author, only our analysis approach can deal with
recursive SQL queries.</p>
      <p>The query syntax-checks molecular formulae. Technically,
the finite state machine depicted in Figure 3 is executed.
The encoded FSM and input formulae can be found in
Figure 2(a).</p>
      <p>We inspect result cell O73- (mouse pointer 2 ). The
according highlights within the compounds table ( citrate
compounds</p>
      <p>compound
t1 citrate
t2 glucose
t3 hydronium
formula
C6H5O73</p>
      <p>CH63HO1+2O6
t4 C6H12O6</p>
      <p>1
(b) Query and result.
and C6H5O73- ) show the same pattern as in the basic
example of Section 1.2. [A::Za::z0::9]</p>
      <p>More interesting markers
can be found within ta- [A::Za::z]
ble fsm. The highlighted s0 s1
cells inside t8 and t9 in- [0::9]
dicate which state changes [+-]
were triggered while parsing
the first letters of the
formula. [0::9] s2 s3</p>
    </sec>
    <sec id="sec-5">
      <title>2. SQL COMPILATION</title>
      <p>1As part of our future work, we seek to modify an existing
database system and let it run the dynamic analysis
simultaneously with query execution.
Translate SQL</p>
      <p>Imperative Code
Provenance Analysis</p>
      <p>You find the table compounds of Figure 1(a) represented as
a data structure (list of dictionaries). The algorithm iterates
over the input table (line 3) and if a tuple has qualified
(line 5), its formula is appended to the result (line 7).</p>
      <p>Please note that we combined input data (i.e. database
instance) and the computation algorithm into one program.
In the regular case, both of them are kept separate (refer to
Figure 4).
3. PROVENANCE ANALYSIS</p>
      <p>Before we get to the details of our approach, we shed some
light on the theoretical limits of program analysis and the
arising dilemma. The theorem of Rice is a result of
computational theory. Cast informally, the theorem states that in
the Turing-complete computation model only trivial
questions about the behavior of a program can be answered.
A sample trivial question would be: how many lines has
the program? However, non-trivial properties of a program
(such as data provenance) can only be adressed if the
program actually is executed.</p>
      <p>This gives rise to the following dilemma: to embrace a
rich SQL dialect, we want to be Turing-complete (i.e.,
compute anything). Regarding program analysis, however, we
want to avoid Turing completeness and its implications
formulated in the theorem of Rice. The approach illustrated
next allows us to have the cake and eat it, too. It allows us
to stay in the Turing-complete computation model during
runtime and to switch into a weaker computation model for
provenance analysis.
compounds</p>
      <p>compound
t5 citrate
t6 glucose
t7 hydronium</p>
    </sec>
    <sec id="sec-6">
      <title>Two-Step Program Analysis</title>
      <p>To make this switch possible we run consecutive dynamic
and static analyses (compare Figure 4).</p>
      <p>During dynamic analysis, the behavior (not: result) of
certain program statements is recorded in logs. For example, an
if-statement can branch into the then- or the else-block.
We record this (binary) decision. During static analysis, this
makes the behavior of an if-statement predetermined. The
if does no longer actively contribute to the computation
and can be replaced by the according then- or else-branch.</p>
      <p>When applying this record &amp; replace discipline for a
relevant subset of a program’s statements, we get an equivalent
form of the original program computing the same result.
But now, the computation model has been simplified and
is open for running an exhaustive program analysis. In the
remainder of this section we explain the two analysis steps
in detail.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Dynamic Analysis</title>
      <p>As motivated above, we aim to record the behavior of
program statements during runtime. The following two logs
are appended to:
a logcf (control flow): which/how often does a certain
code branch get executed by if and foreach?
a logix (indices): at which locations are elements inside
lists/dictionaries accessed?</p>
      <p>During runtime, these properties are available and can
easily be recorded. We use the technique of code
instrumentation to create the two logs.</p>
      <p>For an instrumented example, see Figure 5. The
instrumentation instructions are placed on the righthand side of
the listing. The first argument of the put -function is the
type of log we want to append to. Its second argument is
the actual value being logged. Figure 6 lists the according
logs. These are written (and read) sequentially and do not
need any further meta-data, keeping the logs small.</p>
      <p>The logged data items are to be interpreted in the context
of the (uninstrumented) source code. For example, the first
entry of logcf corresponds to the first control flow decision
in the program at line 3. The foreach loop opened there
can either execute its body (another time) or terminate and
continue at the statement after line 11. We encode these
decisions using Boolean values. The first true found in the
log indicates that the body has been executed. The last
false indicates that the foreach loop has exited. Similarly,
an if-statement can decide between then (yields true) or
else (yields false).</p>
      <p>List/dictionary element ac- logcf logix
cesses get logged in logix . Note  true;
that foreach and append im- false;
plicitly use numeric indices to ttrruuee;;
read/write from/into lists and true;
need to be included. The false;
idxOf() function retrieves the false 
ordinal position of a list
element. Log
con 0;
"compound";
1;
"compound";
"formula";
0;
2;
"compound" </p>
      <p>Our static analysis does an abstract (value-less)
interpretation of the uninstrumented source code. Instead of
computing values, all input values are replaced by unique
numeric identifiers. These pids are propagated during
program interpretation and successively create a variable
environment containing the data provenance information. Based
on the basic query example of Figure 1 we present a
simplified subset of our provenance derivation algorithm.</p>
      <p>Figure 8 shows provenance inference rules denoted in
operational semantics. The top Statements rule is the entry
point for the interpretation. It takes the first statement s out
of all statements ss to be interpreted. In general,
interpretation of statements is triggered by the , symbol and leads
to an update of the current variable environment . The
CF symbol represents the current data provenance for the
control flow. The idea behind this is that reaching a certain
code section depends on a number of branching decisions
carried out by if/else statements. The dependencies for
these decisions are collected in CF and propagated during
program interpretation.</p>
      <p>The numeric ids which represent a data provenance
relationship are defined in Figure 7. There are the two kinds
pid e and pid y which stand for Where- and Why-provenance,
respectively. During analysis, these ids are created by the
new -function (for an example, see the Lit-Str rule).
Initially, all pids are of the Where-type because any pid e
represents a certain value and a location of origin. During
interpretation, they may be converted into Why-type using
function ).</p>
      <p>The main data structure is P . It can represent any value
of any type of our programming language. Its second
component e is used for container types (i.e., lists/dictionaries) to
store contained elements. The first component c is used for
both, containers as well as atomic values (e.g., strings). It
represents the provenance for that value itself. The logs logcf
and logix are read by the inference rules. See rules If-True
and If-False, for example. The popf -function reads and
removes the first element of the according log.</p>
      <p>The inference rules presented in Figure 8 are suitable to
compute the data provenance of the basic query and
finally yield the environment shown in Figure 9. As the
P
c
e
c; e
rpid1; :::; pidnx
pid " r1e; 1y; 2e; 2y; 3e; 3y; :::x
rl1 ( P1; :::; ln ( Pnx
l</p>
      <p>any identifier
P</p>
      <p>P
pids
c
e
rpidy pide¶y " pidsx</p>
      <p>CF ; à s ; ss , 2
PutVar
CF ; à e , P</p>
      <p>res
CF ; à v = e , res
rv ( P x</p>
      <p>CF ; à e , Pe Pel Pe popf logix
for rv (  Pe &lt; Pel ; Pel x
CF ; for à ss ; foreach v in e do ss od , res</p>
      <p>CF ; à foreach v in e do ss od , res
Append</p>
      <p>CF ; à e , Pe Pv v
 Pv ; Pv rpopf logix ( Pex res</p>
      <p>CF ; à append(v; e) , res
Pres  P &lt; CF ; P 
CF ; à v , Pres
Lit-Dict
·CF ; à ei , Pi·i 0:::n Pres rnew x &lt; CF ; r·`i ( Pi·i 0:::nx</p>
      <p>CF ; à {`0:e0; : : : ; `n:en} , Pres
Lit-List
·CF ; à ei , Pi·i 0:::n Pres rnew x &lt; CF ; r·i ( Pi·i 0:::nx</p>
      <p>CF ; à [e0; : : : ; en] , Pres
BinOp
CF ; à e1 , P1 CF ; à e2 , P2 Pres  P1 &lt;
P2 ; o</p>
      <p>
        CF ; à e1 l e2 , Pres
main result, we find four provenance relationships located
in res[0]["formula"]. The highlighted pids 5e (relates
to t2: C6H12O6 ) and 4y (relates to t2: glucose )
constitute the data provenance visualized in Figure 1. 6y and
10y do not correspond to table cells and may be ignored.
We already presented a visualization prototype in a recent
demo paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>3.4 Related Work</title>
      <p>
        The strongest group of related work builds upon
provenance propagation through query transformation on the
algebraic layer. For example, there is the Provenance
Semirings approach [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as well as the PERM system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In more
recent work, both of them were extended to support
aggregations and subqueries, respectively.
1 ( r 6e x; "compound" ( r 4e x; o;
      </p>
      <p>"formula" ( r 5e x; o;
2 ( r9ex; "compound" ( r7ex; o;</p>
      <p>"formula" ( r8ex; ox
row r9e; 10ex; r"compound" ( r7ex; o;</p>
      <p>"formula" ( r8ex; ox</p>
      <p>The aforementioned algebraic approaches are all limited in
their expressiveness and extending the number of supported
algebraic operators is non-trivial.</p>
    </sec>
    <sec id="sec-9">
      <title>4. CONCLUSIONS</title>
      <p>The approach presented in this article pushes the
boundaries of the provenance analysis for SQL queries. Our
prototype can analyse queries with advanced but timely SQL
language features. Due to Turing-completeness, this approach
can deal with any (non-updating) query translated into
imperative code.</p>
      <p>
        It is part of our future work to run this approach in
the environment of a decent DBMS. In parallel, we
pursue the derivation of How -provenance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], i.e. get each one
of the computed provenance relations associated to the SQL
clauses accountable for its existence.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Why and Where: A Characterization of Data Provenance</article-title>
          .
          <source>In Proc. ICDT</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          .
          <source>Program Slicing and Data Provenance. IEEE Data Engineering Bulletin</source>
          ,
          <volume>30</volume>
          (
          <issue>4</issue>
          ):
          <fpage>22</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Tan</surname>
          </string-name>
          . Provenance in Databases: Why, How, and
          <string-name>
            <surname>Where</surname>
          </string-name>
          . Foundations and Trends in Databases,
          <volume>1</volume>
          (
          <issue>4</issue>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wiener</surname>
          </string-name>
          .
          <article-title>Tracing the Lineage of View Data in a Warehousing Environment</article-title>
          .
          <source>ACM TODS</source>
          ,
          <volume>25</volume>
          (
          <issue>2</issue>
          ),
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dietrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Müller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Grust</surname>
          </string-name>
          .
          <article-title>The best bang for your bu(ck)g</article-title>
          .
          <source>In Proc. EDBT</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Glavic</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Alonso</surname>
          </string-name>
          . Perm:
          <article-title>Processing Provenance and Data on the Same Data Model Through Query Rewriting</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karvounarakis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          . Provenance Semirings.
          <source>In Proc. PODS</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Müller</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Grust</surname>
          </string-name>
          .
          <article-title>Provenance for SQL Based on Abstract Interpretation: Value-less, but Worthwhile</article-title>
          .
          <source>In Proc. VLDB</source>
          , Hawaii, USA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          .
          <article-title>Efficiently Compiling Efficient Query Plans for Modern Hardware</article-title>
          .
          <source>In Proc. VLDB</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Weiser</surname>
          </string-name>
          .
          <source>Program Slicing. IEEE Transactions on Software Engineering</source>
          , SE-
          <volume>10</volume>
          (
          <issue>4</issue>
          ),
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>