<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Computing Provenance Using the Negated Chase</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Görres</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Supervised by  Andreas Heuer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Rostock</institution>
          ,
          <addr-line>Universitätsplatz 1, 18055 Rostock</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Since diferent challenges of data processing are interconnected, we describe them in a unified manner using a classic algorithm of database theory: the Chase. Computing the origin of query results is one of the challenges considered in this research project. Previously, the Chase has been used to calculate why-provenance of simple conjunctive queries. However, applying the Chase to more realistic scenarios requires an extension of the algorithm, for example with negation. This work reveals opportunities for the extended Chase by calculating both why- and why-not provenance of conjunctive queries with negation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Chase</kwd>
        <kwd>negation</kwd>
        <kwd>why-provenance</kwd>
        <kwd>why-not provenance</kwd>
        <kwd>data science pipeline</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>its termination behavior. Despite its success in the field
of theory, software tools making use of the algorithm
When processing large amounts of data in a system- are – for the most part – restricted to scientific
protoatic fashion, we usually solve the arising issues with types. Ultimately, we intend to use the Chase to solve
algorithms tailored towards the specific challenge. Even practice-related use cases, in particular, issues of privacy
though this strategy leads to increased eficiency on a and provenance.
local level, we miss connections between the existing In this work, we focus on our results concerning the
problems. For instance, privacy and provenance are con- why- and why-not provenance. Previous studies explored
tradictory requirements usually solved in isolation from Chase based solutions concerning the why-provenance
each other. of simple queries. However, for more realistic scenarios,</p>
      <p>If we describe data processing with the concept of the extensions of the algorithm are necessary. Unfortunately,
data science pipeline, privacy and provenance can be this endangers confluence, termination and eficiency of
mapped to individual steps of the pipeline, but at the the Chase. While we regard the Chase as a universal
alsame time, they are requirements for the other steps. gorithm targeting a broad variety of objects, semantics of
Therefore, we need to consider them while designing the its extensions depends on the Chase object and therefore
database schema, schema evolution, cleaning the data, the use case. Privacy issues can be solved with the Chase
transforming the queries and while processing the results on queries, whereas provenance computations require
of data analysis. Formulating separate issues with a single the Chase on database instances, which is therefore the
consistent language makes this possible. In our research, focus of this work.
we chose the language of the Chase algorithm.</p>
      <p>The Chase  () integrates a parameter  into an 1.1. Data Science Pipeline
object , in the classical applications using integrity
constraints as  and database schemas or queries as . The
algorithm was introduced more than four decades ago,
processing two seemingly unrelated use cases – query
optimization and schema construction – in a unified
way [1, 2]. In the following years, the number of its
application areas increased even further, with the
concept of “universal solution” connecting the diferent
challenges [3]. Since then, intense research lead to a deeper
understanding of the algorithm’s properties, for example</p>
      <sec id="sec-1-1">
        <title>Data processing can be divided into a sequence of steps</title>
        <p>we call the data science pipeline. The initial step
comprises schema evolution, data migration and data
cleaning. The subsequent data analysis is abstracted as a
sequence of database queries. Finally, results are
interpreted and issues of privacy and provenance are tackled.</p>
        <p>In the following, we will take a closer look at the
individual steps, highlighting contributions of the Chase
algorithm. For data cleaning, Chase parameter  are
integrity constraints (e.g. key constraints) and Chase object
 is the database instance. The Chase might substitute
null values with constants by making use of functional
dependencies or insert missing tuples, e.g. to satisfy
instance to generate a target instance under a diferent we describe here has been studied in previous
theoretischema. Again,  is a database instances, whereas  are cal works [4], connecting it to Chase negation is rather
rules describing a mapping from source to target schema. uncommon. Furthermore, this extension allows the
calFor data analysis,  is again a database instance, while culation of instance-based why-not provenance. While
 represents the database queries. Here, our interest in the computation of this provenance is already possible
complex data analyses comprising e.g. statistical analysis with specialized algorithms, our Chase based solution is
warrants an extension of the current Chase formalism, directly integrated into a framework solving a multitude
for example with mathematical operators or negation. of other data science challenges, for example schema
In the provenance step, data responsible for the analysis evolution and query transformation.
results is identified using provenance techniques. This
way, reproducibility of those results is guaranteed. To
compute provenance, we invert the Chase rules used in 2. State of the Art
the previous data analysis step. Here,  refers to the
achieved analysis results. However, privacy guidelines
might prohibit direct access to the raw data, for example
when personal information is concerned. We contribute
to privacy by applying Chase rules (e.g. view definitions)
to the queries used in the analysis step, thereby
integrating them into the latter. By combining diferent Chase
applications, we contribute to every step of the data
science pipeline.</p>
      </sec>
      <sec id="sec-1-2">
        <title>The Chase is a fixpoint algorithm incorporating a set</title>
        <p>of rules, the Chase parameter, into a Chase object. In
this work, the Chase parameter is a query or
transformation rule encoded as a tuple generating dependency (tgd),
while the Chase object is a database instance. Tgds are
logical implications of the form (⃗, ⃗) → ∃⃗ :  (⃗, ⃗).</p>
        <p>Here,  and  are sets of relational atoms over the
depicted variables. If the tgd encodes a query, we often
existentially quantify ⃗, while at the same time prohibit
existentially quantified variables ⃗ in the rule head (its
1.2. Research Data Management right-hand side). The inversion of a tgd is the logical
One of the real world application areas motivating our implication  (⃗, ⃗) → ∃⃗ : (⃗, ⃗).
study is research data management. Many of our use If there is a homomorphic mapping from a Chase rule’s
cases originate from our long term co-operation with body (the left-hand side) to the Chase object and the rule
the Institute for Baltic Sea Research (IOW). Here, repro- is not satisfied yet, the rule head is materialized under the
ducibility of published research results is a requirement homomorphism, generating a set of new tuples. Each
exfor good scientific conduct. However, the schema of istentially quantified variable of the head contributes to a
recorded data changes over time. Thus, while tracing fresh marked null values  ( ∈ N). The Chase continues
back results to the responsible data, we need to account until a fix point is reached, however, termination is not
for schema evolution. After identifying involved tuples, guaranteed if existentially quantified variables in the rule
scientists provide them – and not the entire data set – to head are allowed and the rules are cyclic. Still, several
an external reviewer. termination tests guarantee Chase termination even on
cyclic rule sets. The Conditional Chase described later in
this work for instance terminates on richly acyclic rule</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>1.3. Negation in Data Analysis sets in polynomial time [4].</title>
      <p>We interpret data analysis as a series of database queries, In general, provenance can be seen as meta-data
dewhich in turn are expressed as Chase rules. In particu- scribing a production process [5]. For our generalizing
lar, we are interested in statistical analysis of scientific research approach, why- and why-not data provenance
data which often contains negation. On the one hand, are of particular interest. While there are tools (e.g. [6])
negation might be explicit part of the analysis algorithm, computing more detailed provenance information, they
for example in the form of set diference. On the other are not applicable to other challenges of the data science
hand, negation can be implicit part of basic aggregate pipeline.
functions, for example the maximum function. However, Why-provenance provides tuples – the witnesses –
jusnegation is not part of the standard Chase. tifying a certain result. Quite often, for example if tuples
become indistinguishable after projection, there are
alternative sets of tuples witnessing the same result. Thus,
1.4. Contribution a witness basis is the set of all sets of witnesses.
AlternaUsing the Chase algorithm, we solve interconnected chal- tively, inverting the query might provide a single generic
lenges of the data science pipeline in a coordinated man- representation for alternative witness sets (compare the
ner. In this work, we study how previous results con- relaxed Chase-inverse in [7]).
cerning provenance calculation are afected if we extend Why-not provenance explains the absence of
exthe Chase with negation. While the Conditional Chase pected result tuples. This can take three diferent forms:
Instance-based explanations, query-based explanations
and refinement-based explanations [5]. Since the Chase
algorithm allows neither tuple deletion nor update of
constants, we restrict ourselves to explanations based on
the insertion of tuples into the database. Similar to our
approach, the algorithm described in [8] computes
whynot provenance using conditions and c-tables. However,
those concepts are not used in the context of the
Conditional Chase discussed in this work. Most importantly,
this solution is restricted to provenance and isolated from
other challenges of the data science pipeline.</p>
      <p>Even though [9] formalizes both evolution rules and
conjunctive queries using tgds, only basic analysis
operations can be realized this way. Furthermore, only
non-recursive queries are actually inverted using the
Chase.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Implementation</title>
      <p>With the Chase implementation ChaTEAU, diferent
applications of the Chase are combined in a in a single
toolkit [9]. Currently, extensions like generalized
negation on database instances and queries under the
conditional semantics discussed in this work are integrated
into the software.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Negation on Instances</title>
    </sec>
    <sec id="sec-5">
      <title>5. Conditions and Certain Answers</title>
      <p>Standard Chase (without negation) operates under
certain answers semantics. Interpreting (marked) null values
as unknown, but existing values, we consider only
results justified under any valuation of null values with
concrete constants. In general, we treat null values as
constants unequal to any constant in the database
instance. However, this naive interpretation is insuficient
for negation under certain answers semantics. Consider
rule 1 which is triggered if null value  1 equals
constant . Under naive interpretation,  1 is not equal to 
and no result is generated. Let there be a second rule 2
which would be blocked by 1’s result. Clearly, 2 should
not be triggered under certain answers semantics since
there is a valuation of null values ( 1 ↦→ ) not justifying
the generation of this tuple. Therefore, some tuples are
only generated under certain conditions. For this paper,
we define conditions as conjunctions of logical
comparisons 1 2 ( ∈ {=, ̸=},  ∈ CONST∪NULL), with
  being the set of all constants of the domain
and NULL being the set of all marked null values. In the
previous example, the blocking tuple would be generated
under condition  1 = , while the result of 2 is
generated under condition  1 ̸= . If conditions of equivalent
tuples form a tautology (e.g.  1 =  and  1 ̸= ), those
tuples exist under certain answers semantics.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Provenance</title>
      <sec id="sec-6-1">
        <title>While we regard the Chase as a universal algorithm han</title>
        <p>dling diferent kinds of parameters and objects in a uni- 6.1. Why Provenance
ifed manner, semantics of negation still depends on the
Chase object. For the Chase on instances, negation in a As mentioned before, we can use the standard Chase
Chase rule requires the absence of a certain set of tu- algorithm (without negation) to calculate the
whyples. In contrast, negation in Chase rules applied to provenance of query results. The notion of
whyqueries requires the presence of an explicitly negated provenance used here corresponds to the relaxed
Chaseset of atoms in the query. As a consequence, we difer- inverse found in [7].
entiate between negation as a negated boolean subquery For this, we invert the Chase rules that created the
(from integrity constraint to database instance), and nega- result by switching the rule’s head and body. Attribute
tion as a boolean subquery with inverted direction (from values not passed to the result correspond to existentially
query to integrity constraint). In this work, we will focus quantified variables of the inverted Chase rule. Applying
on the first case, since it is more relevant the use case data the inverted query to the result generates the minimal
provenance. After finding a term mapping ℎ for all vari- witness basis [9]. This set of tuples is suficient to create
ables ⃗ of the positive body of an integrity constraint, we the original query result. However, it is usually smaller
select the atoms (⃗, ⃗) of a negative body ¬∃⃗ : (⃗, ⃗) than the complete instance and contains marked null
(that is, a negated conjunction of relational atoms). If the values in places irrelevant for the query.
result of the boolean query (⃗, ⃗) → () is (), we reject Extending this established algorithm with
semiℎ, otherwise, we continue with the next negative body positive negation is often without consequences.
Howor complete the Chase step (e.g. generate some tuples). ever, the same witness might play diferent roles during</p>
        <p>
          In this work, we will focus on semi-positive negation, query execution. Consider query  on instance :
that is, negation of base relations. Otherwise, we can  ={(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )}
no longer guarantee a single generic witness base and
calculations become rather complicated.  :() ∧ () ∧ ¬() → (, ).
        </p>
        <p>
          If we ignore the semi-positive negation of , the
witness basis is {(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )}. Indeed, even if we materialize
8. Conclusion
the image of ’s negative body, we only learn that (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
cannot exist, but we learn nothing about (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ).
Consequently, we can justify the existence of all previously The Chase algorithm solves a multitude of data science
published results {(
          <xref ref-type="bibr" rid="ref1 ref1">1, 1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
          )}, but we could addition- challenges in a unified manner. Provenance, for instance,
ally justify the result (
          <xref ref-type="bibr" rid="ref2 ref2">2, 2</xref>
          ). The absence of this expected can be calculated while keeping track of schema
evoluresult from the published result could be explained us- tion. However, real world applications require extensions
ing why-not provenance. However, the instance-based of the algorithm. In this work, we explain computation
why-not provenance presented in this work is restricted of why- and why-not provenance of conjunctive queries
to insertions – clearly, there is no way to generate (
          <xref ref-type="bibr" rid="ref2 ref2">2, 2</xref>
          ) with semi-positive negation using the extended Chase.
by inserting additional tuples into the database. While we chose the Conditional Chase to realize certain
answers semantics with negation, this decision was
ad6.2. Why-not Provenance vantageous for the explanation of why-not provenance
even in the absence of negation.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Similarly to why-provenance, why-not provenance can</title>
        <p>be computed using the witness basis generated by the
Chase. This witness basis (in the context of why-not
provenance known as "‘generic"’ witness basis) includes
the query’s materialized negative bodies. Since we are
interested in witnesses without representation in the
database, we insert artificial representatives for each
(positive) witness into the database. We interpret the
witness basis as a query returning the tuple identifiers
and apply it to the database. Every generated result tuple
corresponds to one possible why-not explanation. We
select results referring to a minimum of artificial
representatives and return the respective representatives as
the why-not explanation. If a witness mapped to its own
artificial representative and a witness mapped to a tuple
from the original database share an existentially
quantified variable, this variable is mapped to a marked null
value (from the artificial tuple) and a constant (from the
original tuple). The Conditional Chase allows this
mapping under the condition that null value and constant are
equal. If those equality conditions are consistent, we
interpret them as term mappings and substitute null values
from explanation tuples with constants from the original
database instance. A more detailed description of the
algorithm outlined above can be found in [10].</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Future Work</title>
      <p>Currently, we only consider semi-positive negation.
Otherwise, two queries triggering each other could lead to
nested negation in the generic witness basis, which
resolves to a disjunction of generic witnesses. However,
the use cases motivating our work are not restricted to
semi-positive negation or even stratifiable rule sets, so
an extension of our framework is necessary.</p>
      <p>In this work, the Chase object is a database instance.
However, other steps of the data science pipeline require
a query to be the Chase object. While Chase semantics
are very similar for both types of objects, semantics of
negation difer distinctively.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <sec id="sec-8-1">
        <title>This work was supported by a scholarship of the Landesgraduiertenförderung Mecklenburg-Vorpommern.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Aho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sagiv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          ,
          <article-title>Eficient optimization of a class of relational expressions</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>4</volume>
          (
          <year>1979</year>
          )
          <fpage>435</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Mendelzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sagiv</surname>
          </string-name>
          ,
          <article-title>Testing implications of data dependencies</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>4</volume>
          (
          <year>1979</year>
          )
          <fpage>455</fpage>
          -
          <lpage>469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Remmel</surname>
          </string-name>
          ,
          <article-title>The chase revisited</article-title>
          , in: PODS, ACM,
          <year>2008</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grahne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Onet</surname>
          </string-name>
          ,
          <article-title>On conditional chase termination</article-title>
          .,
          <source>AMW</source>
          <volume>11</volume>
          (
          <year>2011</year>
          )
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Herschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Diestelkämper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Ben</given-names>
            <surname>Lahmar</surname>
          </string-name>
          ,
          <article-title>A survey on provenance: What for? what form? what from?</article-title>
          ,
          <source>VLDB J</source>
          .
          <volume>26</volume>
          (
          <year>2017</year>
          )
          <fpage>881</fpage>
          -
          <lpage>906</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Senellart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jachiet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maniu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ramusat</surname>
          </string-name>
          ,
          <article-title>Provsql: Provenance and probability management in postgresql</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>2034</fpage>
          -
          <lpage>2037</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Kolaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Schema mapping evolution through composition and inversion</article-title>
          , in: Schema Matching and Mapping,
          <source>DataCentric Systems and Applications</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Herschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hernández</surname>
          </string-name>
          ,
          <article-title>Explaining missing answers to SPJUA queries</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>3</volume>
          (
          <year>2010</year>
          )
          <fpage>185</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Auge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hanzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heuer</surname>
          </string-name>
          ,
          <article-title>Prosa pipeline: Provenance conquers the chase</article-title>
          ,
          <source>in: ADBIS (Short Papers)</source>
          , volume
          <volume>1652</volume>
          of Communications in Computer and Information Science, Springer,
          <year>2022</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Görres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heuer</surname>
          </string-name>
          ,
          <article-title>Computing Provenance Using the Negated Chase</article-title>
          ,
          <source>Technical Report CS 01-23</source>
          , Institut für Informatik, Universität Rostock,
          <year>2023</year>
          . Extended version of this work.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>