1. Introduction

Computing Provenance Using the Negated Chase

Andreas Görres

Supervised by Andreas Heuer

0 0 University of Rostock , Universitätsplatz 1, 18055 Rostock , Germany

Since diferent challenges of data processing are interconnected, we describe them in a unified manner using a classic algorithm of database theory: the Chase. Computing the origin of query results is one of the challenges considered in this research project. Previously, the Chase has been used to calculate why-provenance of simple conjunctive queries. However, applying the Chase to more realistic scenarios requires an extension of the algorithm, for example with negation. This work reveals opportunities for the extended Chase by calculating both why- and why-not provenance of conjunctive queries with negation.

eol>Chase negation why-provenance why-not provenance data science pipeline

1. Introduction

its termination behavior. Despite its success in the field of theory, software tools making use of the algorithm When processing large amounts of data in a system- are – for the most part – restricted to scientific protoatic fashion, we usually solve the arising issues with types. Ultimately, we intend to use the Chase to solve algorithms tailored towards the specific challenge. Even practice-related use cases, in particular, issues of privacy though this strategy leads to increased eficiency on a and provenance. local level, we miss connections between the existing In this work, we focus on our results concerning the problems. For instance, privacy and provenance are con- why- and why-not provenance. Previous studies explored tradictory requirements usually solved in isolation from Chase based solutions concerning the why-provenance each other. of simple queries. However, for more realistic scenarios,

If we describe data processing with the concept of the extensions of the algorithm are necessary. Unfortunately, data science pipeline, privacy and provenance can be this endangers confluence, termination and eficiency of mapped to individual steps of the pipeline, but at the the Chase. While we regard the Chase as a universal alsame time, they are requirements for the other steps. gorithm targeting a broad variety of objects, semantics of Therefore, we need to consider them while designing the its extensions depends on the Chase object and therefore database schema, schema evolution, cleaning the data, the use case. Privacy issues can be solved with the Chase transforming the queries and while processing the results on queries, whereas provenance computations require of data analysis. Formulating separate issues with a single the Chase on database instances, which is therefore the consistent language makes this possible. In our research, focus of this work. we chose the language of the Chase algorithm.

The Chase () integrates a parameter into an 1.1. Data Science Pipeline object , in the classical applications using integrity constraints as and database schemas or queries as . The algorithm was introduced more than four decades ago, processing two seemingly unrelated use cases – query optimization and schema construction – in a unified way [1, 2]. In the following years, the number of its application areas increased even further, with the concept of “universal solution” connecting the diferent challenges [3]. Since then, intense research lead to a deeper understanding of the algorithm’s properties, for example

Data processing can be divided into a sequence of steps

we call the data science pipeline. The initial step comprises schema evolution, data migration and data cleaning. The subsequent data analysis is abstracted as a sequence of database queries. Finally, results are interpreted and issues of privacy and provenance are tackled.

In the following, we will take a closer look at the individual steps, highlighting contributions of the Chase algorithm. For data cleaning, Chase parameter are integrity constraints (e.g. key constraints) and Chase object is the database instance. The Chase might substitute null values with constants by making use of functional dependencies or insert missing tuples, e.g. to satisfy instance to generate a target instance under a diferent we describe here has been studied in previous theoretischema. Again, is a database instances, whereas are cal works [4], connecting it to Chase negation is rather rules describing a mapping from source to target schema. uncommon. Furthermore, this extension allows the calFor data analysis, is again a database instance, while culation of instance-based why-not provenance. While represents the database queries. Here, our interest in the computation of this provenance is already possible complex data analyses comprising e.g. statistical analysis with specialized algorithms, our Chase based solution is warrants an extension of the current Chase formalism, directly integrated into a framework solving a multitude for example with mathematical operators or negation. of other data science challenges, for example schema In the provenance step, data responsible for the analysis evolution and query transformation. results is identified using provenance techniques. This way, reproducibility of those results is guaranteed. To compute provenance, we invert the Chase rules used in 2. State of the Art the previous data analysis step. Here, refers to the achieved analysis results. However, privacy guidelines might prohibit direct access to the raw data, for example when personal information is concerned. We contribute to privacy by applying Chase rules (e.g. view definitions) to the queries used in the analysis step, thereby integrating them into the latter. By combining diferent Chase applications, we contribute to every step of the data science pipeline.

The Chase is a fixpoint algorithm incorporating a set

of rules, the Chase parameter, into a Chase object. In this work, the Chase parameter is a query or transformation rule encoded as a tuple generating dependency (tgd), while the Chase object is a database instance. Tgds are logical implications of the form (⃗, ⃗) → ∃⃗ : (⃗, ⃗).

Here, and are sets of relational atoms over the depicted variables. If the tgd encodes a query, we often existentially quantify ⃗, while at the same time prohibit existentially quantified variables ⃗ in the rule head (its 1.2. Research Data Management right-hand side). The inversion of a tgd is the logical One of the real world application areas motivating our implication (⃗, ⃗) → ∃⃗ : (⃗, ⃗). study is research data management. Many of our use If there is a homomorphic mapping from a Chase rule’s cases originate from our long term co-operation with body (the left-hand side) to the Chase object and the rule the Institute for Baltic Sea Research (IOW). Here, repro- is not satisfied yet, the rule head is materialized under the ducibility of published research results is a requirement homomorphism, generating a set of new tuples. Each exfor good scientific conduct. However, the schema of istentially quantified variable of the head contributes to a recorded data changes over time. Thus, while tracing fresh marked null values ( ∈ N). The Chase continues back results to the responsible data, we need to account until a fix point is reached, however, termination is not for schema evolution. After identifying involved tuples, guaranteed if existentially quantified variables in the rule scientists provide them – and not the entire data set – to head are allowed and the rules are cyclic. Still, several an external reviewer. termination tests guarantee Chase termination even on cyclic rule sets. The Conditional Chase described later in this work for instance terminates on richly acyclic rule

1.3. Negation in Data Analysis sets in polynomial time [4].

We interpret data analysis as a series of database queries, In general, provenance can be seen as meta-data dewhich in turn are expressed as Chase rules. In particu- scribing a production process [5]. For our generalizing lar, we are interested in statistical analysis of scientific research approach, why- and why-not data provenance data which often contains negation. On the one hand, are of particular interest. While there are tools (e.g. [6]) negation might be explicit part of the analysis algorithm, computing more detailed provenance information, they for example in the form of set diference. On the other are not applicable to other challenges of the data science hand, negation can be implicit part of basic aggregate pipeline. functions, for example the maximum function. However, Why-provenance provides tuples – the witnesses – jusnegation is not part of the standard Chase. tifying a certain result. Quite often, for example if tuples become indistinguishable after projection, there are alternative sets of tuples witnessing the same result. Thus, 1.4. Contribution a witness basis is the set of all sets of witnesses. AlternaUsing the Chase algorithm, we solve interconnected chal- tively, inverting the query might provide a single generic lenges of the data science pipeline in a coordinated man- representation for alternative witness sets (compare the ner. In this work, we study how previous results con- relaxed Chase-inverse in [7]). cerning provenance calculation are afected if we extend Why-not provenance explains the absence of exthe Chase with negation. While the Conditional Chase pected result tuples. This can take three diferent forms: Instance-based explanations, query-based explanations and refinement-based explanations [5]. Since the Chase algorithm allows neither tuple deletion nor update of constants, we restrict ourselves to explanations based on the insertion of tuples into the database. Similar to our approach, the algorithm described in [8] computes whynot provenance using conditions and c-tables. However, those concepts are not used in the context of the Conditional Chase discussed in this work. Most importantly, this solution is restricted to provenance and isolated from other challenges of the data science pipeline.

Even though [9] formalizes both evolution rules and conjunctive queries using tgds, only basic analysis operations can be realized this way. Furthermore, only non-recursive queries are actually inverted using the Chase.

3. Implementation

With the Chase implementation ChaTEAU, diferent applications of the Chase are combined in a in a single toolkit [9]. Currently, extensions like generalized negation on database instances and queries under the conditional semantics discussed in this work are integrated into the software.

4. Negation on Instances 5. Conditions and Certain Answers

Standard Chase (without negation) operates under certain answers semantics. Interpreting (marked) null values as unknown, but existing values, we consider only results justified under any valuation of null values with concrete constants. In general, we treat null values as constants unequal to any constant in the database instance. However, this naive interpretation is insuficient for negation under certain answers semantics. Consider rule 1 which is triggered if null value 1 equals constant . Under naive interpretation, 1 is not equal to and no result is generated. Let there be a second rule 2 which would be blocked by 1’s result. Clearly, 2 should not be triggered under certain answers semantics since there is a valuation of null values ( 1 ↦→ ) not justifying the generation of this tuple. Therefore, some tuples are only generated under certain conditions. For this paper, we define conditions as conjunctions of logical comparisons 1 2 ( ∈ {=, ̸=}, ∈ CONST∪NULL), with being the set of all constants of the domain and NULL being the set of all marked null values. In the previous example, the blocking tuple would be generated under condition 1 = , while the result of 2 is generated under condition 1 ̸= . If conditions of equivalent tuples form a tautology (e.g. 1 = and 1 ̸= ), those tuples exist under certain answers semantics.

6. Provenance While we regard the Chase as a universal algorithm han

dling diferent kinds of parameters and objects in a uni- 6.1. Why Provenance ifed manner, semantics of negation still depends on the Chase object. For the Chase on instances, negation in a As mentioned before, we can use the standard Chase Chase rule requires the absence of a certain set of tu- algorithm (without negation) to calculate the whyples. In contrast, negation in Chase rules applied to provenance of query results. The notion of whyqueries requires the presence of an explicitly negated provenance used here corresponds to the relaxed Chaseset of atoms in the query. As a consequence, we difer- inverse found in [7]. entiate between negation as a negated boolean subquery For this, we invert the Chase rules that created the (from integrity constraint to database instance), and nega- result by switching the rule’s head and body. Attribute tion as a boolean subquery with inverted direction (from values not passed to the result correspond to existentially query to integrity constraint). In this work, we will focus quantified variables of the inverted Chase rule. Applying on the first case, since it is more relevant the use case data the inverted query to the result generates the minimal provenance. After finding a term mapping ℎ for all vari- witness basis [9]. This set of tuples is suficient to create ables ⃗ of the positive body of an integrity constraint, we the original query result. However, it is usually smaller select the atoms (⃗, ⃗) of a negative body ¬∃⃗ : (⃗, ⃗) than the complete instance and contains marked null (that is, a negated conjunction of relational atoms). If the values in places irrelevant for the query. result of the boolean query (⃗, ⃗) → () is (), we reject Extending this established algorithm with semiℎ, otherwise, we continue with the next negative body positive negation is often without consequences. Howor complete the Chase step (e.g. generate some tuples). ever, the same witness might play diferent roles during

In this work, we will focus on semi-positive negation, query execution. Consider query on instance : that is, negation of base relations. Otherwise, we can ={( 1 ), ( 2 ), ( 2 )} no longer guarantee a single generic witness base and calculations become rather complicated. :() ∧ () ∧ ¬() → (, ).

If we ignore the semi-positive negation of , the witness basis is {( 1 ), ( 2 )}. Indeed, even if we materialize 8. Conclusion the image of ’s negative body, we only learn that ( 1 ) cannot exist, but we learn nothing about ( 2 ). Consequently, we can justify the existence of all previously The Chase algorithm solves a multitude of data science published results {( 1, 1 ), ( 2, 1 )}, but we could addition- challenges in a unified manner. Provenance, for instance, ally justify the result ( 2, 2 ). The absence of this expected can be calculated while keeping track of schema evoluresult from the published result could be explained us- tion. However, real world applications require extensions ing why-not provenance. However, the instance-based of the algorithm. In this work, we explain computation why-not provenance presented in this work is restricted of why- and why-not provenance of conjunctive queries to insertions – clearly, there is no way to generate ( 2, 2 ) with semi-positive negation using the extended Chase. by inserting additional tuples into the database. While we chose the Conditional Chase to realize certain answers semantics with negation, this decision was ad6.2. Why-not Provenance vantageous for the explanation of why-not provenance even in the absence of negation.

Similarly to why-provenance, why-not provenance can

be computed using the witness basis generated by the Chase. This witness basis (in the context of why-not provenance known as "‘generic"’ witness basis) includes the query’s materialized negative bodies. Since we are interested in witnesses without representation in the database, we insert artificial representatives for each (positive) witness into the database. We interpret the witness basis as a query returning the tuple identifiers and apply it to the database. Every generated result tuple corresponds to one possible why-not explanation. We select results referring to a minimum of artificial representatives and return the respective representatives as the why-not explanation. If a witness mapped to its own artificial representative and a witness mapped to a tuple from the original database share an existentially quantified variable, this variable is mapped to a marked null value (from the artificial tuple) and a constant (from the original tuple). The Conditional Chase allows this mapping under the condition that null value and constant are equal. If those equality conditions are consistent, we interpret them as term mappings and substitute null values from explanation tuples with constants from the original database instance. A more detailed description of the algorithm outlined above can be found in [10].

7. Future Work

Currently, we only consider semi-positive negation. Otherwise, two queries triggering each other could lead to nested negation in the generic witness basis, which resolves to a disjunction of generic witnesses. However, the use cases motivating our work are not restricted to semi-positive negation or even stratifiable rule sets, so an extension of our framework is necessary.

In this work, the Chase object is a database instance. However, other steps of the data science pipeline require a query to be the Chase object. While Chase semantics are very similar for both types of objects, semantics of negation difer distinctively.

Acknowledgments This work was supported by a scholarship of the Landesgraduiertenförderung Mecklenburg-Vorpommern.

[1]

A. V.

Aho ,

Sagiv ,

J. D.

Ullman , Eficient optimization of a class of relational expressions , ACM Trans. Database Syst . 4 ( 1979 ) 435 - 454 .

[2]

Maier ,

A. O.

Mendelzon ,

Sagiv , Testing implications of data dependencies , ACM Trans. Database Syst . 4 ( 1979 ) 455 - 469 .

[3]

Deutsch ,

Nash ,

J. B.

Remmel , The chase revisited , in: PODS, ACM, 2008 , pp. 149 - 158 .

[4]

Grahne ,

Onet , On conditional chase termination ., AMW 11 ( 2011 ) 46 .

[5]

Herschel ,

Diestelkämper ,

H. Ben

Lahmar , A survey on provenance: What for? what form? what from? , VLDB J . 26 ( 2017 ) 881 - 906 .

[6]

Senellart ,

Jachiet ,

Maniu ,

Ramusat , Provsql: Provenance and probability management in postgresql , Proc. VLDB Endow . 11 ( 2018 ) 2034 - 2037 .

[7]

Fagin ,

P. G.

Kolaitis ,

Popa ,

W. C.

Tan , Schema mapping evolution through composition and inversion , in: Schema Matching and Mapping, DataCentric Systems and Applications , Springer, 2011 , pp. 191 - 222 .

[8]

Herschel ,

M. A.

Hernández , Explaining missing answers to SPJUA queries , Proc. VLDB Endow . 3 ( 2010 ) 185 - 196 .

[9]

Auge ,

Hanzig ,

Heuer , Prosa pipeline: Provenance conquers the chase , in: ADBIS (Short Papers) , volume 1652 of Communications in Computer and Information Science, Springer, 2022 , pp. 89 - 98 .

[10]

Görres ,

Heuer , Computing Provenance Using the Negated Chase , Technical Report CS 01-23 , Institut für Informatik, Universität Rostock, 2023 . Extended version of this work.