Extended Provenance Management
                                 for Data Science Applications

                                                                    Tanja Auge
                                                     Supervised by Prof. Andreas Heuer
                                                      University of Rostock, Germany
                                                       tanja.auge@uni-rostock.de

ABSTRACT                                                                       I*                      Query                         K*
Research data management deals with tracking and archiv-
ing of data collected during scientific projects, experiments                                        Provenance

or observations. The path from data collection to publica-
tion should thus be kept comprehensible, reconstructable                                                          ne
                                                                                                                       w

and plausible. The continuous growth of data, frequent                              Evolution            e   ry
                                                                                                      Qu                                         en
                                                                                                                                                      ew
schema changes as well as the varied evaluation of the data                                                                             a   nc
                                                                                                                                     en
makes the storage of every possible database state a very                      J*                                               ov
                                                                                                                           Pr
complicated and lengthy task.
   With the help of data provenance, however, we can deter-
mine which part of the primary research data must be stored
long-term in order to ensure the reproducibility of the eval-
uations. It should also be possible to recalculate changes to
data and schemata so that old data records do not have to                    Figure 1: Query Evaluation, Provenance and
be archived completely. In addition, the stored data must                    Schema Evolution
not conflict with existing privacy guidelines.

                                                                             relational query language, starting from conjunctive queries,
1.     INTRODUCTION                                                          and adding arithmetic or aggregation functions lateron.
   The presentation and publication of research results in-                     Thus, our goal is not only to store the evaluation query
creasingly requires the publication of the corresponding re-                 and the query result itself, but also the relevant source data.
search data, which ensures the findability, accessibility, in-               However, if data and/or schema change frequently, the orig-
teroperability, and reusability in the sense of FAIR Data                    inal database must be ”frozen” and saved after each evalua-
Principles1 . In our research, we concentrate on the struc-                  tion carried out on the dataset. To avoid this, we use prove-
tured data, e.g. resulting from measurement series, experi-                  nance management techniques [7, 6] to calculate the sub-
ments, or always-on sensors, stored in a relational database.                database before (red highlighted) or after evolution (blue
The FAIR principle does not necessarily mean, that the en-                   highlighted) required to reproduce the query result (green
tire database of a project has to be stored and published,                   highlighted in Figure 1).
but only that part of the database which is necessary for the                   After the new general data protection regulation (GDPR)
traceability and/or reproducibility of the respective publica-               becoming valid, it was apparent that the storage of research
tion. The difficulty now is to generate exactly this minimal                 data, even without containing personal data, may fall under
part of the original research database, called sub-database.                 the aspect of privacy. Reasons for the additional privacy
The need for the data reduction can be due to high costs                     requirements are high costs, a lot of time as well as the great
in collecting or evaluating the data (expensive or elabo-                    effort required to collect the research data. This implies a
rately produced), to privacy aspects when evaluating per-                    natural conflict of interest between publishing original data
sonal data, or to intellectual property preservation. In our                 (provenance) and protecting these data (privacy) for reasons
case, the evaluation of the research database is restricted to a             of competition.
1
                                                                                All in all, this results in three central research questions,
    https://www.go-fair.org/                                                 which are summarized in Figure 1:
                                                                              (I.) How to calculate the minimal part of the original re-
                                                                                   search database that has to be stored permanently to
                                                                                   achieve replicable research?        (red highlighted)
                                                                             (II.) How to unify the theories behind data provenance and
                                                                                   schema evolution?                  (blue highlighted)

Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,         (III.) How to combine Data Provenance and Privacy aspects
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted          in the case of query inversion?     (locked tuple)
for private and academic purposes.
2.   PROBLEM DESCRIPTION                                           Privacy in the case of Query Inversion (Figure 4). The
   Let us take a more detailed look at the different research      determination of a sub-database (red highlighted) based on
questions. A detailed description of the problems (I.) and         the query result (green highlighted) is not always possible or
(II.) can be found in [1]. In the course of the PhD project        permitted. For example, aggregated data cannot be inverted
the third question (III.) arose, which illustrates the practical   without storing additional information. Personal data, on
relevance of the conflict between provenance and privacy.          the other hand, may not be published without a certain
                                                                   anonymization (see locked tuple). It is therefore necessary
Calculation of a minimal Sub-database (Figure 2). Pro-             to generate a partial or generalized database that satisfies
vided that the query result (green highlighted) and the eval-      the provenance criteria on the one hand and does not contra-
uation query is archived, one specific problem is, to deter-       dict the privacy aspect on the other. This implies a natural
mine the minimal (additional) information that is required         conflict of interest between publishing original data (prove-
for the reconstruction of the sub-database (red highlighted).      nance) and protecting these data (privacy).
Occasionally, we have to save entire tuples or parts of the
database directly. Using provenance management, we can                                        Data
                                                                                           Provenance
specify this necessary information. The so calculated min-
imal sub-database is able to reconstruct the results of the
evaluation query under the following boundary conditions:
(1) The number of tuples of the original is retained, (2) the
sub-database can be homomorphically mapped to the orig-                                       Query
inal database, and (3) the sub-database is an intensional
description of the original database.
                                                                     Figure 4: Privacy in the case of query inversion
                            Data
                         Provenance


                                                                   3.   STATE OF THE ART

                                                                   Dependencies. The best known conditions are key depen-
                            Query
                                                                   dencies, functional dependencies (FD) or join dependencies
                                                                   (JD). These can be extended to much more general depen-
                                                                   dencies called (source-to-target) tuple generating dependen-
 Figure 2: Calculation of a minimal sub-database                   cies ((s-t) tgd) and equality generating dependencies (egd).
                                                                   While s-t tgds are used as a kind of inter-database depen-
                                                                   dencies, tgds – an s-t tgd on only one database schema –
Unification of Provenance and Evolution (Figure 3).                as well as egds can be seen as intra-database dependencies
Previous provenance queries have usually been processed            representing integrity constraints within a database [9, 10].
on a given fixed database and an evaluation query. The
combination of data provenance with schema and data evo-           CHASE. The CHASE is a procedure that modifies a given
lution should enable the evaluation of provenance queries          object     by incorporating a parameter ?. We represent
with changing schemata. Under evolution the new query              this by: chase? ( ) = ? . While the object        can represent
evaluation (green dotted) can be directly calculated as a          both queries and instances, we understand the parameter ?
composition of the original query evaluation and the inverse       as set of dependencies like (s-t) tgds and/or egds. There
evolution (black). It is therefore sufficient to memorize one      are already first approaches to generalize the CHASE to
of the two minimal sub-databases I ∗ (red highlighted) or J ∗      (arbitrary) objects and parameters [3].
(blue highlighted), the other sub-database can be calculated          In our use case s-t tgds create new tuples and tgds/egds
with the help of the inverse.                                      clean the database by replacing null values until the CHASEd
                                                                   database satisfies all given dependencies [13, 5]. The CHASE
                                                                   on instances can be used for data exchange, data integration,
                                                                   query answering on incomplete databases, or data cleaning,
                        Query                                      among others.

                                                                   CHASE-inverse. A CHASE-inverse is an inverse function
                                                                   calculated via the CHASE algorithm. Weaker variants like
          Evolution
                                                                   the relaxed CHASE-inverse [10], tuple-preserving relaxed and
                                                                   result equivalent CHASE-inverse [2] do not guarantee an ex-
                                                                   act and unique inverse.
                        Querynew


                                                                   Data Provenance. Given a database instance I and an
                                                                   evaluation query Q, data provenance describes (1) where a
                                                                   result tuple r does come from (where-provenance), (2) why
                                                                   and (3) how r exists in the result Q(I). Why -provenance [6]
Figure 3: Unification of Provenance and Evolution                  specifies a witness base that identifies the tuples involved in
                                                                   the calculation of r. The question of how a result tuple r
is calculated is answered by how -provenance using prove-          hand, refer to the existence of homomorphisms, an equal
nance polynomials. These polynomials give a concrete cal-          number of tuples as well as result equivalence [2].
culation of r. They are defined by a commutative semi-ring            An exact (=), (tp-)relaxed (tp ) or result equivalent (↔)
(N[X], +, ·, 0, 1) with + for union and projection as well as      CHASE-inverse can be specified for each basic operation
· for natural join [14].                                           like π, ./, σ or AVG. Adding provenance information such as
   Why and where can be derived from the result of the             provenance polynomials [14] and (minimal) witness bases [6]
how -provenance. For this we can define a reduction based          allows the specification of stronger CHASE-inverse schema
on the information content: where  why  how. There-              mappings then without. Thus, in the case of a projec-
fore, we often only concentrate on how -provenance. When           tion, formalized as R(a, b, c) → S(a, c), the inverse function
including privacy aspects, however, the why - and where-           S(a, c) → ∃d : R(a, d, c) is tp-relaxed instead of relaxed. For
provenance should not be neglected.                                other operations such as selection on the other hand, the
                                                                   inverse type cannot be improved despite additional infor-
Provenance under Schema Evolution. The description                 mation [2].
of schema development using schema modification opera-
tors (SMO) such as CREATE table, ADD or DROP column en-            Unification of Provenance and Evolution. Given a data-
ables schema and corresponding data changes [8]. First ap-         base instance I and its evolution J, we can differentiate be-
proaches to combining schema evolution and provenance are          tween 15 schema modifications like CREATE table, ADD or
given in [12] and [11], the latter supporting a total of three     DROP column. Theses modifications can be formalized us-
types of provenance queries: (1) data provenance queries,          ing the schema modification operators defined in [11]. We
(2) schema provenance queries and (3) statistics queries.          examined the most common schema modification operators
                                                                   with reference to their CHASE-inverses and extend them
Privacy. Privacy refers the protection of (personal) data          with why - and how -provenance as well as additional anno-
against unauthorized collection, storage and publication. Im-      tations. The most common operators are DECOMPOSE, JOIN
portant criteria in this context are for example k-anonymity       and MERGE Table as well as MERGE Column. Currently we
and l-diversity [15]. These are necessary, since a tuple can of-   are evaluating the remaining 11 operators for their CHASE-
ten be uniquely identified by apparently harmless attributes,      inverse functions without and with using data provenance.
so-called quasi-identifiers. The goal of our research is to           Among other things, we found that in some research in-
reconstruct the original database as accurately as possible.       stitutions the SMOs defined by Zaniolo et al. are not suf-
This implies a natural conflict of interest between publishing     ficient [11]. In corporation with the Leibniz Institute for
original data (provenance) and protecting these data (pri-         Baltic Sea Research Warnemünde we were able to deter-
vacy) [4].                                                         mine that their schema modifications contain a lot of merg-
                                                                   ing and splitting operations. We therefore define two opera-
                                                                   tors MERGE Column and SPLIT Column as sequence of ADD
4.   PREVIOUS RESULTS                                              and DROP operations. These merge function can be for-
   In order to unify the different theories, we represent evalu-   malized as R(a, b, c) → S(b, f (a, c)) with an inverse function
ation queries, provenance queries and evolution functions as       S(b, f (a, c)) → ∃d, e : R(d, b, e). Depending on the use or
s-t tgds, so that the CHASE algorithm can be applied as a          not use of provenance information we can generate a tp-
technique. The CHASE is thus a formalization of the evalu-         relaxed or exact CHASE-inverse resulting in a better recon-
ation as well as evolution of the research database. In a sec-     structed sub-database.
ond step, called BACKCHASE process, we use the CHASE
again to generate a provenance query based on the result of        Privacy in the case of Query Inversion. For us, the term
the evaluation query. Our theories of inverse functions can        privacy goes beyond the term of (usually personal) data pro-
be developed from the already existing work of Fagin [10].         tection. Rather, we refer to the protection of research data
                                                                   in general. Reasons for protecting research data include eco-
Calculation of a minimal Sub-database. Let I be a data-            nomic (company protection), personal (personal data) or
base instance, M a schema mapping defined by a s-t tgd and         financial aspects. The creation of such data is often time-
I ∗ = chaseM∗ (chaseM (I)) the minimal sub-database calcu-         consuming and expensive. The identification of personal or
lated by applying CHASE twice, red highlighted in Fig-             internal company information should also be strictly pre-
ure 2. While an exact CHASE-inverse always reconstructs            vented.
the original database itself, weaker variants only require            When reconstructing the minimal sub-database only those
data exchange equivalence and the existence of a homomor-          tuples may be reconstructed which do not contradict privacy
phism between I ∗ and I. To preserve the number of tuples          aspects. Depending on the selected provenance, different
we define the tuple preserving relaxed CHASE-inverse (tp-          data protection problems have to be considered [4]: (1) Us-
relaxed). To specify a CHASE-inverse for aggregated func-          ing relation names as where-provenance, there is generally
tions as well, we define the result equivalent CHASE-inverse       not enough data worth protecting and reproducibility of the
[2]. Both definitions are based on the theory of Fagin [10].       data is not guaranteed. Data protection aspects are there-
   The result equivalent CHASE-inverse is therefore the weak-      fore negligible. (2) In the case of why, we may encounter
est CHASE-inverse. Overall, this results in the reduction          privacy problems, if the variance of the distribution of at-
                                                                   tribute values is equal to zero. However, this only applies
     result equivalent  relaxed  tp-relaxed  exact,             for special cases not known to the user interpreting the re-
                                                                   sults of the provenance queries. (3) How -provenance often
which forms the sufficient conditions for the existence of         calculates too much recoverable information, so that privacy
a CHASE-inverse. The necessary conditions, on the other            aspects are likely to be a major problem with this approach.
(4) If we interpret where as tuple names and we save not          Leibniz Institute for Baltic Sea Research Warnemünde for
only the scheme but the tuple itself, this can lead to major      providing their research data and thanks also to my stu-
privacy problems. However, this second where approach is          dents for the many interesting discussions about privacy,
subject of our current work.                                      provenance, evolution and the CHASE algorithm.
   For solving problems generated by the different prove-
nance queries, different approaches such as generalization        8.   REFERENCES
and suppression, permutation of attribute values, differen-
                                                                   [1] T. Auge and A. Heuer. Combining Provenance
tial privacy, and intensional (instead of extensional) answers
                                                                       Management and Schema Evolution. In IPAW, volume
have been developed. The next step is now to examine
                                                                       11017 of Lecture Notes in Computer Science, pages
them for their compatibility with where-, why - and how -
                                                                       222–225. Springer, 2018.
provenance.
                                                                   [2] T. Auge and A. Heuer. The Theory behind
                                                                       Minimizing Research Data — Result equivalent
Generalization of the CHASE. We are currently working                  CHASE-inverse Mappings. In LWDA, volume 2191 of
on adapting the CHASE variant presented in [5] to (arbi-
                                                                       CEUR Workshop Proceedings, pages 1–12.
trary) objects     and parameters ?. Initial approaches to
                                                                       CEUR-WS.org, 2018.
this are described in [3]. So the parameter ? is always rep-
resented as a intra- or inter-database dependency and the          [3] T. Auge and A. Heuer. ProSA — Using the CHASE
predefined hierarchy of dependencies JD  tgd  s-t tgd                for Provenance Management. In ADBIS, volume
and FD  egd allows to display all dependencies as s-t tgds            11695 of Lecture Notes in Computer Science, pages
respectively egds. Views can also be displayed as such de-             357–372. Springer, 2019.
pendencies. This allows the usage of a general parameter           [4] T. Auge, N. Scharlau, and A. Heuer. Privacy Aspects
for all today’s relevant CHASE applications like semantic              of Provenance Queries. Accepted for ProvenanceWeek,
optimization, answering queries using views, data exchange             2020.
and data cleaning, query rewriting and many more.                  [5] M. Benedikt, G. Konstantinidis, G. Mecca, B. Motik,
  The CHASE object        is either a query Q or a database in-        P. Papotti, D. Santoro, and E. Tsamoura.
stance I. In both cases variables/null values can be replaced          Benchmarking the Chase. In PODS, pages 37–52.
by other variables/null values or constants. The variable              ACM, 2017.
substitution depends on certain conditions shown in [3]. Our       [6] P. Buneman, S. Khanna, and W. C. Tan. Why and
goal here is to develop a tool that execute multiple CHASE             Where: A Characterization of Data Provenance. In
applications. For the best of our knowledge, all currently             ICDT, volume 1973, pages 316–330. Springer, 2001.
existing tools are always designed for only one use case.          [7] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance
                                                                       in Databases: Why, How, and Where. Foundations
5.   FUTURE WORK                                                       and Trends in Databases, 1(4):379–474, 2009.
  Both in the calculation of a minimal sub-database and in         [8] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo.
the unification of provenance and evolution, there are still           Update rewriting and integrity constraint maintenance
open questions to be answered. First, a concrete BACK-                 in a schema evolution support system: PRISM++.
CHASE process must be defined using provenance polyno-                 Proc. VLDB Endow., 4(2):117–128, 2010.
mials and witness bases. Secondly, we have to evaluate the         [9] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa.
remaining 11 SMOs for their CHASE-inverse functions with-              Data Exchange: Semantics and Query Answering.
out and with the use of data provenance. Further, we need              Theor. Comput. Sci., 336(1):89–124, 2005.
to examine the different privacy approaches for their com-        [10] R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan.
patibility with where-, why -, and how -provenance.                    Schema Mapping Evolution Through Composition and
  In addition to these specific questions, we also want to             Inversion. In Schema Matching and Mapping, pages
know whether our results are applicable for other use cases            191–222. Springer, 2011.
than research data management. And of course we hope to           [11] S. Gao and C. Zaniolo. Provenance Management in
further develop the theory of generalized CHASE a little bit.          Databases Under Schema Evolution. In TaPP.
                                                                       USENIX Association, 2012.
6.   CONCLUSION                                                   [12] B. Glavic, G. Alonso, R. J. Miller, and L. M. Haas.
   We presented the current status and plans to extend our             TRAMP: Understanding the Behavior of Schema
work in the field of provenance management considering                 Mappings through Provenance. Proc. VLDB Endow.,
evolution and privacy aspects. We defined the tp-relaxed               3(1):1314–1325, 2010.
and result equivalent CHASE-inverse, examined the differ-         [13] S. Greco, C. Molinaro, and F. Spezzano. Incomplete
ent evaluation and evolution operators for their inverse types         Data and Data Dependencies in Relational Databases.
without and with using data provenance and started to crit-            Synthesis Lectures on Data Management. Morgan &
ically study the correlation between provenance and privacy.           Claypool Publishers, 2012.
                                                                  [14] T. J. Green, G. Karvounarakis, and V. Tannen.
7.   ACKNOWLEDGMENTS                                                   Provenance semirings. In PODS, pages 31–40. ACM,
                                                                       2007.
   Special thanks go to my PhD supervisor Andreas Heuer
and my mentor Goetz Graefe as well as to my colleagues            [15] P. Samarati. Protecting Respondents’ Identities in
from the database chair of the University of Rostock for               Microdata Release. IEEE Trans. Knowl. Data Eng.,
their support during my PhD studies so far. Thanks to the              13(6):1010–1027, 2001.