Extended Provenance Management for Data Science Applications Tanja Auge Supervised by Prof. Andreas Heuer University of Rostock, Germany tanja.auge@uni-rostock.de ABSTRACT I* Query K* Research data management deals with tracking and archiv- ing of data collected during scientific projects, experiments Provenance or observations. The path from data collection to publica- tion should thus be kept comprehensible, reconstructable ne w and plausible. The continuous growth of data, frequent Evolution e ry Qu en ew schema changes as well as the varied evaluation of the data a nc en makes the storage of every possible database state a very J* ov Pr complicated and lengthy task. With the help of data provenance, however, we can deter- mine which part of the primary research data must be stored long-term in order to ensure the reproducibility of the eval- uations. It should also be possible to recalculate changes to data and schemata so that old data records do not have to Figure 1: Query Evaluation, Provenance and be archived completely. In addition, the stored data must Schema Evolution not conflict with existing privacy guidelines. relational query language, starting from conjunctive queries, 1. INTRODUCTION and adding arithmetic or aggregation functions lateron. The presentation and publication of research results in- Thus, our goal is not only to store the evaluation query creasingly requires the publication of the corresponding re- and the query result itself, but also the relevant source data. search data, which ensures the findability, accessibility, in- However, if data and/or schema change frequently, the orig- teroperability, and reusability in the sense of FAIR Data inal database must be ”frozen” and saved after each evalua- Principles1 . In our research, we concentrate on the struc- tion carried out on the dataset. To avoid this, we use prove- tured data, e.g. resulting from measurement series, experi- nance management techniques [7, 6] to calculate the sub- ments, or always-on sensors, stored in a relational database. database before (red highlighted) or after evolution (blue The FAIR principle does not necessarily mean, that the en- highlighted) required to reproduce the query result (green tire database of a project has to be stored and published, highlighted in Figure 1). but only that part of the database which is necessary for the After the new general data protection regulation (GDPR) traceability and/or reproducibility of the respective publica- becoming valid, it was apparent that the storage of research tion. The difficulty now is to generate exactly this minimal data, even without containing personal data, may fall under part of the original research database, called sub-database. the aspect of privacy. Reasons for the additional privacy The need for the data reduction can be due to high costs requirements are high costs, a lot of time as well as the great in collecting or evaluating the data (expensive or elabo- effort required to collect the research data. This implies a rately produced), to privacy aspects when evaluating per- natural conflict of interest between publishing original data sonal data, or to intellectual property preservation. In our (provenance) and protecting these data (privacy) for reasons case, the evaluation of the research database is restricted to a of competition. 1 All in all, this results in three central research questions, https://www.go-fair.org/ which are summarized in Figure 1: (I.) How to calculate the minimal part of the original re- search database that has to be stored permanently to achieve replicable research? (red highlighted) (II.) How to unify the theories behind data provenance and schema evolution? (blue highlighted) Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, (III.) How to combine Data Provenance and Privacy aspects Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted in the case of query inversion? (locked tuple) for private and academic purposes. 2. PROBLEM DESCRIPTION Privacy in the case of Query Inversion (Figure 4). The Let us take a more detailed look at the different research determination of a sub-database (red highlighted) based on questions. A detailed description of the problems (I.) and the query result (green highlighted) is not always possible or (II.) can be found in [1]. In the course of the PhD project permitted. For example, aggregated data cannot be inverted the third question (III.) arose, which illustrates the practical without storing additional information. Personal data, on relevance of the conflict between provenance and privacy. the other hand, may not be published without a certain anonymization (see locked tuple). It is therefore necessary Calculation of a minimal Sub-database (Figure 2). Pro- to generate a partial or generalized database that satisfies vided that the query result (green highlighted) and the eval- the provenance criteria on the one hand and does not contra- uation query is archived, one specific problem is, to deter- dict the privacy aspect on the other. This implies a natural mine the minimal (additional) information that is required conflict of interest between publishing original data (prove- for the reconstruction of the sub-database (red highlighted). nance) and protecting these data (privacy). Occasionally, we have to save entire tuples or parts of the database directly. Using provenance management, we can Data Provenance specify this necessary information. The so calculated min- imal sub-database is able to reconstruct the results of the evaluation query under the following boundary conditions: (1) The number of tuples of the original is retained, (2) the sub-database can be homomorphically mapped to the orig- Query inal database, and (3) the sub-database is an intensional description of the original database. Figure 4: Privacy in the case of query inversion Data Provenance 3. STATE OF THE ART Dependencies. The best known conditions are key depen- Query dencies, functional dependencies (FD) or join dependencies (JD). These can be extended to much more general depen- dencies called (source-to-target) tuple generating dependen- Figure 2: Calculation of a minimal sub-database cies ((s-t) tgd) and equality generating dependencies (egd). While s-t tgds are used as a kind of inter-database depen- dencies, tgds – an s-t tgd on only one database schema – Unification of Provenance and Evolution (Figure 3). as well as egds can be seen as intra-database dependencies Previous provenance queries have usually been processed representing integrity constraints within a database [9, 10]. on a given fixed database and an evaluation query. The combination of data provenance with schema and data evo- CHASE. The CHASE is a procedure that modifies a given lution should enable the evaluation of provenance queries object by incorporating a parameter ?. We represent with changing schemata. Under evolution the new query this by: chase? ( ) = ? . While the object can represent evaluation (green dotted) can be directly calculated as a both queries and instances, we understand the parameter ? composition of the original query evaluation and the inverse as set of dependencies like (s-t) tgds and/or egds. There evolution (black). It is therefore sufficient to memorize one are already first approaches to generalize the CHASE to of the two minimal sub-databases I ∗ (red highlighted) or J ∗ (arbitrary) objects and parameters [3]. (blue highlighted), the other sub-database can be calculated In our use case s-t tgds create new tuples and tgds/egds with the help of the inverse. clean the database by replacing null values until the CHASEd database satisfies all given dependencies [13, 5]. The CHASE on instances can be used for data exchange, data integration, query answering on incomplete databases, or data cleaning, Query among others. CHASE-inverse. A CHASE-inverse is an inverse function calculated via the CHASE algorithm. Weaker variants like Evolution the relaxed CHASE-inverse [10], tuple-preserving relaxed and result equivalent CHASE-inverse [2] do not guarantee an ex- act and unique inverse. Querynew Data Provenance. Given a database instance I and an evaluation query Q, data provenance describes (1) where a result tuple r does come from (where-provenance), (2) why and (3) how r exists in the result Q(I). Why -provenance [6] Figure 3: Unification of Provenance and Evolution specifies a witness base that identifies the tuples involved in the calculation of r. The question of how a result tuple r is calculated is answered by how -provenance using prove- hand, refer to the existence of homomorphisms, an equal nance polynomials. These polynomials give a concrete cal- number of tuples as well as result equivalence [2]. culation of r. They are defined by a commutative semi-ring An exact (=), (tp-)relaxed (tp ) or result equivalent (↔) (N[X], +, ·, 0, 1) with + for union and projection as well as CHASE-inverse can be specified for each basic operation · for natural join [14]. like π, ./, σ or AVG. Adding provenance information such as Why and where can be derived from the result of the provenance polynomials [14] and (minimal) witness bases [6] how -provenance. For this we can define a reduction based allows the specification of stronger CHASE-inverse schema on the information content: where  why  how. There- mappings then without. Thus, in the case of a projec- fore, we often only concentrate on how -provenance. When tion, formalized as R(a, b, c) → S(a, c), the inverse function including privacy aspects, however, the why - and where- S(a, c) → ∃d : R(a, d, c) is tp-relaxed instead of relaxed. For provenance should not be neglected. other operations such as selection on the other hand, the inverse type cannot be improved despite additional infor- Provenance under Schema Evolution. The description mation [2]. of schema development using schema modification opera- tors (SMO) such as CREATE table, ADD or DROP column en- Unification of Provenance and Evolution. Given a data- ables schema and corresponding data changes [8]. First ap- base instance I and its evolution J, we can differentiate be- proaches to combining schema evolution and provenance are tween 15 schema modifications like CREATE table, ADD or given in [12] and [11], the latter supporting a total of three DROP column. Theses modifications can be formalized us- types of provenance queries: (1) data provenance queries, ing the schema modification operators defined in [11]. We (2) schema provenance queries and (3) statistics queries. examined the most common schema modification operators with reference to their CHASE-inverses and extend them Privacy. Privacy refers the protection of (personal) data with why - and how -provenance as well as additional anno- against unauthorized collection, storage and publication. Im- tations. The most common operators are DECOMPOSE, JOIN portant criteria in this context are for example k-anonymity and MERGE Table as well as MERGE Column. Currently we and l-diversity [15]. These are necessary, since a tuple can of- are evaluating the remaining 11 operators for their CHASE- ten be uniquely identified by apparently harmless attributes, inverse functions without and with using data provenance. so-called quasi-identifiers. The goal of our research is to Among other things, we found that in some research in- reconstruct the original database as accurately as possible. stitutions the SMOs defined by Zaniolo et al. are not suf- This implies a natural conflict of interest between publishing ficient [11]. In corporation with the Leibniz Institute for original data (provenance) and protecting these data (pri- Baltic Sea Research Warnemünde we were able to deter- vacy) [4]. mine that their schema modifications contain a lot of merg- ing and splitting operations. We therefore define two opera- tors MERGE Column and SPLIT Column as sequence of ADD 4. PREVIOUS RESULTS and DROP operations. These merge function can be for- In order to unify the different theories, we represent evalu- malized as R(a, b, c) → S(b, f (a, c)) with an inverse function ation queries, provenance queries and evolution functions as S(b, f (a, c)) → ∃d, e : R(d, b, e). Depending on the use or s-t tgds, so that the CHASE algorithm can be applied as a not use of provenance information we can generate a tp- technique. The CHASE is thus a formalization of the evalu- relaxed or exact CHASE-inverse resulting in a better recon- ation as well as evolution of the research database. In a sec- structed sub-database. ond step, called BACKCHASE process, we use the CHASE again to generate a provenance query based on the result of Privacy in the case of Query Inversion. For us, the term the evaluation query. Our theories of inverse functions can privacy goes beyond the term of (usually personal) data pro- be developed from the already existing work of Fagin [10]. tection. Rather, we refer to the protection of research data in general. Reasons for protecting research data include eco- Calculation of a minimal Sub-database. Let I be a data- nomic (company protection), personal (personal data) or base instance, M a schema mapping defined by a s-t tgd and financial aspects. The creation of such data is often time- I ∗ = chaseM∗ (chaseM (I)) the minimal sub-database calcu- consuming and expensive. The identification of personal or lated by applying CHASE twice, red highlighted in Fig- internal company information should also be strictly pre- ure 2. While an exact CHASE-inverse always reconstructs vented. the original database itself, weaker variants only require When reconstructing the minimal sub-database only those data exchange equivalence and the existence of a homomor- tuples may be reconstructed which do not contradict privacy phism between I ∗ and I. To preserve the number of tuples aspects. Depending on the selected provenance, different we define the tuple preserving relaxed CHASE-inverse (tp- data protection problems have to be considered [4]: (1) Us- relaxed). To specify a CHASE-inverse for aggregated func- ing relation names as where-provenance, there is generally tions as well, we define the result equivalent CHASE-inverse not enough data worth protecting and reproducibility of the [2]. Both definitions are based on the theory of Fagin [10]. data is not guaranteed. Data protection aspects are there- The result equivalent CHASE-inverse is therefore the weak- fore negligible. (2) In the case of why, we may encounter est CHASE-inverse. Overall, this results in the reduction privacy problems, if the variance of the distribution of at- tribute values is equal to zero. However, this only applies result equivalent  relaxed  tp-relaxed  exact, for special cases not known to the user interpreting the re- sults of the provenance queries. (3) How -provenance often which forms the sufficient conditions for the existence of calculates too much recoverable information, so that privacy a CHASE-inverse. The necessary conditions, on the other aspects are likely to be a major problem with this approach. (4) If we interpret where as tuple names and we save not Leibniz Institute for Baltic Sea Research Warnemünde for only the scheme but the tuple itself, this can lead to major providing their research data and thanks also to my stu- privacy problems. However, this second where approach is dents for the many interesting discussions about privacy, subject of our current work. provenance, evolution and the CHASE algorithm. For solving problems generated by the different prove- nance queries, different approaches such as generalization 8. REFERENCES and suppression, permutation of attribute values, differen- [1] T. Auge and A. Heuer. Combining Provenance tial privacy, and intensional (instead of extensional) answers Management and Schema Evolution. In IPAW, volume have been developed. The next step is now to examine 11017 of Lecture Notes in Computer Science, pages them for their compatibility with where-, why - and how - 222–225. Springer, 2018. provenance. [2] T. Auge and A. Heuer. The Theory behind Minimizing Research Data — Result equivalent Generalization of the CHASE. We are currently working CHASE-inverse Mappings. In LWDA, volume 2191 of on adapting the CHASE variant presented in [5] to (arbi- CEUR Workshop Proceedings, pages 1–12. trary) objects and parameters ?. Initial approaches to CEUR-WS.org, 2018. this are described in [3]. So the parameter ? is always rep- resented as a intra- or inter-database dependency and the [3] T. Auge and A. Heuer. ProSA — Using the CHASE predefined hierarchy of dependencies JD  tgd  s-t tgd for Provenance Management. In ADBIS, volume and FD  egd allows to display all dependencies as s-t tgds 11695 of Lecture Notes in Computer Science, pages respectively egds. Views can also be displayed as such de- 357–372. Springer, 2019. pendencies. This allows the usage of a general parameter [4] T. Auge, N. Scharlau, and A. Heuer. Privacy Aspects for all today’s relevant CHASE applications like semantic of Provenance Queries. Accepted for ProvenanceWeek, optimization, answering queries using views, data exchange 2020. and data cleaning, query rewriting and many more. [5] M. Benedikt, G. Konstantinidis, G. Mecca, B. Motik, The CHASE object is either a query Q or a database in- P. Papotti, D. Santoro, and E. Tsamoura. stance I. In both cases variables/null values can be replaced Benchmarking the Chase. In PODS, pages 37–52. by other variables/null values or constants. The variable ACM, 2017. substitution depends on certain conditions shown in [3]. Our [6] P. Buneman, S. Khanna, and W. C. Tan. Why and goal here is to develop a tool that execute multiple CHASE Where: A Characterization of Data Provenance. In applications. For the best of our knowledge, all currently ICDT, volume 1973, pages 316–330. Springer, 2001. existing tools are always designed for only one use case. [7] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in Databases: Why, How, and Where. Foundations 5. FUTURE WORK and Trends in Databases, 1(4):379–474, 2009. Both in the calculation of a minimal sub-database and in [8] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo. the unification of provenance and evolution, there are still Update rewriting and integrity constraint maintenance open questions to be answered. First, a concrete BACK- in a schema evolution support system: PRISM++. CHASE process must be defined using provenance polyno- Proc. VLDB Endow., 4(2):117–128, 2010. mials and witness bases. Secondly, we have to evaluate the [9] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. remaining 11 SMOs for their CHASE-inverse functions with- Data Exchange: Semantics and Query Answering. out and with the use of data provenance. Further, we need Theor. Comput. Sci., 336(1):89–124, 2005. to examine the different privacy approaches for their com- [10] R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan. patibility with where-, why -, and how -provenance. Schema Mapping Evolution Through Composition and In addition to these specific questions, we also want to Inversion. In Schema Matching and Mapping, pages know whether our results are applicable for other use cases 191–222. Springer, 2011. than research data management. And of course we hope to [11] S. Gao and C. Zaniolo. Provenance Management in further develop the theory of generalized CHASE a little bit. Databases Under Schema Evolution. In TaPP. USENIX Association, 2012. 6. CONCLUSION [12] B. Glavic, G. Alonso, R. J. Miller, and L. M. Haas. We presented the current status and plans to extend our TRAMP: Understanding the Behavior of Schema work in the field of provenance management considering Mappings through Provenance. Proc. VLDB Endow., evolution and privacy aspects. We defined the tp-relaxed 3(1):1314–1325, 2010. and result equivalent CHASE-inverse, examined the differ- [13] S. Greco, C. Molinaro, and F. Spezzano. Incomplete ent evaluation and evolution operators for their inverse types Data and Data Dependencies in Relational Databases. without and with using data provenance and started to crit- Synthesis Lectures on Data Management. Morgan & ically study the correlation between provenance and privacy. Claypool Publishers, 2012. [14] T. J. Green, G. Karvounarakis, and V. Tannen. 7. ACKNOWLEDGMENTS Provenance semirings. In PODS, pages 31–40. ACM, 2007. Special thanks go to my PhD supervisor Andreas Heuer and my mentor Goetz Graefe as well as to my colleagues [15] P. Samarati. Protecting Respondents’ Identities in from the database chair of the University of Rostock for Microdata Release. IEEE Trans. Knowl. Data Eng., their support during my PhD studies so far. Thanks to the 13(6):1010–1027, 2001.