Fine-grained Provenance for High-quality Data Science (Discussion Paper) Adriane Chapman1 , Paolo Missier2 , Giulia Simonelli3 and Riccardo Torlone3 2 University of Southampton 2 Newcastle University 3 Università Roma Tre Abstract In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of in- dividual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python. 1. Introduction Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models’ accuracy and performance, as well on other properties, such as model fairness. However, while substantial recent research has produced techniques for model explanation that focus primarily on the model itself (e.g., [1, 2]) relatively little work has been done into trying to explain models in terms of the transformations that occur before the data is used for learning. In this work, we enable the explanation on the effect of each transformation in a pre-processing pipeline on the data that is ultimately fed into a model. We consider transformations that apply to commonly used tabular or relational datasets and across application domains. These steps have been systematically enumerated in multiple reviews (see eg. [3]) and include, among others: feature selection, engineering of new features; imputation of missing values, or listwise deletion (excluding an entire record if data is missing on any variable for that record); downsampling or upsampling of data subsets in order to achieve better balance, typically on the class labels (for classification tasks) or on the distribution of the outcome variable (for regression tasks); outlier detection and removal; smoothing and normalisation; de-duplication, as well as steps that preserve the original information but are required by some algorithms, such as “one-hot” SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy " adriane.chapman@soton.ac.uk (A. Chapman); paolo.missier@ncl.ac.uk (P. Missier); giulia.simonelli@uniroma3.it (G. Simonelli); giulia.simonelli@uniroma3.it (R. Torlone) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) encoding of categorical variables. A complex pipeline may include some or all of these steps, and different techniques, algorithms, and choice of algorithm-specific parameters may be available for each of them. These are often grounded in established literature but variations can be created by data scientists to suit specific needs. We consider the space of all configured pipelines that can potentially be composed out of these operators, and we focus on relational datasets, which are arguably the most common data structures in popular analytics-friendly scripting languages like R, Spark, and Python (where they are called dataframes). In this framework, we propose a formalisation and categorisation of a core set of these operators. Data derivations for such operators are expressed at the level of the atomic elements in the dataset using the PROV data model [4], a standard and a widely adopted ontology. Then, with each of the core operators we associate a provenance pattern that describes their effect on the data at the appropriate level of detail, i.e., on individual dataset elements, columns, rows, or collections of those. Provenance patterns defined in this work for data science operators play a similar role to that of provenance polynomials [5], i.e., annotations that are associated to relational algebra operators to describe the fine-grained provenance of the result of relational operators. Finally, we associate a provenance function pfo () to each operator o, which generates a provenance document pfo (𝐷) when a dataset 𝐷 is processed using o. The document is an instance of the pattern associated with o. Provenance functions are implemented as part of a Python module. Collecting all the provenance documents from each operator’s execution results in a seamless, end-to-end provenance document that contains the detailed history of each dataset element in the final training set, including their creation (e.g. as a new derived feature), transformation (value imputation, for example) and possibly deletion (e.g., by feature selection, removal of null values). In the rest of this paper we illustrate the models we have adopted for describing data, operators, and provenance (Section 2) as well as the way in which provenance for a data preprocessing pipeline is captured (Section 3). Further details on the query capabilities of the resulting provenance and on the scalability of the overall approach can be found in the extended version of the paper [6]. 2. Models for Data, Operators, and Provenance. 2.1. Data model A (dataset) schema 𝑆 is an array of distinct names called features: 𝑆 = [a1 , . . . , a𝑛 ]. Each feature is associated with a domain of atomic values (such as numbers, strings, and timestamps). A dataset 𝐷 over a schema 𝑆 = [a1 , . . . , a𝑛 ] is an ordered collection of rows (or records) of the form: 𝑖 : (𝑑𝑖1 , . . . , 𝑑𝑖𝑛 ) where 𝑖 is the unique index of the row and each element 𝑑𝑖𝑗 (for 1 ≤ 𝑗 ≤ 𝑛) is either a value in the domain of the feature a𝑗 or the special symbol ⊥, denoting a missing value. Given a dataset 𝐷 over a schema 𝑆 we denote by 𝐷𝑖a the element for the feature a of 𝑆 occurring the 𝑖-th row of 𝐷. We also denote by 𝐷𝑖* the 𝑖-th row of 𝐷, and by 𝐷*a the column of 𝐷 associated with the feature a of 𝑆. Example 1. A possible dataset 𝐷 over the schema 𝑆 = [CId, Gender, Age, Zip] is as follows: CId Gender Age Zip 1 113 𝐹 24 98567 2 241 𝑀 28 ⊥ 3 375 𝐶 ⊥ 32768 4 578 𝐹 44 32768 𝐷*Age and 𝐷2* denote the third column and the second row of 𝐷, respectively. □ 2.2. Data manipulation model The data transformation operators that are available in packages for building data preprocessing pipelines (e.g., Orange [7] and SciKit [8]) can be classified in three main classes, according to the type of manipulation done on the input dataset 𝐷 over a schema 𝑆, as follows. Data reductions. They reduce the size of 𝐷 by eliminating rows (without changing 𝑆) or columns (changing 𝑆 to 𝑆 ′ ⊂ 𝑆) from 𝐷. Two basic data reduction operators are defined over datasets. They are simple extensions of two well known relational operators. 𝜋𝐶 : the (conditional) projection of 𝐷 on a set of features of 𝑆 that satisfy a boolean condition 𝐶 over 𝑆, denoted by 𝜋𝐶 (𝐷), is the dataset obtained from 𝐷 by including only the columns 𝐷*a of 𝐷 such that a is a feature of 𝑆 that satisfy 𝐶; 𝜎𝐶 : the selection of 𝐷 with respect to a boolean condition 𝐶 over 𝑆, denoted by 𝜎𝐶 (𝐷), is the dataset obtained from 𝐷 by including the rows 𝐷𝑖* of 𝐷 satisfying 𝐶. The condition of both the projection and the selection operators can refer to the values in 𝐷, as shown in the following example that use an intuitive syntax for the condition. Example 2. Consider the dataset 𝐷 in Example 1. The result of the expression 𝜋{features without nulls} (𝜎Age<30 (𝐷)) is the following dataset: CId Gender Age 1 113 𝐹 24 2 241 𝑀 28 □ Data augmentations. They increase the size of 𝐷 by adding rows (without changing 𝑆) or columns (changing 𝑆 to 𝑆 ′ ⊃ 𝑆) to 𝐷. Two basic data augmentation operators are defined over datasets. They allow the addition of columns and rows to a dataset, respectively. 𝛼𝑓→(𝑋):𝑌 : the vertical augmentation of 𝐷 to 𝑌 using a function 𝑓 over a subset of features 𝑋 = [a1 . . . a𝑘 ] of 𝑆 is obtained by adding to 𝐷 a new set of features 𝑌 = [a′1 . . . a′𝑙 ] whose new values 𝑑𝑖a′1 . . . 𝑑𝑖a′𝑙 for the 𝑖-th row are obtained by applying 𝑓 to 𝑑𝑖a1 . . . 𝑑𝑖a𝑘 ; ↓ 𝛼𝑋:𝑓 (𝑌 ) : the horizontal augmentation of 𝐷 using an aggregative function 𝑓 is obtained by adding one or more new rows to 𝐷 obtained by first grouping over the features in 𝑋 and then, for each group, by applying 𝑓 to 𝜋𝑌 (𝐷) (extending the result to 𝑆 with nulls if needed). Example 3. Consider again the dataset 𝐷 in Example 1 and the following functions: (i) 𝑓1 , which associates the string young to an age less than 25 and the string adult otherwise, and (ii) 𝑓2 , which computes the average of a set of numbers. Then, the expression 𝛼𝑓→1 (Age):ageRange (𝐷) produces the following dataset: CId Gender Age Zip ageRange 1 113 𝐹 24 98567 𝑦𝑜𝑢𝑛𝑔 2 241 𝑀 28 ⊥ 𝑎𝑑𝑢𝑙𝑡 3 375 𝐶 ⊥ 32768 ⊥ 4 578 𝐹 44 32768 𝑎𝑑𝑢𝑙𝑡 ↓ whereas 𝐸2 = 𝛼Gender:𝑓 2 (Age) (𝐷) the dataset: CId Gender Age Zip 1 113 𝐹 24 98567 2 241 𝑀 28 ⊥ 3 375 𝐶 ⊥ 32768 4 578 𝐹 44 32768 5 ⊥ 𝐹 34 ⊥ 6 ⊥ 𝑀 28 ⊥ □ Data transformation. The goal is to transform (some of) the elements in 𝐷 without changing its size or its schema. One basic data transformation operator is defined over datasets: 𝜏𝑓 (𝑋) : the transformation of a set of features 𝑋 of 𝐷 using a function 𝑓 is obtained by replacing each value 𝑑𝑖a with 𝑓 (𝑑*a ), for each a occurring in 𝑋. Example 4. Let 𝐷 be the dataset in Example 1 and 𝑓 be an imputation function that associates with the ⊥’s occurring in a feature a the most frequent value occurring in 𝐷*a . Then, the result of 𝜏𝑓 (Zip) (𝐷) is the following dataset: CId Gender Age Zip 1 113 𝐹 24 98567 2 241 𝑀 28 32768 3 375 𝐶 ⊥ 32768 4 578 𝐹 44 32768 □ In [6] we illustrate how a large variety of pre-processing operators that are often used in data preparation workflows (including feature and instance selection, data repair, binarization, normalization, discretization, imputation, space transformation, and One-Hot encoder) can be suitably expressed as composition of the basic operators introduced in this section. 2.3. Data provenance model The purpose of data provenance is to extract relatively simple explanations for the existence (or the absence) of some piece of data in the result of complex data manipulations. Along this line, we adopt as the provenance model a subset of the PROV model [9] from the W3C.In PROV an entity represents an element 𝑑 of a dataset 𝐷 and is uniquely identified by 𝐷 and the coordinates of 𝑑 in 𝐷 (i.e., the corresponding row index and feature). An activity represents any pre-processing data manipulation that operates over datasets. For each element 𝑑 in a dataset 𝐷′ generated by an operation o over a dataset 𝐷 we represent the facts that: (i) 𝑑 wasGeneratedBy o, and (ii) 𝑑 wasDerivedFrom a set of elements in 𝐷. In addition, we represent: (iii) all the elements 𝑑 of 𝐷 such that 𝑑 was used by o and (iv) all the elements 𝑑 of 𝐷 such that 𝑑 wasInvalidatedBy (i.e., deleted by) o (if any). Example 5. Let 𝐸 be the first expression in Example 3 and 𝐷′ = 𝐸(𝐷). A fragment of the provenance generated by this operation is reported in Figure 1. □ Figure 1: A fragment of provenance data for the operation in Example 5. 3. Capturing Provenance 3.1. Provenance templates In order to capture the provenance of a pipeline 𝑝 of a sequence of preprocessing operations o1 , . . . , o𝑛 , we associate a provenance-generating function (p-gen) with each operation o𝑘 occurring in 𝑝. Each such function generates a collection of provenance data whenever a dataset is processed using o𝑘 , which describes the effect of o𝑘 on the data at the appropriate level of detail. As a large variety of preprocessing operators can be defined in terms of our five core pipeline operators [6], it is enough to define a p-gen function for each of these operators. Each p-gen function takes inputs 𝐷, 𝐷′ (the inputs and outputs of their associated operator) along with a description of the operator itself, and produces a PROV document that describes the transformation produced by the operator on each element of 𝐷. The output PROV document is obtained by instantiating an appropriate provenance template [10], which is designed to capture the transformation at the most granular level, i.e., at the level of individual elements of 𝐷, or its rows or columns, as appropriate. A template is simply a PROV document where: (i) variables, indicated by the namespace var:, are used as placeholders for values and (ii) a set of rules is used to specify how the “used” and the “generated” sides of the template are repeatedly instantiated, by binding the variables to each of the data items involved in the transformation. We refer to each instantiated template produced by a p-gen function as a provlet. Take for example the case of Vertical Augmentation (VA): 𝛼𝑓→1 (Age):ageRange (𝐷) which we used in Example 3, where attribute Age is binarised into {young, adult} based on a pre-defined cutoff, defined as part of 𝑓 (). The p-gen function for VA will produce a collection of small PROV documents, one for each input-output pair ⟨𝐷𝑖,Age , 𝐷𝑖,AgeRange ′ ⟩ as shown in the example. As these documents all share the same structure, we specify p-gen by giving two elements. First, a single PROV template for (VA) as shown in Figure 2, where we use the generic attribute names 𝑋, 𝑌 to indicate the old and new feature names, as per the operator’s general definition in Section 2.2. Figure 2: PROV template for Vertical Augmentation and corresponding instances. 3.2. Code instrumentation Approaches for automated provenance capture, such as by using the python call stack fail to capture data provenance at the level of the individual element within a dataframe. To accomplish this, in our initial prototype, we opted for explicit and analyst-controlled instrumentation at the script level. We have packaged the implementation of the p-gen functions described in the previous section as a python library that analysts can add to their code where provenace capture is desired. Note also that it may be possible to automate function call injection, at least in part, by leveraging mature code annotation tools. While this does not completely eliminate the need for manual intervention, this is now a simple comment/ annotation effort (which can be driven by a smart UI) rather than requiring additional programming. 3.3. Generating provenance documents A complete provenance document is produced by combining the collection of provlets that results from calling p-gen functions. Specifically, one provlet is generated for every transformation and every element in the dataframe that are affected by that transformation. The document is repre- sented by such collection of provlets, where entity identifiers match across provlets, and never needs to be fully materialised, as explained shortly. To illustrate how provlets are generated, consider the following pipeline: 𝜎𝐶 (𝛼𝑓→1 (Age):ageRange (𝐷)) where 𝐶 = {AgeRange ̸= ‘Young’} and 𝐷 is the dataset of Ex. 3. The corresponding provenance document is represented in Figure 3 Applying vertical augmentation produces one provlet for each record in the input dataframe, showing the derivation from Age to AgeRange. The second step, selecting records for ‘not Figure 3: Provlet composition. young’ people, produces the new set of provlets on the right, to indicate invalidation of the first record. Note that the “used” side on the left refers to existing entities, which are created either into the pipeline from the input dataset, or by an upstream data generation operator. Provlet composition requires looking up the set of entities already produced, whenever a new provlet is added to the document. For this, we have built a bespoke architecture that allows lazy provenance composition. Each p-gen function generates a set of provlets, one for each element in the dataframe (in the worst case), constructs a partial document, and stores it to a persistent MongoDB back end. This allows the provenance to be collected quickly at execution of each script, and be assembled later, minimizing execution dependencies and possible bottlenecks during the actual execution of the pipeline. Concretely, each p-gen function creates, at query time, a provenance object containing all provlets, and an input json file representing the input dataframe. By capturing provlets from each p-gen function, it is possible to compose these provlets into a complete graph, which can be traversed as a bipartite graph for any 𝑑𝑖𝑗 . The process for composing provlets, and tracing the influence (either direct of indirect) of data and operations on 𝑑𝑖𝑗 can then support “why provenance" [11]. 4. Conclusions In this work, we have illustrated an approach for producing fine-grained data provenance in machine learning pipelines, irrespective of the pipeline tool used. Indeed, because a substantial effort goes into selecting and preparing data for use in modelling, and because changes made during preparation can affect the ultimate model, it is important to be able to trace what is happening to the data at a such level of detail. Using an implementation of this system, we have demonstrated the utility and performance of our approach over real-world ML benchmark pipelines. In order to investigate scalability issues with our design, we also used the TPC-DI generator and apply several operators over that data at scale. Our results indicate that we can collect fine-grained provenance that is both useful and performant [6]. References [1] A. M. Alaa, M. van der Schaar, Demystifying black-box models with symbolic metamodels, in: Advances in Neural Information Processing Systems, 2019, pp. 11301–11311. [2] M. T. Ribeiro, S. Singh, C. Guestrin, “Why should i trust you?”: Explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144. [3] S. García, et.al., Big data preprocessing: methods and prospects, Big Data Analytics 1 (2016). [4] L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil, P. Groth, G. Klyne, et al., Prov-dm: The prov data model. w3c recommendation rec-prov- dm-20130430, WWW Consortium (2013). [5] V. Green, Todd J., Karvounarakis, G., Tannen, Provenance Semirings, in: PODS, 2007, pp. 31–40. [6] A. Chapman, P. Missier, G. Simonelli, R. Torlone, Capturing and querying fine-grained provenance of preprocessing pipelines in data science, Proc. VLDB Endow. 14 (2020) 507–520. [7] J. Demšar, et al., Orange: Data mining toolbox in python, Journal of Machine Learning Research 14 (2013) 2349–2353. [8] Pedregosa, et al., Scikit-learn: Machine learning in Python 12 (2011) 2825–2830. [9] L. Moreau, J. Cheney, P. Missier, Constraints of the prov data model, 2013. URL: http: //www.w3.org/TR/2013/REC-prov-constraints-20130430/. [10] L. Moreau, B. V. Batlajery, T. D. Huynh, D. T. Michaelides, H. S. Packer, A templating system to generate provenance, IEEE Transactions on Software Engineering 44 (2018) 103–121. [11] P. Buneman, S. Khanna, T. Wang-Chiew, Why and where: A characterization of data provenance, in: ICDT, Springer, 2001, pp. 316–330.