Fine-grained Provenance for High-quality Data
Science
(Discussion Paper)

Adriane Chapman1 , Paolo Missier2 , Giulia Simonelli3 and Riccardo Torlone3
2
  University of Southampton
2
  Newcastle University
3
  Università Roma Tre


                                         Abstract
                                         In this work we analyze the typical operations of data preparation within a machine learning process,
                                         and provide infrastructure for generating very granular provenance records from it, at the level of in-
                                         dividual elements within a dataset. Our contributions include: (i) the formal definition of a core set of
                                         preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype
                                         implementation of an application-level provenance capture library that works alongside Python.


1. Introduction
Data processing pipelines that are designed to clean, transform and alter data in preparation for
learning predictive models, have an impact on those models’ accuracy and performance, as well
on other properties, such as model fairness.
   However, while substantial recent research has produced techniques for model explanation
that focus primarily on the model itself (e.g., [1, 2]) relatively little work has been done into
trying to explain models in terms of the transformations that occur before the data is used for
learning.
   In this work, we enable the explanation on the effect of each transformation in a pre-processing
pipeline on the data that is ultimately fed into a model. We consider transformations that apply
to commonly used tabular or relational datasets and across application domains. These steps
have been systematically enumerated in multiple reviews (see eg. [3]) and include, among others:
feature selection, engineering of new features; imputation of missing values, or listwise deletion
(excluding an entire record if data is missing on any variable for that record); downsampling
or upsampling of data subsets in order to achieve better balance, typically on the class labels
(for classification tasks) or on the distribution of the outcome variable (for regression tasks);
outlier detection and removal; smoothing and normalisation; de-duplication, as well as steps
that preserve the original information but are required by some algorithms, such as “one-hot”

SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV),
Italy
" adriane.chapman@soton.ac.uk (A. Chapman); paolo.missier@ncl.ac.uk (P. Missier);
giulia.simonelli@uniroma3.it (G. Simonelli); giulia.simonelli@uniroma3.it (R. Torlone)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
encoding of categorical variables. A complex pipeline may include some or all of these steps, and
different techniques, algorithms, and choice of algorithm-specific parameters may be available
for each of them. These are often grounded in established literature but variations can be created
by data scientists to suit specific needs. We consider the space of all configured pipelines that
can potentially be composed out of these operators, and we focus on relational datasets, which
are arguably the most common data structures in popular analytics-friendly scripting languages
like R, Spark, and Python (where they are called dataframes).
   In this framework, we propose a formalisation and categorisation of a core set of these
operators. Data derivations for such operators are expressed at the level of the atomic elements
in the dataset using the PROV data model [4], a standard and a widely adopted ontology. Then,
with each of the core operators we associate a provenance pattern that describes their effect on
the data at the appropriate level of detail, i.e., on individual dataset elements, columns, rows,
or collections of those. Provenance patterns defined in this work for data science operators
play a similar role to that of provenance polynomials [5], i.e., annotations that are associated to
relational algebra operators to describe the fine-grained provenance of the result of relational
operators. Finally, we associate a provenance function pfo () to each operator o, which generates
a provenance document pfo (𝐷) when a dataset 𝐷 is processed using o. The document is an
instance of the pattern associated with o. Provenance functions are implemented as part of
a Python module. Collecting all the provenance documents from each operator’s execution
results in a seamless, end-to-end provenance document that contains the detailed history of
each dataset element in the final training set, including their creation (e.g. as a new derived
feature), transformation (value imputation, for example) and possibly deletion (e.g., by feature
selection, removal of null values).
   In the rest of this paper we illustrate the models we have adopted for describing data, operators,
and provenance (Section 2) as well as the way in which provenance for a data preprocessing
pipeline is captured (Section 3). Further details on the query capabilities of the resulting
provenance and on the scalability of the overall approach can be found in the extended version
of the paper [6].


2. Models for Data, Operators, and Provenance.
2.1. Data model
A (dataset) schema 𝑆 is an array of distinct names called features: 𝑆 = [a1 , . . . , a𝑛 ]. Each
feature is associated with a domain of atomic values (such as numbers, strings, and timestamps).
A dataset 𝐷 over a schema 𝑆 = [a1 , . . . , a𝑛 ] is an ordered collection of rows (or records) of
the form: 𝑖 : (𝑑𝑖1 , . . . , 𝑑𝑖𝑛 ) where 𝑖 is the unique index of the row and each element 𝑑𝑖𝑗 (for
1 ≤ 𝑗 ≤ 𝑛) is either a value in the domain of the feature a𝑗 or the special symbol ⊥, denoting a
missing value. Given a dataset 𝐷 over a schema 𝑆 we denote by 𝐷𝑖a the element for the feature
a of 𝑆 occurring the 𝑖-th row of 𝐷. We also denote by 𝐷𝑖* the 𝑖-th row of 𝐷, and by 𝐷*a the
column of 𝐷 associated with the feature a of 𝑆.
Example 1. A possible dataset 𝐷 over the schema 𝑆 = [CId, Gender, Age, Zip] is as follows:
                                       CId    Gender    Age Zip
                                   1   113      𝐹       24 98567
                                   2   241      𝑀       28   ⊥
                                   3   375      𝐶        ⊥ 32768
                                   4   578      𝐹       44 32768
𝐷*Age and 𝐷2* denote the third column and the second row of 𝐷, respectively.                          □

2.2. Data manipulation model
The data transformation operators that are available in packages for building data preprocessing
pipelines (e.g., Orange [7] and SciKit [8]) can be classified in three main classes, according to
the type of manipulation done on the input dataset 𝐷 over a schema 𝑆, as follows.
Data reductions. They reduce the size of 𝐷 by eliminating rows (without changing 𝑆) or
columns (changing 𝑆 to 𝑆 ′ ⊂ 𝑆) from 𝐷. Two basic data reduction operators are defined over
datasets. They are simple extensions of two well known relational operators.
𝜋𝐶 : the (conditional) projection of 𝐷 on a set of features of 𝑆 that satisfy a boolean condition 𝐶
      over 𝑆, denoted by 𝜋𝐶 (𝐷), is the dataset obtained from 𝐷 by including only the columns
      𝐷*a of 𝐷 such that a is a feature of 𝑆 that satisfy 𝐶;
𝜎𝐶 : the selection of 𝐷 with respect to a boolean condition 𝐶 over 𝑆, denoted by 𝜎𝐶 (𝐷), is the
      dataset obtained from 𝐷 by including the rows 𝐷𝑖* of 𝐷 satisfying 𝐶.
The condition of both the projection and the selection operators can refer to the values in 𝐷, as
shown in the following example that use an intuitive syntax for the condition.
Example 2. Consider the dataset 𝐷 in Example 1.                      The result of the expression
𝜋{features without nulls} (𝜎Age<30 (𝐷)) is the following dataset:
                                             CId   Gender   Age
                                        1    113     𝐹      24
                                        2    241     𝑀      28
                                                                                                      □
Data augmentations. They increase the size of 𝐷 by adding rows (without changing 𝑆) or
columns (changing 𝑆 to 𝑆 ′ ⊃ 𝑆) to 𝐷. Two basic data augmentation operators are defined over
datasets. They allow the addition of columns and rows to a dataset, respectively.

𝛼𝑓→(𝑋):𝑌 : the vertical augmentation of 𝐷 to 𝑌 using a function 𝑓 over a subset of features
      𝑋 = [a1 . . . a𝑘 ] of 𝑆 is obtained by adding to 𝐷 a new set of features 𝑌 = [a′1 . . . a′𝑙 ]
      whose new values 𝑑𝑖a′1 . . . 𝑑𝑖a′𝑙 for the 𝑖-th row are obtained by applying 𝑓 to 𝑑𝑖a1 . . . 𝑑𝑖a𝑘 ;
 ↓
𝛼𝑋:𝑓 (𝑌 ) : the horizontal augmentation of 𝐷 using an aggregative function 𝑓 is obtained by
      adding one or more new rows to 𝐷 obtained by first grouping over the features in 𝑋
      and then, for each group, by applying 𝑓 to 𝜋𝑌 (𝐷) (extending the result to 𝑆 with nulls if
      needed).
Example 3. Consider again the dataset 𝐷 in Example 1 and the following functions: (i) 𝑓1 , which
associates the string young to an age less than 25 and the string adult otherwise, and (ii) 𝑓2 , which
computes the average of a set of numbers. Then, the expression 𝛼𝑓→1 (Age):ageRange (𝐷) produces the
following dataset:
                                CId   Gender      Age Zip      ageRange
                            1   113     𝐹         24 98567      𝑦𝑜𝑢𝑛𝑔
                            2   241     𝑀         28   ⊥         𝑎𝑑𝑢𝑙𝑡
                            3   375     𝐶          ⊥ 32768        ⊥
                            4   578     𝐹         44 32768       𝑎𝑑𝑢𝑙𝑡
              ↓
whereas 𝐸2 = 𝛼Gender:𝑓 2 (Age)
                               (𝐷) the dataset:

                                      CId   Gender    Age Zip
                                  1   113     𝐹       24 98567
                                  2   241     𝑀       28   ⊥
                                  3   375     𝐶        ⊥ 32768
                                  4   578     𝐹       44 32768
                                  5    ⊥      𝐹       34   ⊥
                                  6    ⊥      𝑀       28   ⊥

                                                                                                   □

Data transformation. The goal is to transform (some of) the elements in 𝐷 without changing
its size or its schema. One basic data transformation operator is defined over datasets:
𝜏𝑓 (𝑋) : the transformation of a set of features 𝑋 of 𝐷 using a function 𝑓 is obtained by replacing
        each value 𝑑𝑖a with 𝑓 (𝑑*a ), for each a occurring in 𝑋.

Example 4. Let 𝐷 be the dataset in Example 1 and 𝑓 be an imputation function that associates
with the ⊥’s occurring in a feature a the most frequent value occurring in 𝐷*a . Then, the result
of 𝜏𝑓 (Zip) (𝐷) is the following dataset:

                                      CId   Gender    Age Zip
                                  1   113     𝐹       24 98567
                                  2   241     𝑀       28 32768
                                  3   375     𝐶        ⊥ 32768
                                  4   578     𝐹       44 32768

                                                                                                   □

In [6] we illustrate how a large variety of pre-processing operators that are often used in
data preparation workflows (including feature and instance selection, data repair, binarization,
normalization, discretization, imputation, space transformation, and One-Hot encoder) can be
suitably expressed as composition of the basic operators introduced in this section.

2.3. Data provenance model
The purpose of data provenance is to extract relatively simple explanations for the existence
(or the absence) of some piece of data in the result of complex data manipulations. Along
this line, we adopt as the provenance model a subset of the PROV model [9] from the W3C.In
PROV an entity represents an element 𝑑 of a dataset 𝐷 and is uniquely identified by 𝐷 and the
coordinates of 𝑑 in 𝐷 (i.e., the corresponding row index and feature). An activity represents
any pre-processing data manipulation that operates over datasets. For each element 𝑑 in a
dataset 𝐷′ generated by an operation o over a dataset 𝐷 we represent the facts that: (i) 𝑑
wasGeneratedBy o, and (ii) 𝑑 wasDerivedFrom a set of elements in 𝐷. In addition, we represent:
(iii) all the elements 𝑑 of 𝐷 such that 𝑑 was used by o and (iv) all the elements 𝑑 of 𝐷 such that
𝑑 wasInvalidatedBy (i.e., deleted by) o (if any).

Example 5. Let 𝐸 be the first expression in Example 3 and 𝐷′ = 𝐸(𝐷). A fragment of the
provenance generated by this operation is reported in Figure 1.                     □


Figure 1: A fragment of provenance data for the operation in Example 5.

3. Capturing Provenance
3.1. Provenance templates
In order to capture the provenance of a pipeline 𝑝 of a sequence of preprocessing operations
o1 , . . . , o𝑛 , we associate a provenance-generating function (p-gen) with each operation o𝑘
occurring in 𝑝. Each such function generates a collection of provenance data whenever a dataset
is processed using o𝑘 , which describes the effect of o𝑘 on the data at the appropriate level of
detail. As a large variety of preprocessing operators can be defined in terms of our five core
pipeline operators [6], it is enough to define a p-gen function for each of these operators.
   Each p-gen function takes inputs 𝐷, 𝐷′ (the inputs and outputs of their associated operator)
along with a description of the operator itself, and produces a PROV document that describes
the transformation produced by the operator on each element of 𝐷. The output PROV document
is obtained by instantiating an appropriate provenance template [10], which is designed to
capture the transformation at the most granular level, i.e., at the level of individual elements of
𝐷, or its rows or columns, as appropriate. A template is simply a PROV document where: (i)
variables, indicated by the namespace var:, are used as placeholders for values and (ii) a set of
rules is used to specify how the “used” and the “generated” sides of the template are repeatedly
instantiated, by binding the variables to each of the data items involved in the transformation.
We refer to each instantiated template produced by a p-gen function as a provlet.
   Take for example the case of Vertical Augmentation (VA): 𝛼𝑓→1 (Age):ageRange (𝐷) which we
used in Example 3, where attribute Age is binarised into {young, adult} based on a pre-defined
cutoff, defined as part of 𝑓 (). The p-gen function for VA will produce a collection of small PROV
documents, one for each input-output pair ⟨𝐷𝑖,Age , 𝐷𝑖,AgeRange
                                                          ′         ⟩ as shown in the example. As
these documents all share the same structure, we specify p-gen by giving two elements. First, a
single PROV template for (VA) as shown in Figure 2, where we use the generic attribute names
𝑋, 𝑌 to indicate the old and new feature names, as per the operator’s general definition in
Section 2.2.


Figure 2: PROV template for Vertical Augmentation and corresponding instances.


3.2. Code instrumentation
Approaches for automated provenance capture, such as by using the python call stack fail to
capture data provenance at the level of the individual element within a dataframe. To accomplish
this, in our initial prototype, we opted for explicit and analyst-controlled instrumentation at
the script level. We have packaged the implementation of the p-gen functions described in
the previous section as a python library that analysts can add to their code where provenace
capture is desired. Note also that it may be possible to automate function call injection, at least
in part, by leveraging mature code annotation tools. While this does not completely eliminate
the need for manual intervention, this is now a simple comment/ annotation effort (which can
be driven by a smart UI) rather than requiring additional programming.

3.3. Generating provenance documents
A complete provenance document is produced by combining the collection of provlets that results
from calling p-gen functions. Specifically, one provlet is generated for every transformation and
every element in the dataframe that are affected by that transformation. The document is repre-
sented by such collection of provlets, where entity identifiers match across provlets, and never
needs to be fully materialised, as explained shortly. To illustrate how provlets are generated,
consider the following pipeline: 𝜎𝐶 (𝛼𝑓→1 (Age):ageRange (𝐷)) where 𝐶 = {AgeRange ̸= ‘Young’}
and 𝐷 is the dataset of Ex. 3. The corresponding provenance document is represented in Figure 3
   Applying vertical augmentation produces one provlet for each record in the input dataframe,
showing the derivation from Age to AgeRange. The second step, selecting records for ‘not
Figure 3: Provlet composition.


young’ people, produces the new set of provlets on the right, to indicate invalidation of the first
record. Note that the “used” side on the left refers to existing entities, which are created either
into the pipeline from the input dataset, or by an upstream data generation operator.
   Provlet composition requires looking up the set of entities already produced, whenever a new
provlet is added to the document. For this, we have built a bespoke architecture that allows lazy
provenance composition. Each p-gen function generates a set of provlets, one for each element
in the dataframe (in the worst case), constructs a partial document, and stores it to a persistent
MongoDB back end. This allows the provenance to be collected quickly at execution of each
script, and be assembled later, minimizing execution dependencies and possible bottlenecks
during the actual execution of the pipeline.
   Concretely, each p-gen function creates, at query time, a provenance object containing all
provlets, and an input json file representing the input dataframe. By capturing provlets from
each p-gen function, it is possible to compose these provlets into a complete graph, which can
be traversed as a bipartite graph for any 𝑑𝑖𝑗 . The process for composing provlets, and tracing
the influence (either direct of indirect) of data and operations on 𝑑𝑖𝑗 can then support “why
provenance" [11].


4. Conclusions
In this work, we have illustrated an approach for producing fine-grained data provenance in
machine learning pipelines, irrespective of the pipeline tool used. Indeed, because a substantial
effort goes into selecting and preparing data for use in modelling, and because changes made
during preparation can affect the ultimate model, it is important to be able to trace what is
happening to the data at a such level of detail. Using an implementation of this system, we
have demonstrated the utility and performance of our approach over real-world ML benchmark
pipelines. In order to investigate scalability issues with our design, we also used the TPC-DI
generator and apply several operators over that data at scale. Our results indicate that we can
collect fine-grained provenance that is both useful and performant [6].
References
 [1] A. M. Alaa, M. van der Schaar, Demystifying black-box models with symbolic metamodels,
     in: Advances in Neural Information Processing Systems, 2019, pp. 11301–11311.
 [2] M. T. Ribeiro, S. Singh, C. Guestrin, “Why should i trust you?”: Explaining the predictions
     of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on
     knowledge discovery and data mining, 2016, pp. 1135–1144.
 [3] S. García, et.al., Big data preprocessing: methods and prospects, Big Data Analytics 1
     (2016).
 [4] L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil,
     P. Groth, G. Klyne, et al., Prov-dm: The prov data model. w3c recommendation rec-prov-
     dm-20130430, WWW Consortium (2013).
 [5] V. Green, Todd J., Karvounarakis, G., Tannen, Provenance Semirings, in: PODS, 2007, pp.
     31–40.
 [6] A. Chapman, P. Missier, G. Simonelli, R. Torlone, Capturing and querying fine-grained
     provenance of preprocessing pipelines in data science, Proc. VLDB Endow. 14 (2020)
     507–520.
 [7] J. Demšar, et al., Orange: Data mining toolbox in python, Journal of Machine Learning
     Research 14 (2013) 2349–2353.
 [8] Pedregosa, et al., Scikit-learn: Machine learning in Python 12 (2011) 2825–2830.
 [9] L. Moreau, J. Cheney, P. Missier, Constraints of the prov data model, 2013. URL: http:
     //www.w3.org/TR/2013/REC-prov-constraints-20130430/.
[10] L. Moreau, B. V. Batlajery, T. D. Huynh, D. T. Michaelides, H. S. Packer, A templating
     system to generate provenance, IEEE Transactions on Software Engineering 44 (2018)
     103–121.
[11] P. Buneman, S. Khanna, T. Wang-Chiew, Why and where: A characterization of data
     provenance, in: ICDT, Springer, 2001, pp. 316–330.