<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fine-grained Provenance for High-quality Data Science</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriane Chapman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Missier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Simonelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Torlone</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Newcastle University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università Roma Tre</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Data processing pipelines that are designed to clean, transform and alter data in preparation for
learning predictive models, have an impact on those models’ accuracy and performance, as well
on other properties, such as model fairness.</p>
      <p>
        However, while substantial recent research has produced techniques for model explanation
that focus primarily on the model itself (e.g., [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]) relatively little work has been done into
trying to explain models in terms of the transformations that occur before the data is used for
learning.
      </p>
      <p>
        In this work, we enable the explanation on the efect of each transformation in a pre-processing
pipeline on the data that is ultimately fed into a model. We consider transformations that apply
to commonly used tabular or relational datasets and across application domains. These steps
have been systematically enumerated in multiple reviews (see eg. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) and include, among others:
feature selection, engineering of new features; imputation of missing values, or listwise deletion
(excluding an entire record if data is missing on any variable for that record); downsampling
or upsampling of data subsets in order to achieve better balance, typically on the class labels
(for classification tasks) or on the distribution of the outcome variable (for regression tasks);
outlier detection and removal; smoothing and normalisation; de-duplication, as well as steps
that preserve the original information but are required by some algorithms, such as “one-hot”
encoding of categorical variables. A complex pipeline may include some or all of these steps, and
diferent techniques, algorithms, and choice of algorithm-specific parameters may be available
for each of them. These are often grounded in established literature but variations can be created
by data scientists to suit specific needs. We consider the space of all configured pipelines that
can potentially be composed out of these operators, and we focus on relational datasets, which
are arguably the most common data structures in popular analytics-friendly scripting languages
like R, Spark, and Python (where they are called dataframes).
      </p>
      <p>
        In this framework, we propose a formalisation and categorisation of a core set of these
operators. Data derivations for such operators are expressed at the level of the atomic elements
in the dataset using the PROV data model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a standard and a widely adopted ontology. Then,
with each of the core operators we associate a provenance pattern that describes their efect on
the data at the appropriate level of detail, i.e., on individual dataset elements, columns, rows,
or collections of those. Provenance patterns defined in this work for data science operators
play a similar role to that of provenance polynomials [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], i.e., annotations that are associated to
relational algebra operators to describe the fine-grained provenance of the result of relational
operators. Finally, we associate a provenance function pfo() to each operator o, which generates
a provenance document pfo() when a dataset  is processed using o. The document is an
instance of the pattern associated with o. Provenance functions are implemented as part of
a Python module. Collecting all the provenance documents from each operator’s execution
results in a seamless, end-to-end provenance document that contains the detailed history of
each dataset element in the final training set, including their creation (e.g. as a new derived
feature), transformation (value imputation, for example) and possibly deletion (e.g., by feature
selection, removal of null values).
      </p>
      <p>
        In the rest of this paper we illustrate the models we have adopted for describing data, operators,
and provenance (Section 2) as well as the way in which provenance for a data preprocessing
pipeline is captured (Section 3). Further details on the query capabilities of the resulting
provenance and on the scalability of the overall approach can be found in the extended version
of the paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Models for Data, Operators, and Provenance.</title>
      <sec id="sec-2-1">
        <title>2.1. Data model</title>
        <p>A (dataset) schema  is an array of distinct names called features:  = [a1, . . . , a]. Each
feature is associated with a domain of atomic values (such as numbers, strings, and timestamps).
A dataset  over a schema  = [a1, . . . , a] is an ordered collection of rows (or records) of
the form:  : (1, . . . , ) where  is the unique index of the row and each element  (for
1 ≤  ≤ ) is either a value in the domain of the feature a or the special symbol ⊥, denoting a
missing value. Given a dataset  over a schema  we denote by a the element for the feature
a of  occurring the -th row of . We also denote by * the -th row of , and by * a the
column of  associated with the feature a of .
* Age and 2* denote the third column and the second row of , respectively.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data manipulation model</title>
        <p>
          The data transformation operators that are available in packages for building data preprocessing
pipelines (e.g., Orange [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and SciKit [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) can be classified in three main classes, according to
the type of manipulation done on the input dataset  over a schema , as follows.
Data reductions. They reduce the size of  by eliminating rows (without changing ) or
columns (changing  to ′ ⊂ ) from . Two basic data reduction operators are defined over
datasets. They are simple extensions of two well known relational operators.
  : the (conditional) projection of  on a set of features of  that satisfy a boolean condition 
over , denoted by   (), is the dataset obtained from  by including only the columns
* a of  such that a is a feature of  that satisfy ;
  : the selection of  with respect to a boolean condition  over , denoted by   (), is the
dataset obtained from  by including the rows * of  satisfying .
        </p>
        <p>The condition of both the projection and the selection operators can refer to the values in , as
shown in the following example that use an intuitive syntax for the condition.</p>
        <sec id="sec-2-2-1">
          <title>Example 2. Consider the dataset  in Example 1.</title>
          <p>{features without nulls}( Age&lt;30()) is the following dataset:
CId Gender Age
1 113  24
2 241  28</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>The result of the expression</title>
          <p>Example 1. A possible dataset  over the schema  = [CId, Gender, Age, Zip] is as follows:
Data augmentations. They increase the size of  by adding rows (without changing ) or
columns (changing  to ′ ⊃ ) to . Two basic data augmentation operators are defined over
datasets. They allow the addition of columns and rows to a dataset, respectively.
 →(): : the vertical augmentation of  to  using a function  over a subset of features
 = [a1 . . . a] of  is obtained by adding to  a new set of features  = [a′1 . . . a′]
whose new values a′1 . . . a′ for the -th row are obtained by applying  to a1 . . . a ;
 ↓:( ): the horizontal augmentation of  using an aggregative function  is obtained by
adding one or more new rows to  obtained by first grouping over the features in 
and then, for each group, by applying  to   () (extending the result to  with nulls if
needed).
□
□
Example 3. Consider again the dataset  in Example 1 and the following functions: (i) 1, which
associates the string young to an age less than 25 and the string adult otherwise, and (ii) 2, which
computes the average of a set of numbers. Then, the expression  →1(Age):ageRange() produces the
following dataset:
Data transformation. The goal is to transform (some of) the elements in  without changing
its size or its schema. One basic data transformation operator is defined over datasets:
 (): the transformation of a set of features  of  using a function  is obtained by replacing
each value a with  (* a), for each a occurring in .</p>
          <p>
            Example 4. Let  be the dataset in Example 1 and  be an imputation function that associates
with the ⊥’s occurring in a feature a the most frequent value occurring in * a. Then, the result
of  (Zip)() is the following dataset:
In [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] we illustrate how a large variety of pre-processing operators that are often used in
data preparation workflows (including feature and instance selection, data repair, binarization,
normalization, discretization, imputation, space transformation, and One-Hot encoder) can be
suitably expressed as composition of the basic operators introduced in this section.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Data provenance model</title>
        <p>
          The purpose of data provenance is to extract relatively simple explanations for the existence
(or the absence) of some piece of data in the result of complex data manipulations. Along
this line, we adopt as the provenance model a subset of the PROV model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] from the W3C.In
PROV an entity represents an element  of a dataset  and is uniquely identified by  and the
coordinates of  in  (i.e., the corresponding row index and feature). An activity represents
any pre-processing data manipulation that operates over datasets. For each element  in a
dataset ′ generated by an operation o over a dataset  we represent the facts that: (i) 
wasGeneratedBy o, and (ii)  wasDerivedFrom a set of elements in . In addition, we represent:
(iii) all the elements  of  such that  was used by o and (iv) all the elements  of  such that
 wasInvalidatedBy (i.e., deleted by) o (if any).
        </p>
        <p>Example 5. Let  be the first expression in Example 3 and ′ = (). A fragment of the
provenance generated by this operation is reported in Figure 1. □</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Capturing Provenance</title>
      <sec id="sec-3-1">
        <title>3.1. Provenance templates</title>
        <p>
          In order to capture the provenance of a pipeline  of a sequence of preprocessing operations
o1, . . . , o, we associate a provenance-generating function (p-gen) with each operation o
occurring in . Each such function generates a collection of provenance data whenever a dataset
is processed using o, which describes the efect of o on the data at the appropriate level of
detail. As a large variety of preprocessing operators can be defined in terms of our five core
pipeline operators [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], it is enough to define a p-gen function for each of these operators.
        </p>
        <p>
          Each p-gen function takes inputs , ′ (the inputs and outputs of their associated operator)
along with a description of the operator itself, and produces a PROV document that describes
the transformation produced by the operator on each element of . The output PROV document
is obtained by instantiating an appropriate provenance template [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], which is designed to
capture the transformation at the most granular level, i.e., at the level of individual elements of
, or its rows or columns, as appropriate. A template is simply a PROV document where: (i)
variables, indicated by the namespace var:, are used as placeholders for values and (ii) a set of
rules is used to specify how the “used” and the “generated” sides of the template are repeatedly
instantiated, by binding the variables to each of the data items involved in the transformation.
We refer to each instantiated template produced by a p-gen function as a provlet.
        </p>
        <p>Take for example the case of Vertical Augmentation (VA):  →1(Age):ageRange() which we
used in Example 3, where attribute Age is binarised into {young, adult} based on a pre-defined
cutof, defined as part of  (). The p-gen function for VA will produce a collection of small PROV
documents, one for each input-output pair ⟨,Age, ′,AgeRange⟩ as shown in the example. As
these documents all share the same structure, we specify p-gen by giving two elements. First, a
single PROV template for (VA) as shown in Figure 2, where we use the generic attribute names
,  to indicate the old and new feature names, as per the operator’s general definition in
Section 2.2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Code instrumentation</title>
        <p>Approaches for automated provenance capture, such as by using the python call stack fail to
capture data provenance at the level of the individual element within a dataframe. To accomplish
this, in our initial prototype, we opted for explicit and analyst-controlled instrumentation at
the script level. We have packaged the implementation of the p-gen functions described in
the previous section as a python library that analysts can add to their code where provenace
capture is desired. Note also that it may be possible to automate function call injection, at least
in part, by leveraging mature code annotation tools. While this does not completely eliminate
the need for manual intervention, this is now a simple comment/ annotation efort (which can
be driven by a smart UI) rather than requiring additional programming.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Generating provenance documents</title>
        <p>A complete provenance document is produced by combining the collection of provlets that results
from calling p-gen functions. Specifically, one provlet is generated for every transformation and
every element in the dataframe that are afected by that transformation. The document is
represented by such collection of provlets, where entity identifiers match across provlets, and never
needs to be fully materialised, as explained shortly. To illustrate how provlets are generated,
consider the following pipeline:   ( →1(Age):ageRange()) where  = {AgeRange ̸= ‘Young’}
and  is the dataset of Ex. 3. The corresponding provenance document is represented in Figure 3</p>
        <p>Applying vertical augmentation produces one provlet for each record in the input dataframe,
showing the derivation from Age to AgeRange. The second step, selecting records for ‘not
young’ people, produces the new set of provlets on the right, to indicate invalidation of the first
record. Note that the “used” side on the left refers to existing entities, which are created either
into the pipeline from the input dataset, or by an upstream data generation operator.</p>
        <p>Provlet composition requires looking up the set of entities already produced, whenever a new
provlet is added to the document. For this, we have built a bespoke architecture that allows lazy
provenance composition. Each p-gen function generates a set of provlets, one for each element
in the dataframe (in the worst case), constructs a partial document, and stores it to a persistent
MongoDB back end. This allows the provenance to be collected quickly at execution of each
script, and be assembled later, minimizing execution dependencies and possible bottlenecks
during the actual execution of the pipeline.</p>
        <p>
          Concretely, each p-gen function creates, at query time, a provenance object containing all
provlets, and an input json file representing the input dataframe. By capturing provlets from
each p-gen function, it is possible to compose these provlets into a complete graph, which can
be traversed as a bipartite graph for any  . The process for composing provlets, and tracing
the influence (either direct of indirect) of data and operations on  can then support “why
provenance" [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>
        In this work, we have illustrated an approach for producing fine-grained data provenance in
machine learning pipelines, irrespective of the pipeline tool used. Indeed, because a substantial
efort goes into selecting and preparing data for use in modelling, and because changes made
during preparation can afect the ultimate model, it is important to be able to trace what is
happening to the data at a such level of detail. Using an implementation of this system, we
have demonstrated the utility and performance of our approach over real-world ML benchmark
pipelines. In order to investigate scalability issues with our design, we also used the TPC-DI
generator and apply several operators over that data at scale. Our results indicate that we can
collect fine-grained provenance that is both useful and performant [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Alaa</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van der Schaar</surname>
          </string-name>
          ,
          <article-title>Demystifying black-box models with symbolic metamodels</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>11301</fpage>
          -
          <lpage>11311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , “
          <article-title>Why should i trust you?”: Explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>García</surname>
          </string-name>
          , et.al.,
          <article-title>Big data preprocessing: methods and prospects</article-title>
          ,
          <source>Big Data Analytics</source>
          <volume>1</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. B'Far</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Cheney</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Coppens</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cresswell</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Klyne</surname>
          </string-name>
          , et al.,
          <article-title>Prov-dm: The prov data model</article-title>
          .
          <source>w3c recommendation rec-provdm-20130430</source>
          ,
          <string-name>
            <given-names>WWW</given-names>
            <surname>Consortium</surname>
          </string-name>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Todd J.</given-names>
            ,
            <surname>Karvounarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Tannen</surname>
          </string-name>
          , Provenance Semirings, in: PODS,
          <year>2007</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          , G. Simonelli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Capturing and querying fine-grained provenance of preprocessing pipelines in data science</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>507</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demšar</surname>
          </string-name>
          , et al.,
          <article-title>Orange: Data mining toolbox in python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>14</volume>
          (
          <year>2013</year>
          )
          <fpage>2349</fpage>
          -
          <lpage>2353</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Pedregosa</surname>
          </string-name>
          , et al.,
          <source>Scikit-learn: Machine learning in Python 12</source>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <article-title>Constraints of the prov data model</article-title>
          ,
          <year>2013</year>
          . URL: http: //www.w3.org/TR/2013/REC-prov-constraints-
          <volume>20130430</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. V.</given-names>
            <surname>Batlajery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Michaelides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Packer</surname>
          </string-name>
          ,
          <article-title>A templating system to generate provenance</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>44</volume>
          (
          <year>2018</year>
          )
          <fpage>103</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang-Chiew</surname>
          </string-name>
          ,
          <article-title>Why and where: A characterization of data provenance</article-title>
          ,
          <source>in: ICDT</source>
          , Springer,
          <year>2001</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>