<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From why-provenance to why+provenance: Towards addressing deep data explanations in Data-Centric AI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Missier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Torlone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università Roma Tre</institution>
          ,
          <addr-line>Roma, Dipartimento di Ingegneria</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Birmingham, School of Computer Science</institution>
          ,
          <addr-line>Birmingham</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this position paper we discuss the problem of exploiting data provenance to provide explanations in data-centric AI processes, where the emphasis of model development is placed on the quality of data. In particular, we show how a classification of the main operators used in the data preparation phase provides an efective and powerful means for the production of increasingly detailed explanations at the needed level of data granularity.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data-centric AI</kwd>
        <kwd>Data Engineering pipeline</kwd>
        <kwd>Data Provenance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In provenance theory, the notion of why-provenance has been introduced primarily in the
context of relational models and algebra with reference to the set of tuples in source relations
that contribute to producing the results of a (SQL) query. These have been known as witness
tuples [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and define the lineage of a tuple that appears in the result of a query. The clear
semantics associated with relational algebra operators made it possible to develop formal,
elegant models for representing why-provenance, and its extensions to how-provenance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Further extensions have subsequently been developed, for instance to capture the provenance
of results of aggregated queries [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        A parallel strand of research focused on the definition of provenance as it applies to datasets
that undergo a series of transformations, where arbitrary operators are typically arranged into
a Directed Acyclic Graph topology. Provenance in this setting can itself be expressed as a graph
of data derivations, where each derivation is mediated by one operator. This became known
as “coarse-grained” provenance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in contrast to the tuple-level provenance grounded in the
relational framework. As the term suggests, the “black-box” nature of the operators makes
granular provenance hard to determine directly [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This is because why- and how-provenance
are grounded in the precise semantics of the query operators, which is not available when using
arbitrary data transformers. The result is a provenance graph that is limited to dataset-level
derivations, i.e., through each of the processors. Existing approaches that attempt to circumvent
this limitation and reconstruct “high-fidelity” provenance, using system-level events recorded
during execution (“provenance record and replay” [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), are designed to apply to completely
unstructured processes and may incur significant computational overhead.
      </p>
      <p>
        In this position paper, we focus on the provenance of data transformations that occur in
the context of Data Science, where dataflow-structured data analytics pipelines are common,
and where their elements are not simple relational operators (as was assumed e.g. in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]),
but they are not completely arbitrary, either. In this middle-ground setting, we consider the
problem of using provenance to generate explanations that justify and account for the observed
transformations. Looking at the semantics of the processors involved in the pipeline reveals a
range of automatic provenance-generation capabilities, with relational, how-provenance on
one extreme, and arbitrary data manipulation code, on the other.
      </p>
      <p>We suggest that an interesting region within this spectrum is occupied by a new generation
of operators, which are defined in the context of so-called Data-Centric AI (DCAI). These are
sophisticated operators specifically designed to produce training sets from raw datasets, and
where data processing is often interleaved with model training, in an iterative fashion. While
their semantics is not formally defined, these tend to fall into a few categories, for example, data
transformations such as incremental data cleaning, data augmentation, for instance through
upsampling algorithms, and data selection, including feature selection and elimination of
redundant data points.</p>
      <p>
        In the rest of this paper, we present exemplar use cases of DCAI processes taken from recent
literature, and propose a categorisation of operators that extends from our previous work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
showing how the use cases relate to these categories. We then discuss options for generating
“provenance narratives” to describe these operators’ behavior. Somewhat provocatively, we
refer to these narratives as why+provenance, to indicate that they can be used to answer “why”
type questions with reference to complex but well-defined data transformations.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Example use cases</title>
      <p>We present four use cases, where data transformation and data selection exhibit two kinds
of complexities, either they are entangled with the modelling itself, or they implement some
bespoke data manipulation strategy that is not captured by typical data processing operators.</p>
      <sec id="sec-2-1">
        <title>2.1. Model-driven incremental data cleaning</title>
        <p>
          ActiveClean [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is one of several incremental data cleaning algorithms, surveyed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
targeted specifically at training sets. It provides a good example of an iterative approach
designed to progressively clean a “dirty” training set  by balancing the cost of selecting and
cleaning items in , with the benefits of learning a usable model, despite being trained on a
training set that is still partially dirty [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. A “dirty” multidimensional data point is one where
one or more of its components is inaccurate, for instance a wrong numerical figure, or a wrongly
spelled name. As observed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], these dirty data lead to a sub-optimal training process, where
the model parameters are optimised, but for a loss function that corresponds to a misleading
training set, potentially rendering the model useless in practice.
        </p>
        <p>Cleaning is generally an expensive operation, especially when it requires manual inspection.
The idea behind ActiveClean is to start by selecting a subset of dirty data points from ,
manually clean them to generate 1, and use 1 to retrain the model. This is repeated until a
stop condition is reached, producing a sequence  → 1, 2, . . . , ′ of training sets. There is
an assumption that the dirty/clean status of a data point can be automatically detected, and
that one can optimise the procedure by choosing the data points to be cleaned that most afect
the model, with the goal to minimise the number of iterations.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Training set debugging</title>
        <p>
          This use case is similar to the previous one, but here we assume that the ground truth labels
associated with some of the data points, as opposed to the features, may be incorrect. The challenge, as
it was presented by the DataPerf group [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] (https://www.dataperf.org/) as part of ML Commons
(https://github.com/mlcommons), is to devise a strategy by which a suficiently accurate model
can be trained by correcting the fewest possible labels. Specifically, the challenge used annotated
image data from the OpenImage V7 dataset (https://storage.googleapis.com/openimages/web),
which contains millions of images with various levels of annotations (bounding boxes,
relationships, image-level, point-level labels, etc.). Given a training set  taken from OpenImage
with perfect annotations and a predefined classification task  , a model  () is trained
and its performance  (according to some agreed upon metric) is used as a benchmark for
the challenge. Some of the labels in  are then randomly corrupted, leading to a noisy set
. The performance  of a model obtained by training a classifier for  using  will in
general be sub-optimal,  &lt;  . The challenge is to devise a strategy for selecting the smallest
possible subset ′ ⊂  such that, by correcting the labels in ′ and then retraining, the new
performance will approximate  within some predefined threshold  :  −  &lt;  .
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Training set optimisation</title>
        <p>
          The next two use cases are motivated by the well-known “power laws” observation, common in
deep learning applications, that model performance (test loss) correlates positively with training
set size according to a power law [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. However, increasing training set sizes ultimately incurs
diminishing returns in terms of loss reduction. This motivates trying to prune  to reduce its
size. Two approaches stand out, both proposed by the same researchers.
        </p>
        <p>Firstly, in [15] the idea is to map  to an embedded space, using pre-trained foundation
models, then cluster all data points in that space using a standard clustering algorithm (k-means).
Neighbouring points within each cluster (according to some distance metrics) are considered
redundant and candidates for pruning, resulting in the final ′.</p>
        <p>The second approach [16] is based on the concept of data points that are easy or hard to
learn from. Two main results underpin the pruning method. Firstly, the authors claim that the
dificulty of each data point is proportional to the distance of that point from the centroid of a
k-means cluster, where the clustering is performed in an embedded space. Once the points are
ranked in terms of their dificulty, the second claim is that hard examples should be preserved
for large training sets, and conversely, the easier ones should be preferred for smaller training
sets (details are in the paper). It should be noted that the experiments supporting this claim
were only performed using a dedicated self-supervised model pre-trained on ImageNet, and
may not generalise well to other contexts.</p>
        <p>Item transformation:
x -&gt; x’
Item transformation:
y -&gt; y’
Prune items from training set
Filtering:
remove (y)</p>
        <p>Prune items from training set
Training set debugging</p>
        <p>Select items from training set for Aims to rank data points and
label correction minimize manual corrections</p>
        <p>The re-labelling strategy is incremental and
interleaved with model retraining. However, winning
strategy not published and thus its generalizability is
not clear.</p>
        <p>Training set optimization,
reducing redundancy by
removing similar points
Training set optimization,
reducing redundancy by
pruning hard/easy examples</p>
        <p>Cluster data points in embedded Training set pruning happens before model training
space, select representatives
from each cluster
Identify simple / hard examples, Training set pruning happens before model training
sample from those depending on
training set size
3. Representing provenance at multiple levels of detail
With reference to the examples just presented, we would like to provide provenance support for
answering the following types of questions. Firstly, which data transformation were applied to
raw input dataset(s) to generate the final training set used for modelling? Secondly, which of the
individual data items where afected by each of the transformations, and what was the efect?
And thirdly, why was a specific data item chosen for transformation or inclusion/exclusion, and
in the case of transformations, how was a specific new value chosen? These questions address
issues of reproducibility, specifically when the operator is part of a processing pipeline, and
explainability, both at the level of entire training set and of individual data points.</p>
        <p>Viewed at a high level, the examples fall into the broad category of data transformation:
 → ′, and selection: ′ ⊂ . At this level, it is straightforward to record the provenance
of ′ as a derivation from , which is mediated by some abstract activity  that represents
the cleaning or pruning operations. Using the formal notation provided by the PROV data
model [17], this can be written simply as:
activity()
entity(), entity(′)
Used(, )
WasGeneratedBy(′, )
WasDerivedFrom(′, )
# an activity represents a data operator</p>
        <p># entities represent datasets
#  consumes input dataset 
#  produces output dataset ′
# data-data derivations</p>
        <p>Fig. 2(a) shows a corresponding graph representation for these derivations. This high-level
provenance is not very informative, however, if we want to account for how  operates on
each data item. In the first two examples,  performs 1-1, item-wise transformations, i.e.,
 ∈  → ′ ∈ ′ where either ′ =  or ′ is a clean version of . Our PROV notation can be
wasDerivedFrom
(a)</p>
        <p>D
used A wasGeneratedBy</p>
        <p>D’
(d)</p>
        <p>Mi-1
Di-1</p>
        <p>wasDerivedFrom
(c)</p>
        <p>D</p>
        <p>An-1 used Dn-1
used An wgby</p>
        <p>D’
…
used Ai wgby CTi used Ci-i wgby Di</p>
        <p>used Ti wgby Mi
Cleaning
targets
Assessment</p>
        <p>Cleaning</p>
        <p>Training</p>
        <p>Model
… used
wasDerivedFrom</p>
        <p>A wasGeneratedBy … (b)
wasDerivedFrom
x’1
x’n
extended to account for this more granular level, as follows (see also Fig. 2(b)):
{entity()}∈
{entity(′)}′∈′
activity()
{Used(, )}∈
{WasGeneratedBy(′, )}′∈′
{WasDerivedFrom(′, )}
# items in 
# items in ′
#  is still an atomic perator
#  that have been afected by 
# and their new values ′
# data-data derivations
Using these assertions, one can reconstruct the derivations for any data item, from the initial
 to the final ′, along a whole sequence of operators, through simple traversal queries.</p>
        <p>In this simple example, the notation is used to represent item-wise transformations, i.e.,
by creating instances for each Used, wasGeneratedBy, and wasDerivedBy relationship for
corresponding items , ′. Note however, that this can also be used, more generally, to capture
M-N transformations, for example to represent the efects of data imputation based on aggregate
statistics that afect multiple data points simultaneously. This can be achieved by adding
relationship instances as needed. For example, when a single value  ∈  is used to produce
multiple values ′1, . . . , ′, the derivation can be written as {WasDerivedFrom(′, )}:1,.</p>
        <p>We can further account for the incremental nature of cleaning in ActiveClean, by breaking
down  into  . . .  and explicitly representing  iterations (Fig. 2(c)):</p>
        <p>{entity()}:0
{activity(A)}:1
{WasDerivedFrom(, − 1)}:1
{Used(A), − 1, }:1
{WasGeneratedBy(, A)}:1
# each  is the result of one iteration</p>
        <p>#  represents one cleaning round
# data-data derivations for one iteration
#  consumes − 1
#  produces</p>
        <p>A similar formal notation can be used to represent the selection operations in the second two
examples, both at a dataset level and at item level (details omitted for brevity).</p>
        <p>
          In previous work [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] we have shown how these representations can be automatically generated
in the common case where the operators are implemented in Python / Pandas / scikit-learn, and
 are Pandas dataframes. We also presented a prototype-level tool [18] to show that item-level
provenance within each pair , ′ can be accurately inferred by observing the diferences in
schema and content between  and ′.
        </p>
        <p>These results efectively address the first two of the three questions above. Addressing the
third “why” question is harder, however, as it requires capturing the internal processing logic of
complex operators, at some level of abstraction. For instance, the why+provenance of a data item
that was cleaned using ActiveClean would include not only the before/after values, but also an
explanation of why that item was selected for cleaning. Similarly, the why+provenance of an
item that was included/discarded as part of a training set optimisation process would provide an
insight into why that particular item was identified, for instance as being an easy/hard example,
or as being redundant.</p>
        <p>Our position is that this new level of detail will become increasingly relevant, as Data
Science pipelines expand their scope from well-understood operators, to sophisticated black-box
algorithms that afect training sets in complex ways.</p>
        <p>
          This is the focus of the rest of the paper. In the next Section we summarise and extend
the high-level classification of operators from [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], as a starting point for discussing options to
describe more DCAI-specific patterns.
4. A classification of pipeline operators
Based on the analysis of the main Python libraries used in data science, we have observed
that the majority of pre-processing operations, including the most used in practice, can be
implemented by combining a rather small set of basic operators of data manipulation over
datasets belonging to four main classes, as follows.
        </p>
        <p>Data reductions: operations that take as input a dataset  and reduce its size by eliminating
rows or columns from . These are simple extensions of two well-known relational operators:
the (conditional) projection of  on a set of features in , given a boolean condition , is the
dataset obtained from  by including only the columns of  that satisfy ; and the selection
of , given a boolean condition , is the dataset obtained from  by including the rows of 
satisfying .</p>
        <p>Data augmentations: operations that take as input a dataset  on a schema  and increase the
size of  by adding rows or columns to . These two operators allow the addition of columns
and rows to a dataset, respectively: the vertical augmentation of  to  using a function  over
a set  of features of , is obtained by adding to each row of  a new set of features whose
values are obtained by applying  to the features in ; and the horizontal augmentation of 
using an aggregative function  is obtained by adding one or more new rows to  obtained by
ifrst grouping over a set of features of  and then by applying  to each group.
Data transformation: the transformation of a set of features  of  using a function  is
obtained by applying  to all the values occurring in .</p>
        <p>Data fusion: operations that take as input two datasets 1 and 2 and combine them into
a new dataset : the join of the two datasets based on a boolean condition  is the dataset
obtained by applying a standard join operation (inner, (left/right/full ) outer) based on the
condition ; the append of the two datasets is the dataset obtained by appending 2 to 1 and
possibly extending the result with nulls on the mismatching columns.</p>
        <p>Figure 1 reports some common data pre-processing operators and the way in which they can
be implemented by combining the above basic operators.</p>
        <sec id="sec-2-3-1">
          <title>Pre-processing Operations</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Feature Selection</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>Instance Drop</title>
        </sec>
        <sec id="sec-2-3-4">
          <title>Feature Augmentation</title>
        </sec>
        <sec id="sec-2-3-5">
          <title>Space Transformation</title>
        </sec>
        <sec id="sec-2-3-6">
          <title>One-hot encoding</title>
        </sec>
        <sec id="sec-2-3-7">
          <title>Instance Generation</title>
        </sec>
        <sec id="sec-2-3-8">
          <title>Imputation, Data Type Con</title>
          <p>version, Renaming,
Normalization, Scaler, Encoding</p>
        </sec>
        <sec id="sec-2-3-9">
          <title>Dimensionality Reduction</title>
        </sec>
        <sec id="sec-2-3-10">
          <title>Basic Operators</title>
        </sec>
        <sec id="sec-2-3-11">
          <title>Conditional projection</title>
        </sec>
        <sec id="sec-2-3-12">
          <title>Selection</title>
        </sec>
        <sec id="sec-2-3-13">
          <title>Vertical Augmentation</title>
        </sec>
        <sec id="sec-2-3-14">
          <title>Vertical Augmentation +</title>
        </sec>
        <sec id="sec-2-3-15">
          <title>Conditional projection</title>
        </sec>
        <sec id="sec-2-3-16">
          <title>Vertical Augmentation +</title>
        </sec>
        <sec id="sec-2-3-17">
          <title>Conditional projection</title>
        </sec>
        <sec id="sec-2-3-18">
          <title>Horizontal Augmentation</title>
        </sec>
        <sec id="sec-2-3-19">
          <title>Transformation</title>
        </sec>
        <sec id="sec-2-3-20">
          <title>Description</title>
        </sec>
        <sec id="sec-2-3-21">
          <title>One or more features are removed.</title>
        </sec>
        <sec id="sec-2-3-22">
          <title>One or more records are removed.</title>
        </sec>
        <sec id="sec-2-3-23">
          <title>One or more features are added.</title>
        </sec>
        <sec id="sec-2-3-24">
          <title>New features are derived from old features, which can be later dropped.</title>
        </sec>
        <sec id="sec-2-3-25">
          <title>New features are derived from old features, which can be later dropped.</title>
        </sec>
        <sec id="sec-2-3-26">
          <title>One or more records are added.</title>
        </sec>
        <sec id="sec-2-3-27">
          <title>Values are modified using various func</title>
          <p>tions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Towards Why+provenance</title>
      <p>Given the classification just presented, we can frame the general provenance granularity problem
in terms of the two orthogonal dimensions of data derivation, from dataset to item-level, and
detail of processor behaviour, from class to internal logic (Fig. 3). When the processors are
described using the classification just given, this results in the examples provenance assertions
as in Sec. 3. The Figure summarises these for the transformation and selection processors in
our use cases, respectively. Representations become more challenging in the lower bottom of
the Figure, where we aim to represent processor logic.</p>
      <p>When operating at the dataset level, using ActiveClean as an example, the provenance would
need to describe processor logic as consisting of three components: “assessment” (), “cleaning”
(), and “training” ( ), and assert that input dataset  is assessed using , generating a list of
data cleaning targets, which is used by , producing a new version ′ of . ′ is used by  to
train a model  , which in turn is again used by  in conjunction with ′ in the next iteration
Processor
detail</p>
      <p>Data
detail
class level
- transformation
- selection
⟶
logic level
dataset
⟶</p>
      <p>item
D à D’
D à D’ ⊆ D
Processor
logic
{ x à x’}x ∈ D, x’ ∈ D’
{ x ∈ D | (x’) = True}
Why x? (transformation, selection)
Why x’? (transformation, augmentation)
of incremental cleaning. Note that PROV is able to support these relationships, after providing
the required breakdown of the whole strategy into processes and iterations (Fig. 2(d)).</p>
      <p>Supporting the actual why questions at item level remains challenging, however. This is the
bottom right quadrant in the Figure, where for each  ∈ , we ask “why did the assessor 
choose  for cleaning?” and “how did the cleaner  choose the replacement value?”</p>
      <p>The corresponding why questions for the selection processes are similar, namely “why did
 ∈  get selected for removal from the training set?”. Note that here the full explanation may
be quite involved, as the processor logic involves learning an embedding for , then clustering
in the embedded space, and finally choosing data points based on their distance from the cluster
centroids.</p>
      <p>While the problem of automatically generating suitable provenance at this level is not fully
addressed, it seems that two elements are needed. Firstly, a vocabulary and language, or
perhaps a small knowledge graph if relationships are included, to be able to express the concepts
mentioned in the provenance narrative above. Such a vocabulary would include a choice
of abstraction level, grounded in the baseline classification described in Sec. 4. Secondly,
a mechanism to generate provenance assertions that involves active participation from the
processors themselves, as simply observing the process execution from the outside would not
be enough.</p>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusions</title>
      <p>In this position paper we started from the observation that Data processing workflows for
Data Science applications now include sophisticated processors, whose operations are often
interleaved with model training. We have suggested increasingly detailed levels of provenance
representation, aimed not only at recording granular data derivations, but also to explain why
each of these derivations occurred. At the deepest level, this may require processors that actively
interact with the provenance subsystem during execution, providing the necessary details.</p>
      <p>Experimenting with provenance capture models at this level is work in progress. Importantly,
we believe this research should be driven by user studies to determine, for diferent stakeholders,
what kinds of explanations are actually expected and desirable.
J. Wu, D. Amodei, Scaling Laws for Neural Language Models, 2020. URL: http://arxiv.org/
abs/2001.08361. doi:10.48550/arXiv.2001.08361, arXiv:2001.08361 [cs, stat].
[15] A. Abbas, K. Tirumala, D. Simig, S. Ganguli, A. S. Morcos, SemDeDup: Data-eficient
learning at web-scale through semantic deduplication, 2023. URL: http://arxiv.org/abs/
2303.09540, arXiv:2303.09540 [cs].
[16] B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli, A. Morcos, Beyond neural scaling laws:
beating power law scaling via data pruning, Advances in Neural Information Processing
Systems 35 (2022) 19523–19536. URL: https://proceedings.neurips.cc/paper_files/paper/
2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html.
[17] L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil,
P. Groth, G. Klyne, T. Lebo, J. McCusker, S. Miles, J. Myers, S. Sahoo, C. Tilmes, L. Moreau,
P. Missier, PROV-DM: The PROV Data Model, Technical Report, World Wide Web
Consortium, 2012. URL: http://www.w3.org/TR/prov-dm/.
[18] A. Chapman, P. Missier, L. Lauro, R. Torlone, DPDS: Assisting Data Science with Data
Provenance, PVLDB 15 (2022) 3614 – 3617. URL: https://vldb.org/pvldb/vol15/p3614-torlone.pdf.
doi:10.14778/3554821.3554857.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang-Chiew</surname>
          </string-name>
          ,
          <article-title>Why and where: A characterization of data provenance</article-title>
          , in: J.
          <string-name>
            <surname>Van den Bussche</surname>
          </string-name>
          , V. Vianu (Eds.),
          <source>Database Theory - ICDT</source>
          <year>2001</year>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2001</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          , W.-C. Tan, Provenance in databases: Why, how, and where,
          <source>Foundations and Trends® in Databases 1</source>
          (
          <year>2009</year>
          )
          <fpage>379</fpage>
          -
          <lpage>474</lpage>
          . URL: http://dx.doi.org/10.1561/ 1900000006. doi:
          <volume>10</volume>
          .1561/1900000006.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karvounarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>Provenance semirings</article-title>
          ,
          <source>in: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems</source>
          , PODS '07,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2007</year>
          , p.
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          . URL: https://doi.org/10.1145/1265530.1265535. doi:
          <volume>10</volume>
          .1145/1265530.1265535.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Amsterdamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>Provenance for aggregate queries</article-title>
          ,
          <source>in: Proceedings of the 30th ACM SIGMOD Symposium on Principles of Database Systems</source>
          , PODS '11,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2011</year>
          , p.
          <fpage>153</fpage>
          -
          <lpage>164</lpage>
          . URL: https: //doi.org/10.1145/1989284.1989302. doi:
          <volume>10</volume>
          .1145/1989284.1989302.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ocaña</surname>
          </string-name>
          , D. de Oliveira, V. Braganholo,
          <article-title>Analyzing provenance across heterogeneous provenance graphs</article-title>
          ,
          <source>in: Procs. 6th International Provenance and Annotation Workshop</source>
          , IPAW 2016,
          <article-title>McLean</article-title>
          ,
          <string-name>
            <surname>VA</surname>
          </string-name>
          , USA, volume
          <volume>9672</volume>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>70</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -40593-
          <issue>3</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <article-title>Understanding provenance black boxes</article-title>
          ,
          <source>Distributed and Parallel Databases</source>
          <volume>27</volume>
          (
          <year>2010</year>
          )
          <fpage>139</fpage>
          -
          <lpage>167</lpage>
          . URL: https://doi.org/10.1007/s10619-009-7058-3. doi:
          <volume>10</volume>
          .1007/s10619-009-7058-3.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stamatogiannakis</surname>
          </string-name>
          , E. Athanasopoulos,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bos</surname>
          </string-name>
          , P. Groth,
          <article-title>Prov2r: Practical provenance analysis of unstructured processes</article-title>
          ,
          <source>ACM Trans. Internet Technol</source>
          .
          <volume>17</volume>
          (
          <year>2017</year>
          ). URL: https: //doi.org/10.1145/3062176. doi:
          <volume>10</volume>
          .1145/3062176.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Amsterdamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>Putting lipstick on pig: Enabling database-style workflow provenance</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>5</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lauro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Supporting better insights of data science pipelines with fine-grained provenance</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          . (
          <year>2024</year>
          ). URL: https: //doi.org/10.1145/3644385. doi:
          <volume>10</volume>
          .1145/3644385, just Accepted.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , E. Wu,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Activeclean: interactive data cleaning for statistical modeling</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>9</volume>
          (
          <year>2016</year>
          )
          <fpage>948</fpage>
          -
          <lpage>959</lpage>
          . URL: https://doi. org/10.14778/2994509.2994514. doi:
          <volume>10</volume>
          .14778/2994509.2994514.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Neutatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , E. Wu,
          <article-title>From Cleaning before ML to Cleaning for ML., IEEE Data Eng</article-title>
          .
          <source>Bull</source>
          .
          <volume>44</volume>
          (
          <year>2021</year>
          )
          <fpage>24</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , E. Wu,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          , K. Goldberg,
          <article-title>ActiveClean: interactive data cleaning for statistical modeling</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>9</volume>
          (
          <year>2016</year>
          )
          <fpage>948</fpage>
          -
          <lpage>959</lpage>
          . URL: https://dl.acm.org/doi/10.14778/2994509.2994514. doi:
          <volume>10</volume>
          .14778/2994509.2994514.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Banbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Karlaš</surname>
          </string-name>
          , e. a. Rojas,
          <source>DataPerf: Benchmarks for Data-Centric AI Development</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2207.10062. doi:
          <volume>10</volume>
          .48550/ arXiv.2207.10062, arXiv:
          <fpage>2207</fpage>
          .10062 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Radford,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>