From why-provenance to why+provenance: Towards addressing deep data explanations in Data-Centric AI Paolo Missier1,* , Riccardo Torlone2 1 University of Birmingham, School of Computer Science, Birmingham, UK 2 Università Roma Tre, Roma, Dipartimento di Ingegneria, Italy Abstract In this position paper we discuss the problem of exploiting data provenance to provide explanations in data-centric AI processes, where the emphasis of model development is placed on the quality of data. In particular, we show how a classification of the main operators used in the data preparation phase provides an effective and powerful means for the production of increasingly detailed explanations at the needed level of data granularity. Keywords Data-centric AI, Data Engineering pipeline, Data Provenance 1. Introduction In provenance theory, the notion of why-provenance has been introduced primarily in the context of relational models and algebra with reference to the set of tuples in source relations that contribute to producing the results of a (SQL) query. These have been known as witness tuples [1, 2] and define the lineage of a tuple that appears in the result of a query. The clear semantics associated with relational algebra operators made it possible to develop formal, elegant models for representing why-provenance, and its extensions to how-provenance [3]. Further extensions have subsequently been developed, for instance to capture the provenance of results of aggregated queries [4]. A parallel strand of research focused on the definition of provenance as it applies to datasets that undergo a series of transformations, where arbitrary operators are typically arranged into a Directed Acyclic Graph topology. Provenance in this setting can itself be expressed as a graph of data derivations, where each derivation is mediated by one operator. This became known as “coarse-grained” provenance [5], in contrast to the tuple-level provenance grounded in the relational framework. As the term suggests, the “black-box” nature of the operators makes granular provenance hard to determine directly [6]. This is because why- and how-provenance are grounded in the precise semantics of the query operators, which is not available when using arbitrary data transformers. The result is a provenance graph that is limited to dataset-level derivations, i.e., through each of the processors. Existing approaches that attempt to circumvent this limitation and reconstruct “high-fidelity” provenance, using system-level events recorded SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ p.missier@bham.ac.uk (P. Missier); riccardo.torlone@uniroma3.it (R. Torlone)  0000-0002-0978-2446 (P. Missier); 0000-0003-1484-3693 (R. Torlone) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings during execution (“provenance record and replay” [7]), are designed to apply to completely unstructured processes and may incur significant computational overhead. In this position paper, we focus on the provenance of data transformations that occur in the context of Data Science, where dataflow-structured data analytics pipelines are common, and where their elements are not simple relational operators (as was assumed e.g. in [8]), but they are not completely arbitrary, either. In this middle-ground setting, we consider the problem of using provenance to generate explanations that justify and account for the observed transformations. Looking at the semantics of the processors involved in the pipeline reveals a range of automatic provenance-generation capabilities, with relational, how-provenance on one extreme, and arbitrary data manipulation code, on the other. We suggest that an interesting region within this spectrum is occupied by a new generation of operators, which are defined in the context of so-called Data-Centric AI (DCAI). These are sophisticated operators specifically designed to produce training sets from raw datasets, and where data processing is often interleaved with model training, in an iterative fashion. While their semantics is not formally defined, these tend to fall into a few categories, for example, data transformations such as incremental data cleaning, data augmentation, for instance through upsampling algorithms, and data selection, including feature selection and elimination of redundant data points. In the rest of this paper, we present exemplar use cases of DCAI processes taken from recent literature, and propose a categorisation of operators that extends from our previous work [9], showing how the use cases relate to these categories. We then discuss options for generating “provenance narratives” to describe these operators’ behavior. Somewhat provocatively, we refer to these narratives as why+provenance, to indicate that they can be used to answer “why” type questions with reference to complex but well-defined data transformations. 2. Example use cases We present four use cases, where data transformation and data selection exhibit two kinds of complexities, either they are entangled with the modelling itself, or they implement some bespoke data manipulation strategy that is not captured by typical data processing operators. 2.1. Model-driven incremental data cleaning ActiveClean [10] is one of several incremental data cleaning algorithms, surveyed in [11], targeted specifically at training sets. It provides a good example of an iterative approach designed to progressively clean a “dirty” training set 𝐷 by balancing the cost of selecting and cleaning items in 𝐷, with the benefits of learning a usable model, despite being trained on a training set that is still partially dirty [12]. A “dirty” multidimensional data point is one where one or more of its components is inaccurate, for instance a wrong numerical figure, or a wrongly spelled name. As observed in [11], these dirty data lead to a sub-optimal training process, where the model parameters are optimised, but for a loss function that corresponds to a misleading training set, potentially rendering the model useless in practice. Cleaning is generally an expensive operation, especially when it requires manual inspection. The idea behind ActiveClean is to start by selecting a subset of dirty data points from 𝐷, manually clean them to generate 𝐷1 , and use 𝐷1 to retrain the model. This is repeated until a stop condition is reached, producing a sequence 𝐷 → 𝐷1 , 𝐷2 , . . . , 𝐷′ of training sets. There is an assumption that the dirty/clean status of a data point can be automatically detected, and that one can optimise the procedure by choosing the data points to be cleaned that most affect the model, with the goal to minimise the number of iterations. 2.2. Training set debugging This use case is similar to the previous one, but here we assume that the ground truth labels associ- ated with some of the data points, as opposed to the features, may be incorrect. The challenge, as it was presented by the DataPerf group [13] (https://www.dataperf.org/) as part of ML Commons (https://github.com/mlcommons), is to devise a strategy by which a sufficiently accurate model can be trained by correcting the fewest possible labels. Specifically, the challenge used annotated image data from the OpenImage V7 dataset (https://storage.googleapis.com/openimages/web), which contains millions of images with various levels of annotations (bounding boxes, rela- tionships, image-level, point-level labels, etc.). Given a training set 𝐷𝑡𝑟 taken from OpenImage with perfect annotations and a predefined classification task 𝑇 , a model 𝑀 (𝐷𝑡𝑟 ) is trained and its performance 𝑃 (according to some agreed upon metric) is used as a benchmark for the challenge. Some of the labels in 𝐷𝑡𝑟 are then randomly corrupted, leading to a noisy set 𝐷𝑛 . The performance 𝑃𝑛 of a model obtained by training a classifier for 𝑇 using 𝐷𝑛 will in general be sub-optimal, 𝑃𝑛 < 𝑃 . The challenge is to devise a strategy for selecting the smallest possible subset 𝐷𝑛′ ⊂ 𝐷𝑛 such that, by correcting the labels in 𝐷𝑛′ and then retraining, the new performance will approximate 𝑃 within some predefined threshold 𝜏 : 𝑃 − 𝑃𝑛 < 𝜏 . 2.3. Training set optimisation The next two use cases are motivated by the well-known “power laws” observation, common in deep learning applications, that model performance (test loss) correlates positively with training set size according to a power law [14]. However, increasing training set sizes ultimately incurs diminishing returns in terms of loss reduction. This motivates trying to prune 𝐷 to reduce its size. Two approaches stand out, both proposed by the same researchers. Firstly, in [15] the idea is to map 𝐷 to an embedded space, using pre-trained foundation models, then cluster all data points in that space using a standard clustering algorithm (k-means). Neighbouring points within each cluster (according to some distance metrics) are considered redundant and candidates for pruning, resulting in the final 𝐷′ . The second approach [16] is based on the concept of data points that are easy or hard to learn from. Two main results underpin the pruning method. Firstly, the authors claim that the difficulty of each data point is proportional to the distance of that point from the centroid of a k-means cluster, where the clustering is performed in an embedded space. Once the points are ranked in terms of their difficulty, the second claim is that hard examples should be preserved for large training sets, and conversely, the easier ones should be preferred for smaller training sets (details are in the paper). It should be noted that the experiments supporting this claim were only performed using a dedicated self-supervised model pre-trained on ImageNet, and may not generalise well to other contexts. Context Type of operation strategy Data processing and model training ActiveClean Select items from training set for Iterative batch cleaning strategy ActiveClean processing is interleaved with model manual cleaning driven by SGD training, both stop at the same time. Item transformation: x -> x’ Training set debugging Select items from training set for Aims to rank data points and The re-labelling strategy is incremental and label correction minimize manual corrections interleaved with model retraining. However, winning strategy not published and thus its generalizability is Item transformation: not clear. y -> y’ Training set optimization, Prune items from training set Cluster data points in embedded Training set pruning happens before model training reducing redundancy by space, select representatives removing similar points Filtering: from each cluster remove (y) Training set optimization, Prune items from training set Identify simple / hard examples, Training set pruning happens before model training reducing redundancy by sample from those depending on pruning hard/easy examples training set size Figure 1: Summary of data interventions for the example use cases 3. Representing provenance at multiple levels of detail With reference to the examples just presented, we would like to provide provenance support for answering the following types of questions. Firstly, which data transformation were applied to raw input dataset(s) to generate the final training set used for modelling? Secondly, which of the individual data items where affected by each of the transformations, and what was the effect? And thirdly, why was a specific data item chosen for transformation or inclusion/exclusion, and in the case of transformations, how was a specific new value chosen? These questions address issues of reproducibility, specifically when the operator is part of a processing pipeline, and explainability, both at the level of entire training set and of individual data points. Viewed at a high level, the examples fall into the broad category of data transformation: 𝐷 → 𝐷′ , and selection: 𝐷′ ⊂ 𝐷. At this level, it is straightforward to record the provenance of 𝐷′ as a derivation from 𝐷, which is mediated by some abstract activity 𝐴 that represents the cleaning or pruning operations. Using the formal notation provided by the PROV data model [17], this can be written simply as: activity(𝐴) # an activity represents a data operator entity(𝐷), entity(𝐷 ) ′ # entities represent datasets Used(𝐴, 𝐷) # 𝐴 consumes input dataset 𝐷 WasGeneratedBy(𝐷 ′ , 𝐴) # 𝐴 produces output dataset 𝐷′ WasDerivedFrom(𝐷 ′ , 𝐷) # data-data derivations Fig. 2(a) shows a corresponding graph representation for these derivations. This high-level provenance is not very informative, however, if we want to account for how 𝐴 operates on each data item. In the first two examples, 𝐴 performs 1-1, item-wise transformations, i.e., 𝑥 ∈ 𝐷 → 𝑥′ ∈ 𝐷′ where either 𝑥′ = 𝑥 or 𝑥′ is a clean version of 𝑥. Our PROV notation can be wasDerivedFrom x1 x’1 wasDerivedFrom … used A wasGeneratedBy … (b) (a) used wasGeneratedBy D A D’ wasDerivedFrom xn x’n wasDerivedFrom (c) used used wgby D An-1 Dn-1 An D’ … Mi-1 wgby used wgby used wgby (d) used Ai CTi Ci-i Di Ti Mi Di-1 Cleaning Assessment Cleaning Training Model targets Figure 2: Provenance patterns for the examples in the text extended to account for this more granular level, as follows (see also Fig. 2(b)): {entity(𝑥)}𝑥∈𝐷 # items in 𝐷 ′ {entity(𝑥 )}𝑥′ ∈𝐷′ # items in 𝐷′ activity(𝐴) # 𝐴 is still an atomic perator {Used(𝐴, 𝑥)}𝑥∈𝐷 # 𝑥 that have been affected by 𝐴 {WasGeneratedBy(𝑥′ , 𝐴)}𝑥′ ∈𝐷′ # and their new values 𝑥′ {WasDerivedFrom(𝑥′ , 𝑥)} # data-data derivations Using these assertions, one can reconstruct the derivations for any data item, from the initial 𝐷 to the final 𝐷′ , along a whole sequence of operators, through simple traversal queries. In this simple example, the notation is used to represent item-wise transformations, i.e., by creating instances for each Used, wasGeneratedBy, and wasDerivedBy relationship for corresponding items 𝑥, 𝑥′ . Note however, that this can also be used, more generally, to capture M-N transformations, for example to represent the effects of data imputation based on aggregate statistics that affect multiple data points simultaneously. This can be achieved by adding relationship instances as needed. For example, when a single value 𝑦 ∈ 𝐷 is used to produce multiple values 𝑥′1 , . . . , 𝑥′𝑛 , the derivation can be written as {WasDerivedFrom(𝑥′𝑖 , 𝑦)}𝑖:1,𝑛 . We can further account for the incremental nature of cleaning in ActiveClean, by breaking down 𝐴 into 𝐴𝑖 . . . 𝐴𝑛 and explicitly representing 𝑛 iterations (Fig. 2(c)): {entity(𝐷𝑖 )}𝑛𝑖:0 # each 𝐷𝑖 is the result of one iteration {activity(A𝑖 )}𝑛𝑖:1 # 𝐴𝑖 represents one cleaning round {WasDerivedFrom(𝐷𝑖 , 𝐷𝑖−1 )}𝑛𝑖:1 # data-data derivations for one iteration {Used(A𝑖 ), 𝐷𝑖−1 , }𝑛𝑖:1 # 𝐴𝑖 consumes 𝐷𝑖−1 {WasGeneratedBy(𝐷𝑖 , A𝑖 )}𝑛𝑖:1 # 𝐴𝑖 produces 𝐷𝑖 A similar formal notation can be used to represent the selection operations in the second two examples, both at a dataset level and at item level (details omitted for brevity). In previous work [9] we have shown how these representations can be automatically generated in the common case where the operators are implemented in Python / Pandas / scikit-learn, and 𝐷 are Pandas dataframes. We also presented a prototype-level tool [18] to show that item-level provenance within each pair 𝐷, 𝐷′ can be accurately inferred by observing the differences in schema and content between 𝐷 and 𝐷′ . These results effectively address the first two of the three questions above. Addressing the third “why” question is harder, however, as it requires capturing the internal processing logic of complex operators, at some level of abstraction. For instance, the why+provenance of a data item that was cleaned using ActiveClean would include not only the before/after values, but also an explanation of why that item was selected for cleaning. Similarly, the why+provenance of an item that was included/discarded as part of a training set optimisation process would provide an insight into why that particular item was identified, for instance as being an easy/hard example, or as being redundant. Our position is that this new level of detail will become increasingly relevant, as Data Science pipelines expand their scope from well-understood operators, to sophisticated black-box algorithms that affect training sets in complex ways. This is the focus of the rest of the paper. In the next Section we summarise and extend the high-level classification of operators from [9], as a starting point for discussing options to describe more DCAI-specific patterns. 4. A classification of pipeline operators Based on the analysis of the main Python libraries used in data science, we have observed that the majority of pre-processing operations, including the most used in practice, can be implemented by combining a rather small set of basic operators of data manipulation over datasets belonging to four main classes, as follows. Data reductions: operations that take as input a dataset 𝐷 and reduce its size by eliminating rows or columns from 𝐷. These are simple extensions of two well-known relational operators: the (conditional) projection of 𝐷 on a set of features in 𝑆, given a boolean condition 𝐶, is the dataset obtained from 𝐷 by including only the columns of 𝐷 that satisfy 𝐶; and the selection of 𝐷, given a boolean condition 𝐶, is the dataset obtained from 𝐷 by including the rows of 𝐷 satisfying 𝐶. Data augmentations: operations that take as input a dataset 𝐷 on a schema 𝑆 and increase the size of 𝐷 by adding rows or columns to 𝐷. These two operators allow the addition of columns and rows to a dataset, respectively: the vertical augmentation of 𝐷 to 𝑌 using a function 𝑓 over a set 𝑋 of features of 𝐷, is obtained by adding to each row of 𝐷 a new set of features whose values are obtained by applying 𝑓 to the features in 𝑋; and the horizontal augmentation of 𝐷 using an aggregative function 𝑓 is obtained by adding one or more new rows to 𝐷 obtained by first grouping over a set of features of 𝐷 and then by applying 𝑓 to each group. Data transformation: the transformation of a set of features 𝑋 of 𝐷 using a function 𝑓 is obtained by applying 𝑓 to all the values occurring in 𝑋. Data fusion: operations that take as input two datasets 𝐷1 and 𝐷2 and combine them into a new dataset 𝐷: the join of the two datasets based on a boolean condition 𝐶 is the dataset obtained by applying a standard join operation (inner, (left/right/full) outer) based on the condition 𝐶; the append of the two datasets is the dataset obtained by appending 𝐷2 to 𝐷1 and possibly extending the result with nulls on the mismatching columns. Figure 1 reports some common data pre-processing operators and the way in which they can be implemented by combining the above basic operators. Pre-processing Operations Basic Operators Description Feature Selection Conditional projection One or more features are removed. Instance Drop Selection One or more records are removed. Feature Augmentation Vertical Augmentation One or more features are added. Space Transformation Vertical Augmentation + New features are derived from old features, Conditional projection which can be later dropped. One-hot encoding Vertical Augmentation + New features are derived from old features, Conditional projection which can be later dropped. Instance Generation Horizontal Augmenta- One or more records are added. tion Imputation, Data Type Con- Transformation Values are modified using various func- version, Renaming, Normal- tions. ization, Scaler, Encoding Dimensionality Reduction Transformation + Condi- Some features are modified, others are re- tional projection moved. Integration, Cartesian Prod- Join Two or more datasets are combined based uct on a common attribute or key. Concatenate Append Two datasets are combined by taking their union. Table 1 common data pre-processing operations and corresponding (combination of) basic operators 5. Towards Why+provenance Given the classification just presented, we can frame the general provenance granularity problem in terms of the two orthogonal dimensions of data derivation, from dataset to item-level, and detail of processor behaviour, from class to internal logic (Fig. 3). When the processors are described using the classification just given, this results in the examples provenance assertions as in Sec. 3. The Figure summarises these for the transformation and selection processors in our use cases, respectively. Representations become more challenging in the lower bottom of the Figure, where we aim to represent processor logic. When operating at the dataset level, using ActiveClean as an example, the provenance would need to describe processor logic as consisting of three components: “assessment” (𝐴), “cleaning” (𝐶), and “training” (𝑇 ), and assert that input dataset 𝐷 is assessed using 𝐴, generating a list of data cleaning targets, which is used by 𝐶, producing a new version 𝐷′ of 𝐷. 𝐷′ is used by 𝑇 to train a model 𝑀 , which in turn is again used by 𝐴 in conjunction with 𝐷′ in the next iteration Data Processor detail dataset ⟶ item detail class level - transformation D à D’ { x à x’}x ∈ D, x’ ∈ D’ - selection D à D’ ⊆ D { x ∈ D | 𝜎(x’) = True} ⟶ Processor Why x? (transformation, selection) logic level logic Why x’? (transformation, augmentation) Figure 3: Data and Processor details and corresponding provenance of incremental cleaning. Note that PROV is able to support these relationships, after providing the required breakdown of the whole strategy into processes and iterations (Fig. 2(d)). Supporting the actual why questions at item level remains challenging, however. This is the bottom right quadrant in the Figure, where for each 𝑥 ∈ 𝐷, we ask “why did the assessor 𝐴 choose 𝑥 for cleaning?” and “how did the cleaner 𝐶 choose the replacement value?” The corresponding why questions for the selection processes are similar, namely “why did 𝑥 ∈ 𝐷 get selected for removal from the training set?”. Note that here the full explanation may be quite involved, as the processor logic involves learning an embedding for 𝐷, then clustering in the embedded space, and finally choosing data points based on their distance from the cluster centroids. While the problem of automatically generating suitable provenance at this level is not fully addressed, it seems that two elements are needed. Firstly, a vocabulary and language, or perhaps a small knowledge graph if relationships are included, to be able to express the concepts mentioned in the provenance narrative above. Such a vocabulary would include a choice of abstraction level, grounded in the baseline classification described in Sec. 4. Secondly, a mechanism to generate provenance assertions that involves active participation from the processors themselves, as simply observing the process execution from the outside would not be enough. 6. Conclusions In this position paper we started from the observation that Data processing workflows for Data Science applications now include sophisticated processors, whose operations are often interleaved with model training. We have suggested increasingly detailed levels of provenance representation, aimed not only at recording granular data derivations, but also to explain why each of these derivations occurred. At the deepest level, this may require processors that actively interact with the provenance subsystem during execution, providing the necessary details. Experimenting with provenance capture models at this level is work in progress. Importantly, we believe this research should be driven by user studies to determine, for different stakeholders, what kinds of explanations are actually expected and desirable. References [1] P. Buneman, S. Khanna, T. Wang-Chiew, Why and where: A characterization of data provenance, in: J. Van den Bussche, V. Vianu (Eds.), Database Theory — ICDT 2001, Springer Berlin Heidelberg, Berlin, Heidelberg, 2001, pp. 316–330. [2] J. Cheney, L. Chiticariu, W.-C. Tan, Provenance in databases: Why, how, and where, Foundations and Trends® in Databases 1 (2009) 379–474. URL: http://dx.doi.org/10.1561/ 1900000006. doi:10.1561/1900000006. [3] T. J. Green, G. Karvounarakis, V. Tannen, Provenance semirings, in: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’07, Association for Computing Machinery, New York, NY, USA, 2007, p. 31–40. URL: https://doi.org/10.1145/1265530.1265535. doi:10.1145/1265530.1265535. [4] Y. Amsterdamer, D. Deutch, V. Tannen, Provenance for aggregate queries, in: Proceedings of the 30th ACM SIGMOD Symposium on Principles of Database Systems, PODS ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 153–164. URL: https: //doi.org/10.1145/1989284.1989302. doi:10.1145/1989284.1989302. [5] W. Oliveira, P. Missier, K. Ocaña, D. de Oliveira, V. Braganholo, Analyzing provenance across heterogeneous provenance graphs, in: Procs. 6th International Provenance and Annotation Workshop, IPAW 2016, McLean, VA, USA, volume 9672, Springer, 2016, pp. 57–70. doi:10.1007/978-3-319-40593-3_5. [6] A. Chapman, H. V. Jagadish, Understanding provenance black boxes, Distributed and Parallel Databases 27 (2010) 139–167. URL: https://doi.org/10.1007/s10619-009-7058-3. doi:10.1007/s10619-009-7058-3. [7] M. Stamatogiannakis, E. Athanasopoulos, H. Bos, P. Groth, Prov2r: Practical provenance analysis of unstructured processes, ACM Trans. Internet Technol. 17 (2017). URL: https: //doi.org/10.1145/3062176. doi:10.1145/3062176. [8] Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, V. Tannen, Putting lipstick on pig: Enabling database-style workflow provenance, Proceedings of the VLDB Endowment 5 (2011). [9] A. Chapman, L. Lauro, P. Missier, R. Torlone, Supporting better insights of data science pipelines with fine-grained provenance, ACM Trans. Database Syst. (2024). URL: https: //doi.org/10.1145/3644385. doi:10.1145/3644385, just Accepted. [10] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, K. Goldberg, Activeclean: interactive data cleaning for statistical modeling, Proc. VLDB Endow. 9 (2016) 948–959. URL: https://doi. org/10.14778/2994509.2994514. doi:10.14778/2994509.2994514. [11] F. Neutatz, B. Chen, Z. Abedjan, E. Wu, From Cleaning before ML to Cleaning for ML., IEEE Data Eng. Bull. 44 (2021) 24–41. [12] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, K. Goldberg, ActiveClean: interactive data clean- ing for statistical modeling, Proceedings of the VLDB Endowment 9 (2016) 948–959. URL: https://dl.acm.org/doi/10.14778/2994509.2994514. doi:10.14778/2994509.2994514. [13] M. Mazumder, C. Banbury, X. Yao, B. Karlaš, e. a. Rojas, DataPerf: Benchmarks for Data-Centric AI Development, 2023. URL: http://arxiv.org/abs/2207.10062. doi:10.48550/ arXiv.2207.10062, arXiv:2207.10062 [cs]. [14] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling Laws for Neural Language Models, 2020. URL: http://arxiv.org/ abs/2001.08361. doi:10.48550/arXiv.2001.08361, arXiv:2001.08361 [cs, stat]. [15] A. Abbas, K. Tirumala, D. Simig, S. Ganguli, A. S. Morcos, SemDeDup: Data-efficient learning at web-scale through semantic deduplication, 2023. URL: http://arxiv.org/abs/ 2303.09540, arXiv:2303.09540 [cs]. [16] B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli, A. Morcos, Beyond neural scaling laws: beating power law scaling via data pruning, Advances in Neural Information Processing Systems 35 (2022) 19523–19536. URL: https://proceedings.neurips.cc/paper_files/paper/ 2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html. [17] L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil, P. Groth, G. Klyne, T. Lebo, J. McCusker, S. Miles, J. Myers, S. Sahoo, C. Tilmes, L. Moreau, P. Missier, PROV-DM: The PROV Data Model, Technical Report, World Wide Web Consor- tium, 2012. URL: http://www.w3.org/TR/prov-dm/. [18] A. Chapman, P. Missier, L. Lauro, R. Torlone, DPDS: Assisting Data Science with Data Prove- nance, PVLDB 15 (2022) 3614 – 3617. URL: https://vldb.org/pvldb/vol15/p3614-torlone.pdf. doi:10.14778/3554821.3554857.