=Paper= {{Paper |id=Vol-2975/paper3 |storemode=property |title=Towards Traceability in Data Ecosystems using a Bill of Materials Model |pdfUrl=https://ceur-ws.org/Vol-2975/paper3.pdf |volume=Vol-2975 |authors=Iain Barclay,Alun Preece,Ian Taylor,Dinesh Verma }} ==Towards Traceability in Data Ecosystems using a Bill of Materials Model== https://ceur-ws.org/Vol-2975/paper3.pdf
                          11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019



                Towards Traceability in Data Ecosystems
                    using a Bill of Materials Model
                     Iain Barclay, Alun Preece, Ian Taylor                                          Dinesh Verma
                      Crime and Security Research Institute,                              IBM TJ Watson Research Center,
                               Cardiff University,                                            1110 Kitchawan Road,
                                  Cardiff, UK                                                   Yorktown Heights,
                         Email: BarclayIS@cardiff.ac.uk                                          NY 10598, USA



   Abstract—Researchers and scientists use aggregations of data                component parts is well-established, has informed the design
from a diverse combination of sources, including partners, open                and development of a gateway which enables data ecosystems
data providers and commercial data suppliers. As the complexity                to be described in terms of sub-assemblies of their constituent
of such data ecosystems increases, and in turn leads to the
generation of new reusable assets, it becomes ever more difficult              data components and supporting artifacts, in a Bill of Materials
to track data usage, and to maintain a clear view on where data                (BoM) format. Artifacts in a BoM might include data licenses,
in a system has originated and makes onward contributions.                     software descriptions and versions, and lists of staff or other
Reliable traceability on data usage is needed for accountability,              human resources involved in producing the outputs. When the
both in demonstrating the right to use data, and having assurance              system described by the BoM is run, the BoM is instantiated,
that the data is as it is claimed to be. Society is demanding more
accountability in data-driven and artificial intelligence systems              queried for the locations of data sources and populated with
deployed and used commercially and in the public sector. This                  any dynamic values for the data or artifacts of each run,
paper introduces the conceptual design of a model for data                     generating a Bill of Lots (BoL). The BoM and BoL together
traceability based on a Bill of Materials scheme, widely used                  provide a record of the static and dynamic elements of the
for supply chain traceability in manufacturing industries, and                 system for an invocation at a particular point in time. This
presents details of the architecture and implementation of a
gateway built upon the model. Use of the gateway is illustrated                allows for later inspection of the data and the supporting
through a case study, which demonstrates how data and artifacts                environment, and provides a means for scientists to trace
used in an experiment would be defined and instantiated to                     data and artifact usage through and across experiments - for
achieve the desired traceability goals, and how blockchain tech-               example, identifying all uses of a particular IOT sensor, all
nology can facilitate accurate recordings of transactions between              runs using a particular version of a machine learning model,
contributors.
                                                                               or all uses of data generated by a particular researcher.
                      I. I NTRODUCTION
   Scientists and researchers increasingly assemble and use                       A pilot gateway, dataBoM, has been developed to allow
rich data ecosystems[1] in their experimentation. As these                     scientists to describe data ecosystem as a Bill of Materi-
ecosystems expand in capability and leverage data from a                       als, containing pipelines of assemblies detailing sets of data
diverse combination of internal sources, partners and third                    sources and artifacts, and to instantiate the BoM into a BoL
party data suppliers, it is becoming necessary for users and                   for each run of an experiment. The dataBoM gateway has
curators of data to have reliable traceability on its origins and              been developed using GraphQL[3], which facilitates the rapid
uses. This can be important to provide accountability[2], such                 development of cross-platform applications and web services
as proving ownership or legitimate usage of the source data,                   which scientists can use to generate and query BoMs and
as well as being able to identify quality or supply problems                   populate and store BoL records. Integration of the dataBoM
and alert users to problems or to seek redress when things go                  gateway with blockchain or distributed ledger technologies
awry.                                                                          can provide dynamic behaviour in data acquisition, as well
   Using a gateway to provide traceability on data used within                 as providing a permanent audit trail of both the data used and
experiments offers mechanisms for demonstrating where data                     its supporting environment.
and assets derived from the data are used, as well as aiding
understanding where data contributing to a system has come                        The remainder of this paper is structured as follows: Sec-
from. By coupling the traceability trail with distributed ledger               tion II discusses the context in which the BoM model for
or blockchain technology, it is possible to provide a distributed              data ecosystem traceability has been derived; the architecture
store that can record digital data or events in a way that makes               and implementation of the dataBoM gateway is discussed in
them immutable, non-repudiable and identifiable, thereby lead-                 Section III, with Section IV describing a case study illustrating
ing to a trustworthy record of fact.                                           how a scientist could use the pilot gateway to conduct research
   Research into manufacturing, agricultural and food indus-                   using data from several sources to identify traffic congestion.
tries, where the need for traceability of products and their                   Section V considers areas for future work.

Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
                         11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


                        II. R EQUIREMENTS                              in software applications. The intent is to give visibility on
   In manufacturing industries it has been standard practice           the underlying components used in software applications and
since the late twentieth century to track product through the          processes such that vulnerable out-of-date modules can easily
life-cycle from its origin as raw materials, through component         be identified and replaced. Tools such as CycloneDX2 , SPDX3 ,
assembly to finished goods in a store, with the relation-              and SWID4 are defining formats for identifying and tracking
ships and information flows between suppliers and customers            such sub-components.
recorded and tracked using supply chain management (SCM)                  As well as the data used and efforts through standards[10]
processes[4]. In agri-food industries, traceability through the        and research made to secure its provenance in workflows[11],
supply chain is necessary to give visibility from a product            there are many supporting assets which can be considered use-
on a supermarket shelf, back to the farm and to the batch of           ful supplementary information when recording the characteris-
foodstuff, as well as to other products in which the same batch        tics of a data ecosystem, which Singh, Cobbe and Norval[12]
has been used.                                                         have described as providing decision provenance. Hind, et
   Describing data ecosystems in terms of the data supply              al, describe a document based on a Supplier’s Declaration of
chain provides a mechanism to identify data sources and the            Conformity[13] as a suitable vehicle for providing an overview
assets which contribute to the development of the data com-            of an AI system, detailing the purpose, performance, safety,
ponents, or which are produced as the results of intermediate          security, and provenance characteristics of the overall system.
processes. As new assets are created and used in other systems         At the component level, Gebru et al explore the benefits of
- perhaps by other parties - the supply chain mapping can be           developing and maintaining Datasheets for Data[14], which
extended to give traceability on the extended data ecosystem.          replicates the specification documents that often accompany
   A definition for traceability is provided by Opara[5], as           physical components, and Mitchell et al propose a document
”the collection, documentation, maintenance, and application           format for AI model specifications and benchmarks[15]. Schel-
of information related to all processes in the supply chain            ter, Böse, Kirschnick, Klein and Seufert[16] describe a system
in a manner that provides guarantee to the consumer and                to automatically document the parameters of machine learning
other stakeholders on the origin, location and life history of a       experiments by extracting and archiving the metadata from
product as well as assisting in crises management in the event         the model generation process, which would be appropriate
of a safety and quality breach.”                                       information to store alongside the data used in a system.
   Further helpful terminology is provided by Kelepouris,                 Members of the scientific community are familiar with
Pramatari and Doukidis[6] when discussing the traceability of          the use of workflow systems, such as Node-RED[17] and
information in terms of the direction of analysis of the supply        Pegasus WMS[18], to define and execute the processes for
chain. Petroff and Hill[7] define Tracing as the ability to work       their experiments. The BoM model proposed herein is intended
backwards from any point in the supply chain to find the origin        to augment a workflow by providing a means to add contextual
of a product (i.e., ‘where-from’ relationships) and Tracking           traceability as the workflow progresses, such that it can be
as the ability to work forwards, finding products made up              archived, and the supporting conditions retrieved and inspected
of given constituents (i.e., ‘where-used’ relationships). Thus,        later. Workflow blocks typically describe a job or a service,
an effective traceability solution should support both tracing         and do not allow other contributing artifacts to be described.
and tracking; providing effectiveness in one direction does not        The proposed BoM model describes a rich set of information
necessary deliver effectiveness in the other[6].                       per node, which can better represent the data supply chain and
   Jansen-Vullers, van Dorp, and Beulens[8] and van Dorp[9]            associated documents and payloads that are contained at each
discuss the composition of products in terms of a Bill of              stage. By maintaining a BoM model alongside a workflow,
Materials (BoM) and a Bill of Lots (BoL). The BoM is the               researchers can populate and capture a record of the data for
list of types of component needed to make a finished item of           each run, as well as the supporting artifacts for each run,
a certain type, whereas the BoL lists the actual components            giving traceability of the data and the circumstances in which
used to create an instance of the item. In other words, the            it was obtained and used. In practical terms, a function could
BoM might specify a sub-assembly to be used, and the BoL               be written to populate the BoL with dynamic data, and invoked
would identify which exact batch the sub-assembly used in              at appropriate points in the workflow.
the building of a particular instance of a product was part of.           Distributed ledger technologies, such as those afforded by
Furthermore, a BoM can be multi-level, wherein components              blockchain platforms[19], [20], provide a means of recording
can be used to create sub-assemblies which are subsequently            information and transactions between parties who do not have
used in several different product types.                               formal trust relationships[21], such as inter-organisational or
   The notion of using a BoM to identify and record compo-             commercial data sharing entities. The design of a blockchain
nent parts of assets in an IT context is already established,          system ensures that data written cannot be changed, pro-
with US Department of Commerce working on the NTIA                     viding a level of immutability and non-repudiation which
Software Component Transparency initiative to provide a stan-
dardised Software BoM1 format to detail the sub-components              2 https://cyclonedx.org
                                                                        3 https://spdx.org”
  1 https://www.ntia.doc.gov/SoftwareTransparency                       4 https://www.iso.org/standard/65666.html




                                                                   2
                      11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


is well suited to keeping an auditable record of events and
transactions which occur between parties. Furthermore, the
use of a public blockchain platform, such as the Ethereum
Project[20], provides an archival resource which remains in
existence long after the resources of a project have been
retired. State-of-the-art blockchain platforms, including the
Ethereum Project, allow for the deployment of so-called
smart contracts, which can be considered to be “autonomous
agents stored in the blockchain, encoded as part of a creation
transaction that introduces a contract to the blockchain”[22].
Such smart contracts enable blockchain platforms to facilitate
non-repudiable dynamic behaviours alongside their immutable
storage capabilities.                                                                 Fig. 1. Assemblies can be chained in a BoM

           III. A DATA T RACEABILITY G ATEWAY
   In this section the design and implementation of dataBoM,             assembly) could be a labelled data set. In the second assembly,
a gateway capable of supporting levels of tracking and tracing           Artifact 2 would be relevant to the model training process, for
appropriate for providing traceability in multi-party decen-             example the parameters used in training. The output artifact,
tralised data ecosystems, is described. The solution uses a              Artifact 2’, would be the trained model. Note that both the
model based on a Bill of Materials scheme, where data and                intermediate output, Data 1’ and the final output, Artifact 2’
supporting materials are treated as constituent components of            could be further used as inputs by other processes and specified
a deployed system, which is instantiated into a unique Bill of           as inputs to subsequent assemblies.
Lots each time the deployment is run.                                       The BoM defines a map of the structure of the system
                                                                         by providing a record of the connections between the as-
A. Conceptual Model                                                      semblies, and provides a framework to enumerate a system’s
   The dataBoM gateway employs a BoM model, such that                    data sources and artifacts as well as any static data that
each experiment utilising the system is described in terms               applies to the contained data sources or artifacts. This static
of its data supply chain. The BoM consists of a collection               information could include a location for access to the data,
of assemblies, with each assembly being an aggregation of                for example, a Digital Object Identifier (DOI) or an API
contributing input components and an output component.                   URL, and metadata specifying acceptable data threshold levels
   An assembly will typically have at least one data input,              or response requirements for active quality of service (QoS)
and can produce new data as its output. Data output from                 monitoring.
one assembly can be used as a data input in a subsequent                    Each time the process described by the BoM is run, the
assembly within the current BoM, or used in other systems                application code for the process will instantiate a new BoL
by being referenced in their BoM. To reflect this, data inputs           for the given BoM. In order to provide on-going traceability,
and outputs are defined as data sources.                                 a shadow data item is created for each data source and artifact
   Assemblies can also contain artifacts, which are pertinent            in the BoM when it is instantiated in a BoL. The shadow
software components, ML models, and documentation such                   items in the BoL are used to maintain a record of the dynamic
as licenses, staff lists, policy documentation, etc.. Including          elements of each run.
artifacts in assemblies in the BoM definition ensures that each             By storing and then later referencing the assemblies, data
BoL retains a full record of its heritage and dependencies.              sources and artifacts in a BoM, and all the instantiations of the
   An assembly can produce a new artifact as its output; for             BoM in each BoL, along with the shadow data, it is possible
example, an assembly which described the training of an AI               to derive an overview of the history of the data lifecycle of the
model would produce the trained model as its output. The                 system, such that any item can be traced back to its origins
trained model would then be considered an artifact, which                or tracked forward to find all its consumers.
could be used as an input to other assemblies.                              One of the roles for the data source elements specified
   Figure 1 shows two assemblies that are chained to produce             in the BoM is to store the means to access the data when
a data component (Data 1’) and an artifact (Artifact 2’)                 the experiment is run. In many cases this will be via a url
as outputs. Such a BoM could be used by a scientist to                   parameterised dynamically at runtime - the static entities of the
describe a simple AI model training process containing two               url could be stored in the data source as part of the BoM, with
assemblies. Assembly 1 represents the data labelling process,            the dynamic parameters and the results stored in the shadow
and Assembly 2 the model training process. Data 1 is an input            data item of the BoL. The intent of the design is that there is
data source, which could be training data. Artifact 1 might be           flexibility of type, so any metadata could be stored in the BoM
a roster of the staff employed to label the data, and the central        and retrieved and interpreted in the application process. Uses
data source, Data 1’ (which, as illustrated, is both the output of       of this metadata could include storing encrypted information,
the data labelling assembly and the input to the model training          which is unencrypted and subsequently used by the client



                                                                     3
                        11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


application. Further, the metadata could include information             The gateway server is written in Node.js7 , using Apollo
to initiate an asynchronous data request and an endpoint to           GraphQL Server8 , which acts as an abstraction layer above
which the data should be delivered. The intent is to provide a        the gateway’s Mongo DB database store.
flexible storage slot for static data about the data, which can          GraphQL allows developers to specify a data schema, and
be retrieved, interpreted and used by the client application.         define queries and mutations, which are interfaces to allow
Experimentally, it has been possible to use the dataBoM               reading and writing of the data, respectively. The GraphQL
gateway pilot to store and retrieve an encoded blockchain             data schema, queries and mutations are public interfaces,
contract address and function interface from a data source,           which hide the details of the underlying data storage from
and use this information to initiate a blockchain transaction         users of the interfaces. The server’s data store does not have
from the client application to retrieve data at runtime. Such a       to match the GraphQL schema, as the server code which
transaction could be used to provide immutable proof of a data        implements the queries and mutations performs the mapping
request, or for gateway users to have a means to access third-        to read and write the correct data to its database. GraphQL
party data on a pay-per-use basis, which is discussed further         is intended to provide an efficient transfer of data between
in Section VI.                                                        client and server, as queries can be written to request only
                                                                      the data needed. Furthermore, the gateway’s API can be
B. The dataBoM Gateway                                                enhanced by extending the queries and mutations offered,
                                                                      without implications for existing users.
   The dataBoM gateway provides a working implementation                 The GraphQL interface is self-documenting, and can be
of the conceptual data ecosystem BoM model[23] and enables            queried by client application developers to find out the data
researchers to declare BoMs to describe the data components           structures and queries and mutations available to them.
of their experiments, and instantiate BoLs to preserve contex-
                                                                         The dataBoM gateway offers access to its GraphQL server
tual records for each run to provide traceability.
                                                                      via an https end-point for API access.
                                                                      C. Integration with Client Applications
                                                                         To take advantage of the traceability capabilities provided
                                                                      by the dataBoM gateway, scientists should use the supplied
                                                                      API to define a BoM for their experiments, detailing the as-
                                                                      semblies, data sources and artifacts required in their processes,
                                                                      passing the desired parameters and retaining the identifiers
                                                                      which are returned by the API calls in order to chain entities
                                                                      together - for example, when creating a data source item, the
                                                                      identifier that is returned should be retained so that it can be
                  Fig. 2. The dataBoM Pilot Gateway                   used as a parameter when creating an assembly.
                                                                         Once the BoM is defined, the researcher should instantiate
   The architecture of the dataBoM gateway is shown in                the BoM whenever they run their experiment, and then use
Figure 2. The gateway is to be offered as a web service, with         the API from their application code to query the experiment’s
interactions between the gateway and researchers conducted            BoM for static factors such as the locations of data assets, with
through a web interface or via an API.                                any dynamic state arising during experimentation (eg. data
   The pilot version of the dataBoM gateway stores data in            values) being written to the BoL via the API as the experiment
a MongoDB5 database, such that queries can be written to              progresses.
provide traceability on data sourcing and data use for any               Use of the API requires the researcher to integrate a
BoM. Further development of the gateway will explore the              GraphQL client library with their application code or workflow
off-loading of the archival of the BoMs and BoLs to commons-          scripts, and support is available for popular web and mobile
based decentralised storage, such as IPFS[24], with indexing          platforms, including Python, Node.js, iOS and Android.
secured on a public ledger or blockchain. This will serve                The steps in the integration would typically include:
to preserve records beyond the lifetime of the gateway, and              • Define data sources, artifacts and assemblies in BoM
provide an immutable record of events, suitable for later audit          • Use BoM’s ID to instantiate a new BoL for a new run
or inspection.                                                           • Access data source metadata for data location or endpoint
   The dataBoM gateway is initially hosted on an intranet,               • On receipt of data, populate data source shadow in BoL
and it is envisaged that future versions of the gateway will             In this way, the BoM and the BoL can combine to gen-
be migrated to public facing web services, or serverless[25]          erate an evidence trail of the dynamic data values and the
environments, such as AWS Appsync6 , to provide a robust and          static components of the data and supporting artifacts which
reliable service.                                                     contributed to each run of an experiment.
 5 https://www.mongodb.com                                              7 https://nodejs.org/en/
 6 https://aws.amazon.com/appsync/                                      8 https://www.apollographql.com/docs/apollo-server/




                                                                  4
                         11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


  Section IV, below, describes a case study implementation, to                  In the application code for the experiment, the BoM should
provide further insight and explanation of dataBoM integration               be instantiated via its identifier to generate a new BoL for
and usage.                                                                   the run. As the code runs, it should refer to its BoM (via the
                                                                             instantiated BoL) to get locations for data it needs to access,
                         IV. C ASE S TUDY
                                                                             and write any dynamic information to its BoL for permanent
   By way of illustration of the use of the dataBoM gateway,                 archival.
consider a simple software application which serves to provide                  In the HPC congestion scoring example, the data source for
a ‘traffic congestion score’ for a fixed location, e.g., Hyde                the traffic scene holds a static URL for a live camera. The
Park Corner, depending on how much traffic the application                   scientist’s code would retrieve this information through the
determines is currently at the location. This simple process has             dataBoM API and access the photo, and (if desired) store a
a single assembly, Traffic Scene Analysis, an input data source              permanent copy of the photo to its own archives, writing a
Location Photo, an ML model artifact Congestion Model and                    reference to the location of the archived copy to the shadow
an output data source Congestion Score (Figure 3).                           data item, such that it will be saved as part of the archival of
                                                                             the BoL. The resultant congestion score should also be written
                                                                             to the BoL, by referencing the appropriate data source item.
                                                                                Thus, each data source and artifact in every BoL would
                                                                             have any dynamic values recorded and stored in a database as
                                                                             a persistent record of the run, so that each of the Assemblies
                                                                             in the BoL would have traceable input and output data values
                                                                             which could be accessed at a later date.
                                                                                                    V. D ISCUSSION
                                                                                There are a number of interesting directions in which future
                                                                             development of the dataBoM gateway could be taken. Interac-
                                                                             tion with the gateway is currently provided by a GraphQL
      Fig. 3. The components of a simple traffic congestion system           API, which provides good integration with the application
                                                                             code at runtime, however, initial definition of the BoM and
   In defining the BoM for the Hyde Park Corner (HPC) con-
                                                                             its elements would be more intuitive if it were faciliated
gestion rating process, the scientist should give each element a
                                                                             through a visual UI. Thus, the BoM could be authored using a
name and an optional description, and declare static elements,
                                                                             visual interface via a web browser, with the runtime invocation
such as the URL to be used to retrieve a live photo from
                                                                             and interaction with the BoL remaining an API-driven task.
the location of interest. Encoding this simple single assembly
                                                                             There is a similar opportunity to add a visual interface to the
process as a BoM through the gateways’s API gives a data
                                                                             overview of each experiment logged by the gateway. Such an
model as shown in Listing 1, which is the result of a GraphQL
                                                                             interface would provide a means to explore the composition
query on the BoM’s entry.
                                                                             of the data and artifact elements of each experiment, and help
”bom”: {                                                                     to satisfy the traceability goals of the gateway, by providing
     ”name”: ”HPC Congestion”,
     ”description”: ”Determine congestion levels on Hyde Park Corner”,       a convenient means of exploring the nodes in the BoM and
     ”assemblies”: [                                                         each BoL.
        {                                                                       Integration of the dataBoM gateway with the workflow man-
          ”name”: ”Traffic Scene Analysis”,
          ”description”: ”Determine congestion at Hyde Park Corner”,         ager systems that are popular in the research community will
          ”inputData”: [                                                     facilitate smoother integration of the gateway into experiment
             {                                                               workflows, and help to foster acceptance of the benefits of
                 ”name”: ”Traffic Scene”,
                 ”dataAccess”: ”https://xyz.com/00001.06514.jpg”             the BoM model in providing traceability in scientific data-
             }                                                               ecosystems.
          ],                                                                    There is scope to extend and deepen the integration of
          ”outputData”: [
             {                                                               the gateway and its BoM and BoL models with blockchain
               ”name”: ”Result”                                              technologies, such as the programmable smart contracts pro-
             }                                                               vided by the Ethereum blockchain platform. By associating
          ],
          ”inputArtifacts”: [                                                smart contracts with the data sources and artifacts from the
             {                                                               BoM model, novel dynamic behaviour in data ecosystems can
               ”name”: ”Congestion Model”                                    be explored. Such dynamic behaviours might include runtime
             }
          ]                                                                  selection of the most appropriate data source sets, along with
        }                                                                    automatic remuneration and sanctioning, based on dynamic
     ]                                                                       measures of data quality. Further development of the dataBoM
   }
                                                                             gateway could provide a means by which scientists are able
       Listing 1. GraphQL data schema for HPC Congestion BoM                 to share data and artifacts with their peers, and a blockchain



                                                                         5
                           11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


platform might underpin this. Related to blockchain integration                     [5] L. U. Opara, “Traceability in agriculture and food supply chain: a review
is motivation to explore traceability on the human side of the                          of basic concepts, technological implications, and future prospects,”
                                                                                        Journal of Food Agriculture and Environment, vol. 1, pp. 101–106, 2003.
experimental process, using Decentralised Identifiers9 (DIDs)                       [6] T. Kelepouris, K. Pramatari, and G. Doukidis, “Rfid-enabled traceability
to associate researchers or crowd-workers with components of                            in the food supply chain,” Industrial Management & data systems, vol.
the system and to provide a means to trace their activity and                           107, no. 2, pp. 183–200, 2007.
                                                                                    [7] J. N. Petroff and A. V. Hill, “A framework for the design of lot-tracing
the data and artifacts they are associated with.                                        systems for the 1990s,” Production and Inventory Management Journal,
                                                                                        vol. 32, no. 2, p. 55, 1991.
                           VI. C ONCLUSION                                          [8] M. H. Jansen-Vullers, C. A. van Dorp, and A. J. Beulens, “Managing
                                                                                        traceability information in manufacture,” International journal of infor-
   The dataBoM gateway provides scientists and developers                               mation management, vol. 23, no. 5, pp. 395–413, 2003.
with a means to map the overall structure of the compo-                             [9] C. Van Dorp, “A traceability application based on gozinto graphs,” in
                                                                                        Proceedings of EFITA 2003 Conference, 2003, pp. 280–285.
nents that make up complex data ecosystems used in their                           [10] P. Missier, K. Belhajjame, and J. Cheney, “The w3c prov family of
experiments. By going beyond the data, and considering other                            specifications for modelling provenance metadata,” in Proceedings of
contributing factors such as the software and hardware which                            the 16th International Conference on Extending Database Technology.
                                                                                        ACM, 2013, pp. 773–776.
produces or manages the data, licenses which govern the use                        [11] S. B. Davidson and J. Freire, “Provenance and scientific workflows:
and sharing of the data, and policies which contributed to the                          challenges and opportunities,” in Proceedings of the 2008 ACM SIG-
generation of the data, the development of a BoM for each                               MOD international conference on Management of data. ACM, 2008,
                                                                                        pp. 1345–1350.
system provides a mechanism to archive the ecosystem for                           [12] J. Singh, J. Cobbe, and C. Norval, “Decision provenance: Harnessing
each experiment. Instantiating the BoM into a BoL each time                             data flow for accountable systems,” IEEE Access, vol. 7, pp. 6562–6574,
the system runs augments the static parts list with a dynamic                           2019.
                                                                                   [13] M. Hind, S. Mehta, A. Mojsilovic, R. Nair, K. N. Ramamurthy,
and traceable view into every invocation of the system, such                            A. Olteanu, and K. R. Varshney, “Increasing trust in ai services through
that the data inputs, data outputs and any artifacts which                              supplier’s declarations of conformity,” arXiv preprint arXiv:1808.07261,
are used or produced by the system can be archived, readily                             2018. [Online]. Available: https://arxiv.org/pdf/1808.07261.pdf
                                                                                   [14] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach,
identified and traced back to their source. Similarly, future                           H. Daumeé III, and K. Crawford, “Datasheets for datasets,”
users of produced data and artifacts, such as models, can be                            arXiv preprint arXiv:1803.09010, 2018. [Online]. Available: https:
identified, which could prove to be very important if errors                            //arxiv.org/abs/1803.09010
                                                                                   [15] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin-
are later found and are notifiable. Storing metadata capable of                         son, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model
identifying smart contracts on the blockchain further enables                           reporting,” in Proceedings of the Conference on Fairness, Accountability,
immutable recording of the action and timing of requests for                            and Transparency. ACM, 2019, pp. 220–229.
                                                                                   [16] S. Schelter, J.-H. Böse, J. Kirschnick, T. Klein, and S. Seufert, “Au-
data provision, along with the potential for encoding quality                           tomatically tracking metadata and provenance of machine learning
of service requirements, and providing automatic payment for                            experiments,” in Machine Learning Systems Workshop at NIPS, 2017.
services.                                                                          [17] “Node-red: Flow-based programming for the internet of things.”
                                                                                        [Online]. Available: https://nodered.org/
                                                                                   [18] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling,
                         ACKNOWLEDGMENT                                                 R. Mayani, W. Chen, R. F. Da Silva, M. Livny et al., “Pegasus, a work-
                                                                                        flow management system for science automation,” Future Generation
   This research was sponsored by the U.S. Army Research                                Computer Systems, vol. 46, pp. 17–35, 2015.
Laboratory and the UK Ministry of Defence under Agreement                          [19] S. Nakamoto et al., “Bitcoin: A peer-to-peer electronic cash system,”
Number W911NF-16-3-0001. The views and conclusions con-                                 2008.
                                                                                   [20] G. Wood, “Ethereum: A secure decentralised generalised transaction
tained in this document are those of the authors and should                             ledger,” Ethereum project yellow paper, vol. 151, pp. 1–32, 2014.
not be interpreted as representing the official policies, either                   [21] D. Tapscott and A. Tapscott, “How blockchain will change organiza-
expressed or implied, of the U.S. Army Research Laboratory,                             tions,” MIT Sloan Management Review, vol. 58, no. 2, p. 10, 2017.
                                                                                   [22] L. Luu, D.-H. Chu, H. Olickel, P. Saxena, and A. Hobor, “Making smart
the U.S. Government, the UK Ministry of Defence or the UK                               contracts smarter,” in Proceedings of the 2016 ACM SIGSAC Conference
Government. The U.S. and UK Governments are authorized                                  on Computer and Communications Security. ACM, 2016, pp. 254–269.
to reproduce and distribute reprints for Government purposes                       [23] I. Barclay, A. Preece, I. Taylor, and D. Verma, “A conceptual architecture
                                                                                        for contractual data sharing in a decentralised environment,” arXiv
notwithstanding any copyright notation hereon.                                          preprint arXiv:1904.03045, 2019.
                                                                                   [24] J. Benet, “Ipfs-content addressed, versioned, p2p file system,” arXiv
                              R EFERENCES                                               preprint arXiv:1407.3561, 2014.
                                                                                   [25] E. Jonas, J. Schleier-Smith, V. Sreekanti, C.-C. Tsai, A. Khandelwal,
 [1] M. I. S. Oliveira, G. d. F. B. Lima, and B. F. Lóscio, “Investigations            Q. Pu, V. Shankar, J. Carreira, K. Krauth, N. Yadwadkar et al., “Cloud
     into data ecosystems: a systematic mapping study,” Knowledge and                   programming simplified: A berkeley view on serverless computing,”
     Information Systems, pp. 1–42, 2019.                                               arXiv preprint arXiv:1902.03383, 2019.
 [2] N. Diakopoulos, “Accountability in algorithmic decision making,” Com-
     munications of the ACM, vol. 59, no. 2, pp. 56–62, 2016.
 [3] L.    Byron,     “Graphql:    A    data    query     language.”    [On-
     line]. Available: https://code.facebook.com/posts/1691455094417024/
     graphql-a-data-query-language
 [4] D. M. Lambert, M. C. Cooper, and J. D. Pagh, “Supply chain manage-
     ment: implementation issues and research opportunities,” The interna-
     tional journal of logistics management, vol. 9, no. 2, pp. 1–20, 1998.

  9 https://w3c-ccg.github.io/did-spec/




                                                                               6