Dataset and Feature-Level Provenance Integration for Spatial Datasets Nicholas J Car Geoscience Australia, Symonston, ACT, Australia; Email: nicholas.car@ga.gov.au SUMMARY Large, multi-agency projects such as the Foundational Spatial Data Framework are interested in capturing the provenance of their spatial datasets as they are processed and combined to form products. Additionally, work is underway at the CRC for Spatial Information and elsewhere to track the provenance of the production of individual elements (features) within spatial datasets. How can we reconcile these provenance situations, given the different levels of granularity? Can we relate the provenance from lower-level systems to higher levels? Can we use common tools and methodologies? This paper and talk present provenance modelling work that has taken place at Geoscience Australia and CSIRO to solve these issues. The differing levels of granularity can be related however, for interoperability, a standard must be used and we’ve used PROV. Keywords: spatial dataset, provenance, multi-granularity, spatial data infrastructure, transparency INTRODUCTION Transparency of process and some measure of reproducibility are requirements for information hoping to engender a high degree of trust in its users. A system-independent, international, standard known as PROV [1], now exists to generically represent the provenance of things (i.e. anything that was produced) and can be used to describe the production of national spatial datasets. The use of such standards ensures the interoperability of provenance description across systems and the longevity of the understanding of such descriptions. A presentation at a previous Locate conference by this author [2] demonstrated the standardised provenance representation of a single map’s production, down to the ‘layers’ level using a formulation of PROV, PROV-O. More recent work by the Cooperative Research Centre for Spatial Information (CRC-SI) has represented individual geoprocessing toolkit actions undertaken to produce elements within spatial datasets using an extension to the PROV-O that they made, called GeoPROV [3]. Additionally, the Foundational Spatial Data Framework (FSDF) project 1 intends to use PROV to represent the overall information flow from base data to FSDF data products. actedOnBehalfOf wasGeneratedBy wasAssociatedWith Activity ArcGIS Person X Raster wasAssociatedWith generated/used Vector Clipped mask Clip Entity Agent Raster used wasGenerateBy Config wasDerivedFrom actedOnBehalfOf A. wasAttributedTo B. Figure 1. A: The basic PROV-O classes and their relationships. B: A simple implementation of PROV-O describing the clipping of a raster image using ArcGIS2. 1 http://www.anzlic.gov.au/foundation_spatial_data_framework 2 https://esriaustralia.com.au/products-arcgis-software Proc. of the 3rd Annual Conference of Research@Locate 1 These three bodies of work are all use PROV at different granularities and for slightly different purposes, however all three intend to enhance the transparency of the production of spatial products. In this paper we will demonstrate how standardized provenance information recorded by different processes at different levels of granularity can be conceptually combined. Such combination is necessary in order to provide point-of-truth provenance information for data products. USING THE PROV DATA MODEL PROV-O provenance depiction The PROV Data Model [1] consists of 3 main classes of concepts: Entities (things), Activities (events that act on Entities) and Agents (people or systems that trigger Activities). A diagram of these classes and their basic relationships is given in Figure 1A. An implementation of PROV-O for a simple geoprocessing task exhibiting a granularity similar to the examples in [2] is given in Figure 1B. PROV-O representations of provenance are graph-based in structure. Graphs 3 by their nature, unlike relational databases, contain their schema within the data [4]. This allows for infinitely detailed and infinitely large representations of systems’ provenance with the schema of the graph not limiting extensions of the information stored about items in it, or the links between items. Real limits on the information stored are only imposed by the ability of users to capture provenance information and for storage systems to physically cater for its management. Additions to provenance graphs can be made by inserting new data into the graph, joining on appropriate prov:Activity 4 , prov:Entity or prov:Agent nodes. Since PROV-O uses a Resource Description Framework (RDF)5-based graph, each node’s identity is given as a URI6, thus one just needs to discover the URI for a node and graph additions can be made. Ancestor Dataset 3 Ancestor Ancestor Dataset D Dataset C Dataset 1 Dataset 1 Ancestor Target Dataset 4 Dataset Ancestor Target Dataset 2 Dataset B Dataset A Dataset A. all links are prov:wasDerivedFrom B. all links are prov:wasDerivedFrom Ancestor Ancestor used wasGeneratedBy Dataset A used Dataset A used Sub Intermediate Sub Activity Dataset Activity Ancestor Target Ancestor Activity Dataset B Dataset Dataset B wasGeneratedBy Ancestor wasGeneratedBy Ancestor Dataset C Target Dataset C C. D. Dataset Figure 2. A: A high-level dataset provenance graph. B: Two datasets from A with intermediate datasets shown. C: A ‘black box’ Activity consuming 3 datasets and producing 1, D: The same datasets as C with the ‘black box’ broken down into two parts and an intermediate dataset shown. PROV-O used at different levels of granularity Detail insertion If a system records the provenance of a dataset at a high level – perhaps just recording which datasets are a target dataset’s ancestors (see Figure 2A) – and this information is stored, additions to that can 3 https://en.wikipedia.org/wiki/Graph_(abstract_data_type) 4 PROV-O objects are denoted prov:{CLASS_NAME}, e.g. a PROV Agent is denoted prov:Agent 5 https://en.wikipedia.org/wiki/Resource_Description_Framework 6 https://en.wikipedia.org/wiki/Uniform_Resource_Identifier Proc. of the 3rd Annual Conference of Research@Locate 2 be made later that fill in intermediate steps (see Figure 2B). Additionally, if a process records high- level provenance noting an activity that has taken place and that consumes (prov:used) and produces (prov:generated) datasets (see Figure 2C) which is then stored, that too can be added to later by recording activities at a finer granularity and any intermediate datasets (these don’t necessarily have to be persisted: their existence may only be represented) (Figure 2D). As well as increasing the granularity of provenance graphs by filling in details, detailed provenance graphs can have their granularity decreased by querying. The SPARQL query protocol7 is for RDF- based graph databases what SQL is for relational databases. It is able to skip over nodes in provenance graphs by using path-based, transitive queries. This skipping of intermediate nodes allows one to, for example, discover the ultimate ancestor of a dataset, despite there being any number of intermediate ancestors. For the scenario shown in Figure 2B, a path-based SPARQL query can tell the user that “Ancestor Dataset 1” is the ancestor of “Target Dataset”. Dataset Subsetting Representing dataset subsetting is important for linking provenance at different granularities as subsetting can be the tie-in points for systems’ reporting provenance at different scales. There are a range of options regarding the recording of provenance for datasets that are subsets of other datasets. The PROV data model doesn’t directly prescribe how one should represent subsetting of datasets or how a part of a dataset is related to the larger whole: such instructions require far more detail than the generic PROV data model can deliver. One method of representing detailed dataset subsetting is shown in Figure 3A. As per that diagram, a dataset subset is created via a prov:Activity subsetting procedure with instructions as to how the sub-setting was undertaken recorded in a prov:Plan class object which is a specialised prov:Entity used to denote methodology. The prov:Plan object could hold computer code, detailed manual methodology or other instructions. Another method for representing subsetting is shown in Figure 3B. In this formulation, instructions for performing the subsetting are not given with additional input data but are described by typing the subsetting prov:Activity. An example could be a prov:Activity of a hypothetical class such as “TemporalExtentSubsetting” where the instances of such always subset the Large Dataset with some selection of a temporal extent. Sufficient metadata for the types subsetting activity, such as actual temporal extents, would need to be provided elsewhere (i.e. not in the provenance graph) in order to remove ambiguity from the action. One location for such metadata could be a register of typed activities maintained for use by a certain set of workflows. Figure 3C presents a combined formulation in which the typed prov:Activity demands that certain inputs to the subsetting action, in addition to the dataset from which a subset was taken, be represented in the provenance graph. Parent Class Key Typed Dataset Dataset prov:Entity Parent Subsetting Dataset Activity Subset Dataset Activity Subset Subsetting Instance Instruction prov:Plan A B. Parent Dataset Classed Subsetting Subsetting Dataset Instruction Activity Subset Instance Additional Required input C. Figure 3. PROV-O Representations of subsetting actions. A: Using a prov:Plan object to hold subsetting instructions. B: By classifying the subsetting prov:Activity instance. C: Formulation combining A & B where required inputs are specified by the typed subsetting prov:Activity. 7 https://en.wikipedia.org/wiki/SPARQL Proc. of the 3rd Annual Conference of Research@Locate 3 Dataset Merging & Splitting Dataset merging and splitting can be modelled like dataset subsetting with either prov:Plan objects or typed prov:Activities, or a combination of the two, providing the instructions the action. It follows that the representations of dataset merging & splitting are akin to that of dataset subsetting shown in Figure 3 but with multiple input (merging) or multiple output (splitting) datasets. REPRESENTING FEATURE AND DATASET PROVENANCE Limited sets of typed actions for features Where the provenance of features manipulated via a limited set of actions is to be represented, the representation shown in Figure 3A or B may be used and then aggregated to dataset-level provenance. Figure 4 shows a representation of a hypothetical set of feature manipulation actions using the formulation given in Figure 3B: “selected”, “not-selected”, “merged”, “split” and the generic “alter” typed prov:Activities are shown. These actions may have been carried out against features in one or more datasets and the results stored in a resultant dataset. They may be the result of specialized spatial tools, such as ArcGIS, certain actions of which are modelled using PROV-O in [3]. For a scenario in which features from one dataset (perhaps classes of vectors in a cadastral dataset) may be manipulated to form features in another dataset, such actions and their associated features may be represented as in Figure 4. Figure 4A shows feature-level manipulation and parts B, C & D dataset- level integration of feature-level provenance. Input Output Input Output Selected Feature feature Dataset Dataset Feature Manipulation Input Not‐ Feature‐action Action‐feature Feature selected mapping mapping B. Input Feature X Input Output Dataset Feature Output Merged feature Manipulation Dataset Input Feature Y Feature‐action‐ Output feature mapping Feature A C. Input Split Feature Output Annotated Input Feature Feature B Output Dataset Manipulation Dataset Input Alter Output D. Feature feature A Figure 4. A: Feature manipulation actions as per Figure 3B. B: Aggregation of features manipulated into datasets with feature/action mappings preserved as prov:Activity inputs and outputs, C: Aggregation of features manipulated into datasets with feature/feature mappings preserved as a prov:Activity, prov:Plan input and, D: Aggregation of features manipulated into datasets with feature/feature mappings preserved by annotating output features with links to actions performed and features within the input dataset. Identifier handling The three feature-level provenance integration strategies presented in Figure 4B, C & D all rely on feature identification in order to link input and output features to their manipulation actions and each other. All three strategies are therefore dependent on either a mechanism for minting IDs for features that, although they are part of a dataset, are referenceable from outside that dataset or a feature register Proc. of the 3rd Annual Conference of Research@Locate 4 that records feature identity independently from any particular dataset. The first case is implementable by URI patterns in accordance with Linked Data8 principles where the feature-level URIs are mapped to a higher level dataset-level URI via a relative, logical path. The second case requires a master feature register that can mint identifiers for features which can be referred to by any dataset containing them. Such a register may provide access to authoritative copies of their data, but this is not necessary. In addition to the requirements listed above, the part B scenario also relies on the identification of, and storage of, the instance of each typed prov:Activity in order to preserve feature-level provenance since the feature linking is not directly coupled – it is in two parts: input feature(s) Æ action then action Æ output feature(s). The part C scenario conceptualizes the input and output feature mapping as a prov:Plan object for such a mapping if it contains feature-to-action-to-feature mappings that act as the entire instructions for the “Feature Manipulation” prov:Activity. The part D scenario annotates each feature in the output dataset with the identity of its relevant manipulation actions instance as well as the input features manipulated. Such a formulation is also dependent on the identification and storage of the instance of each typed prov:Activity, as per part B, but it also has a shortcoming not present in parts B & C: actions that result in no output feature, such as feature non-selection, will not be identifiable in the annotated output dataset. FSDF DATASET PRODUCTION CASE STUDY Detail insertion, dataset subsetting, aggregating and splitting actions, as described two sections above, can easily be used in specific spatial data scenarios. Feature-level action recording and feature/action mapping as outlined in the section above can be applied to spatial datasets if the feature manipulation systems are able to record it and if the dependencies, also outlined above, are met. Figure 5 shows the processing of two hypothetical FSDF source datasets (A & B) into an FSDF product. Part A shows simple dataset-level provenance, part B shows dataset-level provenance but with more details PROV-O formulation, as per Figure 3A. 5C implements many of the techniques described above, specifically: x The whole of 5C shows detail insertion (Figure 2D); x The path from Source Dataset A to Intermediate X shows detail addition (3A) and either 4B or 4C formulation, depending on whether feature-action + action-feature mapping (4B) or feature- action-feature mapping (4C) is used; x The Intermediate X to Intermediate Y path shows typed prov:Activity formulation (3B) and could use annotated output dataset (4D) mapping; x Intermediate Y plus Source Dataset B fusing to form the FSDF product could be a 3C-type exercise where the types prov:Activity, “Merging” specifies two input datasets and an feature mapping prov:Plan which preserves feature origin knowledge. This formulation is also a feature- action-feature mapping (4C). Source wasDerivedFrom Source Dataset A Dataset A FSDF Product Source FSDF Source Production Dataset B Dataset B Product A. Plan B. Ancestor/Descen dent mapping FSDF Source Product Dataset A Feature Feature Inter‐ Inter‐ Manipulation Manipulation mediate X mediate Y Selection (Selection) Type M Criteria Source C. Merging Dataset B Figure 5. A hypothetical FSDF product generation scenario modelled with different amounts of detail and at different levels of granularity. 8 http://www.w3.org/TR/ld-bp/ Proc. of the 3rd Annual Conference of Research@Locate 5 PROVENANCE DATA MANAGEMENT It’s also not possible to write in generalities about provenance data collection or generation – in-depth knowledge of specific systems is required in order to make sensible descriptions – and collecting provenance data in standardised formats is far harder than managing and storing it [5, see Discussion]. Once collected however, there are a range of generic tools available to manage and manipulate it. The PROMS family of tools and their associated methodology [6]9 allow any number of systems to report PROV-O-based provenance information and have it stored in a graph database. The system will automatically join provenance graphs where the same node URIs are used, thus detail insertion, as per Figure 2, can easily be achieved. Similarly, the joining of small provenance graphs into larger super- graphs can be achieved which allows independent systems to assemble continuous graphs across their individual processes, as long as they can share dataset or feature identifiers in order to report against them. Most RDF-based graph database allow querying via SPARQL thus the abstraction of detailed graphs into simpler ones can take place when detail insertion or multi-process reporting has taken place. Installations of PROMS Server make the SPARQL endpoint of its underlying RDF graph database available for such use thus allowing fine to coarse granularity translation out of the box. CONCLUSIONS We have presented a range of PROV-O-based modelling formulations (ontology design patterns) to help provenance data managers meld provenance information at varying levels of granularity. We focused on dataset and feature level provenance, as these are the two obvious granularities for spatial data products, but the principles could apply to information at other granularities. We have presented alternative methods for the integration of provenance information of different granularities and pointed out some of the logical and system dependencies that certain patterns require. We have given a very brief FSDF case study implementing many of the techniques and also finally described several aspects of provenance data management referencing a particular tool. ACKNOWLEDGEMENTS This paper is published with the permission of the CEO, Geoscience Australia. REFERENCES [1] Moreau, L. & Missier, P. (eds.) PROV-DM: The PROV Data Model. W3C Recommendation 30 April 2013 W3C (2013). Online at http://www.w3.org/TR/prov-dm/. Accessed 2015-12-08. [2] Car, N.J. Map data lineage: provenance concepts, tools and future shared infrastructure. Locate2015 Conference presentation (2015). [3] Sadiq, M.A., West, G., Arnold, L., McMeekin, D.A. and Moncrieff, S. Spatial data supply chain provenance modelling for next generation spatial infrastructures using semantic web technologies. MODSIM2015, Gold Coast, Australia, 29th Nov – 4th Dec, 2015. (2015) Online at http://mssanz.org.au/modsim2015/. [4] Robinson, I., Webber, J. & Eifrem, E. (2013) Graph Databases. O’Reilly Media. ISBN 978-1- 4493-5626-2. Online at http://graphdatabases.com. Accessed 2015-12-11. [5] C. Wise, N. J. Car, R. Fraser and G. Squire. Standard Provenance Reporting and Scientific Software Management in Virtual Laboratories. MODSIM2015, Gold Coast, Australia, 29th Nov – 4th Dec, 2015. (2015) Online at http://mssanz.org.au/modsim2015/. [6] Nicholas J Car, Matt Stenson, Mick Hartcher, Simon Cox, Peter Fitch, and David Lemon. A provenance management methodology and example architecture for science projects containing heterogeneous automated and manual processes. In HIC 2014 – 11th International Conference on Hydroinformatics, page 8, New York, USA, 2014. International Water Association. URL http://academicworks.cuny.edu/cc_conf_hic/57/. Accessed 2015-12-11. 9 See http://promsns.org for up-to-date information on the PROMS family of provenance tools Proc. of the 3rd Annual Conference of Research@Locate 6