=Paper=
{{Paper
|id=Vol-2969/paper18-DEMO
|storemode=property
|title=AMP: An Automated Metadata Pipeline
|pdfUrl=https://ceur-ws.org/Vol-2969/paper18-DEMO.pdf
|volume=Vol-2969
|authors=Beth Huffer,Simon Handley
|dblpUrl=https://dblp.org/rec/conf/jowo/HufferH21
}}
==AMP: An Automated Metadata Pipeline==
AMP: An Automated Metadata Pipeline Beth Huffer 1 and Simon Handley 1 1 Lingua Logica LLC, Denver, Colorado, United States Abstract Making data more FAIR (Findable, Accessible, Interoperable, and Reusable) is key to helping data consumers make use of NASA data. The FAIR doctrine embraces the principle that facilitating machine-driven research activities is critical to supporting scientific research in the 21st century. Our automated metadata pipeline, AMP, generates syntactically and semantically consistent metadata records for U.S. National Aeronautics and Space Administration (NASA) Earth science datasets using ontologies and machine learning techniques. AMP addresses issues of usability and scalability for data providers and metadata curators who are asked to create robust metadata records to describe their data products, but who find it difficult to do so because of the lack of available tools. AMP auto-generates information-rich, semantically consistent metadata records for NASA datasets by sending the data through a semantic annotation pipeline that uses ontologies and machine learning techniques to generate sets of RDF assertions that describe it in detail. We demonstrate an end-to-end metadata curator’s workflow, the final metadata records produced by AMP, and a data discovery and access application that makes use of those records. Keywords 1 Ontologies, Machine Learning, Semantic Interoperability, Data Discovery, Earth Science, FAIR Data 1. Introduction Our automated metadata pipeline, AMP, generates syntactically and semantically consistent metadata records for U.S. National Aeronautics and Space Administration (NASA) Earth science datasets using ontologies and machine learning techniques. Using AMP as it is deployed on Amazon Web Services (AWS), we demonstrate an end-to-end metadata curator’s workflow, the final metadata records produced by AMP, and a data discovery and access application that makes use of those records. We will include interludes that provide insights into the back-end systems that make the production pipeline possible. 2. Background NASA is a champion of free and open access to scientific data. Among the objectives identified in NASA’s 2018 Strategic Plan are: safeguarding and improving life on Earth; and providing data and applications for operational use across a diverse set of communities of practice [1]. Achieving NASA’s science goals - whether it be improving our ability to predict climate FOIS 2021 Demonstrations, held at FOIS 2021 - 12th International Conference on Formal Ontology in Information Systems, September 13- 17, 2021, Bolzano, Italy EMAIL: beth@lingualogica.net (B. Huffer); simon@lingualogica.net (S. Handley) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) change, or using Earth system science research to inform climate-related policy and planning decisions - depends on researchers, climate modelers, policy makers, environmental. planners, and others, making optimal use of the full range of Earth science data that NASA has to offer. Making data more FAIR (Findable, Accessible, Interoperable, and Reusable) is key to helping data consumers make use of NASA data. The FAIR doctrine embraces the principle that facilitating machine-driven research activities is critical to supporting scientific research in the 21st century. Yet, the Earth science community still struggles to realize the FAIR objectives: scientific data discovery services are still inadequate, and scientists continue to spend a significant percentage of their time finding and preparing data for use in their research. In a study of science data users, Gregory et al. [2] found that researchers and environmental policy and decision makers need information from different locations and time, but they have difficulty accessing the information, or finding the right type [of information]… Integrating diverse data is problematic across the environmental sciences. Data collected at different scales and using different nomenclatures are difficult to merge (Dow et al., 2015; Maier, et al., 2014; Bowker, 2000b). Implementing effective data discovery services and fully automated, machine-driven transactions starts with creating metadata that provides the information that users - both humans and machines - need to understand what is in a given dataset, and how to correctly use the data. But such metadata records are uncommon. More commonly, metadata records are inadequately contextualized, incomplete, or simply do not exist. Metadata requirements at data centers tend to be minimal, to ease the burden on data producers and metadata curators; and the metadata that does exist lacks adequate semantic underpinnings. As a result, tools and services for discovering and using data are, at best, syntactically interoperable; but they lack the semantic understanding necessary to achieve FAIR objectives. Unless the problem of semantic interoperability is addressed, scientific data discovery tools will continue to struggle to provide relevant search results, researchers will continue to struggle to make their data interoperate with their analytical tools, and NASA’s goal of optimizing use of the full range of Earth science data that it has to offer will remain unrealized. 3. Automating the Metadata Production Process 3.1. Overview Contributing to the problem of inadequate metadata is the fact that tools for generating metadata rely largely on manual curation and have little or no shared semantics. In some cases, metadata curators may be asked to pick from a controlled list of keywords; but the approach does not scale, and consistency is difficult to enforce. NASA archives currently hold over 8,000 data collections, many of which contain 100 or more individual datasets needing metadata. Manual curation of metadata for NASA datasets is not feasible. If metadata are to provide the means to address scientific data discovery and interoperability challenges at NASA, then scalable, user-friendly tools that can generate FAIR-compliant metadata are needed. Our Automated Metadata Pipeline (AMP) addresses issues of usability and scalability for data producers and metadata curators who are asked to create robust metadata records to describe their data products, but who find it difficult to do so because of the lack of available tools. AMP auto-generates information-rich, semantically consistent metadata records for NASA datasets by sending the data through a semantic annotation pipeline that uses ontologies and machine learning techniques to generate sets of RDF assertions that describe it in detail. To annotate the datasets in NASA collections, we look to the data itself to tell us what it represents. Our metadata production pipeline retrieves, parses, and subsets the datasets in large NASA Earth science data collections, and stages them in AWS S3 containers, where our machine learning (ML) component accesses and analyzes them. The ML component has been trained on a set of well understood, well characterized, and carefully chosen datasets that we call the “gold standard” datasets. A model of the temporal and spatial patterns in the data is derived from the gold standard datasets, and using this model, the ML component determines, for each new dataset (the “de novo dataset”), which of the gold standards the temporal-spatial patterns of the de novo dataset is most similar to. Given enough confidence in this similarity estimate, inference rules in the AMP Ontology attribute to the de novo dataset the same set of RDF assertions that describe the gold standard dataset. This set of assertions provides sufficient detail about the dataset to support highly precise data discovery, and to enable a downstream system to use the data. See Figure 1. Figure 1. The AMP Workflow 3.2. Automating the Metadata Production Pipeline Much of NASA’s Earth science data is packaged in large collections, which assemble together numerous individual datasets for storage using an opaque format called HDF (Hierarchical Data Format). HDF is well-suited for cold-storage archiving, but is poorly suited for data analysis. To overcome this, we developed a sophisticated, serverless back-end system that retrieves and prepares the data for analysis by the ML component. The back-end system connects to NASA’s 12 Distributed Active Archive Centers (DAACs) via Application Programming Interfaces (APIs), using the AWS API Gateway to accept incoming requests from a web-based metadata curator console. It executes a complex workflow that 1) populates a queuing system with file download URLs, 2) auto-scales AWS Fargate containers to download each file, 3) parses each file to extract the individual datasets (called “slicing”), 4) extracts information needed by the AMP ontology to generate inference rules, and 4) populates S3 buckets with the sliced datasets so they can be accessed by the ML component. A series of scripts convert the information extracted by the back-end, and the output of the ML component, into inference rules and assertions about the antecedent conditions that trigger the rules. These are pushed to the AMP ontology and used to generate the metadata records for each of the sliced datasets. The AMP Ontology constitutes a conceptual model of the relationships between datasets and the Earth System observations that they record. Each dataset has an identifiable and well-defined type, with a unique set of membership criteria: i.e., a set of salient facts that determine, for any dataset and any dataset type, whether or not the dataset is of that type. Following the design pattern of the Semantic Sensor Network Ontology [3], the salient facts about a dataset include, for example, the Earth System feature of interest or phenomenon that was measured by the sensor (or derived from a sensor measurement), e.g., water evaporation; the particular quantity of that feature that was measured (or derived) (e.g., mass flux rate); the medium or context in which it was measured (e.g., the Earth’s Atmosphere); the vertical profile of the measurement (e.g., the Earth’s surface); the process involved, if any (e.g., sublimation); and the spatial and temporal resolution of the data. Datasets that share those sets of facts are of the same type. The ML component classifies a de novo dataset in terms of the set of gold-standard datasets: a de novo dataset X is predicted to have the same label (and therefore the same properties) as a gold standard Y if X is more similar to Y than to any of the other gold-standard datasets. Each dataset is a time-series of gridded measurement values that represent daily, monthly, 1-hourly, or 3-hourly averages, which can each be thought of as a temporal snapshot of the measured phenomenon. Each temporal snapshot in the gold standard datasets becomes a training example, and is given the same label as the dataset that it belongs to. All examples - gold standard and de novo - are spatially resampled to a 1.0°×1.0° grid, producing an input 180×360 matrix for datasets with global coverage. The classification of the de novo dataset is determined by decomposing it into a set of temporal snapshots, where each snapshot inherits its label from the dataset it belongs to. When training, each example is an input (the temporal snapshot, a 180×360 matrix) along with an output (the label or class). When doing inference, we compute a soft- max probability distribution for each temporal snapshot from the de novo dataset, combine these probability distributions into a distribution for the entire dataset, and use the per-dataset distribution to predict the class of the de novo dataset. The initial training uses 49 individual datasets from 8 NASA data collections which gives us 189,269 examples (labeled temporal snapshots), of which 50% were used for training, 25% for testing, and the remaining 25% as a hold-out set. Once a de novo dataset has been classified with sufficient confidence, the ML component posts an assertion to the ontology via a SPARQL end-point, indicating which gold standard dataset the de novo dataset is most similar to. Upon receiving this information from the ML component, a set of inference rules, written as constructors in SPARQL Inference Notation, are executed to generate the appropriate assertions for the de novo variable. For example, this rule generates the set of assertions that indicate the feature of interest, measured quantity, and measurement context of the dataset: CONSTRUCT { ?deNovo ?attribute ?value } WHERE { ?deNovo AMP:matches ?goldStandard . ?attribute rdf:type AMP:AMPScienceProperty . ?goldStandard ?attribute ?value } Other inference rules assert properties specific to the datasets, in virtue of the collection they belong to. These rules are generated automatically using information extracted by the back-end pre-processing platform. Every data collection (i.e., a group of datasets packaged together in a set of HDF files) is created as a class in the ontology, of which the individual datasets are instances, and datasets in the collection share many common properties, such as spatial resolution. This rule, for example, about the spatial resolution of the data in the GPM_3CMB_DAY_V06 data collection is attached to an ontology class of the same name, so that each dataset that belongs to the collection will be annotated with an appropriate assertion about its spatial resolution: CONSTRUCT { ?this AMP:spatialResolution AMP:Grid.25X.25Degree . } WHERE { } This rule about the instrument used to collect the data in the GPM_3CMB_DAY_V06 collection is also attached to it: CONSTRUCT { ?this AMP:instrument AMP:GMIInstrument . } WHERE { } Using the combined techniques of the AMP end-to-end pipeline, we are able to auto-generate robust, detailed, and semantically consistent metadata records that drive a faceted search capability that enables users to specify precisely what kind of data they’re looking for, and get back a set of results that actually satisfy their criteria. For example, a user can request measurements of black carbon mass concentration in the atmosphere, while excluding both organic carbon concentration in soil, and optical depth due to black carbon. This allows a researcher to quickly identify the highly relevant datasets, and even filter them further by specifying specific spatial resolutions, temporal resolutions, or instruments. We are implementing the faceted search capability in our prototype data discovery and access platform, which will be included in our demonstration. AMP metadata records provide additional support for semantic interoperability by linking concepts referred to in AMP-generated metadata records to terms from well-established, external ontologies and glossaries such as the OBO Foundry’s Environment Ontology (ENVO) [4], the Chemical Entities of Biological Interest (CHEBi) Ontology [5], the W3C Ontology for Quantity Kinds and Units [6], and the American Meteorological Society’s Glossary of Meteorology [7]. These mappings help ensure that external systems already making use of those vocabularies can correctly interpret AMP-generated metadata. The URIs also provide additional information to metadata consumers about the definitions of the terms used in the metadata records, which we plan to make use of in future development to implement text search. Our demonstration features an end-to-end run of the semantic annotation pipeline, with insights into the back-end mechanisms that make it work. We will also demonstrate our data discovery and access service, NASA Made Simple, which uses the metadata records we produce with AMP to drive a faceted search service that returns highly relevant results in response to user inputs. 4. Acknowledgements This work was made possible through Grant No. 80NSSC20K0209 from NASA’s Earth Science Technology Office Advanced Information Systems Technology Program. 5. References [1] NASA 2018 Strategic Plan, 2018, URL: https://www.nasa.gov/sites/default/files/atoms/files/nasa_2018_strategic_plan.pdf [2] Gregory, K. , Groth, P. , Cousijn, H. , Scharnhorst, A. and Wyatt, S. (2019), Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines. Journal of the Association for Information Science and Technology, 70: 419-432. doi:10.1002/asi.24165 [3] Atkinson, Rob, García-Castro, Raul, Lieberman, Joshua, and Stadler, Claus, Semantic Sensor Network Ontology. URL: https://www.w3.org/TR/vocab-ssn/ [4] Buttigieg, P. L., Morrison, N., Smith, B., Mungall, C. J., & Lewis, S. E. (2013). The environment ontology: contextualising biological and biomedical entities. Journal of Biomedical Semantics, 4(1), 43. doi:10.1186/2041-1480-4-43 [5] Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. (2016). ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res., doi:10.1093/nar/gkv1031 [6] Lefort, Laurent, Ontology for Quantity Kinds and Units: units and quantities definitions. (2010), https://www.w3.org/2005/Incubator/ssn/ssnx/qu/qu-rec20.html [7] American Meteorology Society Glossary of Meteorological Terms, URL: https://glossary.ametsoc.org/wiki/Welcome