=Paper=
{{Paper
|id=Vol-3073/paper22
|storemode=property
|title=Leveraging Ontologies within the National Microbiome Data Collaborative
|pdfUrl=https://ceur-ws.org/Vol-3073/paper22.pdf
|volume=Vol-3073
|authors=William D. Duncan,Faiza Ahmed,Fnu Anubhav,Jeffrey Baumes,Jonathan Beezley,Mark Borkum,Lisa Bramer,Shane Canon,Patrick Chain,Danielle Christianson,Yuri Corilo,Karen Davenport,Brandon Davis,Meghan Drake,Kjiersten Fagnan,Mark Flynn,David Hays,Bin Hu,Marcel Huntemann,Julia Kelliher,Sofya Lebedeva,Po-E Li,Mary Lipton,Chien-Chi Lo,Douglas Mans,Stanton Martin,Lee Ann McCue,David Millard,Kayd Miller,Nigel Mouncey,Paul Piehowski,Elais Player Jackson,Anastasiya Prymolenna,Samuel Purvine,Tbk Reddy,Rachel Richardson,Migun Shakya,Montana Smith,Jagadish Chandrabose Sundaramurthi,Mark A. Miller,Deepak Unni,Pajau Vangay,Bruce Wilson,Donald Winston,Elisha Wood-Charlson,Yan Xu,Emiley Eloe-Fadrosh,Christopher J. Mungall
|dblpUrl=https://dblp.org/rec/conf/icbo/DuncanAABBBBCCC21
}}
==Leveraging Ontologies within the National Microbiome Data Collaborative==
Leveraging Ontologies within the National Microbiome Data Collaborative William D. Duncan¹, Faiza Ahmed², Fnu Anubhav³, Jeffrey Baumes², Jonathan Beezley², Mark Borkum³, Lisa Bramer³, Shane Canon¹, Patrick Chain4, Danielle Christianson¹, Yuri Corilo³, Karen Davenport4, Brandon Davis², Meghan Drake⁵, Kjiersten Fagnan¹, Mark Flynn⁴, David Hays¹, Bin Hu⁴, Marcel Huntemann¹, Julia Kelliher⁴, Sofya Lebedeva¹, Po-E Li⁴, Mary Lipton³, Chien-Chi Lo⁴, Douglas Mans³, Stanton Martin⁵, Lee Ann McCue³, David Millard³, Kayd Miller¹, Nigel Mouncey¹, Paul Piehowski³, Elais Player Jackson⁴, Anastasiya Prymolenna³, Samuel Purvine³, TBK Reddy¹, Rachel Richardson³, Migun Shakya⁴, Montana Smith³, Jagadish Chandrabose Sundaramurthi¹, Mark A. Miller¹, Deepak Unni¹, Pajau Vangay¹, Bruce Wilson⁵, Donald Winston⁶, Elisha Wood-Charlson¹, Yan Xu⁴, Emiley Eloe- Fadrosh¹ and Christopher J. Mungall¹ ¹ Lawrence Berkeley National Laboratory, Berkeley, CA, USA ² Kitware, Clifton Park, NY, USA ³ Pacific Northwest National Laboratory, Richland, WA, USA ⁴ Los Alamos National Laboratory, Los Alamos, NM, USA ⁵ Oak Ridge National Laboratory, Oak Ridge, TN, USA ⁶ Polyneme LLC, New York, NY, USA Abstract The National Microbiome Data Collaborative (NMDC) is a multi-organizational effort to integrate microbiome data across diverse areas in environmental science. Data provided by the NMDC can then undergo advanced analysis and provide new insights into metagenomics, metatranscriptomics, metaproteomics, and metabolomics. To address these challenges, we have developed our schema using the Linked data Modeling Language (LinkML). This allows us to easily map data to existing standards and ontologies. Keywords 1 Ontology, environmental science, environmental metagenomics 1. Introduction The National Microbiome Data Collaborative (NMDC) is a multi-organizational effort to integrate microbiome data across diverse areas in environmental science. Data provided by the NMDC can then undergo advanced analysis and provide new insights into metagenomics, metatranscriptomics, metaproteomics, and metabolomics. A major challenge for the NMDC is that data are heterogeneous and complex, and existing standards and ontologies are lacking or incomplete. To address these challenges, we have developed our schema using the Linked data Modeling Language (LinkML). This allows us to easily map data to existing standards where appropriate. This includes mapping of both schema elements and data values. For instance, in the NMDC schema, the LinkML syntax specification maps the NMDC term biosample processing to the Ontology for Biomedical Investigations’ (OBI) term material processing. International Conference on Biomedical Ontologies 2021, September 16–18, 2021, Bozen-Bolzano, Italy EMAIL: wdduncan@gmail.com ORCID: 0000-0001-9625-1899 © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) biosample processing: aliases: - material processing is_a: named thing description: >- A process that takes one or more biosamples as inputs and generates one or as outputs. slots: - has input broad_mappings: - OBI:0000094 In this example, outputs of the biosample processing are not specified, since not every biosample process will necessarily have an output, and the mapping relation is a “broad mapping”, meaning that the OBI term is more general that the NMDC term. LinkML’s broad_mapping relation implements the broadMatch predicate from the Simple Knowledge Organization System Namespace (SKOS). Thus, by using SKOS predicates, the NMDC schema (via LinkML) leverages a well established standard for mapping terms. For each biosample in the NMDC database, we record a number of important properties about the biosample’s environment. We standardize this information in two ways. First, we utilize the env_broad_scale, env_local_scale, and env_medium terms defined by the Genomics Standards Consortium (GSC) MIxS (Minimal Information about any Sequence) standard (note that MIxS is in the process of migrating to LinkML). Roughly speaking, the env_medium defines the material containing the microorganism, the env_local_scale defines geographic features of the material, and the env_broad_scale defines the biosample’s biome. Second, we use terms from the Environment Ontology (EnvO) to provide values for the aforementioned MIxS terms. For example, the following JSON formatted record from the NMDC database clearly shows the material, geographic feature, and biome of the biosample identified by the compact URI (i.e., CURIE) gold:Gb0115217: { "id": "gold:Gb0115850", "env_medium": {"has_raw_value": "ENVO:00005802" }, # bulk soil "env_local_scale": { "has_raw_value": "ENVO:00000291"}, # drainage basin "env_broad_scale": { "has_raw_value": "ENVO:00000446"}, # terrestrial biome ... } In other words, the microorganisms within this biosample were found in a portion of bulk soil taken from a drainage basin in a terrestrial biome. Moreover, using EnvO’s terms to define the environmental context of biosamples also permits us to leverage the ontology’s semantics. For instance, since bulk soil is a kind of soil, and a drainage basin is a kind of (geographic) depression, we can make use of EnvO’s hierarchy to find other kinds of soil (e.g., dry soil) that are found in other kinds of depressions (e.g., dry lake). Finally, in the NMDC, we track many aspects of data provenance. This is especially important for computational workflows that produce files used for genomic analysis, such as metabolomics files. For this, we make use of the Provenance Ontology (PROV). In the following record, we use PROV’s wasGeneratedBy predicate to specify that the file resulted of the activity identified as nmdc:6fdeaf901c4c4c8fa19ec94696a2d03a: { "id": "nmdc:eadcd3b883c0da0f9d42a9fb1162ffcf", "name": "Froze_Core_2015_S2_0_10_7_Metab.csv", "description": "MetaMS GC-MS metabolomics output detail CSV file", "file_size_bytes": 565558, "md5_checksum": "eadcd3b883c0da0f9d42a9fb1162ffcf", "url": "https://nmdcdemo.emsl.pnnl.gov/metabolomics/results/ Froze_Core_2015_S2_0_10_7_Metab.csv", "was_generated_by": "nmdc:6fdeaf901c4c4c8fa19ec94696a2d03a" } 2. Acknowledgements This work is supported by the Genomic Science Program in the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER) under contract numbers DE-AC02- 05CH11231 (LBNL), 89233218CNA000001 (LANL), DE-AC05-00OR22725 (ORNL), and DE- AC05-76RL01830 (PNNL).