=Paper= {{Paper |id=Vol-3073/paper22 |storemode=property |title=Leveraging Ontologies within the National Microbiome Data Collaborative |pdfUrl=https://ceur-ws.org/Vol-3073/paper22.pdf |volume=Vol-3073 |authors=William D. Duncan,Faiza Ahmed,Fnu Anubhav,Jeffrey Baumes,Jonathan Beezley,Mark Borkum,Lisa Bramer,Shane Canon,Patrick Chain,Danielle Christianson,Yuri Corilo,Karen Davenport,Brandon Davis,Meghan Drake,Kjiersten Fagnan,Mark Flynn,David Hays,Bin Hu,Marcel Huntemann,Julia Kelliher,Sofya Lebedeva,Po-E Li,Mary Lipton,Chien-Chi Lo,Douglas Mans,Stanton Martin,Lee Ann McCue,David Millard,Kayd Miller,Nigel Mouncey,Paul Piehowski,Elais Player Jackson,Anastasiya Prymolenna,Samuel Purvine,Tbk Reddy,Rachel Richardson,Migun Shakya,Montana Smith,Jagadish Chandrabose Sundaramurthi,Mark A. Miller,Deepak Unni,Pajau Vangay,Bruce Wilson,Donald Winston,Elisha Wood-Charlson,Yan Xu,Emiley Eloe-Fadrosh,Christopher J. Mungall |dblpUrl=https://dblp.org/rec/conf/icbo/DuncanAABBBBCCC21 }} ==Leveraging Ontologies within the National Microbiome Data Collaborative== https://ceur-ws.org/Vol-3073/paper22.pdf
Leveraging Ontologies within the National Microbiome Data
Collaborative
William D. Duncan¹, Faiza Ahmed², Fnu Anubhav³, Jeffrey Baumes², Jonathan Beezley², Mark
Borkum³, Lisa Bramer³, Shane Canon¹, Patrick Chain4, Danielle Christianson¹, Yuri Corilo³,
Karen Davenport4, Brandon Davis², Meghan Drake⁵, Kjiersten Fagnan¹, Mark Flynn⁴, David
Hays¹, Bin Hu⁴, Marcel Huntemann¹, Julia Kelliher⁴, Sofya Lebedeva¹, Po-E Li⁴, Mary
Lipton³, Chien-Chi Lo⁴, Douglas Mans³, Stanton Martin⁵, Lee Ann McCue³, David Millard³,
Kayd Miller¹, Nigel Mouncey¹, Paul Piehowski³, Elais Player Jackson⁴, Anastasiya
Prymolenna³, Samuel Purvine³, TBK Reddy¹, Rachel Richardson³, Migun Shakya⁴, Montana
Smith³, Jagadish Chandrabose Sundaramurthi¹, Mark A. Miller¹, Deepak Unni¹, Pajau
Vangay¹, Bruce Wilson⁵, Donald Winston⁶, Elisha Wood-Charlson¹, Yan Xu⁴, Emiley Eloe-
Fadrosh¹ and Christopher J. Mungall¹
¹ Lawrence Berkeley National Laboratory, Berkeley, CA, USA
² Kitware, Clifton Park, NY, USA
³ Pacific Northwest National Laboratory, Richland, WA, USA
⁴ Los Alamos National Laboratory, Los Alamos, NM, USA
⁵ Oak Ridge National Laboratory, Oak Ridge, TN, USA
⁶ Polyneme LLC, New York, NY, USA

                               Abstract
                               The National Microbiome Data Collaborative (NMDC) is a multi-organizational effort to
                               integrate microbiome data across diverse areas in environmental science. Data provided by the
                               NMDC can then undergo advanced analysis and provide new insights into metagenomics,
                               metatranscriptomics, metaproteomics, and metabolomics. To address these challenges, we
                               have developed our schema using the Linked data Modeling Language (LinkML). This allows
                               us to easily map data to existing standards and ontologies.

                               Keywords 1
                               Ontology, environmental science, environmental metagenomics




1. Introduction

   The National Microbiome Data Collaborative (NMDC) is a multi-organizational effort to integrate
microbiome data across diverse areas in environmental science. Data provided by the NMDC can then
undergo advanced analysis and provide new insights into metagenomics, metatranscriptomics,
metaproteomics, and metabolomics.
   A major challenge for the NMDC is that data are heterogeneous and complex, and existing standards
and ontologies are lacking or incomplete. To address these challenges, we have developed our schema
using the Linked data Modeling Language (LinkML). This allows us to easily map data to existing
standards where appropriate. This includes mapping of both schema elements and data values. For
instance, in the NMDC schema, the LinkML syntax specification maps the NMDC term biosample
processing to the Ontology for Biomedical Investigations’ (OBI) term material processing.


International Conference on Biomedical Ontologies 2021, September 16–18, 2021, Bozen-Bolzano, Italy
EMAIL: wdduncan@gmail.com
ORCID: 0000-0001-9625-1899
                            © 2021 Copyright for this paper by its authors.
                            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Wor
 Pr
    ks
     hop
  oceedi
       ngs
             ht
             I
              tp:
                //
                 ceur
                    -
             SSN1613-
                     ws
                      .or
                    0073
                        g

                            CEUR Workshop Proceedings (CEUR-WS.org)
        biosample processing:
          aliases:
            - material processing
          is_a: named thing
          description: >-
            A process that takes one or more biosamples as inputs and generates one or
            as outputs.
          slots:
            - has input
          broad_mappings:
            - OBI:0000094

    In this example, outputs of the biosample processing are not specified, since not every biosample
process will necessarily have an output, and the mapping relation is a “broad mapping”, meaning that
the OBI term is more general that the NMDC term. LinkML’s broad_mapping relation implements the
broadMatch predicate from the Simple Knowledge Organization System Namespace (SKOS). Thus,
by using SKOS predicates, the NMDC schema (via LinkML) leverages a well established standard for
mapping terms.
    For each biosample in the NMDC database, we record a number of important properties about the
biosample’s environment. We standardize this information in two ways. First, we utilize the
env_broad_scale, env_local_scale, and env_medium terms defined by the Genomics Standards
Consortium (GSC) MIxS (Minimal Information about any Sequence) standard (note that MIxS is in the
process of migrating to LinkML). Roughly speaking, the env_medium defines the material containing
the microorganism, the env_local_scale defines geographic features of the material, and the
env_broad_scale defines the biosample’s biome. Second, we use terms from the Environment Ontology
(EnvO) to provide values for the aforementioned MIxS terms. For example, the following JSON
formatted record from the NMDC database clearly shows the material, geographic feature, and biome
of the biosample identified by the compact URI (i.e., CURIE) gold:Gb0115217:

        {
            "id": "gold:Gb0115850",
             "env_medium": {"has_raw_value": "ENVO:00005802" },      # bulk soil
             "env_local_scale": { "has_raw_value": "ENVO:00000291"}, # drainage basin
             "env_broad_scale": { "has_raw_value": "ENVO:00000446"}, # terrestrial biome
             ...
        }

    In other words, the microorganisms within this biosample were found in a portion of bulk soil taken
from a drainage basin in a terrestrial biome. Moreover, using EnvO’s terms to define the environmental
context of biosamples also permits us to leverage the ontology’s semantics. For instance, since bulk soil
is a kind of soil, and a drainage basin is a kind of (geographic) depression, we can make use of EnvO’s
hierarchy to find other kinds of soil (e.g., dry soil) that are found in other kinds of depressions (e.g., dry
lake).

    Finally, in the NMDC, we track many aspects of data provenance. This is especially important for
computational workflows that produce files used for genomic analysis, such as metabolomics files. For
this, we make use of the Provenance Ontology (PROV). In the following record, we use PROV’s
wasGeneratedBy predicate to specify that the file resulted of the activity identified as
nmdc:6fdeaf901c4c4c8fa19ec94696a2d03a:

        {
            "id": "nmdc:eadcd3b883c0da0f9d42a9fb1162ffcf",
            "name": "Froze_Core_2015_S2_0_10_7_Metab.csv",
            "description": "MetaMS GC-MS metabolomics output detail CSV file",
            "file_size_bytes": 565558,
           "md5_checksum": "eadcd3b883c0da0f9d42a9fb1162ffcf",
           "url": "https://nmdcdemo.emsl.pnnl.gov/metabolomics/results/
                  Froze_Core_2015_S2_0_10_7_Metab.csv",
           "was_generated_by": "nmdc:6fdeaf901c4c4c8fa19ec94696a2d03a"
       }

2. Acknowledgements

   This work is supported by the Genomic Science Program in the U.S. Department of Energy, Office
of Science, Office of Biological and Environmental Research (BER) under contract numbers DE-AC02-
05CH11231 (LBNL), 89233218CNA000001 (LANL), DE-AC05-00OR22725 (ORNL), and DE-
AC05-76RL01830 (PNNL).