KG-Microbe: A Reference Knowledge-Graph and Platform for
Harmonized Microbial Information
Marcin P. Joachimiak 1, Harshad Hegde 1, William D. Duncan 1, Justin T. Reese 1, Luca
Cappelletti 2, Anne E. Thessen 3, Christopher J. Mungall 1
1
  Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley,
CA, USA
2
  University of Milan, Milan, Italy
3
  Oregon State University, Beaverton, Oregon, USA


                                    Abstract
                                    Microorganisms (microbes) are incredibly diverse, spanning all major divisions of life, and
                                    represent the greatest fraction of known species. A vast amount of knowledge about
                                    microbes is available in the literature, across experimental datasets, and in established data
                                    resources. While the genomic and biochemical pathway data about microbes is well-
                                    structured and annotated using standard ontologies, broader information about microbes and
                                    their ecological traits is not. We created the KG-Microbe (github.com/Knowledge-Graph-
                                    Hub/kg-microbe) resource in order to extract and integrate diverse knowledge about
                                    microbes from a variety of structured and unstructured sources. Initially, we are harmonizing
                                    and linking prokaryotic data for phenotypic traits, taxonomy, functions, chemicals, and
                                    environment descriptors, to construct a knowledge graph with over 266,000 entities linked
                                    by 432,000 relations. The effort is supported by a knowledge graph construction platform
                                    (KG-Hub) for rapid development of knowledge graphs using available data, knowledge
                                    modeling principles, and software tools. KG-Microbe is a microbe-centric Knowledge Graph
                                    (KG) to support tasks such as querying and graph link prediction in many use cases
                                    including microbiology, biomedicine, and the environment. KG-Microbe fulfills a need for
                                    standardized and linked microbial data, allowing the broader community to contribute,
                                    query, and enrich analyses and algorithms.

                                    Keywords 1
                                    Knowledge graph, microbiology, ontology, graph learning, data standardization, data science,
                                    semantic technology


1. Introduction

   Not only are microbes the most abundant and diverse life forms, but they are also found in the
greatest range of environments and possess the largest metabolic and functional potential which is just
beginning to be harnessed for biomedicine and biomanufacturing. A vast amount of knowledge about
microbes is available in the literature, across experimental datasets, and in established data resources.
While the genomic and biochemical pathway data about microbes is well-structured and annotated
using standard ontologies, broader information about microbes and their ecological traits is not.
   We draw inspiration from the biomedical domain, which has a rich set of ontologies, controlled
vocabularies, and data schemas, which have been deployed in multiple biomedical knowledge


International Conference on Biomedical Ontologies 2021, September 16–18, 2021, Bozen-Bolzano, Italy
EMAIL: MJoachimiak@lbl.gov (A. 1); hhegde@lbl.gov (A. 2); wdduncan@lbl.gov (A. 3); JustinReese@lbl.gov (A. 4);
luca.cappelletti1@unimi.it (A. 5); thessena@oregonstate.edu (A. 6); CJMungall@lbl.gov (A. 7)
ORCID: 0000-0001-8175-045X (A. 1); 0000-0002-2411-565X (A. 2); 0000-0001-9625-1899 (A. 3); 0000-0002-2170-2250 (A. 4); 0000-
0002-1269-2038 (A. 5); 0000-0002-2908-3327 (A. 6); 0000-0002-6601-2165 (A. 7)
                               © 2021 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g
                               CEUR Workshop Proceedings (CEUR-WS.org)
resources. Examples include data schemas such as OMOP [1], MeSH [2], and the Biolink Model [3],
collections of ontologies such as the OBO Foundry [4] and the NCBO Bioportal [5], and a growing
ecosystem of graph and Natural Language Processing (NLP) tools. These resources have helped drive
the standardization and interoperability of information across research domains. This set of concepts
and tools provides a path to useful semantic harmonization of knowledge in other domains.

2. Main

    The KG-Microbe knowledge graph (KG) resource was created to support extraction and
integration of diverse knowledge about microbes. Initially, our focus has been on harmonizing and
linking prokaryotic data for phenotypic traits, taxonomy, functions, chemicals, and environment
descriptors. Based on this information we constructed a knowledge graph, which in its current release
(9/2/21) contains over 266,000 entities linked by 432,000 relations, classified into 9 and 31 Biolink
Model entity and relation categories, respectively. The KG-Microbe knowledge graph effort is
supported by a knowledge graph construction platform, KG-Hub [6], designed for rapid development
and deployment of knowledge graphs using available data, common knowledge modeling principles,
and software tools. One key concept for KG-Hub is to connect unstructured data into broader
knowledge using links to structured data such as ontologies.
    KG-Microbe (github.com/Knowledge-Graph-Hub/kg-microbe) is a microbe-centric Knowledge
Graph (KG) to support tasks such as querying and graph link prediction in a variety of use cases
including microbiology, biomedicine, and the environment. We use Named Entity Recognition (NER)
and Natural Language Processing (NLP) tools to identify, annotate, and normalize terms found in raw
data. The harmonized data contained in the KG-Microbe knowledge graph provides rich and
standardized labeling for building, training, and evaluating machine learning models. The resulting
KG-Microbe graph is able to answer questions like which microbes are enriched in soil environments.
It can also be used to train models for various microbial trait predictions and it can report enriched
features for a given set of taxa or taxa features. We demonstrate example applications of KG-Microbe
with predictive models for microbial shape and metabolism using embeddings (Figure 1) from graph
learning. Many other types of link predictions are possible based on the available KG-Microbe entity
and relation categories, allowing predictions for data, which is difficult to obtain without resource
intensive field and laboratory experiments such as metabolic characterization or cell imaging. KG-
Microbe fulfills a need for standardized and linked microbial data, allowing the broader community to
contribute as well as enrich analyses and algorithms.
Figure 1: Visualization of tSNE [7] dimensionality reduction of KG-Microbe graph node (A) and edge
(B) embeddings, respectively. Each point corresponds to a node or edge and is colored by Biolink
Model categories for edge and node types. Graph embeddings were generated with the embiggen
package [8], using the SkipGram method.

3. Acknowledgements

  This work was supported by a grant from the Laboratory Directed Research and Development
(LDRD) Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy
Contract No. DE-AC02-05CH11231.

4. References
[1] E.A. Voss, R. Makadia, A. Matcho, Q. Ma, C. Knoll, M. Schuemie, F.J. DeFalco, A. Londhe, V.
    Zhu, and P.B. Ryan. 2015. Feasibility and Utility of Applications of the Common Data Model to
    Multiple, Disparate Observational Health Databases. Journal of the American Medical
    Informatics Association: JAMIA 22 (3): 553–64.
[2] P. Agarwal, and D.B. Searls. 2009. Can Literature Analysis Identify Innovation Drivers in Drug
    Discovery? Nature Reviews. Drug Discovery 8 (11): 865–78.
[3] C.J. Mungall, et al. 2021. Biolink Model. URL: https://github.com/biolink/biolink-model.
[4] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L.J. Goldberg, et al. 2007. The
    OBO Foundry: Coordinated Evolution of Ontologies to Support Biomedical Data Integration.
    Nature Biotechnology 25 (11): 1251–55.
[5] P.L. Whetzel, N.F. Noy, N.H. Shah, P.R. Alexander, C. Nyulas, T. Tudorache, and M.A. Musen.
    2011. BioPortal: Enhanced Functionality via New Web Services from the National Center for
    Biomedical Ontology to Access and Use Ontologies in Software Applications. Nucleic Acids
    Research 39 (Web Server issue): W541–45.
[6] C.J. Mungall, et al. 2021. KG-Hub: a knowledge graph hub. URL: https://knowledge-graph-
    hub.github.io/.
[7] L. van der Maaten, L. and G. Hinton. 2008. Visualizing Data Using T-SNE. Journal of Machine
    Learning Research: JMLR 9 (Nov): 2579–2605.
[8] L. Cappelletti, T. Fontana, E. Casiraghi, V. Ravanmehr, T. J. Callahan, M. P. Joachimiak, C. J.
    Mungall, P. N. Robinson, J. Reese, and Valentini Giorgio. 2021. GraPE: fast and scalable Graph
    Processing and Embedding. arXiv:2110.06196.