KG-Microbe: A Reference Knowledge-Graph and Platform for Harmonized Microbial Information Marcin P. Joachimiak 1, Harshad Hegde 1, William D. Duncan 1, Justin T. Reese 1, Luca Cappelletti 2, Anne E. Thessen 3, Christopher J. Mungall 1 1 Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA 2 University of Milan, Milan, Italy 3 Oregon State University, Beaverton, Oregon, USA Abstract Microorganisms (microbes) are incredibly diverse, spanning all major divisions of life, and represent the greatest fraction of known species. A vast amount of knowledge about microbes is available in the literature, across experimental datasets, and in established data resources. While the genomic and biochemical pathway data about microbes is well- structured and annotated using standard ontologies, broader information about microbes and their ecological traits is not. We created the KG-Microbe (github.com/Knowledge-Graph- Hub/kg-microbe) resource in order to extract and integrate diverse knowledge about microbes from a variety of structured and unstructured sources. Initially, we are harmonizing and linking prokaryotic data for phenotypic traits, taxonomy, functions, chemicals, and environment descriptors, to construct a knowledge graph with over 266,000 entities linked by 432,000 relations. The effort is supported by a knowledge graph construction platform (KG-Hub) for rapid development of knowledge graphs using available data, knowledge modeling principles, and software tools. KG-Microbe is a microbe-centric Knowledge Graph (KG) to support tasks such as querying and graph link prediction in many use cases including microbiology, biomedicine, and the environment. KG-Microbe fulfills a need for standardized and linked microbial data, allowing the broader community to contribute, query, and enrich analyses and algorithms. Keywords 1 Knowledge graph, microbiology, ontology, graph learning, data standardization, data science, semantic technology 1. Introduction Not only are microbes the most abundant and diverse life forms, but they are also found in the greatest range of environments and possess the largest metabolic and functional potential which is just beginning to be harnessed for biomedicine and biomanufacturing. A vast amount of knowledge about microbes is available in the literature, across experimental datasets, and in established data resources. While the genomic and biochemical pathway data about microbes is well-structured and annotated using standard ontologies, broader information about microbes and their ecological traits is not. We draw inspiration from the biomedical domain, which has a rich set of ontologies, controlled vocabularies, and data schemas, which have been deployed in multiple biomedical knowledge International Conference on Biomedical Ontologies 2021, September 16–18, 2021, Bozen-Bolzano, Italy EMAIL: MJoachimiak@lbl.gov (A. 1); hhegde@lbl.gov (A. 2); wdduncan@lbl.gov (A. 3); JustinReese@lbl.gov (A. 4); luca.cappelletti1@unimi.it (A. 5); thessena@oregonstate.edu (A. 6); CJMungall@lbl.gov (A. 7) ORCID: 0000-0001-8175-045X (A. 1); 0000-0002-2411-565X (A. 2); 0000-0001-9625-1899 (A. 3); 0000-0002-2170-2250 (A. 4); 0000- 0002-1269-2038 (A. 5); 0000-0002-2908-3327 (A. 6); 0000-0002-6601-2165 (A. 7) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) resources. Examples include data schemas such as OMOP [1], MeSH [2], and the Biolink Model [3], collections of ontologies such as the OBO Foundry [4] and the NCBO Bioportal [5], and a growing ecosystem of graph and Natural Language Processing (NLP) tools. These resources have helped drive the standardization and interoperability of information across research domains. This set of concepts and tools provides a path to useful semantic harmonization of knowledge in other domains. 2. Main The KG-Microbe knowledge graph (KG) resource was created to support extraction and integration of diverse knowledge about microbes. Initially, our focus has been on harmonizing and linking prokaryotic data for phenotypic traits, taxonomy, functions, chemicals, and environment descriptors. Based on this information we constructed a knowledge graph, which in its current release (9/2/21) contains over 266,000 entities linked by 432,000 relations, classified into 9 and 31 Biolink Model entity and relation categories, respectively. The KG-Microbe knowledge graph effort is supported by a knowledge graph construction platform, KG-Hub [6], designed for rapid development and deployment of knowledge graphs using available data, common knowledge modeling principles, and software tools. One key concept for KG-Hub is to connect unstructured data into broader knowledge using links to structured data such as ontologies. KG-Microbe (github.com/Knowledge-Graph-Hub/kg-microbe) is a microbe-centric Knowledge Graph (KG) to support tasks such as querying and graph link prediction in a variety of use cases including microbiology, biomedicine, and the environment. We use Named Entity Recognition (NER) and Natural Language Processing (NLP) tools to identify, annotate, and normalize terms found in raw data. The harmonized data contained in the KG-Microbe knowledge graph provides rich and standardized labeling for building, training, and evaluating machine learning models. The resulting KG-Microbe graph is able to answer questions like which microbes are enriched in soil environments. It can also be used to train models for various microbial trait predictions and it can report enriched features for a given set of taxa or taxa features. We demonstrate example applications of KG-Microbe with predictive models for microbial shape and metabolism using embeddings (Figure 1) from graph learning. Many other types of link predictions are possible based on the available KG-Microbe entity and relation categories, allowing predictions for data, which is difficult to obtain without resource intensive field and laboratory experiments such as metabolic characterization or cell imaging. KG- Microbe fulfills a need for standardized and linked microbial data, allowing the broader community to contribute as well as enrich analyses and algorithms. Figure 1: Visualization of tSNE [7] dimensionality reduction of KG-Microbe graph node (A) and edge (B) embeddings, respectively. Each point corresponds to a node or edge and is colored by Biolink Model categories for edge and node types. Graph embeddings were generated with the embiggen package [8], using the SkipGram method. 3. Acknowledgements This work was supported by a grant from the Laboratory Directed Research and Development (LDRD) Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC02-05CH11231. 4. References [1] E.A. Voss, R. Makadia, A. Matcho, Q. Ma, C. Knoll, M. Schuemie, F.J. DeFalco, A. Londhe, V. Zhu, and P.B. Ryan. 2015. Feasibility and Utility of Applications of the Common Data Model to Multiple, Disparate Observational Health Databases. Journal of the American Medical Informatics Association: JAMIA 22 (3): 553–64. [2] P. Agarwal, and D.B. Searls. 2009. Can Literature Analysis Identify Innovation Drivers in Drug Discovery? Nature Reviews. Drug Discovery 8 (11): 865–78. [3] C.J. Mungall, et al. 2021. Biolink Model. URL: https://github.com/biolink/biolink-model. [4] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L.J. Goldberg, et al. 2007. The OBO Foundry: Coordinated Evolution of Ontologies to Support Biomedical Data Integration. Nature Biotechnology 25 (11): 1251–55. [5] P.L. Whetzel, N.F. Noy, N.H. Shah, P.R. Alexander, C. Nyulas, T. Tudorache, and M.A. Musen. 2011. BioPortal: Enhanced Functionality via New Web Services from the National Center for Biomedical Ontology to Access and Use Ontologies in Software Applications. Nucleic Acids Research 39 (Web Server issue): W541–45. [6] C.J. Mungall, et al. 2021. KG-Hub: a knowledge graph hub. URL: https://knowledge-graph- hub.github.io/. [7] L. van der Maaten, L. and G. Hinton. 2008. Visualizing Data Using T-SNE. Journal of Machine Learning Research: JMLR 9 (Nov): 2579–2605. [8] L. Cappelletti, T. Fontana, E. Casiraghi, V. Ravanmehr, T. J. Callahan, M. P. Joachimiak, C. J. Mungall, P. N. Robinson, J. Reese, and Valentini Giorgio. 2021. GraPE: fast and scalable Graph Processing and Embedding. arXiv:2110.06196.