COVIDGraph: Connecting biomedical COVID-19 resources and computational biology models Lea Gütebier Ron Henkel Alexander Jarasch University Medicine Greifswald University Medicine Greifswald German Center for Diabetes Research Greifswald, Germany Greifswald, Germany Munich, Germany lea.guetebier@stud.uni-greifswald.de ron.henkel@uni-greifswald.de jarasch@dzd-ev.de Tim Bleimehl Sebastian Müller Jamie Munro German Center for Diabetes Research yWorks Munro Consulting Munich, Germany Tübingen, Germany London, UK tim.bleimehl@helmholtz- sebastian.mueller@yworks.com jamie@munro.consulting muenchen.de Martin Preusse, and the Dagmar Walthemath HealthEcco Team University Medicine Greifswald Kaiser & Preusse Greifswald, Germany Freiburg, Germany dagmar.waltemath@uni- martin@kaiser-preusse.com greifswald.de ABSTRACT 1 INTRODUCTION The COVID-19 pandemic has changed life across the globe. In Jan- CovidGraph is a research and communication platform that encom- uary 2020, little was known about SARS-COV-2, but the vastly passes publications, case statistics, genes and functions, molecular increasing number of infections and the uncontrolled spreading data and more. It is developed and maintained by HealthECCO, a demanded fast medical action. Within a year, over 4 million publi- non-profit collaboration of researchers, software developers, data cations relating to COVID-19 appeared in the scientific literature. scientists and medical professionals (https://healthecco.org/). Our Additionally, patents have been registered, ontologies have been aim is to help researchers quickly and efficiently find their way extended, simulation studies for prediction of disease spread and through COVID-19 datasets using tools that implement artificial underlying bioinformatics mechanisms have been built, and health intelligence methods, advanced visualisation techniques, and intu- studies have been designed. To support the exploration of COVID- itive user interfaces. Through CovidGraph users can explore papers, 19 data, the CovidGraph project was initiated as a non-profit, collab- patents, treatments and medications covering the family of corona orative and open project driven by researchers, software developers, viruses. In addition to literature data we connect information from data scientists and medical professionals. In this article we outline biological entities - namely genes, proteins and their function - the history, goals and scope of CovidGraph. Using the example of spanning a network of unparalleled size and knowledge. The latest computational biology models, we show how additional resources addition to the CovidGraph are systems biology models (Fig. 1). can be integrated with the knowledge graph to extend the scope of the CovidGraph, for example, to systems biology data. Reference Format: Lea Gütebier, Ron Henkel, Alexander Jarasch, Tim Bleimehl, Sebastian Müller, Jamie Munro, Martin Preusse, and the HealthEcco Team, and Dagmar Walthemath. COVIDGraph: Connecting biomedical COVID-19 resources and computational biology models. In the 2nd Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data 2021). PVLDB Artifact Availability: The source code, data, and/or other artefacts have been made available at https://github.com/covidgraph/documentation. Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021 for the volume as a collection by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal- Figure 1: Overview: CovidGraph data sources with the inte- ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, Copenhagen, Denmark) on CEUR-WS.org. grated system biology nodes (cyan box). Over the last years, NoSQL approaches such as Key-Value Stores, to the Reactome pathway knowledgebase, a database for molec- BigTable, document databases, triple stores, or graph databases ular information about biological pathways [11]. As components [1], together with semantic web applications, became more pop- of the transcription and translation process in humans genes code ular within the life sciences. Graph databases offer a storage con- for transcripts which in turn code for proteins. In the CovidGraph cept based on nodes, (directed) edges, properties and labels. Nodes these processes are described by relationships between gene nodes, can be labelled and are connected by edges, and both can con- transcript nodes and protein nodes. The data for the transcript tain properties. They also allow easy horizontal scaling and fast nodes is taken from the NCBI Reference Sequence Database [17]; graph traversal. Finally, graph databases are schema optional – the Universal Protein Resource (UniProt) provides a resource of a feature that is much appreciated when storing heterogeneous, protein sequences and annotation data [5]. Proteins associated with highly connected, cross-domain data items from different sources. annotation data from the Gene Ontology are linked to GO term The HealthECCO project integrates such heterogeneous resources nodes. The last node type connected with gene nodes are disease and compiles a knowledge-base targeted at COVID-19 data (https: nodes. They are in turn associated with anatomy nodes. The cor- //healthecco.org/covidgraph/), and potentially other diseases in responding data is provided by Hetionet, an integrative network future versions. The underlying graph database is Neo4j [18]. of biomedical data including connections between diseases and anatomies [9]. Knowledge is primarily centred around the domain of corona- 2 DATA RESOURCES viruses but is steadily extended to other connected diseases as part Previous versions of the CovidGraph already integrated data from of the HealthECCO project. The latest addition to CovidGraph is a five categories (Fig. 2 (A)): Patents, Papers, BioMedical (ontolo- resource of computational biology models. We will introduce the gies and controlled vocabularies), Clinical Trials and Statistical & systems biology node in detail in Section 4. Geographic. Categories are cross-linked by relationships. For ex- ample, items from the "Papers" category are linked to items from 3 COVIDGRAPH FRAMEWORK the "Patents" category. One paper source is the COVID-19 Open Re- The CovidGraph infrastructure is built as a labelled property graph search Dataset (CORD-19) – a collection of research papers relating based on the Neo4j Enterprise edition v4.2. Textual information, to COVID-19 (and corona viruses) [24]. It is the main data source such as publications, clinical studies or ontology term descriptions, for information about papers in the CovidGraph and contains pub- is enriched and recognised by a pipeline based on natural lan- lications from PubMed, medRxiv and bioRxiv. Papers and related guage processing and named entity recognition (BioBERT [13]). information are stored and linked in multiple nodes in the Covid- The graph, as of now, contains 36 million nodes and 59 million re- Graph. Each paper node has author nodes connected to affiliation lationships but is still growing as the modular software framework nodes that, in turn, are linked to location nodes. Papers can be linked encourages to add and integrate new data sources. Server-wise, to COVID-19 patents. The Lens (https://about.lens.org/covid-19/) CovidGraph relies on Docker Container. To integrate a new data provides datasets of patent documents and literature concerning hu- source, it needs to be wrapped in a container and it needs to pro- man corona viruses and COVID-19. The CovidGraph furthermore vide information such as connection data and mapping information contains information about clinical COVID-19 studies from the (https://github.com/covidgraph/data_template). An ETL-process ClinicalTrials.gov registry. Studies are represented as clinical trials (https://git.connect.dzd-ev.de/dzdtools/motherlode) subsequently nodes which are linked to multiple other nodes representing more extracts the data from the new source, transforms the data in accor- detailed information about each study. Also included in the Covid- dance with the provided mapping information, and loads the data Graph are case statistics and case data from Johns Hopkins Univer- into the main CovidGraph. sity [7] and population estimates from the United Nations World Population Prospects (https://population.un.org/wpp/). Nodes in- clude city, country, province, daily report and age group. Biomedi- 4 INTEGRATION OF SIMULATION STUDIES cal data encodes information about genes, proteins, pathways and Via the aforementioned ETL-process, we connected the Covid- different diseases associated with COVID-19. The data comprises Graph and the Management System for Models and Simulations information from various biological and biomedical resources and (MaSyMoS, [8]). MaSyMoS is a Neo4j graph database for storing is connected to Gene Ontology terms. The Gene Ontology is a re- and retrieving data items describing biomedical simulation studies. source for computational representation of the function of genes The data is extracted from repositories for computational biology and gene products [4]. Information about genes from the NCBI models (BioModels [15] and Physiome Model Repository2 [25]) Gene Database [2] is stored in Gene nodes which are connected and integrated in a single graph (Fig. 2 (B)). We consider a com- to other nodes describing the underlying biology. Therefore, the putational biology model a mathematical model written in a for- connected nodes include Gene Symbols according to the Ensembl mal machine-readable language, such that it can be systematically Genome Browser, a genome database [10]. The gene symbols are parsed and employed by simulation and analysis software without mapped to synonyms. Since genes are expressed in various tissues further human translation [12]. A biomedical simulation study is the gene nodes are linked to Gtex Tissue nodes containing gene considered any calculation performed on a model and describing expression data from the GTEx Portal [14]. For genes that are part evolution of the biological system represented, for instance, over of a pathway there exists a relation between the corresponding spatial and/or temporal dimensions [23]. MaSyMoS links simulation gene node and pathway node. The data included in the COVID- studies, their results and corresponding models. Curated simula- Graph describes which genes are members of a pathway according tion studies are furthermore annotated with meta-data, primarily 2 (B) (A) Figure 2: (A) Original CovidGraph data model with data from i) Patents, ii) an index for biomedical terms (BioBERT [13]), iii) BioMedical Ontologies [2, 4, 5, 9–11, 17, 21, 22], iv) COVID-19 related papers [3, 24], v) Clinical Trials [26]), vi) and a Statistical & Geographic information [7, 16]. (B) A simplified MaSyMoS [8] meta graph containing i) simulation models formerly encoded in SBML and CellML (not shown) [20], ii) simulation descriptions formerly encoded in SEDML [20], iii) bioontologies encoded in OWL, iv) and links to publications in PubMed. reference publications and ontological terms from bio-ontologies IDs (cmp. Figure 1). For Gene Ontology, ChEBI and Disease Ontol- [4–6, 11]. MaSyMoS provides access to over 1000 manually cu- ogy more than 94% of the terms stored in MaSyMoS were connected rated simulation studies originally published in BioModels. This set to terms in the CovidGraph. The UniProt coverage reached 41%. contains highly curated studies targeting COVID-19 disease and spreading (https://www.ebi.ac.uk/biomodels/covid-19). The result- Example: COVID-19 spread in Wuhan city. The simulation study ing knowledge graph offers domain-specific retrieval and similarity by Roda at al. [19] investigates the COVID-19 spread in Wuhan measures, and it enables efficient access and reuse. As all model city in the beginning of 2020. Figure 3 shows a Neo4j excerpt of the have been shown to reproduce the published results, they are a model in MaSyMoS and the association to disease information in valuable resource for biomedical investigations. the CovidGraph. The association is build by a matching reference The integration of MaSyMoS data with CovidGraph was two- publication and a matching ontology entry from the Disease On- folded: First we matched papers (publications) from both domains. tology. More specifically, the model is linked (in the middle, dark Then we connected biomedical ontology terms from both resources green) to several resources (pink). For example, one annotation thereby linking disease knowledge and biomedical simulation stud- refers to an ontology term from the Disease Ontology and is asso- ies. The Paper data set (cmp. Fig. 2 (A)) in CovidGraph is represented ciated to the corresponding entry in the CovidGraph (on the right, by different nodes (e.g., the abstract, authors, paper ID). In MaSyMoS brown). Another example is the reference publication which links a paper is represented by a single publication node containing the to the corresponding publication in the CovidGraph (on the right, same aforementioned set of information about a publication. Con- blue). We consider this example a first step towards bridging the sequently, we mapped the corresponding IDs (PubMedID and DOI) gap between medical research and systems biology. from CovidGraph paper ID nodes and MaSyMoS publication nodes, thus connecting relevant publications from both data sets. This 5 TAKEAWAYS & FUTURE WORK mapping resulted in 19 connections. This result is in our expected The CovidGraph project integrates COVID-related data from hetero- range, as the underlying publication corpus covers different areas of geneous data sources, mainly from the medial and health domains, interest (e.g. cell cycle, MAPK and apoptosis for simulation models into a single knowledge graph. We demonstrate that even for fairly & clinical trials, respiratory studies and diseases for CovidGraph). distinct scientific domains such as computational biology modeling The BioMedical data set in the CovidGraph represents different and clinical research, it is possible to link knowledge graphs and ontologies with relevance for COVID-19 research. These ontologies thereby quickly provide new data sources. The presented version of have possible connections and overlap with ontological terms used CovidGraph provides a tool set and a single-access point to previ- to annotate simulation studies in MaSyMoS (cmp. Figure 2 (B)). Our ously disconnected data sources. Biomedical and clinician scientists analyses showed that most overlap can be observed in gene infor- can explore a rich set of data items, which are not connected in any mation, chemical entities, proteins and diseases. Consequently, we other resource. CovidGraph is only one example for rapid integra- mapped ontological terms in MaSyMoS and CovidGraph for Gene tion of knowledge. The HealthECCO infrastructure offers solutions Ontology (1810 connections), ChEBI (1211 connections), UniProt for integration and exploration of other diseases, building on the (911 connections) and Disease Ontology (72 connections) by their same integration workflow showcased in this paper. 3 BIOMD… Rate mu Law for MASYMOS_HAS_MODEL Susce… http://id… http://id… M MASYMOS_BELONGS_TO Kausthu… MA MAS N O TIO _T S N… YM GS rho NC YM MA _BBELO OS http://id… S_is TO ON FU OR OS_ MA SY MAS S_ _H S_ EL SY M AT YMO MO NG AS AS HA MO SS_ BELO RE YM YM O S S_ _P M S_ _H YMOO COVID-19 EL AS _C OS MASYMOS_DOID_DESCRIBES_… MAS O AR AS BE http://id… MAS YM OS_ MA YM nOf _B YM _h _IS _P NG… ersio AM SY LO as OS AR OS AS AS S_isV OS MO Ta is AM NG _B YMO ET YM M S_ xo YM MASY ET EL n MAS ER MOS_BEL S_ AS beta ER TION O S_TO AS HAS_ MASYMOS_HAS_ANNOTA N… TO ONPA M Mon LONG M GSRAME S_BE _T TER Jul 13 YMO MASY O MAS MOS_ n 19:19:55 MA BELO Roda2020 xo CE… S NGS_ TO s Ta YM OS - SIR model ha _B S_ S_ha… BE… of COVID-… E MA SY O N… MA LO NG MO MASYMOS_BELONGS_TO YM LO S YM S _T S_H AS BE MOS_ OS O S_ _is N AS MO N _C M De Why is it CTIO IO OM O scri CT YM MASY PA bed difficult to MASY EA RT AS TO By MA _R ME accurately REA NT M MA MA GS_ AS PAPER_HAS_PAPERID SY _H SY MO http://id… http://id… MASYMOS_RESOURCE_DESCRIBES_PAPERID 32289100 predict the SY HAS_ TO MO S S_B COVID-19 LON O S_ MO ELO YM epidemi… S AS NG NG _B OS_ S S_BE LO S_ http://id… _H MASYMOS_BEL M TO IES EL BE M AS M MASYMOS_HAS_SPECI SYM S_ YM MA S_TO ON AS EC YMO O _R AS YM SY YM Wuhan MAS G _SP AS EA YM O _TO MO A S_ MAS M CT O AS BE S_ SY GS M IONS_ LO S_H BE M MASY ON NG LOO O S_ IN EL IES IN IN YM Suscept… D_ HA D_ NSG TO MASYMO D_ Infected… TE MOS_CO S_B _HS ONGS_TO CA TE EC S_ S TE CA A_STO _LO MA SP SP O MA MA S_IS Infected… _LO CA _S MA S YM SY MSO EC PE IE S_ MA IS LO M ES SY YMO S_ MA EC SY OS C S AIN S_IS_LO MA IES SY SP NTAIN MO IE IS_ _H MA S_ MA MO S_ M S AS MA HA Y E… S IN OS_H NT AS S_ _P S_ CIE SY OS S_ RO PE TA _… SY PR M CO _R MO AS MA OD S_S ON S_SPEC MO DU MA HA _P _H MO _IS AS SY IN S_ CT SY _C RO CATED_ TAUCT SY M MA S_ AS S_ MO N S DU _H OS S OS MO O S_ _C MO YM MA S_ CT RE _IS IS_ _R ISMOS OS SY OS IS_ YM _P SY SY _PR MA _IS AC E… RE RO I… YM MA S RE _P MA DU OD IN MA RO … AC S CT UC AC DU MA T TA CT … … Infected Confirm… Recover… Suscept… Figure 3: Simulation study by Roda at al. [19] represented in MaSyMoS (model in light blue) with links to CovidGraph. The CovidGraph-Team hopes to motivate other data providers to Pedro Mendes, et al. 2005. Minimum information requested in the annotation of link up with our resource, but we also like to discuss the applicability biochemical models (MIRIAM). Nature biotechnology 23, 12 (2005), 1509–1515. [13] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, of our graph database infrastructure on existing data silos. Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240. ACKNOWLEDGMENTS [14] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, The work presented here is the result of the HealthEcco Team (https: Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. 2013. The genotype-tissue expression (GTEx) project. Nature genetics 45, 6 (2013), //healthecco.org/team/). The COVID-19 collection in BioModels 580–585. https://doi.org/10.1038/ng.2653 was built with the help of an EOSC COVID-19 Fast Track funding. [15] Rahuman S Malik-Sheriff, Mihai Glont, Tung VN Nguyen, Krishna Tiwari, Matthew G Roberts, Ashley Xavier, Manh T Vu, Jinghao Men, Matthieu Maire, Sarubini Kananathan, et al. 2020. BioModels—15 years of sharing computational REFERENCES models in life science. Nucleic acids research 48, D1 (2020), D407–D415. [1] Renzo Angles and Claudio Gutierrez. 2008. Survey of graph database models. [16] United Nations. 2019. World population prospects 2019: highlights. ACM Computing Surveys (CSUR) 40, 1 (2008), 1. [17] Kim D Pruitt, Tatiana Tatusova, and Donna R Maglott. 2007. NCBI reference [2] Garth R Brown, Vichet Hem, Kenneth S Katz, Michael Ovetsky, Craig Wallin, sequences (RefSeq): a curated non-redundant sequence database of genomes, Olga Ermolaeva, Igor Tolstoy, Tatiana Tatusova, Kim D Pruitt, Donna R Maglott, transcripts and proteins. Nucleic acids research 35, suppl_1 (2007), D61–D65. et al. 2015. Gene: a gene-centered information resource at NCBI. Nucleic acids https://doi.org/10.1093/nar/gki025 research 43, D1 (2015), D36–D42. https://doi.org/10.1093/nar/gku1055 [18] Ian Robinson, Jim Webber, and Emil Eifrem. 2013. Graph Databases. O’Reilly [3] Kathi Canese and Sarah Weis. 2013. PubMed: the bibliographic database. The Media, CA, USA. NCBI Handbook 2 (2013), 1. [19] Weston C Roda, Marie B Varughese, Donglin Han, and Michael Y Li. 2020. Why [4] The Gene Ontology Consortium. 2021. The Gene Ontology resource: enriching a is it difficult to accurately predict the COVID-19 epidemic? Infectious Disease GOld mine. Nucleic Acids Research 49, D1 (2021), D325–D334. https://doi.org/10. Modelling 5 (2020), 271–281. 1093/nar/gkaa1113 [20] Falk Schreiber, Björn Sommer, Tobias Czauderna, Martin Golebiewski, Thomas E [5] UniProt Consortium. 2019. UniProt: a worldwide hub of protein knowledge. Gorochowski, Michael Hucka, Sarah M Keating, Matthias König, Chris Myers, Nucleic acids research 47, D1 (2019), D506–D515. https://doi.org/10.1093/nar/ David Nickerson, et al. 2020. Specifications of standards in systems and synthetic gky1049 biology: status and developments in 2020. Journal of integrative bioinformatics [6] Paula de Matos, Adriano Dekker, Marcus Ennis, Janna Hastings, Kenneth Haug, 17, 2-3 (2020). Steve Turner, and Christoph Steinbeck. 2010. ChEBI: a chemistry ontology and [21] Lynn Marie Schriml, Cesar Arze, Suvarna Nadendla, Yu-Wei Wayne Chang, database. Journal of cheminformatics 2, 1 (2010), 1–1. Mark Mazaitis, Victor Felix, Gang Feng, and Warren Alden Kibbe. 2012. Disease [7] Ensheng Dong, Hongru Du, and Lauren Gardner. 2020. An interactive web-based Ontology: a backbone for disease semantic integration. Nucleic acids research 40, dashboard to track COVID-19 in real time. The Lancet infectious diseases 20, 5 D1 (2012), D940–D946. (2020), 533–534. https://doi.org/10.1016/S1473-3099(20)30120-1 [22] The GTEx Portal. 2020. GTEx Portal Documentation. https://gtexportal.org/ [8] Ron Henkel, Olaf Wolkenhauer, and Dagmar Waltemath. 2015. Combining home/documentationPage. Online, accessed 12 October 2020. computational models, semantic annotations and simulation experiments in a [23] Dagmar Waltemath, Richard Adams, Daniel A Beard, Frank T Bergmann, graph database. Database 2015 (2015), bau130. Upinder S Bhalla, Randall Britten, Vijayalakshmi Chelliah, Michael T Cooling, [9] Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Jonathan Cooper, Edmund J Crampin, et al. 2011. Minimum information about a Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E simulation experiment (MIASE). PLoS computational biology 7, 4 (2011), e1001122. Baranzini. 2017. Systematic integration of biomedical knowledge prioritizes drugs [24] Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, for repurposing. eLife 6 (Sept. 2017), e26726. https://doi.org/10.7554/elife.26726 Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Merrill, et al. [10] Tim Hubbard, Daniel Barker, Ewan Birney, Graham Cameron, Yuan Chen, L 2020. Cord-19: The covid-19 open research dataset. ArXiv arXiv2004. (2020), Clark, Tony Cox, J Cuff, Val Curwen, Thomas Down, et al. 2002. The Ensembl 10706v2. genome database project. Nucleic acids research 30, 1 (2002), 38–41. https: [25] Tommy Yu, Catherine M Lloyd, David P Nickerson, Michael T Cooling, Andrew K //doi.org/10.1093/nar/30.1.38 Miller, Alan Garny, Jonna R Terkildsen, James Lawson, Randall D Britten, Peter J [11] Bijay Jassal, Lisa Matthews, Guilherme Viteri, Chuqiao Gong, Pascual Lorente, Hunter, et al. 2011. The physiome model repository 2. Bioinformatics 27, 5 (2011), Antonio Fabregat, Konstantinos Sidiropoulos, Justin Cook, Marc Gillespie, Robin 743–744. Haw, et al. 2020. The reactome pathway knowledgebase. Nucleic acids research [26] Deborah A Zarin, Tony Tse, Rebecca J Williams, Robert M Califf, and Nicholas C 48, D1 (2020), D498–D503. https://doi.org/10.1093/nar/gkz1031 Ide. 2011. The ClinicalTrials.gov results database - update and key issues. New [12] Nicolas Le Novère, Andrew Finney, Michael Hucka, Upinder S Bhalla, Fabien England Journal of Medicine 364, 9 (2011), 852–860. Campagne, Julio Collado-Vides, Edmund J Crampin, Matt Halstead, Edda Klipp, 4