Using SPARQL to unify queries over data, ontologies, and machine learning models in the PhenomeBrowser knowledgebase Ali Syed1[0000−0002−5103−9058] , Şenay Kafkas1[0000−0001−7509−5786] , Maxat Kulmanov1[0000−0003−1710−1820] , and Robert Hoehndorf1[0000−0001−8149−5890] 1 Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia {ali.syed, senay.kafkas, maxat.kulmanov, robert.hoehndorf}@kaust.edu.sa Abstract. We have developed the PhenomeBrowser knowledge base to integrate phenotype associations from a variety of sources into a single knowledge base. We use SPARQL as a unifying query language to ac- cess RDF data, perform Description Logic queries over ontologies, and compute the semantic similarity between entities in the knowledge base. Keywords: Phenotype · Semantic Similarity · SPARQL 1 Introduction The similarity between phenotypes associated with entities studied in the life sciences can be used to reveal interactions between biomedical entities at a molecular level [12]. Entities that are similar phenotypically are often related to each other on a molecular level as well [15], and this principle can be used to suggest or discover novel relations between these entities. There are several databases that have been developed for integrating phenotypes and exploring relations between them such as Online Mendelian Inheritance in Men (OMIM) [5] and ClinVar [9], as well as integrated databases such as Monarch [10]. Key challenges in integrating and exploring phenotype data is the use of integrated phenotype vocabularies or ontologies that can systematically relate phenotype classes between different contexts such as the entity studied or the species in which phenotypes are observed (human, model organism, or non-model organ- ism) [4]; the computation of semantic (phenotypic) similarity or relatedness [11]; and the ability to query phenotype-related information using a uniform and (ideally) standardized query language. We developed the PhenomeBrowser knowledgebase as a semantic framework that combines an RDF-based knowledge base of phenotype associations col- lected from community resources and from in-house curation with the ability to perform Description Logic queries over phenotype (and other) ontologies as well as to perform some basic operations on a type of machine learning model. Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 A. Syed et al. The framework used to develop PhenomeBrowser is based on using SPARQL as query language for any structured data and Apache Lucene indices and queries (implemented in the form of ElasticSearch) for natural language information. PhenomeBrowser currently contains over four million phenotype associations for genes, diseases, drugs, pathogens and chemical entities (metabolites). We in- corporate the Vec2SPARQL method [8] over the Bio2Vec knowledge graph em- bedding repository (https://bio2vec.cbrc.kaust.edu.sa) to perform queries incorporating semantic similarity and machine learning, and we rely on the AberOWL services [7] to perform Description Logic Queries within SPARQL queries. The interface for PhenomeBrowser is based on these SPARQL queries and we provide access to PhenomeBrowser through SPARQL. The Phenome- browser web portal further implements queries for specific tasks such as finding gene–disease associaitons, host–pathogen or drug–target interactions, all based on phenotypic similarity. The PhenomeBrowser software and underlying compo- nents are available as Free Software (phenomebrowser.net) and can serve as an initial model on how to combine graph databases, Description Logic queries, and machine learning within a single framework unified through SPARQL as query language. 2 Design and Implementation Fig. 1: Core Components of the PhenomeBrowser framework. Core components of the PhenomeBrowser architecture are shown in Figure 1. Phenotype annotations data from community resources and in-house curation is transformed into RDF format. The transformed data is subsequently stored in an RDF store. We implemented a data intake workflow using snakemake [1] to achieve reproducible and robust automation. As data model for phenotype associations, we rely on community standards for phenotypes developed by the OBO Foundry initiative [14] and the Monarch project [10]. We use the Dublin PhenomeBrowser 3 Core vocabularies [2] to encode provenance information, and the OBO Relation Ontology [13] to encode relations. We use the integrated phenotype ontology PhenomeNet [6] to integrate phenotypes across different contexts. The data in- take workflows also generate text indices for entity as well as classes and relations from the PhenomeNet ontology. Text indices are Apache Lucene indices imple- mented in ElasticSearch, and we make search of text indices available through a REST API that is complementary to the SPARQL endpoint for querying struc- tured data. Data that is ingested from public sources is passed to the ontology-based ma- chine learning method DL2Vec [3] to generate embeddings for entities (and ontol- ogy classes) that can be used to compute similarity. The embeddings are added to the Bio2Vec repository which stores the embeddings and makes them avail- able for similarity-based queries through a REST API and a special SPARQL endpoint implementing the Vec2SPARQL extensions [8]. When querying data, we use the AberOWL [7] SPARQL endpoint to exe- cute queries that incorporate deductive inference over Semantic Web ontologies. AberOWL is an ontology repository that provides reasoning over ontologies as a service. The queries further federate to the Vec2SPARQL endpoints provided by Bio2Vec, and therefore combine querying RDF phenotype data, phenotype ontologies (through AberOWL), and semantic similarity (through Bio2Vec). 3 Querying using SPARQL One application of computing phenotype similarity is ranking candidate genes for a disease [15]. Using PhenomeBrowser’s integrated SPARQL endpoint, we can perform this operation through SPARQL and therefore suggest gene–disease associations. Figure 2 shows a query for finding genes that are phenotypically similar to ventricular septal defect (HP:0011623). In the first section of the query, the content of the FILTER block performs a Description Logic Query to retrieve all classes that are equivalent to or subclasses of the ventricular septal defect phenotype from the HPO; the query is performed using the AberOWL ontology repository and reasoning service which expands the query and replaces it with the URIs of the classes resulting from the query. Subclasses of ventricular septal defect in the HPO include Tetralogy of Fallot as well as several more specific forms of Tetralogy of Fallot, and also includes ventricular septal defect itself (as the query is reflexive). The second section of the query contains a federated query to the Bio2Vec SPARQL endpoint and uses the mostSimilar function; the mostSimilar func- tion is implemented by the Vec2SPARQL method and executes the phenotypic similarity search for the diseases selected in the first section of the query on the Bio2Vec SPARQL endpoint. The mostSimilar function is a custom SPARQL function that takes as arguments the dataset identify in Bio2Vec, the identifier for the entity within the dataset, the number of entities to retrieve (in our case, we retrieve the three most similar entities to our query), and the type (using rdf:type) of the entity (in our case, the entity type is gene). In the third sec- 4 A. Syed et al. PREFIX rdf: PREFIX pb: PREFIX b2v: PREFIX b2vd: > SELECT ?simGene ?simGeneLabel ?genePhenotype ?genePhenotypeLabel WHERE { { SELECT ?disease FROM WHERE { ?association rdf:type rdf:Statement . ?association rdf:object ?phenotype . FILTER ( ?phenotype in ( OWL subeq { ’ventricular septal defect’ } ) ) . ?association rdf:subject ?disease . ?disease rdf:type pb:Disease . } LIMIT 20 } . SERVICE { (?simGene ?val ?x ?y) b2v:mostSimilar(b2vd:dataset_4 ?disease 3 pb:Gene) . } GRAPH { ?simGene rdfs:label ?simGeneLabel . ?geneAssociation rdf:subject ?simGene . ?geneAssociation rdf:object ?genePhenotype . ?genePhenotype rdfs:label ?genePhenotypeLabel . } } ORDER BY asc(?simGeneLabel) Fig. 2: SPARQL query finding genes that are phenotypically similar to ventric- ular septal defect. PhenomeBrowser 5 tion of the query, we add labels to genes found in the second section of the query and their associated phenotypes. 4 Conclusion We developed the PhenomeBrowser knowledgebase as a semantic framework that integrates querying over graph databases, ontologies, and knowledge graph embeddings, using SPARQL as a unifying and standardized query language. PhenomeBrowser is accessible at http://phenomebrowser.net. Acknowledgements We acknowledge use of the resources of the KAUST Supercomputing Core Lab- oratories. Funding This work was supported by funding from King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/3790-01-01, URF/1/4355-01-01, FCC/1/1976-08-01, and FCC/1/1976- 08-08. References 1. Snakemake. 2. Dcmi metadata terms, Jan 2020. 3. Jun Chen, Azza Althagafi, and Robert Hoehndorf. Predicting candidate genes from phenotypes, functions and anatomical site of expression. Bioinformatics, 37(6):853–860, 10 2020. 4. Georgios V Gkoutos, Paul N Schofield, and Robert Hoehndorf. The anatomy of phenotype ontologies: principles, properties and applications. Briefings in Bioin- formatics, 19(5):1008–1021, April 2017. 5. Ada Hamosh, Alan F. Scott, Joanna Amberger, David Valle, and Victor A. McKu- sick. Online mendelian inheritance in man (omim). Human Mutation, 15(1):57–61, 2000. 6. Robert Hoehndorf et al. Phenomenet: a whole-phenome approach to disease gene discovery. Nucleic Acids Research, 39(18):e119, 2011. 7. Robert Hoehndorf, Luke Slater, Paul N Schofield, and Georgios V Gkoutos. Aber- OWL: a framework for ontology-based data access in biology. BMC Bioinformatics, 16:26, 2015. 8. Maxat Kulmanov, Senay Kafkas, Andreas Karwath, Alexander Malic, Georgios V Gkoutos, Michel Dumontier, and Robert Hoehndorf. Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings. bioRxiv, 2018. 6 A. Syed et al. 9. Melissa J. Landrum, Jennifer M. Lee, George R. Riley, Wonhee Jang, Wendy S. Rubinstein, Deanna M. Church, and Donna R. Maglott. Clinvar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Research, 2013. 10. Christopher J. Mungall, Julie A. McMurry, Sebastian Köhler, James P. Bal- hoff, Charles Borromeo, Matthew Brush, Seth Carbon, Tom Conlin, Nathan Dunn, Mark Engelstad, Erin Foster, J. P. Gourdine, Julius O. B. Jacobsen, Dan Keith, Bryan Laraway, Suzanna E. Lewis, Jeremy NguyenXuan, Kent Shefchek, Nicole Vasilevsky, Zhou Yuan, Nicole Washington, Harry Hochheiser, Tudor Groza, Damian Smedley, Peter N. Robinson, and Melissa A. Haendel. The monarch initia- tive: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic acids research, 45(D1):D712–D722, Jan 2017. 11. Catia Pesquita, Daniel Faria, André O. Falcão, Phillip Lord, and Francisco M. Couto. Semantic similarity in biomedical ontologies. PLoS Computational Biology, 5(7):e1000443, July 2009. 12. Slavé Petrovski and David B. Goldstein. Phenomics and the interpretation of personal genomes. Science Translational Medicine, 6(254):254fs35–254fs35, 2014. 13. B. Smith, W. Ceusters, B. Klagges, J. Köhler, A. Kumar, J. Lomax, C. Mungall, F. Neuhaus, A. L. Rector, and C. Rosse. Relations in biomedical ontologies. Genome Biol, 6(5):R46, 2005. 14. Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J. Mungall, Neocles Leontis, Philippe R. Serra, Alan Ruttenberg, Susanna A. San- sone, Richard H. Scheuermann, Nigam Shah, Patricia L. Whetzel, and Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomed- ical data integration. Nat Biotech, 25(11):1251–1255, 2007. 15. Nicole L. Washington, Melissa A. Haendel, Christopher J. Mungall, Michael Ash- burner, Monte Westerfield, and Suzanna E. Lewis. Linking human diseases to an- imal models using ontology-based phenotype annotation. PLOS Biology, 7(11):1– 20, 11 2009.