<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using SPARQL to unify queries over data, ontologies, and machine learning models in the PhenomeBrowser knowledgebase</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Computational Bioscience Research Center (CBRC)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Computer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Electrical</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mathematical Sciences &amp; Engineering (CEMSE) Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology</institution>
          ,
          <addr-line>Thuwal 23955-6900</addr-line>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1820</year>
      </pub-date>
      <abstract>
        <p>We have developed the PhenomeBrowser knowledge base to integrate phenotype associations from a variety of sources into a single knowledge base. We use SPARQL as a unifying query language to access RDF data, perform Description Logic queries over ontologies, and compute the semantic similarity between entities in the knowledge base.</p>
      </abstract>
      <kwd-group>
        <kwd>Phenotype</kwd>
        <kwd>Semantic Similarity</kwd>
        <kwd>SPARQL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The similarity between phenotypes associated with entities studied in the life
sciences can be used to reveal interactions between biomedical entities at a
molecular level [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Entities that are similar phenotypically are often related
to each other on a molecular level as well [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and this principle can be used
to suggest or discover novel relations between these entities. There are several
databases that have been developed for integrating phenotypes and exploring
relations between them such as Online Mendelian Inheritance in Men (OMIM)
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and ClinVar [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], as well as integrated databases such as Monarch [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Key
challenges in integrating and exploring phenotype data is the use of integrated
phenotype vocabularies or ontologies that can systematically relate phenotype
classes between di erent contexts such as the entity studied or the species in
which phenotypes are observed (human, model organism, or non-model
organism) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; the computation of semantic (phenotypic) similarity or relatedness [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ];
and the ability to query phenotype-related information using a uniform and
(ideally) standardized query language.
      </p>
      <p>We developed the PhenomeBrowser knowledgebase as a semantic framework
that combines an RDF-based knowledge base of phenotype associations
collected from community resources and from in-house curation with the ability
to perform Description Logic queries over phenotype (and other) ontologies as
well as to perform some basic operations on a type of machine learning model.
Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>The framework used to develop PhenomeBrowser is based on using SPARQL as
query language for any structured data and Apache Lucene indices and queries
(implemented in the form of ElasticSearch) for natural language information.</p>
      <p>
        PhenomeBrowser currently contains over four million phenotype associations
for genes, diseases, drugs, pathogens and chemical entities (metabolites). We
incorporate the Vec2SPARQL method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] over the Bio2Vec knowledge graph
embedding repository (https://bio2vec.cbrc.kaust.edu.sa) to perform queries
incorporating semantic similarity and machine learning, and we rely on the
AberOWL services [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to perform Description Logic Queries within SPARQL
queries. The interface for PhenomeBrowser is based on these SPARQL queries
and we provide access to PhenomeBrowser through SPARQL. The
Phenomebrowser web portal further implements queries for speci c tasks such as nding
gene{disease associaitons, host{pathogen or drug{target interactions, all based
on phenotypic similarity. The PhenomeBrowser software and underlying
components are available as Free Software (phenomebrowser.net) and can serve as an
initial model on how to combine graph databases, Description Logic queries, and
machine learning within a single framework uni ed through SPARQL as query
language.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Design and Implementation</title>
      <p>
        Core components of the PhenomeBrowser architecture are shown in Figure
1. Phenotype annotations data from community resources and in-house curation
is transformed into RDF format. The transformed data is subsequently stored
in an RDF store. We implemented a data intake work ow using snakemake [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
to achieve reproducible and robust automation. As data model for phenotype
associations, we rely on community standards for phenotypes developed by the
OBO Foundry initiative [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and the Monarch project [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We use the Dublin
Core vocabularies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to encode provenance information, and the OBO Relation
Ontology [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to encode relations. We use the integrated phenotype ontology
PhenomeNet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to integrate phenotypes across di erent contexts. The data
intake work ows also generate text indices for entity as well as classes and relations
from the PhenomeNet ontology. Text indices are Apache Lucene indices
implemented in ElasticSearch, and we make search of text indices available through a
REST API that is complementary to the SPARQL endpoint for querying
structured data.
      </p>
      <p>
        Data that is ingested from public sources is passed to the ontology-based
machine learning method DL2Vec [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to generate embeddings for entities (and
ontology classes) that can be used to compute similarity. The embeddings are added
to the Bio2Vec repository which stores the embeddings and makes them
available for similarity-based queries through a REST API and a special SPARQL
endpoint implementing the Vec2SPARQL extensions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        When querying data, we use the AberOWL [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] SPARQL endpoint to
execute queries that incorporate deductive inference over Semantic Web ontologies.
AberOWL is an ontology repository that provides reasoning over ontologies as
a service. The queries further federate to the Vec2SPARQL endpoints provided
by Bio2Vec, and therefore combine querying RDF phenotype data, phenotype
ontologies (through AberOWL), and semantic similarity (through Bio2Vec).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Querying using SPARQL</title>
      <p>
        One application of computing phenotype similarity is ranking candidate genes
for a disease [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Using PhenomeBrowser's integrated SPARQL endpoint, we
can perform this operation through SPARQL and therefore suggest gene{disease
associations. Figure 2 shows a query for nding genes that are phenotypically
similar to ventricular septal defect (HP:0011623). In the rst section of the query,
the content of the FILTER block performs a Description Logic Query to retrieve
all classes that are equivalent to or subclasses of the ventricular septal defect
phenotype from the HPO; the query is performed using the AberOWL ontology
repository and reasoning service which expands the query and replaces it with
the URIs of the classes resulting from the query. Subclasses of ventricular septal
defect in the HPO include Tetralogy of Fallot as well as several more speci c
forms of Tetralogy of Fallot, and also includes ventricular septal defect itself (as
the query is re exive).
      </p>
      <p>The second section of the query contains a federated query to the Bio2Vec
SPARQL endpoint and uses the mostSimilar function; the mostSimilar
function is implemented by the Vec2SPARQL method and executes the phenotypic
similarity search for the diseases selected in the rst section of the query on the
Bio2Vec SPARQL endpoint. The mostSimilar function is a custom SPARQL
function that takes as arguments the dataset identify in Bio2Vec, the identi er
for the entity within the dataset, the number of entities to retrieve (in our case,
we retrieve the three most similar entities to our query), and the type (using
rdf:type) of the entity (in our case, the entity type is gene). In the third
sec</p>
      <p>PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX pb: &lt;http://phenomebrowser.net/&gt;
PREFIX b2v: &lt;http://bio2vec.net/function#&gt;
PREFIX b2vd: &lt;http://bio2vec.net/dataset#&gt;&gt;
SELECT ?simGene ?simGeneLabel ?genePhenotype ?genePhenotypeLabel
WHERE {
{</p>
      <p>SELECT ?disease
FROM &lt;http://phenomebrowser.net&gt;
WHERE {
?association rdf:type rdf:Statement .
?association rdf:object ?phenotype .</p>
      <p>FILTER ( ?phenotype in (
OWL subeq &lt;http://phenomebrowser.net/sparql&gt; &lt;HP&gt; {</p>
      <p>'ventricular septal defect'
}</p>
      <p>) ) .
?association rdf:subject ?disease .</p>
      <p>?disease rdf:type pb:Disease .
} LIMIT 20
} .</p>
      <p>SERVICE &lt;https://bio2vec.cbrc.kaust.edu.sa/ds/query&gt; {</p>
      <p>(?simGene ?val ?x ?y) b2v:mostSimilar(b2vd:dataset_4 ?disease 3 pb:Gene) .
}
GRAPH &lt;http://phenomebrowser.net&gt; {
?simGene rdfs:label ?simGeneLabel .
?geneAssociation rdf:subject ?simGene .
?geneAssociation rdf:object ?genePhenotype .</p>
      <p>?genePhenotype rdfs:label ?genePhenotypeLabel .</p>
      <p>}
} ORDER BY asc(?simGeneLabel)
Fig. 2: SPARQL query nding genes that are phenotypically similar to
ventricular septal defect.
tion of the query, we add labels to genes found in the second section of the query
and their associated phenotypes.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We developed the PhenomeBrowser knowledgebase as a semantic framework
that integrates querying over graph databases, ontologies, and knowledge graph
embeddings, using SPARQL as a unifying and standardized query language.
PhenomeBrowser is accessible at http://phenomebrowser.net.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>We acknowledge use of the resources of the KAUST Supercomputing Core
Laboratories.</p>
    </sec>
    <sec id="sec-6">
      <title>Funding References</title>
      <p>This work was supported by funding from King Abdullah University of Science
and Technology (KAUST) O ce of Sponsored Research (OSR) under Award No.
URF/1/3790-01-01, URF/1/4355-01-01, FCC/1/1976-08-01, and
FCC/1/197608-08.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Snakemake.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Dcmi metadata terms,
          <year>Jan 2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Jun</given-names>
            <surname>Chen</surname>
          </string-name>
          , Azza Althaga , and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Hoehndorf</surname>
          </string-name>
          .
          <article-title>Predicting candidate genes from phenotypes, functions and anatomical site of expression</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>37</volume>
          (
          <issue>6</issue>
          ):
          <volume>853</volume>
          {
          <fpage>860</fpage>
          , 10
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Georgios</surname>
            <given-names>V Gkoutos</given-names>
          </string-name>
          ,
          <article-title>Paul N Scho eld</article-title>
          , and Robert Hoehndorf.
          <article-title>The anatomy of phenotype ontologies: principles, properties and applications</article-title>
          .
          <source>Brie ngs in Bioinformatics</source>
          ,
          <volume>19</volume>
          (
          <issue>5</issue>
          ):
          <volume>1008</volume>
          {
          <fpage>1021</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Ada</given-names>
            <surname>Hamosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alan F.</given-names>
            <surname>Scott</surname>
          </string-name>
          , Joanna Amberger, David Valle, and
          <string-name>
            <surname>Victor</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>McKusick</surname>
          </string-name>
          .
          <article-title>Online mendelian inheritance in man (omim)</article-title>
          .
          <source>Human Mutation</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <volume>57</volume>
          {
          <fpage>61</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Hoehndorf</surname>
          </string-name>
          et al.
          <article-title>Phenomenet: a whole-phenome approach to disease gene discovery</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>39</volume>
          (
          <issue>18</issue>
          ):
          <fpage>e119</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Hoehndorf</surname>
          </string-name>
          , Luke Slater, Paul N Scho eld, and Georgios V Gkoutos.
          <article-title>AberOWL: a framework for ontology-based data access in biology</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>16</volume>
          :
          <fpage>26</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Maxat</given-names>
            <surname>Kulmanov</surname>
          </string-name>
          , Senay Kafkas, Andreas Karwath, Alexander Malic, Georgios V Gkoutos,
          <string-name>
            <given-names>Michel</given-names>
            <surname>Dumontier</surname>
          </string-name>
          , and Robert Hoehndorf.
          <article-title>Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings</article-title>
          . bioRxiv,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Melissa</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Landrum</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jennifer</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>George R. Riley</surname>
          </string-name>
          , Wonhee Jang, Wendy S. Rubinstein,
          <string-name>
            <surname>Deanna M. Church</surname>
          </string-name>
          , and
          <string-name>
            <surname>Donna</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Maglott</surname>
          </string-name>
          .
          <article-title>Clinvar: public archive of relationships among sequence variation and human phenotype</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Christopher J. Mungall</surname>
          </string-name>
          , Julie A.
          <string-name>
            <surname>McMurry</surname>
            , Sebastian Kohler, James P. Balho , Charles Borromeo, Matthew Brush, Seth Carbon, Tom Conlin, Nathan Dunn, Mark Engelstad,
            <given-names>Erin</given-names>
          </string-name>
          <string-name>
            <surname>Foster</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Gourdine</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julius O. B. Jacobsen</surname>
            , Dan Keith, Bryan Laraway,
            <given-names>Suzanna E.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jeremy</surname>
            <given-names>NguyenXuan</given-names>
          </string-name>
          , Kent Shefchek, Nicole Vasilevsky, Zhou Yuan, Nicole Washington, Harry Hochheiser, Tudor Groza, Damian Smedley,
          <string-name>
            <surname>Peter N. Robinson</surname>
          </string-name>
          , and
          <string-name>
            <surname>Melissa</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Haendel</surname>
          </string-name>
          .
          <article-title>The monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <volume>45</volume>
          (
          <issue>D1</issue>
          )
          <article-title>:D712{D722</article-title>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Catia</surname>
            <given-names>Pesquita</given-names>
          </string-name>
          , Daniel Faria, Andre O. Falca~o, Phillip
          <string-name>
            <surname>Lord</surname>
          </string-name>
          , and Francisco M. Couto.
          <article-title>Semantic similarity in biomedical ontologies</article-title>
          .
          <source>PLoS Computational Biology</source>
          ,
          <volume>5</volume>
          (
          <issue>7</issue>
          ):e1000443,
          <year>July 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Slave</given-names>
            <surname>Petrovski</surname>
          </string-name>
          and
          <string-name>
            <given-names>David B.</given-names>
            <surname>Goldstein</surname>
          </string-name>
          .
          <article-title>Phenomics and the interpretation of personal genomes</article-title>
          .
          <source>Science Translational Medicine</source>
          ,
          <volume>6</volume>
          (
          <issue>254</issue>
          ):254fs35{
          <fpage>254fs35</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>B.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ceusters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Klagges</surname>
          </string-name>
          , J. Kohler,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lomax</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mungall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neuhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Rector</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Rosse</surname>
          </string-name>
          .
          <article-title>Relations in biomedical ontologies</article-title>
          .
          <source>Genome Biol</source>
          ,
          <volume>6</volume>
          (
          <issue>5</issue>
          ):
          <fpage>R46</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Barry</surname>
            <given-names>Smith</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Ashburner</surname>
          </string-name>
          , Cornelius Rosse, Jonathan Bard,
          <string-name>
            <given-names>William</given-names>
            <surname>Bug</surname>
          </string-name>
          , Werner Ceusters,
          <string-name>
            <given-names>Louis J.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , Karen Eilbeck, Amelia Ireland,
          <string-name>
            <given-names>Christopher J.</given-names>
            <surname>Mungall</surname>
          </string-name>
          , Neocles Leontis,
          <string-name>
            <surname>Philippe R. Serra</surname>
          </string-name>
          , Alan Ruttenberg, Susanna A. Sansone, Richard H. Scheuermann, Nigam Shah, Patricia L.
          <string-name>
            <surname>Whetzel</surname>
            , and
            <given-names>Suzanna</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          .
          <article-title>The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration</article-title>
          .
          <source>Nat Biotech</source>
          ,
          <volume>25</volume>
          (
          <issue>11</issue>
          ):
          <volume>1251</volume>
          {
          <fpage>1255</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Nicole L. Washington, Melissa A.
          <string-name>
            <surname>Haendel</surname>
            ,
            <given-names>Christopher J.</given-names>
          </string-name>
          <string-name>
            <surname>Mungall</surname>
          </string-name>
          , Michael Ashburner, Monte Wester eld, and
          <string-name>
            <surname>Suzanna</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          .
          <article-title>Linking human diseases to animal models using ontology-based phenotype annotation</article-title>
          .
          <source>PLOS Biology</source>
          ,
          <volume>7</volume>
          (
          <issue>11</issue>
          ):
          <volume>1</volume>
          {
          <fpage>20</fpage>
          , 11
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>