<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Phenopackets for the Semantic Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rajaram Kaliyaperumal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gurnoor Singh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>lt-Rosin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jumamurat Bayjanov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter-Bram 't Ho</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>o Roos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leiden University Medical Center</institution>
          ,
          <addr-line>Leiden, 2333 ZA</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Radboud University Medical Center</institution>
          ,
          <addr-line>6525 GA Nijmegen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The GA4GH Phenopackets standard facilitates integrated analysis of genomics and phenomics from patients. Speci cally, it allows the representation of phenotypic pro les in a computable and machinereadable exchange format. However, opportunities for integration with resources not represented in GA4GH standards, are limited due to its lack of compatibility with semantic web standards. Here, we present Semantic Phenopackets (RDF schema) for the phenopackets schema which are interoperable with semantic web technologies. Using an approach based on ontological modelling driven by a use case, we show how to represent and query Phenopackets described as RDF graphs.</p>
      </abstract>
      <kwd-group>
        <kwd>Phenotypes</kwd>
        <kwd>GA4GH</kwd>
        <kwd>Phenopackets</kwd>
        <kwd>FAIR</kwd>
        <kwd>Ontologies</kwd>
        <kwd>Semantic Web</kwd>
        <kwd>RDF</kwd>
        <kwd>SPARQL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Integrative analysis of phenotypic and genomic data facilitates the
understanding of genotype:phenotype relationships. Standard genomic exchange data
formats exist for some years and are widely used in genomics research. However,
phenotype information is usually captured in di erent data format standards
and ontologies. There are vocabularies and ontologies to represent phenotype
information in clinical contexts (eg., SNOMEDCT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) and in research contexts
(eg., HPO [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). Furthermore, there is not a standard way to link genotype to
phenotype information for discovery. To create a common representation and
exchange format for the clinical and research settings, the Global Alliance for
Genomics and Health (GA4GH) initiative established Phenopackets as the
phenotypic standard [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Importantly, it allows linking this rich phenotypic
description with les containing genomic data.
      </p>
      <p>
        Phenotypes are of special importance for the rare disease community. A
detailed description of computable phenotypes bene ts research by allowing
computational mining and new discoveries. For e cient research on the rare disease
eld, the European Joint Programme on Rare Diseases (EJP RD) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is
developing a virtual platform of relevant clinical and biomedical tools and data resources
that adopt the FAIR principles [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Federated data discovery is a cornerstone of
Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the project, and hence, a semantic interoperability layer for data and metadata
based on ontologies, RDF, Linked Data and Semantic Web technologies is also
a key part of the FAIR strategy. Rare disease patient registries are relevant
clinical data resources for rare disease research and can potentially solve complex
problems in biomedicine. These registries describe a set of well de ned common
data elements (CDEs) that thanks to a collaborative e ort among European
Reference Network (ERN) data managers, data stewards and FAIR experts, are
being translated into Linked Data by means of ontological models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>While the GA4GH Phenopackets standard allows the use of ontologies, the
schema per se is not interoperable with the Semantic Web. Therefore, our
research question was, how to make Phenopackets interoperable with the EJP RD
virtual platform and thus with the Semantic Web. Here, we present Semantic
Phenopackets which is the 'ontologized' version of the GA4GH Phenopackets
schema. Our approach relies on ontological modelling and the use of Semantic
Web technologies. We provide Semantic Phenopackets (RDF schema) for
phenotypic representation and for analyses leveraging Linked Data technologies.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Data</title>
        <p>
          We used a rare disease dataset related to the congenital anomalies of the kidney
and urinary tract (CAKUT) disease [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as a driven use case for the semantic
modelling of phenopackets blocks. CAKUT involves a broad spectrum of renal
and urinary tract malformation phenotypes ranging from complete renal
agenesis (the most severe) to renal hypodysplasia and multicystic kidney dysplasia.
This dataset contains clinical data of 178 bilateral CAKUT individuals. The
dataset includes personal information, sample information, disease information,
phenotypic features and pathology reports of each individual.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Semantic modelling</title>
        <p>
          We created semantic models for the GA4GH Phenopackets schema version 1
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Our models are based on the Semanticscience Integrated Ontology (SIO),
which is an upper-level ontology [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It contains various ontological classes and
properties to describe entities and their attributes and speci es simple design
patterns to uniformly represent them. We used the entity-attribute subpattern
within the measurements design pattern to semantically model the phenopackets
blocks. To represent the semantic models of the phenopackets blocks we used
Shape Expressions (ShEx) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. For each phenopackets block model, we also
provide example RDF les serialized in Turtle (Terse RDF Triple Language)
format [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>RDF creation and query federation</title>
        <p>
          To create RDF graphs of the CAKUT dataset we used OpenRe ne [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], a
software application that is used to perform data wrangling activities. OpenRe ne's
RDF extension provides functionalities to transform the content of an
OpenRene project to RDF. For the CAKUT dataset transformation we used the
OpenRe ne software and its RDF extension version 3.4.1. To demonstrate federated
querying of Semantic Phenopackets models with EJP RD virtual platform, we
used example RDF Turtle les provided in the EJP RD CDE model GitHub
page [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The CDE model describes patients and their attributes in RDF
according to the de nition of CDEs for rare disease registries by the Joint Research
Council [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>Semantic Phenopackets</title>
        <p>
          Out of 18 phenopackets blocks from version 1 of the Phenopackets schema [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ],
we modelled the following 9 Phenopackets blocks; `Individual', `Biosample',
`Disease', `Sex', `KaryotypicSex', `Age', `File', `Procedure' and `PhenotypicFeature'.
We chose these 9 blocks since they are the most relevant to the content of the
CAKUT dataset. For these chosen phenopackets blocks we created 21 atomic
semantic models in total. These semantic models were created based on the
SIO entity-attribute pattern and are publicly available on GitHub 3. For each
of these 21 semantic models we provide a separate GitHub markdown le where
we specify its ShEx shapes to describe the structure of the RDF graph; an
example RDF le and a graphical representation. We show in Figure 1 an example
RDF instance for the phenopackets block `Sex'. We used the generic sio:is about
(sio:SIO 000332) object property to describe a speci c attribute of the entity.
We used the data property sio:has value (sio:SIO 000300) to describe the value
of the attribute. We used ontologies recommended by the phenopackets schema
to represent the attributes' values.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Query Semantic Phenopackets</title>
        <p>To demonstrate the simplicity of our semantic model we created two SPARQL
queries to retrieve information about individuals. The query 4 shows how to
retrieve all individuals and all their attributes, whereas the query 5 shows how to
retrieve all individuals and only their date of birth attribute. These two queries
show how to query Semantic Phenopackets RDF graphs to perform generic and
speci c data retrieval with only some minor modi cations to the queries. To
demonstrate interoperability with EJP RD virtual platform we created a
federated SPARQL query 6. The query matches on the diseases of individuals in the
3 https://github.com/LUMC-BioSemantics/phenopackets-rdf-schema/wiki
4
https://github.com/LUMC-BioSemantics/phenopackets-rdf</p>
        <p>schema/blob/master/example-queries/query1.rq
5
https://github.com/LUMC-BioSemantics/phenopackets-rdf</p>
        <p>schema/blob/master/example-queries/query2.rq
6
https://github.com/LUMC-BioSemantics/phenopackets-rdf</p>
        <p>schema/blob/master/example-queries/query3.rq</p>
        <p>Semantic Phenopackets RDF and patients from the RDF graphs of the CDE
model. Further, the query lists biobanks from the CDE model RDF graph for
the matched patients.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Semantic Phenopackets is a more machine readable and interoperable version
of the GA4GH Phenopackets schema. It aims to capture, for machines, what
the elements in a phenopacket mean. It can be used directly in semantic web
queries or as a reference for other phenopacket schemas. With our approach of
using a simple entity-attribute ontological design pattern we can represent di
erent Phenopackets blocks in a uniform way, which also facilitates data retrieval,
and it enables interoperability with the EJP RD virtual platform and with the
semantic web by means of the SPARQL query language. Moreover, the reuse
of ontological design patterns is a knowledge-engineering recommended good
practice. We provided the community with a rst set of 21 atomic Semantic
Phenopackets models in ShEx, RDF and graphical les open and publicly
available on GitHub. Furthermore, using semantic models to represent Phenopackets
makes some of the blocks obsolete in the sense that there is no need to explicitly
model the attributes since they are already described when resolving the IRI
(if the RDF description follows semantic web best practices). For instance, the
Phenopackets block `OntologyClass' only requires the identi er (as a
CURIEstyle string) and the label (as string).</p>
      <p>We developed Semantic Phenopackets as the `ontologized' version of the
GA4GH Phenopackets schema that is interoperable with the semantic web. A
rare disease driven use case was useful to prioritize the set of Phenopackets blocks
to model. As future work, we envision updating and modelling the full newly
released Phenopackets schema version 2. Moreover, we will make a tool to
automate the conversion of Phenopackets to RDF and the translation of SPARQL
query results into YAML serialization to facilitate interoperability with other
GA4GH Phenopackets clients and tools.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Our work is supported by funding from the European Union's Horizon 2020
research and innovation program under the EJP RD COFUND-EJP N° 825575.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. EJP RD CDE model GitHub page</article-title>
          . https://github.com
          <article-title>/ejp-rd-vp/ CDE-semantic-model/tree/develop, last accessed 27 September 2021</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>EJP</given-names>
            <surname>RD</surname>
          </string-name>
          <article-title>Homepage</article-title>
          . https://www.ejprarediseases.org/,
          <source>last accessed</source>
          <year>2020</year>
          /08/24
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>3. GA4GH Phenopackets schema version 1</article-title>
          . https://phenopacket-schema.
          <source>readthedocs.io/en/1</source>
          .0.0/, last accessed
          <issue>27 September 2021</issue>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>4. GA4GH Phenopackets standard</article-title>
          . https://www.ga4gh.
          <article-title>org/news/ phenopackets-standardizing-and-exchanging-patient-phenotypic-data/, last accessed 27 September 2021</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>5. JRC CDE homepage</article-title>
          . https://eu-rd
          <article-title>-platform.jrc.ec.europa.eu/ set-of-common-data-elements en</article-title>
          ,
          <source>last accessed 27 September 2021</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <article-title>OpenRe ne homepage</article-title>
          . https://openre ne.org/,
          <source>last accessed 27 September 2021</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Beckett</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prud</surname>
          </string-name>
          'hommeaux, E.,
          <string-name>
            <surname>Carothers</surname>
          </string-name>
          , G.:
          <article-title>Rdf 1.1 turtle</article-title>
          . World Wide Web Consortium pp.
          <volume>18</volume>
          {
          <issue>31</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Donnelly</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.:
          <article-title>Snomed-ct: The advanced terminology and coding system for ehealth</article-title>
          .
          <source>Studies in health technology and informatics 121</source>
          ,
          <issue>279</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baran</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callahan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chepelev</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz-Toledo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Rio</surname>
            ,
            <given-names>N.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duck</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furlong</surname>
            ,
            <given-names>L.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keath</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , et al.:
          <article-title>The semanticscience integrated ontology (sio) for biomedical research and knowledge discovery</article-title>
          .
          <source>Journal of biomedical semantics 5(1)</source>
          ,
          <volume>1</volume>
          {
          <fpage>11</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bu</surname>
            n-Meyer,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boizard</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moussaoui</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lescat</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breuil</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feuillet</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casemayou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neau</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al.:
          <article-title>Amniotic uid peptides predict postnatal kidney survival in developmental kidney disease</article-title>
          .
          <source>Kidney International</source>
          <volume>99</volume>
          (
          <issue>3</issue>
          ),
          <volume>737</volume>
          {
          <fpage>749</fpage>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Prud'hommeaux</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Labra</surname>
            <given-names>Gayo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.E.</given-names>
            ,
            <surname>Solbrig</surname>
          </string-name>
          , H.:
          <article-title>Shape expressions: an rdf validation and transformation language</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Semantic Systems</source>
          . pp.
          <volume>32</volume>
          {
          <issue>40</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mundlos</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The human phenotype ontology</article-title>
          .
          <source>Clinical genetics 77(6)</source>
          ,
          <volume>525</volume>
          {
          <fpage>534</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aalbersberg</surname>
            ,
            <given-names>I.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Appleton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Axton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomberg</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boiten</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          , da Silva Santos,
          <string-name>
            <given-names>L.B.</given-names>
            ,
            <surname>Bourne</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.E.</surname>
          </string-name>
          , et al.:
          <article-title>The FAIR guiding principles for scienti c data management and stewardship</article-title>
          .
          <source>Scienti c data 3</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>