=Paper= {{Paper |id=Vol-3127/paper-4 |storemode=property |title=Phenopackets for the Semantic Web |pdfUrl=https://ceur-ws.org/Vol-3127/paper-4.pdf |volume=Vol-3127 |dblpUrl=https://dblp.org/rec/conf/swat4ls/KaliyaperumalSQ22 }} ==Phenopackets for the Semantic Web== https://ceur-ws.org/Vol-3127/paper-4.pdf
                            Phenopackets for the Semantic Web

                         Rajaram Kaliyaperumal1[0000−0002−1215−167X] , Gurnoor
                           2[0000−0003−1615−4197]
                 Singh                     , Núria Queralt-Rosinach1[0000−0003−0169−8159] ,
                        Jumamurat Bayjanov2[0000−0002−0637−9950] , Peter-Bram ’t
                     Hoen2[0000−0003−4450−3112] , and Marco Roos1[0000−0002−8691−772X]
                       1
                         Leiden University Medical Center, Leiden, 2333 ZA, The Netherlands
                   2
                       Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands



                       Abstract. The GA4GH Phenopackets standard facilitates integrated
                       analysis of genomics and phenomics from patients. Specifically, it allows
                       the representation of phenotypic profiles in a computable and machine-
                       readable exchange format. However, opportunities for integration with
                       resources not represented in GA4GH standards, are limited due to its lack
                       of compatibility with semantic web standards. Here, we present Semantic
                       Phenopackets (RDF schema) for the phenopackets schema which are
                       interoperable with semantic web technologies. Using an approach based
                       on ontological modelling driven by a use case, we show how to represent
                       and query Phenopackets described as RDF graphs.

                       Keywords: Phenotypes · GA4GH · Phenopackets · FAIR · Ontologies
                       · Semantic Web · RDF · SPARQL


             1     Introduction

             Integrative analysis of phenotypic and genomic data facilitates the understand-
             ing of genotype:phenotype relationships. Standard genomic exchange data for-
             mats exist for some years and are widely used in genomics research. However,
             phenotype information is usually captured in different data format standards
             and ontologies. There are vocabularies and ontologies to represent phenotype
             information in clinical contexts (eg., SNOMEDCT [8]) and in research contexts
             (eg., HPO [12]). Furthermore, there is not a standard way to link genotype to
             phenotype information for discovery. To create a common representation and
             exchange format for the clinical and research settings, the Global Alliance for
             Genomics and Health (GA4GH) initiative established Phenopackets as the phe-
             notypic standard [4]. Importantly, it allows linking this rich phenotypic descrip-
             tion with files containing genomic data.
                 Phenotypes are of special importance for the rare disease community. A de-
             tailed description of computable phenotypes benefits research by allowing com-
             putational mining and new discoveries. For efficient research on the rare disease
             field, the European Joint Programme on Rare Diseases (EJP RD) [2] is develop-
             ing a virtual platform of relevant clinical and biomedical tools and data resources
             that adopt the FAIR principles [13]. Federated data discovery is a cornerstone of




Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2      Authors Suppressed Due to Excessive Length

the project, and hence, a semantic interoperability layer for data and metadata
based on ontologies, RDF, Linked Data and Semantic Web technologies is also
a key part of the FAIR strategy. Rare disease patient registries are relevant clin-
ical data resources for rare disease research and can potentially solve complex
problems in biomedicine. These registries describe a set of well defined common
data elements (CDEs) that thanks to a collaborative effort among European
Reference Network (ERN) data managers, data stewards and FAIR experts, are
being translated into Linked Data by means of ontological models [1].
    While the GA4GH Phenopackets standard allows the use of ontologies, the
schema per se is not interoperable with the Semantic Web. Therefore, our re-
search question was, how to make Phenopackets interoperable with the EJP RD
virtual platform and thus with the Semantic Web. Here, we present Semantic
Phenopackets which is the ’ontologized’ version of the GA4GH Phenopackets
schema. Our approach relies on ontological modelling and the use of Semantic
Web technologies. We provide Semantic Phenopackets (RDF schema) for phe-
notypic representation and for analyses leveraging Linked Data technologies.

2     Methods
2.1   Data
We used a rare disease dataset related to the congenital anomalies of the kidney
and urinary tract (CAKUT) disease [10] as a driven use case for the semantic
modelling of phenopackets blocks. CAKUT involves a broad spectrum of renal
and urinary tract malformation phenotypes ranging from complete renal agen-
esis (the most severe) to renal hypodysplasia and multicystic kidney dysplasia.
This dataset contains clinical data of 178 bilateral CAKUT individuals. The
dataset includes personal information, sample information, disease information,
phenotypic features and pathology reports of each individual.

2.2   Semantic modelling
We created semantic models for the GA4GH Phenopackets schema version 1
[3]. Our models are based on the Semanticscience Integrated Ontology (SIO),
which is an upper-level ontology [9]. It contains various ontological classes and
properties to describe entities and their attributes and specifies simple design
patterns to uniformly represent them. We used the entity-attribute subpattern
within the measurements design pattern to semantically model the phenopackets
blocks. To represent the semantic models of the phenopackets blocks we used
Shape Expressions (ShEx) [11]. For each phenopackets block model, we also
provide example RDF files serialized in Turtle (Terse RDF Triple Language)
format [7].

2.3   RDF creation and query federation
To create RDF graphs of the CAKUT dataset we used OpenRefine [6], a soft-
ware application that is used to perform data wrangling activities. OpenRefine’s
                                       Phenopackets for the Semantic Web        3

RDF extension provides functionalities to transform the content of an OpenRe-
fine project to RDF. For the CAKUT dataset transformation we used the Open-
Refine software and its RDF extension version 3.4.1. To demonstrate federated
querying of Semantic Phenopackets models with EJP RD virtual platform, we
used example RDF Turtle files provided in the EJP RD CDE model GitHub
page [1]. The CDE model describes patients and their attributes in RDF accord-
ing to the definition of CDEs for rare disease registries by the Joint Research
Council [5].


3     Results
3.1   Semantic Phenopackets
Out of 18 phenopackets blocks from version 1 of the Phenopackets schema [3],
we modelled the following 9 Phenopackets blocks; ‘Individual’, ‘Biosample’, ‘Dis-
ease’, ‘Sex’, ‘KaryotypicSex’, ‘Age’, ‘File’, ‘Procedure’ and ‘PhenotypicFeature’.
We chose these 9 blocks since they are the most relevant to the content of the
CAKUT dataset. For these chosen phenopackets blocks we created 21 atomic
semantic models in total. These semantic models were created based on the
SIO entity-attribute pattern and are publicly available on GitHub 3 . For each
of these 21 semantic models we provide a separate GitHub markdown file where
we specify its ShEx shapes to describe the structure of the RDF graph; an ex-
ample RDF file and a graphical representation. We show in Figure 1 an example
RDF instance for the phenopackets block ‘Sex’. We used the generic sio:is about
(sio:SIO 000332) object property to describe a specific attribute of the entity.
We used the data property sio:has value (sio:SIO 000300) to describe the value
of the attribute. We used ontologies recommended by the phenopackets schema
to represent the attributes’ values.

3.2   Query Semantic Phenopackets
To demonstrate the simplicity of our semantic model we created two SPARQL
queries to retrieve information about individuals. The query 4 shows how to re-
trieve all individuals and all their attributes, whereas the query 5 shows how to
retrieve all individuals and only their date of birth attribute. These two queries
show how to query Semantic Phenopackets RDF graphs to perform generic and
specific data retrieval with only some minor modifications to the queries. To
demonstrate interoperability with EJP RD virtual platform we created a feder-
ated SPARQL query 6 . The query matches on the diseases of individuals in the
3
  https://github.com/LUMC-BioSemantics/phenopackets-rdf-schema/wiki
4
  https://github.com/LUMC-BioSemantics/phenopackets-rdf-
  schema/blob/master/example-queries/query1.rq
5
  https://github.com/LUMC-BioSemantics/phenopackets-rdf-
  schema/blob/master/example-queries/query2.rq
6
  https://github.com/LUMC-BioSemantics/phenopackets-rdf-
  schema/blob/master/example-queries/query3.rq
4      Authors Suppressed Due to Excessive Length




Fig. 1. Example RDF instance for the phenopacket block ‘Sex’. The diamond repre-
sents an RDF instance and the rectangle represents an IRI value.


Semantic Phenopackets RDF and patients from the RDF graphs of the CDE
model. Further, the query lists biobanks from the CDE model RDF graph for
the matched patients.


4   Discussion

Semantic Phenopackets is a more machine readable and interoperable version
of the GA4GH Phenopackets schema. It aims to capture, for machines, what
the elements in a phenopacket mean. It can be used directly in semantic web
queries or as a reference for other phenopacket schemas. With our approach of
using a simple entity-attribute ontological design pattern we can represent differ-
ent Phenopackets blocks in a uniform way, which also facilitates data retrieval,
and it enables interoperability with the EJP RD virtual platform and with the
semantic web by means of the SPARQL query language. Moreover, the reuse
of ontological design patterns is a knowledge-engineering recommended good
practice. We provided the community with a first set of 21 atomic Semantic
Phenopackets models in ShEx, RDF and graphical files open and publicly avail-
able on GitHub. Furthermore, using semantic models to represent Phenopackets
makes some of the blocks obsolete in the sense that there is no need to explicitly
model the attributes since they are already described when resolving the IRI
(if the RDF description follows semantic web best practices). For instance, the
Phenopackets block ‘OntologyClass’ only requires the identifier (as a CURIE-
style string) and the label (as string).
    We developed Semantic Phenopackets as the ‘ontologized’ version of the
GA4GH Phenopackets schema that is interoperable with the semantic web. A
                                          Phenopackets for the Semantic Web           5

rare disease driven use case was useful to prioritize the set of Phenopackets blocks
to model. As future work, we envision updating and modelling the full newly
released Phenopackets schema version 2. Moreover, we will make a tool to au-
tomate the conversion of Phenopackets to RDF and the translation of SPARQL
query results into YAML serialization to facilitate interoperability with other
GA4GH Phenopackets clients and tools.


Acknowledgements
Our work is supported by funding from the European Union’s Horizon 2020
research and innovation program under the EJP RD COFUND-EJP N° 825575.


References
 1. EJP RD CDE model GitHub page. https://github.com/ejp-rd-vp/
    CDE-semantic-model/tree/develop, last accessed 27 September 2021
 2. EJP RD Homepage. https://www.ejprarediseases.org/, last accessed 2020/08/24
 3. GA4GH Phenopackets schema version 1. https://phenopacket-schema.
    readthedocs.io/en/1.0.0/, last accessed 27 September 2021
 4. GA4GH           Phenopackets        standard.        https://www.ga4gh.org/news/
    phenopackets-standardizing-and-exchanging-patient-phenotypic-data/,             last
    accessed 27 September 2021
 5. JRC         CDE          homepage.         https://eu-rd-platform.jrc.ec.europa.eu/
    set-of-common-data-elements en, last accessed 27 September 2021
 6. OpenRefine homepage. https://openrefine.org/, last accessed 27 September 2021
 7. Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: Rdf 1.1 turtle.
    World Wide Web Consortium pp. 18–31 (2014)
 8. Donnelly, K., et al.: Snomed-ct: The advanced terminology and coding system for
    ehealth. Studies in health technology and informatics 121, 279 (2006)
 9. Dumontier, M., Baker, C.J., Baran, J., Callahan, A., Chepelev, L., Cruz-Toledo,
    J., Del Rio, N.R., Duck, G., Furlong, L.I., Keath, N., et al.: The semanticscience
    integrated ontology (sio) for biomedical research and knowledge discovery. Journal
    of biomedical semantics 5(1), 1–11 (2014)
10. Klein, J., Buffin-Meyer, B., Boizard, F., Moussaoui, N., Lescat, O., Breuil, B.,
    Fedou, C., Feuillet, G., Casemayou, A., Neau, E., et al.: Amniotic fluid peptides
    predict postnatal kidney survival in developmental kidney disease. Kidney Inter-
    national 99(3), 737–749 (2021)
11. Prud’hommeaux, E., Labra Gayo, J.E., Solbrig, H.: Shape expressions: an rdf val-
    idation and transformation language. In: Proceedings of the 10th International
    Conference on Semantic Systems. pp. 32–40 (2014)
12. Robinson, P.N., Mundlos, S.: The human phenotype ontology. Clinical genetics
    77(6), 525–534 (2010)
13. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M.,
    Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.:
    The FAIR guiding principles for scientific data management and stewardship. Sci-
    entific data 3 (2016)