Take CARE of your patient data. Clinical And Registry Entries (CARE) Semantic Model Pablo Alarcón-Moreno1 , Mark Denis Wilkinson1 1 Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Centro de Biotecnología y Genómica de Plantas. Universidad Politécnica de Madrid (UPM)–Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria-CSIC (INIA-CSIC), Campus Montegancedo 28223 Pozuelo de Alarcón (Madrid), Spain Abstract The Clinical And Registry Entries Semantic Model (CARE-SM) is designed to represent healthcare information stored in patient data registries through the use of Semantic Web technologies, with the objective of facilitating reasoning over federated data sources. Evolving from its origins as the Common Data Element Semantic Model (CDE-SM), CARE-SM improves on this prior art by the standardization and homogenization of its core structure, and also by the addition of a contextual metadata layer with temporal and event-based information. Consistency between data elements’ representations allows several implementation improvements, including simplified data transformation and improved data discoverability. Keywords CARE-SM, Semantic Web, FAIR, Semantic model, Common Data Elements, Interoperability 1. Introduction The “Big Data” era provides unprecedented opportunities to undertake large-scale analytics over combined clinical and molecular data. To achieve this, however, there is a need for standardized and interoperable healthcare data models such that federated exploration and analysis can be more easily achieved. There is an increasing number of sensitive registered patient data sources that are intended to be used for research purposes, but the lack of interoperability between data repositories thwarts this goal, causing researchers to invest valuable time finding, preparing, filtering, and combining datasets. [1, 2] The FAIR Data Principles[1] call for data to be findable, accessible, interoperable, and reusable (FAIR), such that the value of data can be fully realized. Many of the FAIR objectives are realized through a combination of Web and Semantic Web technologies. For example, globally unique identifiers, such as URLs, are a requirement of FAIR, and the use of shared vocabularies (i.e. ontologies) and machine-readable syntaxes such as Resource Description Framework (RDF) [3] are hallmarks of most Semantic Web data architectures. CARE-SM [4] is intended to assist SWAT4HCLS, February 26-29, 2024, Leiden, NL Envelope-Open pabloalarconmoreno@gmail.com (P. Alarcón-Moreno); mark.wilkisnon@upm.es (M. D. Wilkinson) GLOBE https://github.com/pabloalarconm (P. Alarcón-Moreno); https://github.com/markwilkinson (M. D. Wilkinson) Orcid 0000-0001-5974-589X (P. Alarcón-Moreno); 0000-0001-6960-357X (M. D. Wilkinson) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) domain experts in achieving “FAIRness” by providing a pre-defined, generic data model that is inherently FAIR, and leverages the more complex features of the Semantic Web such as upper ontologies - specifically, the SemanticScience Integrated Ontology (SIO) [5] - which will help ensure that federated data can be used for logical reasoning. 2. The Genesis of CARE-SM CARE-SM is an expanded and enriched representation of a prior model, which was primarily drafted to be capable of representing the Common Data Elements (CDE) for Rare Disease Registration [6] from the European Commission. Since that earlier work, the project was faced with the need to expand the number and variety of data elements that should be modeled in a FAIR manner. This included modeling treatments and interventions, imaging, and Patient Reported Outcomes (PROs) [7], and to keep a longitudinal history of patient events. These had a level of complexity that did not exist in the CDEs, and thus necessitated an extensive re-consideration and revision of the earlier CDE semantic model. Nevertheless, the requirement to be able to support semantic reasoning in the future remained, and thus we retained the use of SIO as the “semantic backbone” for CARE-SM. SIO has a well-defined set of design patterns [8] for modeling scholarly data, which guides the entity-relationships that can exist in a SIO-based data representation. For the CDE model, the core design pattern was: an Identifier identified a Role; a Person played the Role; that Role was materialized in a Process; the Process had an Output; that Output was (generally) the measurement of an Attribute; the Attribute was an attribute of the Person. This core set of entity-relationships needed to be expanded to suit the broader range of observational and molecular data that needed to be modeled by CARE-SM. A full description of these extensions and revisions follows in the next section. 3. CARE-SM Overview 3.1. Core structure Compared to the CDE-SM, there was a need to expand the core set of entity-relationships that were captured. Of particular relevance were the following additions to the core model: The process (e.g. a clinical procedure) is now related to several additional entities beyond the process output, including inputs, agents, routes, protocols and targets (see Figure 1). The output in the CDE model is now enhanced with a unit of measurement. Various kinds of observations will use different combinations of these new elements depending on the data element being modeled. CARE-SM is built upon the Open Biological and Biomedical Ontology (OBO) Foundry [9, 10] to describe domain-specific ontological classes for every data element. The dual combination of SIO and OBO terms have been standardized compared with the previous CDE-SM where the prior used an arbitrary number of ontological classes to annotate each data element subcomponent. This standardization increases the data model consistency for transformation Figure 1: CARE-SM core structure. and querying. Non-OBO ontologies such as Orphanet Rare Disease Ontology (ORDO) [11] are also present in the CARE-SM to annotate clinical conditions. Figure 2 provides an example of the application of the core semantic model to a specific type of data - in this case, a tumor resection surgery of a patient: A person that has the role of a patient, denoted by a patient identifier, is participating in a tumor resection process. Several entities are associated with this process, such as the intervention protocol and anatomic structure that targets the surgery (defined as lung tissue in this example). Furthermore, the administration of a drug during the intervention (denoted by a drug identifier), followed by its route of administration. Intervention comments can be also added to the clinical process to enrich the contextual information in a human-readable way. 3.2. An added layer of metadata One of the most consequential changes to the overall model when comparing CARE-SM to its predecessor is the introduction of a metadata layer that imparts context on each data element. Semantically, the contextual metadata layer groups every instance, class or property used to describe each data element. As shown in Figure 3, temporal information, in the form of time points or time intervals, is an example of the use of this layer, allowing the definition of a timeline of patient clinical encounters. The use of an encounter identifier can be added to the Figure 2: Exemplar tumoral resection surgery. model to further relate several of these data elements under the same clinical episode or event, for example, a treatment regimine. As is typical for RDF data, context is modeled using RDF-Quads [12] - that is, a fourth URI element accompanies every RDF triple. This context URI can then be used as the subject for additional triples in order to, for example, add temporal or administrative information about that data element, or to group sets of triples into other higher-level structures. 4. CARE-SM in Action 4.1. The CARE-SM implementation Although CARE-SM only specifies a generic data model, we have generated a set of tools and guidelines to assist with the implementation of this model over patient data. The European Joint Project on Rare Diseases (EJP-RD) [13] has implemented an automated workflow for transforming tabular data into an RDF representation, which has been adapted to the requirements of the CARE-SM models in a variety of ways since its initial use with the CDE-SM. The CDE-SM workflow consumed data-element-specific CSV tables, where the CSV columns were referenced in data-element-specific templates structured using the YARRRML specification [14]. These YARRRML templates were transformed into the RML mapping Figure 3: Contextual metadata layer. language [3] by a YARRRML parser, and this mapping was applied to the CSV to generate the final RDF representation. This same “backbone” still remains in the CARE-SM implementation. However, since all data elements now conform to a single overarching model, every data type can now be represented using a common CSV template, with a common YARRRML. The only remaining data-element specificity is the set of columns that are required/optional for each data type, and these requirements are documented on the project’s GitHub [15]. The flexibility of allowing more optional data facets necessitated the addition of some additional complexity in the transformation templates - in particular, the use of “conditionals” (if/then) within the YARRRML to decide when an RDF statement should be generated. In addition, the transition from RDF Triples to RDF Quads required the addition of a new element (”graph”) in the YARRRML templates. Finally, a toolkit has been created in order to perform quality control, data manipulations, and other pre-processing steps to reduce the burden of accurate CSV generation by the users. This toolkit reorganizes the user-provided CSV template into its final form, compatible with the YARRRML template, prior to the RDF transformation step. All of these components have been linked into a larger data transformation and publication workflow called FAIR-in-a-Box (FiaB) [16], which utilizes a custom daemon to sequentially execute each transformation step within the confines of a docker network, minimizing the exposure of any individual component to the internet, and finally loads the CARE-SM data into a GraphDB-based Triplestore. Thus, the users of CARE-SM within FiaB need only generate a CSV file in order to become FAIR data publishers. 4.2. Mapping activities using CARE-SM CARE-SM, and its predecessor, have been used in mapping activities against other standardized data models in the clinical data semantic community. Early initiatives had the objective of schema integration and harmonization between CDE-SM and both RDF and non-RDF-based schemas. One of these initiatives, in collaboration with the Critical Path Institute, led to the creation of common SPARQL [17] queries that could map both CDE-SM and Critical Path Institute’s semantic schema by leveraging a “Rosetta Stone” of shared Biolink schema concepts [18]. Other initiatives are currently under development, such as the creation of Extract, Transform, Load (ETL) workflow from CARE-SM to Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) v5.3 [19], which we hope will prove capable of transforming any encounter-based data representation represented in CARE-SM, to an equivalent Observational Health Data Sciences and Informatics (OHDSI) [20] representation. Other initiatives focus on the creation of federated query tools. A Beacon-v2-compatible API [21] has been created for data discoverability and federation, parsing JSON-based Beacon requests into SPARQL queries executed over Triplestores. The generic adaptation of CARE-SM to the Beacon API was possible due to having a common, predictable semantic data pattern for every Beacon data filter. 5. Conclusions Compared with its predecessor, CARE-SM simplifies many aspects of FAIR Data publishing and reuse in the clinical space. Having a single CSV template means the data provider does not have to create multiple export routines for each data element, reducing the time required to generate the data extraction layer. Moreover, this allowed the consolidation of the numerous CDE-SM YARRRML templates into a single template, enabling easier maintenance and evolution. The data model consistency achieved by reusing a single design pattern simplifies query, where the primary difference between data elements are the ontological classes that define the various sub-elements of a data type. Thus through minor adjustments to an overall SPARQL query template, any of the CARE-SM data elements can be explored in the same way. This harmonization assists the creation of toolkits and APIs around the model, for example, the implementation of the Beacon API capable of transforming non-semantic JSON calls into a set of templated SPARQL queries. CARE-SM allows grouping, through the “context” node of RDF-Quads, of arbitrary data elements, producing linkages between multiple data models, for example, the multiple data elements that arise from a single patient encounter with the healthcare system. Few resources seem to be taking advantage (in the rare disease space) of this RDF-Quad technology, despite it being a well-documented and official W3 standard for RDF representation for about a decade. References [1] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez- Beltran, A. J. G. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. C. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, The FAIR Guiding Principles for scientific data management and stewardship, Sci Data 3 (2016) 160018. URL: https://www.nature.com/articles/sdata201618. doi:10.1038/sdata.2016.18 , number: 1 Publisher: Nature Publishing Group. [2] P. v. Damme, P. A. Moreno, C. H. Bernabé, A. C. Ballesteros, C. M. A. L. Cornec, B. D. S. Vieira, K. J. v. d. Velde, S. Zhang, C. Carta, R. Cornet, P. A. C. �Hoen, A. Jacobsen, M. A. Swertz, M. Roos, N. Benis, A Resource for Guiding Data Stewards to Make European Rare Disease Patient Registries FAIR 22 (2023) 12. URL: https://datascience.codata.org/articles/ 10.5334/dsj-2023-012. doi:10.5334/dsj- 2023- 012 , number: 1 Publisher: Ubiquity Press. [3] RDF 1.2 Concepts and Abstract Syntax, ???? URL: https://www.w3.org/TR/rdf12-concepts/. [4] Clinical And Registry Entries (CARE) Semantic Model, 2023. URL: https://github.com/ CARE-SM/CARE-Semantic-Model, original-date: 2023-10-05T12:51:43Z. [5] M. Dumontier, C. J. Baker, J. Baran, A. Callahan, L. Chepelev, J. Cruz-Toledo, N. R. Del Rio, G. Duck, L. I. Furlong, N. Keath, D. Klassen, J. P. McCusker, N. Queralt-Rosinach, M. Samwald, N. Villanueva-Rosales, M. D. Wilkinson, R. Hoehndorf, The Semantic- science Integrated Ontology (SIO) for biomedical research and knowledge discovery, Journal of Biomedical Semantics 5 (2014) 14. URL: https://doi.org/10.1186/2041-1480-5-14. doi:10.1186/2041- 1480- 5- 14 . [6] R. Kaliyaperumal, M. D. Wilkinson, P. A. Moreno, N. Benis, R. Cornet, B. dos Santos Vieira, M. Dumontier, C. H. Bernabé, A. Jacobsen, C. M. A. Le Cornec, M. P. Godoy, N. Queralt- Rosinach, L. J. Schultze Kool, M. A. Swertz, P. van Damme, K. J. van der Velde, N. Lalout, S. Zhang, M. Roos, Semantic modelling of common data elements for rare disease registries, and a prototype workflow for their deployment over registry data, Journal of Biomedical Semantics 13 (2022) 9. URL: https://doi.org/10.1186/s13326-022-00264-6. doi:10.1186/ s13326- 022- 00264- 6 . [7] T. Weldring, S. M. Smith, Article Commentary: Patient-Reported Outcomes (PROs) and Patient-Reported Outcome Measures (PROMs), Health�Serv�Insights 6 (2013) HSI.S11093. URL: https://doi.org/10.4137/HSI.S11093. doi:10.4137/HSI.S11093 , publisher: SAGE Publications Ltd STM. [8] Design Patterns · MaastrichtU-IDS/semanticscience Wiki, ???? URL: https://github.com/ MaastrichtU-IDS/semanticscience/wiki/Design-Patterns. [9] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland, C. J. Mungall, N. Leontis, P. Rocca-Serra, A. Ruttenberg, S.-A. Sansone, R. H. Scheuermann, N. Shah, P. L. Whetzel, S. Lewis, The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration, Nat Biotechnol 25 (2007) 1251–1255. URL: https://www.nature.com/articles/nbt1346. doi:10.1038/nbt1346 , number: 11 Publisher: Nature Publishing Group. [10] OBO Foundry, ????. URL: https://obofoundry.org/. [11] Orphanet Rare Disease Ontology - Summary | NCBO BioPortal, ????. URL: https://bioportal. bioontology.org/ontologies/ORDO. [12] RDF 1.2 N-Quads, ????. URL: https://www.w3.org/TR/rdf12-n-quads/. [13] EJP RD – European Joint Programme on Rare Diseases, ????. URL: https://www. ejprarediseases.org/. [14] YARRRML, ????. URL: https://rml.io/yarrrml/spec/. [15] CARE Semantic Model Implementation, 2023. URL: https://github.com/CARE-SM/ CARE-SM-Implementation, original-date: 2023-10-09T15:57:30Z. [16] FiaB: FAIR-in-a-box, 2022. URL: https://github.com/ejp-rd-vp/FiaB, original-date: 2022-12- 12T10:34:05Z. [17] SPARQL 1.1 Overview, ???? URL: https://www.w3.org/TR/sparql11-overview/. [18] P. Alarcon, I. Braun, E. Hartley, D. Olson, N. Benis, R. Cornet, M. Wilkinson, R. L. Walls, Leveraging Biolink as a “Rosetta Stone” Between C-Path and EJP-RD Semantic Models Provides Emergent Interoperability, Journal of the Society for Clinical Data Management 3 (2023). URL: https://www.jscdm.org/article/id/130/. doi:10.47912/jscdm.130 , number: 1 Publisher: Society for Clinical Data Management. [19] OMOP CDM v5.3, ????. URL: https://ohdsi.github.io/CommonDataModel/cdm53.html. [20] OHDSI – Observational Health Data Sciences and Informatics, ????. URL: https://www. ohdsi.org/. [21] J. Rambla, M. Baudis, R. Ariosa, T. Beck, L. A. Fromont, A. Navarro, R. Paloots, M. Rueda, G. Saunders, B. Singh, J. D. Spalding, J. Törnroos, C. Vasallo, C. D. Veal, A. J. Brookes, Beacon v2 and Beacon networks: A “lingua franca” for federated data discovery in biomedical genomics, and beyond, Human Mutation 43 (2022) 791–799. URL: https: //onlinelibrary.wiley.com/doi/abs/10.1002/humu.24369. doi:10.1002/humu.24369 , _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/humu.24369.