Privacy-Preserving Dashboard for F.A.I.R Head and Neck Cancer data supporting multi-centered collaborations Varsha Gouthamchand1,2,∗ , Ananya Choudhury1,2 , Frank Hoebers1 , Frederik Wesseling1 , Mattea Welch3 , Sejin Kim3 , Benjamin Haibe-Kains3 , Joanna Kazmierska4 , Andre Dekker1,2 , Johan van Soest5,1 and Leonard Wee1,2 1 Dept of Radiation Oncology (Maastro), GROW School of Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The Netherlands 2 Clinical Data Science, Faculty of Health Medicine and Life Sciences, Maastricht University, Maastricht, The Netherlands 3 Radiation Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada 4 Dept of Radiation Oncology, Greater Poland Cancer Centre II, Poznan, Poland 5 Brightlands Institute for Smart Society, Faculty of Science and Engineering, Maastricht University, Heerlen, The Netherlands Abstract Research in modern healthcare requires vast volumes of data from various healthcare centers across the globe. It is not always feasible to centralize clinical data without compromising privacy. A tool addressing these issues and facilitating reuse of clinical data is the need of the hour. The Federated Learning approach, governed in a set of agreements such as the Personal Health Train (PHT) manages to tackle these concerns by distributing models to the data centers instead of the traditional approach of centralizing datasets. One of the prerequisites of PHT is using semantically interoperable datasets for the models to be able to find them. FAIR (Findable, Accessible, Interoperable, Reusable) principles help in building interoperable and reusable data by adding knowledge representation and providing descriptive metadata. However, the process of making data FAIR is not always easy and straight-forward. Our main objective is to disentangle this process by using domain and technical expertise and get data prepared for federated learning. This paper introduces applications that are easily deployable as Docker containers, which will automate parts of the aforementioned process and significantly simplify the task of creating FAIR clinical data. Our method bypasses the need for clinical researchers to have a high degree of technical skills. We demonstrate the FAIR-ification process by applying it to five Head and Neck cancer datasets (four public and one private). The PHT paradigm is explored by building a distributed visualization dashboard from the aggregated summaries of the FAIR-ified datasets. Using the PHT infrastructure for exchanging only statistical summaries or model coefficients allows researchers to explore data from multiple centers without breaching privacy. Keywords FAIR, Knowledge graphs, Linked Data, Semantic Web, Ontologies, SPARQL, RDF, Federated Learning, SWAT4HCLS 2023: The 14th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences ∗ Corresponding author. Envelope-Open varsha.gouthamchand@maastro.nl (V. Gouthamchand) Orcid 0000-0002-4756-2866 (V. Gouthamchand); 0000-0001-9847-8165 (A. Choudhury); 0000-0002-4317-9181 (F. Hoebers); 0000-0002-5887-9826 (M. Welch); 0000-0002-0422-7996 (A. Dekker); 0000-0003-2548-0330 (J. v. Soest); 0000-0003-1612-9055 (L. Wee) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR http://ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1. Background Real-world clinical data is defined as data relating to an individual person’s health status and/or the delivery of healthcare to a population that is routinely collected from a variety of sources. This is increasingly being used as evidence of treatment effectiveness as well as to guide clinical decision-making through the development of predictive prognostic models [2]. Re-use of real-world clinical data at scale presents two challenges. First is a lack of syntactic interoperability, i.e., technical differences due to database organization and divergence of human languages. A more flexible alternative is an “open world” approach focusing on semantic interoperability where the data can be queried and retrieved by independent external researchers [13] without having to know details in advance, about database structure or native coding schema of the data. Second is that clinical data will be usually horizontally partitioned; healthcare institutions each own similar sets of data fields but exclusively on their own human subjects. Due to the highly sensitive nature of patient medical data, great care must be taken if shared. Concerns over patient confidentiality and data controllership implies that it is not always an attractive option to aggregate all individual-level data into a few centralized repositories. A federated learning paradigm attempts to address some of the privacy concerns. A privacy- by-design paradigm, e.g., Personal Health Train (PHT) [14], exchanges only aggregated statistical information. A necessary consequence of using the PHT is that the data must first be made semantically interoperable for algorithms to analyze the data remotely and autonomously. The FAIR (Findable, Accessible, Interoperable, and Reusable) data principles were developed to maximize the value of digital assets, including real world clinical data, and it further emphasizes making data interoperable for machine processors and not only for humans [20]. It is important to emphasize that FAIR data does not imply open data, and open data itself may not be FAIR. The purpose of FAIR is that a given community, e.g., researchers and cancer clinicians, can achieve a high degree of interoperability and reusability on each other’s data [8]. The Linked Data [3] concept elegantly captures some of the needs of FAIR by assigning machine-readable unique resource identifiers (URIs) to data elements, as well as capturing the relationships between them. This attribute of linked data can be exploited to integrate disparate pools of data, even if they comprise different domains, e.g., clinical examinations and image- based biomarkers extracted from radiology scans. Independent but linked databases readable by machines over Hypertext Transfer Protocol (HTTP) makes up a worldwide “Semantic Web” of FAIR data. Semantic web standards define a set of essential tools such as the Resource Descriptor Framework (RDF) [10] and a SPARQL Protocol and RDF Query Language (SPARQL) [12], for storing data and querying data, respectively. RDF represents data using a series of statements known as “triples”, i.e., subject-predicate-object. This work defines a partly automated procedure to make structured real-world patient data more FAIR according to Semantic Web standards. Specialist competencies are leveraged collaboratively at the right place, i.e., clinicians and data creators will focus on annotation using rich descriptions; domain experts and ontologists can encode knowledge representation with data graph construction, and data scientists can instantiate these annotations and integrate data from disparate sources using a singular SPARQL query. Privacy-preserving federated learning is used to generate a visualization of aggregated cohort statistics across five private and public datasets in head-and-neck cancer, without revealing individual-level data. 2. Implementation A schematic illustration of the general workflow to convert structured raw data to FAIR semantic web data is provided in Fig 1. Our tooling consists of three principal parts, (1) a graphical user interface (GUI) to select data for serialization as RDF triples and for the data owner to attach some descriptive information, (2) a collaborative annotation step to attach one or more relevant domain ontologies and hence define a fit-for-purpose graph data structure, and (3) a means to query data across multiple FAIR data graphs via a single federated SPARQL query. We packaged these tools as Docker containers to make them platform-independent and easier to deploy; these are made open access (see Availability section). 2.1. Structured data conversion to FAIR data graph The first step takes existing structured data and processes either comma-separated values (CSV) or relational databases (any generic SQL format) into RDF. This component is provided with a GUI running locally where (non-technical) data owners simply browse and choose their dataset for processing. Triplifier [11] is a Java-based resource (integrated with the GUI) that automatically serializes a table into RDF as a Terse-Triple Language (TTL) file [16] and compiles the schema of the ingested table as a database-specific Web Ontology Language (OWL) file [9]. The same GUI also requests data owners to add descriptions and metadata, such as data type of each field (continuous, discrete, categorical ordinal, or patient identifier). The data owner is also able to attach some preselected definitions to their data fields and/or include supplementary information in free text comments. The data owner’s pre-annotations are directly written into the aforementioned OWL schema file. The resulting TTL and OWL files are automatically saved in a graph database (we included a free version of GraphDB in our Docker deployment). 2.2. Annotation of datasets with ontologies The procedure was specifically designed to decouple the contents of the database (in the TTL) from the dictionary/coding of the database (in the OWL), thus allowing an external collaborator to work on semantic annotations using the OWL without needing to read the actual contents of the data, which does contain person-specific and highly privacy-sensitive information. The data owner may thus share the OWL file either publicly, or privately to a defined collaboration, of researchers, domain experts, clinicians and other data owners. For each distinct dataset, a semantically meaningful mapping of the database-specific entities onto a publicly accessible do-main ontology (e.g., ROO and NCIT) will be made through consensus and correspondence (e.g., through extended email discussions). We provide Python scripts that add unique dataset specific annotations as a graph object (“annotation.local”) directly into the local GraphDB. Importantly, if the use case changes (or if a new research question emerges) such that an alternative annotation is required, this can be easily implemented. Re-annotation of the data is always possible because it sits on top of the original OWL and TTL without needing to edit or modify any of the original schema and original contents. A SPARQL query external to the dataset then references the equivalencies in the “annotation.local” in order to extract the correct entities and relations specific to the dataset. 2.3. Tagged release of software The code for this project is made open access on Github (refer Availability section); instructions for use have been provided in associated markdown documents. We have provided a Docker installation containing a Jupyter notebook, GraphDB and triplifier. All the python scripts needed are packaged into the distribution as Jupyter notebooks. 2.4. PHT infrastructure An open source Vantage6 infrastructure [7] (v0.2.4) implementing the PHT method has been pre- viously used to develop and validate a federated CPH model in anal cancer patients across three countries [15]. Full details of Vantage6 are given in its accompanying technical documentation [19]. 2.5. Distributed dashboard aggregation of head and neck cancer data A demonstration was previously made available as a preprint [4] and remains available from our GitHub repository as the “MedRxiv” branch. To recapitulate briefly, we had created an automated process where the clinical case-mix data in four different open access datasets on The Cancer Imaging Archive (TCIA) [1] were serialized using triplifier, and each set of TTL and OWL files were inserted into its respective GraphDB database. On top of each of these GraphDB databases, Python scripts were executed which inserted local annotations on top of the TTL and OWL files, utilizing class entities and predicates from the NCIT and ROO ontologies. This work extends the previous by adding a hitherto unpublished private dataset (HN3). We used the aforementioned GUI and triplifier to serialize the data and create the custom semantic ontology annotation for its RDF graph. We placed each of these five datasets in geographically dispersed Ubuntu virtual machines, with unique public IP addresses and network firewalls, but all connected to a Vantage6 infra-structure (illustrated schematically in Fig 2). We distributed a single SPARQL query through the PHT infrastructure to obtain aggregated cohort statistics (e.g., mean, range, etc.) from the federated datasets, then presented these in two ways – (i) an interactive Python visual dashboard built from PlotLy and Dash libraries, and (ii) a case-mix summary data frame that could be downloaded from the Vantage6 aggregation server as a Comma-Separated Values (CSV) file. 3. Results For open data, the interested reader can get the original clinical data frame directly from TCIA. An example fragment of TTL and OWL generated by triplifier is shown in Fig 3. Note that (for ease of understanding) we have compressed the namespaces using standard semantic web notation in the bottom-left corner of the figure. A graphical representation of as-serialized TTL content for one subject is shown on the left side of Fig 4. For argument’s sake, the original contents might not be syntactically usable to the reader (e.g., the labels are in the Dutch language) and are not yet semantically interoperable. On the right side of Fig 4, we show how dataset-specific annotations mapped to the ROO and NCIT ontologies (including new descriptive predicates) render this data more FAIR. The inserted annotations are strictly additive, i.e., it does not alter or over-write the as-serialized contents. For simplicity of visualization, we masked some of the schema classes and predicates auto generated by triplifier. One can readily look up the unique URIs - C25364, C28421 and C16576 in the NCIT and find definitions for “patient identifier”, “sex” and “female”, respectively. Likewise, with the ROO, the URIs P100061, P100018 and P100042 are resolved as predicates “has_identifier”, “has_biological_sex” and “has_value”, respectively. Where needed, numerical values may be supplemented by extra predicates and classes indicating the exact units of the measure, e.g., age in years, and follow-up time as intervals of days, or months, or years. Additionally, if concepts like dates (Date of Birth/Death) need to be re-formatted to an agreed style (e.g., DD/MM/YYYY), an appropriate formatting task can be sent using the Vantage6 server to the data nodes. A snapshot of an interactive visual dashboard is given as Fig 5, and we retrieved a case-mix aggregated summary table directly from the Vantage6 server as a CSV file. The latter was then reformatted and tidied to provide Table 1 in Supplementary material. 4. Discussion and Conclusion We have produced a partly automated procedure that makes structured clinical data FAIR and available for distributed applications. Semantic interoperability and data linking has been achieved by local annotation with semantic ontologies, such that a single global SPARQL query using those ontologies will correctly filter the data. This illustrative use case was selected to address how common clinically relevant questions may be addressed via privacy-preserving federated learning. In our distributed dashboard approach, we envisage that partners in an established collaboration will be able to safely explore and inter-compare each other’s private data repositories for suitable subsets of patients, without violating patient privacy. If open datasets are also annotated and made FAIR in the abovemen- tioned manner, then published via an accessible web address, they can also be efficiently queried en masse with a single query in the above manner. One of the most important and effective ways to make data FAIR is to assign a globally unique and persistent identifier to both the data repository and its linked metadata. The four open datasets here are unambiguously referenced using a Digital Object Identifier (DOI). Though private dataset HN3 is not openly accessible, we do openly disseminate the readable description of the dataset plus the schema (OWL) and its semantic ontology annotations (“annotation.local”) in open Zenodo repository with its unique persistent DOI for the metadata. Assuming that re-casting the data into a universal master schema is not already done, our method proposes a flexible and adaptable means of applying semantic interoperability by means of annotation with an open semantic ontologies. In the Linked Data paradigm, each data entity as well as its relationship to other data entities is traceably and collectively mapped to a unique and persistent identifier. Every instance of the same identifier must mean semantic equivalence, entirely irrespective of the human-readable label, which is generally in the data owner’s own language. The ontologies not only establish the terminology and definitions but also include some knowledge representation that allows the possibility to apply machine-assisted logical inferencing. Figure 1: Making clinical data accessible as a FAIR graph database object Availability Project name: Flyover (tagged release: v1.0, preprint demonstration project branch: MedRxiv) Project home page: https://github.com/MaastrichtU-CDS/projects_flyover_project Zenodo Repository DOI: https://doi.org/10.5281/zenodo.7190551 Four public datasets were obtained from TCIA. RADIOMICS-HN1 [18] comprises clinical data, volumetric CT and PET of 137 patients with laryngeal carcinoma and OPC treated by RT alone or currently with either cisplatin or cetuximab. HNSCC contains clinical data and contrast-enhanced CT scans of 627 oropharyngeal cancer (OPC) patients [5]. OPC-Radiomics has clinical data and CT scans of 606 OPC subjects, treated by either radiotherapy or chemo- radiotherapy between 2005 and 2010 [6]. HEAD-NECK-PET-CT [17] com-prised 298 subjects with multiple subsites of HNC each with clinical descriptors, PET and planning CT, treated between April 2006 and November 2014. The HN3 dataset is not publicly available at the present time due to material that is potentially identifiable to an individual. Figure 2: Schematic illustration of the Vantage6 infrastructure used in this work to show a likely clinical use case Figure 3: Example showing the expected output of triplifier processing. A hypothetical input table is shown on the left. Namespace aliases are used to improve readability. A fragment of the serialized database contents is shown top right in the TTL file, and a part of the database schema is shown with a database-specific ontology in the OWL file at bottom right Figure 4: Examples showing knowledge graph of a patient “0” with classes ID and biological sex. The image on the left is from Triplifier after the data has been converted to RDF. On the right, is the image after the annotation graph has been added. Double-sided green arrows with double crossing bars are the predicate owl:equivalentClass Figure 5: Distributed Dashboard References [1] Kenneth Clark et al. “The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository”. en. In: Journal of Digital Imaging 26.6 (Dec. 2013), pp. 1045–1057. issn: 1618-727X. doi: 10.1007/s10278-013-9622-7. url: https://doi.org/10. 1007/s10278-013-9622-7. [2] Office of the Commissioner. Real-World Evidence. en. Publisher: FDA. Oct. 2022. url: https://www.fda.gov/science-research/science-and-research-special-topics/real-world- evidence. [3] Data - W3C. url: https://www.w3.org/standards/semanticweb/data. [4] FAIR-IFICATION OF STRUCTURED CLINICAL DATA | medRxiv. url: https://www.medrxiv. org/content/10.1101/2021.07.23.21261032v3.full. [5] Aaron Grossberg et al. HNSCC. Version Number: 2 Type: dataset. 2020. doi: 10.7937/K9/ TCIA.2020.A8SH-7363. url: https://wiki.cancerimagingarchive.net/x/sIN5Ag. [6] Jennifer Yin Yee Kwan et al. Data from Radiomic Biomarkers to Refine Risk Models for Distant Metastasis in Oropharyngeal Carcinoma. type: dataset. 2019. doi: 10.7937/TCIA. 2019.8DHO2GLS. url: https://wiki.cancerimagingarchive.net/x/XAQGAg. [7] Arturo Moncada-Torres et al. “VANTAGE6: an open source priVAcy preserviNg federaTed leArninG infrastructurE for Secure Insight eXchange”. In: AMIA Annual Symposium Proceedings 2020 (Jan. 2021), pp. 870–877. issn: 1942-597X. url: https://www.ncbi.nlm.nih. gov/pmc/articles/PMC8075508/. [8] Open Data and FAIR Data: differences and similarities | Plataforma OGoov. en-US. May 2019. url: https://www.ogoov.com/en/blog/open-data-and-fair-data-differences-and- similarities/. [9] OWL - Semantic Web Standards. url: https://www.w3.org/OWL/. [10] RDF - Semantic Web Standards. url: https://www.w3.org/RDF/. [11] Johan van Soest et al. “Annotation of existing databases using Semantic Web technologies: making data more FAIR”. en. In: (), p. 8. [12] SPARQL - Semantic Web Standards. url: https://www.w3.org/2001/sw/wiki/SPARQL. [13] Syntactic and Semantic Interoperability | Electrosoft. en. url: https://www.electrosoft- inc.com/resources/syntactic-and-semantic-interoperability. [14] The Personal Health Train Network | The Personal Health Train. en. url: https://pht.health- ri.nl/personal-health-train-network. [15] Stelios Theophanous et al. “Development and validation of prognostic models for anal cancer outcomes using distributed learning: protocol for the international multi-centre atomCAT2 study”. In: Diagnostic and Prognostic Research 6.1 (Aug. 2022), p. 14. issn: 2397- 7523. doi: 10.1186/s41512-022-00128-8. url: https://doi.org/10.1186/s41512-022-00128-8. [16] Turtle - Terse RDF Triple Language. url: https://www.w3.org/TeamSubmission/turtle/. [17] Martin Vallières et al. Data from Head-Neck-PET-CT. type: dataset. 2017. doi: 10.7937/K9/ TCIA.2017.8OJE5Q00. url: https://wiki.cancerimagingarchive.net/x/24pyAQ. [18] Leonard Wee and Andre Dekker. Data from Head-Neck-Radiomics-HN1. type: dataset. 2019. doi: 10.7937/TCIA.2019.8KAP372N. url: https://wiki.cancerimagingarchive.net/x/iBglAw. [19] Welcome. en. url: https://docs.vantage6.ai/. [20] Mark D. Wilkinson et al. “The FAIR Guiding Principles for scientific data management and stewardship”. en. In: Scientific Data 3.1 (Mar. 2016). Number: 1 Publisher: Nature Publishing Group, p. 160018. issn: 2052-4463. doi: 10.1038/sdata.2016.18. url: https: //www.nature.com/articles/sdata201618. Supplementary Table 1 Patient Demographics Table from five data nodes HN1 HNSCC OPC HEAD-NECK HN3 Sample size 137 492 606 298 165 Age in years Mean 61.9 57.8 60.5 63.3 62.6 Range 44-83 28-87 33-89 18-90 29-84 Sex Female 26 69 125 71 43 Male 111 423 481 227 122 Tumour stage T1 35 92 103 39 14 T2 32 203 198 109 31 T3 24 117 183 94 68 T4 46 80 122 46 52 Tx - - - 10 - Nodal stage N0 60 45 101 59 48 N1 16 53 61 40 45 N2 58 378 397 180 54 N3 3 16 47 19 18 Nx - - - - - Metastasis stage M0 136 492 606 294 165 M1 1 0 - 0 0 Mx - - - 4 - Overall stage (7th ed.) I 24 3 11 4 - II 11 16 38 27 - III 23 67 85 61 - IV‘ 79 406 472 204 - Unspecified - - - 2 - Tumour location Nasopharynx - - - 28 - Oropharynx 88 492 606 203 63 Hypopharynx - - - 13 31 Larynx 49 - - 45 64 Unknown - - - 9 - HPV status Positive 23 248 356 78 34 Negative 58 44 143 46 29 Unknown 56 200 107 174 102 Radiotherapy type Radiotherapy 100 57 309 48 104 Chemoradiotherapy 37 435 297 250 61 Survival status Censored 63 376 347 242 77 Deceased 74 116 259 56 88