FAIR quantitative imaging in oncology: how Semantic Web and Ontologies will support reproducible science A. Traverso1, Z. Shi1, L. Wee1, A. Dekker1 1Department of Radiation Oncology (MAASTRO), GROW – School for Oncology and Development Biology, Maastricht University Medical Center, Maastricht, the Netherlands Abstract. The automated extraction of quantitative imaging biomarkers from patient’s scans, could augment physician decision making in radiation oncology. Unfortunately, lack of reproducibility and robust methodology current limits this promising field to be applied in the clinic. In this paper, we state how the combination of quantitative medical imaging with Semantic Web and Ontologies techniques could speed up the role of quanti- tative imaging. Keywords: Ontologies, Semantic Web, Quantitative Imaging, Radiation Oncol- ogy. 1 Introduction 1.1 A new era of medical imaging: from images to big data Medical imaging has expanded its fundamental role in radiation oncology since the advent of the first Computed Tomography (CT) scans in the 70s, followed by PET (Positron Emission Tomography) and MRI (Magnetic Resonance Imaging). Radiolog- ical examination has moved from purely descriptive to semi-quantitative and fully automated analysis. In the recent years, the availability of enterprise digital imaging and the overflowing role of AI (Artificial Intelligence, like Machine Learning) domain (e.g. machine learning) led to the development of many quantitative imaging models aimed at assisting and augmenting physician decision-making. The term “radiomics” was first created in 2012 and it describes the process of advanced quantitative clinical imaging analysis in medicine. The hypothesis behind radiomics is that tumor biological properties, often obtained by invasive techniques such as tissue biopsies, can be measured in a non-invasive fashion via extracting image-based descriptors (referred as ‘features’) from medical images [1]. After 2012, the number of radiomics computational packages has increased [2]. How- ever, no consensus has been reached: a) on the optimal configuration that should be used to extract these features for a problem; b) about the robustness of radiomics features when evaluated in different contexts. Therefore, most of the users simultaneously extract features using different parameters, leading to an increase of the number of features. Typical radiomic studies often extract from 500 to 10000 features while starting only from 100 unique features [3] . We are now facing the same Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 “data explosion” defined by Rubin about multi-detector row CT scanners. One main difference divides the two processes: if the CT data explosion was mainly driven by an advance in hardware development, producing more images faster than expected; the new quantitative imaging data explosion is driven by automated imaging analysis computational pipelines that produce a large amount of processed data (e.g. radiomic features) from medical images. This data seems mimicking all the attributes of big data: a) volume: the large amount of data to be processed and analyzed via machine learning requires now dedicated computational power and powerful machine learning able to deal with a large hyperspace of parameters; b) velocity: new data are generated faster as soon as new computational radiomics software become available, with a larger hyperspace of parameters that can be tuned for features extraction; c) variety: not only singe features should be stored in quantitative imaging, but also information about the original source (image, region of interest, computational details) making the data variety larger; d) veracity: in the hyperspace determined by features and associated metadata, some information could be redundant and only meaningful one should be extrapolated [3]. For all the above-mentioned reasons, quantitative imaging strictly connects to the world of big data. We believe that extending the usage of ontologies and Semantic Web technologies to quantitative imaging could help solving some of the issues that would be presented in the next paragraph and further speed up the adoption and acceptance of new image based quantitative biomarkers in the clinic. 1.2 Reproducibility crisis in quantitative imaging Still a strong unbalance exists between published radiomics-based prediction models and their real usage as decision support systems in the clinic [4]. The lack of reproducibility and transparency in radiomics is the major slowdown of its applicability in the clinic [5]. The lack of reproducibility mainly relates to the fact that most radiomics-based models are built on limited-datasets and often vali- dated in one single institution, with no guarantee of generalizability power when ap- plied to multiple centers. This evidence also seems colliding with recommendations from the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for In- dividual Prognosis or Diagnosis), suggesting and encouraging TRIPOD IV-type mod- els, which are fully validated on completely independent external datasets [5]. TRIPOD IV models are based on the possibility for an external user to fully reproduce and vali- date a previously developed model. Unfortunately, this reproducibility crisis reflects not only on the difficulty for external users to fully reproduce a radiomics experiments developed in another institution, but also within the same institution. This issue mainly connects to the previously mentioned concept of lack of transparency. In absence of a standardized and structured way of describing radiomics studies, most of them only report single feature names or values, with no further details on how the model was developed, how the features were computed and which where the computa- tional parameters used (metadata). Even in presence of publications that made available software and datasets, re-usability and inter-operability remain issues. It is not unlikely 3 that two software could call a radiomic feature with the same name but meaning a to- tally different quantitative descriptor. On the other hand, two features could express the same quantitative descriptor but show different values when computed with different software. Without then associated metadata, it is impossible to find the reasons behind this discrepancy, which probably lie in a different choice of hyperparameters. It becomes then clear that quantitative imaging is far behind the FAIR principles that are taking the scene in clinical data science as incentive for reproducible and transparent science [6]. However, the absence of FAIR guiding principles represents a unique op- portunity for the imaging community to propose a new paradigm for a new era repro- ducible quantitative imaging. We believe that ontologies and Semantic Web techniques should guide this effort toward reproducible, transparent quantitative imaging. On the other side, the imaging community needs to accept the challenge to work closely with the data science community and re-use as much as possible available tools. A possible framework and the ongoing actions taken by our group are presented in the following paragraph. 2. Proposed solution 2.1 Ontologies for quantitative imaging: a dynamic body of knowledge to enhance consensus Ontologies represent a formal specification of the terms related to a specific domain and the relations among them [7]. In this specific case, an ontology for quantitative imaging should mimic the workflow that happens during a radiomic study: from image pre-processing, region of interest definition, computational settings definition and fi- nally features extraction, as presented in [6]. Therefore, the ontology not only should include the main radiomic features and their corresponding units, but also all the metadata that relate to the above-mentioned workflow. In this view, building this on- tology is a joint exercise between imaging research groups to represent the state of the art of the knowledge related to the quantitative imaging domain. The ontology acts as harmonizer and standardizer, eliminating barriers related to different nomenclature or labels. In fact, each concept in the ontology is universally defined and the whole com- munity agrees on its meaning. For example, the ontology universally defines the radi- omics features by describing them and associating a unique identifier and their prove- nance. In this view, it enhances consensus and creates a shared knowledge domain. It represents a dynamic body of knowledge that can be expanded with new concepts as the quantitative imaging field evolved (for example by introducing and defining new imaging features or computational methods). Our group took the lead in developing an extensive radiomics ontology (RO), released on the BioPortal (https://bioportal.bioon- tology.org/ontologies/RO) as door-opener for FAIR quantitative imaging. Recently, we published a modular python tool for making radiomics computations FAIR [8]. Finally, ontologies express concepts in a machine-readable language and therefore, when data and metadata are transformed via the ontology, they can be automatically parsed by 4 machines. This becomes of fundamental utility when comparing results computed from different software or under different conditions. If each radiomics computational pack- age is setup to produce ontologies-labelled data and metadata, then automated meta analyzes can be performed and this will open the path to data-driven standardization and harmonization. A summary of the concept behind the RO and possible applications is depicted in Figure 1. Figure1: the RO workflow. Not only standardization of the radiomics workflow is achieved, but the same instrument can be used to enhance the reproducibility and vali- dation of radiomics-based prediction studies 2.1 Semantic Web: linking quantitative imaging with multiple domains Semantic Web has the power to extract knowledge from data labelled via ontologies, using dedicated SPARQL language. If radiomics data and metadata are transformed via the Radiomics Ontology and published on the Semantic Web, then they can be queried using the universal concepts defined by the ontology, without any prior knowledge on the original labels present in the original software. Also, the combination of ontologies and Semantic Web tech- niques allows parsing and joining data and metadata from multiple sources, such as different databases. For example, in a typical radiomics-based prediction study it could be interesting to query a) the value of a certain feature b) computed on an imaging modality c) referring to a patient with a certain disease; d) finding patients with sim- ilar feature values but different clinical outcomes for comparison. As it is clear from this example, that type of query requires merging radiomics data (a); DICOM metadata (b); clinical data (c), and data from other clinics (d). Sooner, additional sources of data 5 such as for example genomics data or pathology data, for better predictions and for exploring connections with medical images will be needed. Our group has developed a portfolio of ontologies for guaranteeing the road to FAIR compliant and transparent prediction models in radiation oncology: the ROO (Radiation Oncology Ontology) [7], the SEDI (Semantic DICOM Ontology) [8] and the presented RO. We successfully showed how this workflow can be used in combination with Se- mantic Web for winning barriers related to data sharing and build more accurate models (distributed learning) [9]. For example, we successfully reproduced a classical cen- tralized radiomics study [10] in a distributed fashion using the above-mentioned ontol- ogies combined with Semantic Web [8]. By using only SPARQL queries we could re- trieve the model and computational details of the model trained at one local institution and externally validated on the second one. We believe the upcoming effort should focus on developing additional ontologies that could link the quantitative imaging domain with data from multiple sources pre- sented above. Finally, we state that ontologies and Semantic Web are the key for speeding up re- producible science. Therefore, the quantitative imaging community should work closely with experts from the semantics, FAIR and data science fields to provide a sus- tainable infrastructure for medical imaging and derived big data. References [1] R. J. Gillies, P. E. Kinahan, and H. Hricak, ‘Radiomics: Images Are More than Pictures, They Are Data’, Radiology, vol. 278, no. 2, pp. 563–577, Feb. 2016. [2] L. E. Court, X. Fave, D. Mackin, J. Lee, J. Yang, and L. Zhang, ‘Computational resources for radiomics’, Transl. Cancer Res., vol. 5, no. 4, pp. 340–348, Aug. 2016. [3] V. Kumar et al., ‘Radiomics: the process and the challenges’, Magnetic Resonance Imaging, vol. 30, no. 9, pp. 1234–1248, Nov. 2012. [4] I. Buvat and F. Orlhac, ‘The Dark Side of Radiomics: On the Paramount Im- portance of Publishing Negative Results’, J Nucl Med, vol. 60, no. 11, pp. 1543– 1544, Nov. 2019. [5] A. Traverso, L. Wee, A. Dekker, and R. Gillies, ‘Repeatability and Reproducibility of Radiomic Features: A Systematic Review’, International Journal of Radiation Oncology*Biology*Physics, vol. 102, no. 4, pp. 1143–1158, Nov. 2018. [6] M. D. Wilkinson et al., ‘The FAIR Guiding Principles for scientific data manage- ment and stewardship’, Scientific Data, vol. 3, p. 160018, Mar. 2016. [7] ‘Ontologies’, in Ontology Learning and Population from Text, Springer US, 2006, pp. 9–17. [8] Z. Shi et al., ‘Distributed radiomics as a signature validation study using the Per- sonal Health Train infrastructure’, Sci Data, vol. 6, no. 1, p. 218, Dec. 2019.