Reliable Central Nervous System Tumor Differentiation on MRI Images with Deep Neural Networks and Conformal Prediction Luis Balderas1,* , María Moreno de Castro1 , Miguel Lastra2 , Jose P. Martínez3,4 , Francisco J. Pérez3 , Antonio Laínez3,4 , Antonio Arauzo-Azofra5 and José M. Benítez1,4 1 School of Sciences, Technology and Engineering’s Doctorates. Department of Computer Science and Artificial Intelligence, DiCITS, DaSCI, iMUDS, University of Granada, 18071, Granada, Spain 2 Department of Software Engineering, DiCITS, DaSCI, iMUDS, University of Granada, 18071, Granada, Spain 3 Radiodiagnosis Service, Hospital Universitario “Virgen de las Nieves”, Granada, Spain 4 ibs.Granada: Instituto Biosanitario de Granada, Granada, Spain 5 Rural Engineering Department, DiCITS Lab, University of Córdoba, 14005, Córdoba, Spain Abstract Central nervous system tumors, particularly gliomas, rank among the top 10 causes of cancer-related deaths worldwide. Thus, precise differentiation of these tumors is crucial for effective treatment, which can reduce patient suffering and lower mortality rates. We propose a deep learning-based technique for glioma differentiation using MRI imaging, which incorporates two novel approaches. First, we represent perfusion sequences as time series, which serve as inputs for a Deep Neural Network (DNN) classifier. This classifier is trained to distinguish the sequences between low-grade glioma (LGG) and high-grade glioma (HGG). Second, we employ Conformal Prediction to calibrate the results, ensuring they include the true category with a 90% probability. This approach has been rigorously tested to evaluate its performance. Our experimental results demonstrate that our deep neural network not only provides accurate but also ensures trustworthy predictions. 1. Introduction Central nervous system (CNS) tumors have traditionally been regarded as a lethal form of cancer, ranking among the top 10 causes of deaths globally. The GLOBOCAN project using nationwide data from 184 countries estimated the annual incidence of malignant-only CNS tumors to be 3.4 per 100,000 individuals [1]. Within the variety of tumor types, gliomas are the most frequent primary intracerebral tumors in adults. Despite their high fatality rates, the prognosis of CNS tumors has considerably improved over the last decades, possibly because of the prompt detection, the optimization of treatment protocols including the introduction of temozolomide and the advances in neurosurgical procedures. Accurate preoperative differentiation of primary CNS tumors is crucial because the treatment strate- gies differ substantially (e.g. stereotactic biopsy + chemotherapy vs. gross total resection + chemotherapy in lymphoma vs. glioblastoma). Magnetic Resonance Imaging (MRI) is the most important non-invasive technique in the diagnosis and differentiation of CNS tumors. In this context, MRI-perfusion is an imaging technique frequently used to assess the vascularization of brain lesions in a minimally invasive way. There are different approaches for the study of perfusion and lesions of CNS. Machine learning has been employed in numerous medical problems as part of expert decision support systems. However, in most instances, these systems do not provide any information regarding the confidence of their predictions. In this paper we propose an innovative approach that treats perfusion curves as time series data and apply Deep Neural Networks (DNN) and Conformal Predictors (CP) to assess the type of glioma a patient suffers, along with measures of confidence. EXPLIMED - First Workshop on Explainable Artificial Intelligence for the medical domain - 19-20 October 2024, Santiago de Compostela, Spain * Corresponding author.  0000-0002-3845-8848 (L. Balderas); 0000-0003-0440-0864 (M. Moreno de Castro); 0000-0002-7278-2668 (M. Lastra); 0000-0002-0663-2856 (A. Laínez); 0000-0002-2486-5792 (A. Arauzo-Azofra); 0000-0002-2346-0793 (J. M. Benítez) © 2024 Copyright © for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In essence, this involves framing the differentiation problem as a classification task using the perfusion curve as input. This requires finding a suitable representation for the series and then identifying a technique that delivers high classification performance. On top of the classifier, a conformal predictor is built to quantify the level of uncertainty inherent in the model’s predictions using specific metrics. Our research has resulted in an effective automated procedure to assist radiologists in their work, contributing to the quantification of uncertainty at the patient level, with applications in personalized and precision medicine. The structure of the article is as follows: In Section 2, we present state-of-the-art methods that use deep learning and conformal prediction for glioma differentiation, including medical context for a better understanding of the problem. In Section 3 we detail Conformal Predictors. In Section 4, we present our proposal for glioma differentiation. In Section 5, we include the empirical study and discussion of the results. Finally, Section 6 contains the conclusions of the work. 2. Related work 2.1. Medical contextualization As noted in the introduction, gliomas are the most frequent primary intracerebral tumors in adults. The age-adjusted annual incidence of histologic verified glioma has been reported to be 7.3 cases per 100,000 person-years [2]. Most patients with gliomas have a fatal prognosis, and the disease has considerable impact on patients and their families’ physical, psychological, and social status. A recent study on the epidemiology of gliomas found that high-grade gliomas —HGG, grades 3 and 4 according to World Health Organization (WHO)— were present in 85% and low-grade glioma (LGG, WHO 1 and 3 grades) in 15% of the cases, with 5-year overall survival of 82, 54, 22 and 3% for grade 1, 2, 3, and 4, respectively [3]. Accurate preoperative differentiation of primary CNS tumors is essential because the treatment strategies differ substantially. MRI is the most important non-invasive technique in the diagnosis of CNS tumors. In particular, MRI-perfusion is frequently used to assess the vascularization of brain lesions. There are different approaches for the study of perfusion and lesions of CNS. The most widely used MRI-perfusion approach in clinical practice is T2 gradient echo or dynamic susceptibility contrast (DSC) due to its high sensitivity and specificity to differentiate lesions and its short acquisition time. DSC perfusion allows the study of changes in the signal intensity of the tissues derived from the passage of an intravenous gadolinium bolus administered in a peripheral vein. The passage of the contrast bolus leads to specific imaging patterns in different tissues and tumors, with changes in their signal intensity proportional to the amount of gadolinium that passes through or is deposited in them. Changes in signal intensity over time allow a curve to be obtained, from which hemodynamic parameters are inferred. At present, pathological anatomy techniques are essential in the diagnosis and genetic and histological differentiation of gliomas. However, in the presence of lesions such as metastases, an adequate diagnostic imaging can avoid a brain biopsy. The procedure of stereotactic brain biopsy entails a considerable risk of complications (e.g., haemorrhage, infection, wound breakdown), which can be as high as 15.3% in tumors located in midline areas, while 28.8% of post-procedural intracranial haemorrhages have been reported [4]. A recent Cochrane review [5] reported limited available evidence on the ability of perfusion MRI to distinguish HGG from LGG, which precludes reliable estimation of the performance of DSC MRI perfusion-derived parameters (e.g., relative cerebral blood volume or rCBV) for determining the tumor grade, specifically in untreated solid and non-enhancing LGG vs. HGG. MRI-perfusion has also been advocated as a useful tool to differentiate between gliomas and other tumors of the CNS, such as lym- phomas or metastases, but with variable diagnostic yields depending on the publications. Furthermore, it demonstrated to be useful in differentiating between tumor recurrence of gliomas and post-treatment changes such as pseudo progression or radionecrosis. However, the available evidence is restricted to isolated non-clinically validated studies. Most of these publications have focused on a classical analysis of MRI perfusion data, with special emphasis on the quantification of rCBV (an estimate of tumor perfusion), which is not quantifiable in absolute terms (thus requiring comparison with areas of normal cerebral parenchyma). A number or publications that have applied Artificial Intelligence algorithms to the analysis of brain perfusion can be found. They are discussed in section 2.2. 2.2. State-of-art Artificial Intelligence methods In this section, we present the most relevant articles in the state of the art related to the topic. To the best of our knowledge, there is no technique in the literature that utilizes the perfusion curve as input for a time series classification model in low-grade gliomas (LGG) and high-grade gliomas (HGG). Moreover, there are numerous approaches in the literature that apply CP to quantify the uncertainty of deep learning models ([6], [7]). In fact, the combination of deep learning and CP is especially useful in biomedical problems. For example, in skin lesion classification ([8]), for the diagnosis and grading of prostate biopsies ([9]), for rating breast density in mammography ([10]), for grading the severity of spinal stenosis in lumbar spine MRI ([11]); among other applications ([12], [13], [14]), we have not encountered in the literature any work that combines DNN and CP methods in the differentiation of central nervous system tumors. However, the non-invasive differentiation of gliomas through the application of machine learning, specifically distinguishing between LGG and HGG gliomas, has been extensively investigated in recent years. For instance, in [15], a substantial number of radiological features were extracted from MRI sequences, including T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and FLAIR, across a total of 285 cases (210 HGG, 75 LGG). These features were used to train three classifiers (logistic regression, random forest, and support vector machine) to determine the glioma type, achieving an average AUC of 0.9030 for test cohorts. In [16], a deep multi-scale 3D Convolutional Neural Network (CNN) architecture was proposed to categorize gliomas into LGG and HGG using volumetric T1 contrast-enhanced MRI sequences, achieving an accuracy of 96.49%. Recently, some studies advocate mixed approaches incorporating molecular profiling for the differen- tiation of LGG and HGG ([17]). For instance, in the work by authors in [18], clinical and laboratory data were integrated to create a tool for predicting the molecular status (ATRX, IDH1/2, MGMT, and 1p19q co-deletion), distinguishing between low-grade and high-grade gliomas. The system achieved an AUC of 0.885 for this specific learning task. Finally, other studies such as [19], [20], [21], and [22] utilized cases of high-grade or low-grade gliomas to conduct specific studies on various features, but did not present tools for classification between these two grades of gliomas. 3. Conformal predictors In spite of their generally accurate performance, many machine learning models are known to estimate poorly prediction probabilities. To deliver reliable estimates and avoid over or under-confidence predictions, prediction uncertainty must be quantified. This decision aligns with the demand of trustworthy Artificial Intelligence, which, as expressed in recent European Parliament AI regulations [23], includes as a requirement the models ability to quantify and communicate their confidence in their outputs ([24], [6]) with special focus on high-risk applications as AI-supported medicine [25]. Performance metrics, such as accuracy or F1-score, cannot tell about the confidence or uncertainty of the model’s predictions. Even for a model that is right most of the time, not all the cases are equally easy to classify. Some of them might even be doubtful (e.g., they belong to a different distribution or their kind was mostly under represented, overlooked, or misclassified). For example, we can think about patients images from a different hospital, taken from a different MRI scanner, or labeled by another team. The model may provide unnoticed overly confident predictions because machine learning algorithms do not include a built-in warning mechanism to prevent well-informed predictions from looking the same as wild guesses [26] and refers to the alignment between predicted probabilities directly provided by the model and relative frequencies in the actual data. Well-calibrated models are those where the predicted probabilities match the empirical probabilities. Thus, it is about calculating “the probability that the predicted probability is right”. For example, given an MRI scan, if the model predicts class A with 80% probability, then class A should occur approximately 80 times out of 100 predictions. More formally, a well-calibrated classifier satisfies the following formula [27]: 𝑃 (actual class is A|predicted probability of A is 𝑝) ≈ 𝑝 (1) To provide confidence to classification results, calibration models should be applied. Two of the most popular calibration methods are Platt’s Scaling and Isotonic Regression. Platt’s Scaling [28] is a very popular method that maps each prediction to its empirical frequency by passing it through a sigmoid. The method is therefore parametric and presupposes normally distributed and heteroscedastic per-class scores, a notably limiting assumption [29]. On the other hand, Isotonic Regression [30] is more general because the map is performed with an isotonic (monotonically increasing) function. It assumes that the classifier perfectly ranks objects in the test set, essentially implying an ROC AUC of 1 and it is not recommended for small sets [29]. A detailed explanation of how both calibration methods can be found in [31]. One step ahead of calibration is conformalization. To conformalize means that each single-point prediction is not only calibrated but also provided with a prediction interval, with statistical guarantees of including the true value at patient-level ([27], [32], [29]). Conformal Prediction (CP) methods conformalize the prediction. CP methods, which are distribution-free and lightweight, have shown some successful application in health-related domains, mostly for cancer prediction ([33], [34], [35], [32],[36], [34], [37], [38], [39]). Venn-ABERS methods ([40], [27]) are members of the CP family [41], [42], [43]). They share the skill to evaluate how different is an unseen instance from the training dataset. These methods were created to work on top of binary classifiers (although new implementations work for multi-class problems too [44]). They conformalize the predicted probability to match the actual frequency of its class, while delivering in the process the upper (𝑝1 ) and lower (𝑝0 ) bounds of the probability interval that includes the true class (a property known as coverage or validity) with statistical guarantee. This prediction interval quantifies the uncertainty of predictions at instance level: the larger the interval, the lower the confidence the model has in that prediction. Venn-ABERS are an adaptation of Isotonic Regression (where it is applied twice, to fit the probability of each class) and a special case of Venn Predictors ([45], [46] ,[29]), from which they inherit guaranteed validity in the form of well-calibrated probability prediction. There are two possible implementations of the Venn-ABERS algorithm depending on whether or not an independent calibration set is required separate from the training set. On one hand, an inductive approach, or Inductive Venn-ABERS (I-V-A), where the calibration is trained over a hold-on split of the training set, thus the conformalization is model-agnostic [46]. On the other hand, Cross Venn- ABERS (C-V-A) where the splits are performed via k-fold, thus there is not need to reserve data just for calibration as in the inductive case and we use all available data to train the classifier but the classifier must be retrained each time we predict and conformalize the probabilitiy of a new patient [33]. 4. Our proposal This work presents a novel methodology for the non-invasive differentiation of a specific family of CNS tumors, specifically distinguishing between high-grade (HGG) and low-grade gliomas (LGG). Patient data are 3D MRI scans focused on the T2 sequence. The T2 sequence is designed to measure the evolution of tissue signal intensity as the contrast bolus passes through. From these changes in signal intensity in the image, once processed, we construct the perfusion curve, which serves as input for our machine learning model. Thus, we transform an image processing problem into a time series classification one, which we address by combining two impactful technologies: on one hand, a deep Preprocessing CP-DNN Segmentation and Conformal Predictor normalization of the based on a Deep Neural images Network "LGG" Perfusion sequence Perfusion curve Prediction T2 gradient echo or Mean luminance value of Characterization of gliomas Dynamic Susceptibility each voxel per snapshot into low grade (LGG) and Contrast Sequences high grade (HGG) Figure 1: Complete processing pipeline for glioma differentiation between low-grade and high-grade. The process initiates with the reception of a new perfusion sequence. After preprocessing the images, we extract the perfusion curve by calculating the mean luminance of the voxels at different time points. Subsequently, a CP-DNN is trained to extract the final prediction and quantify its uncertainty. neural network capable of extracting patterns with high precision; and on the other hand, conformal predictors for quantifying the uncertainty of predictions. Consequently, we develop a decision support system with significant generalization capacity, providing valuable and calibrated information to the radiologist. The presented pipeline is divided into three phases: preprocessing of the sequence, generation of the perfusion curve, and finally, glioma differentiation with a machine learning model, specifically, a conformal predictor based on a deep neural network classifier. Below, we provide a detailed description of each phase. Figure 1 show the complete pipeline. 4.1. Preprocessing the perfusion MRI sequence The differentiation process begins with the arrival of a new perfusion sequence. The first phase, related to preprocessing, involves normalizing the image in both size and grayscale levels using the Statistical Parametric Mapping technique [47]. Additionally, it is crucial that the image is segmented. Despite numerous initiatives aiming to develop deep learning models for brain segmentation in MRI (as illustrated in [48] and other reviews), there is, to the best of our knowledge, no automatic technique for segmenting an image in the perfusion sequence. Therefore, in our approach, cases have been segmented by radiologists using the 3D Slicer tool (https://www.slicer.org, [49]). 4.2. Generating the perfusion curve Once the perfusion sequence is normalized and segmented, we generate the curve. The perfusion sequence consists of a series of snapshots taken over a specified period. This procedure generates 𝑛 3D images of the brain and, specifically, the glioma. By processing each of these 𝑛 images, calculating the mean luminance value of each voxel within the image in each snapshot, we generate a time series with as many points as snapshots. This way, a time series is generated out of the perfusion sequence, the perfusion curve. After that, the time series is normalized by subtracting the mean of the points and dividing by their standard deviation. 4.3. Characterizing the Glioma with a CP based on DNN The differentiation of gliomas into low grade (LGG) and high grade (HGG) requires a machine learning system capable of extracting relevant patterns to distinguish between the two tumor types. Consequently, we constructed a deep dense neural network that takes a perfusion curve as input and classifies it into LGG and HGG. This is an unbalanced classification problem due to the prevalence of the HGG class in the data set over the LGG class. Thus, to evaluate the DNN performance, we calculate the accuracy and F1-score. To enhance the trustworthiness of the classifier outputs, its probabilities must be calibrated. We calibrate the predictions of the model applying four independent calibration techniques: Platt’s Scaling (PS), Isotonic Regression (IR), Cross Venn-ABERS (C-V-A), and Inductive Venn-ABERS (I-V-A). The performance of the calibrators is evaluated with classification and calibration metrics. In this way, physicians have access not only to artificial intelligence-based tools with a high level of accuracy but also to the ability to assess the confidence that these models have in each of the predictions associated with the patients, lending credibility to those predictions in which the model has sufficient confidence. To achieve this, we employ a Conformal Predictor with a specified confidence level, 𝜂, (e.g. 90%) on top of the DNN. Thus, the radiologist will know, with each prediction, whether the model has confidence exceeding 𝜂 in determining whether the tumor is a low-grade or high-grade glioma, what would increase his or her trust on the differentiation model performance. To the best of our knowledge, there is no state-of-the-art method that addresses the problem of glioma differentiation by applying explainability and trustworthiness methods, using perfusion sequences as time series and applying DNN and CP in this binary classification context. 5. Empirical study To demonstrate the ability of our method to differentiate between high-grade and low-grade gliomas satisfactorily, we have conducted a rigorous experimentation. It is detailed in this section along with the results and their analysis. 5.1. Dataset We have compiled a cohort of 58 patients from the radiology department of the Virgen de las Nieves Hospital in Granada, Spain, selected for their exceptional quality and homogeneity in the MRI scans taken for their diagnosis and treatment. Due to the rather small size of the dataset, we have designed a data augmentation strategy to effectively train our machine learning models. For the training partition, we generate perfusion curves not only at the level of the entire volume but also across the different 2D slices that compose it along the Z-axis. All slices of the same sequence inherit the label, either LGG or HGG, from the original sequence. This extends the training set to exceed 700 curves. Besides the anticipated imbalanced character of the problem arises. The distribution of instances between the two classes is 70% (HGG) and 30% (LGG). To address this issue, the models need to be adjusted to weigh the prediction error, making it more costly to make errors in predicting the LGG class than the HGG class. 5.2. Deep Neural Network: architecture and hyperparameters Taking into account that the patient cohort is not very extensive, the dense neural network to be used cannot be very large, as we might incur in overfitting issues. After different attempts, an architecture with five layers (input, output, and three hidden layers with 70, 30, and 5 neurons) is considered, which yields competitive results. For the selection of hyperparameters, a Grid Search Cross Validation is applied to choose the op- timization algorithm and the parameters 𝛼 (strength of the L2 regularization term), initial learning rate, 𝛽1 (exponential decay rate for fist moment vector), 𝛽2 (exponential decay rate for second moment vector), and 𝜖 (value for numerical stability). The result of this hyperparameter tuning process produced the following value selection: • Optimization algorithm: Adam • Initial Learning rate: 0.001 • 𝛼 : 0.0001 • 𝛽1 : 0.9 • 𝛽2 : 0.9 • 𝜖 : 0.000000001 5.3. Metrics Our proposal combines deep learning and conformal prediction. We use classic metrics to evaluate the goodness of the prediction and calibration metrics to assess the level of uncertainty in the predictions. For classification metrics, we use accuracy and F1-Score. For calibration metrics, we use Brier Score and Log Loss. The first one is defined as: 𝑛 1 ∑︁ 𝐵𝑆 = (𝑝𝑖 − 𝑜𝑖 )2 , (2) 𝑛 𝑖=1 where 𝑝 is the prediction probability of occurrence of the event 𝑖, and 𝑜𝑖 is defined as follows: if event 𝑖 ocurred {︂ 1 𝑜𝑖 = (3) 0 if event 𝑖 not ocurred Log Loss is defined as 𝐿𝑙𝑜𝑔 (𝑦, 𝑝) = −(𝑦 log(𝑝) + (1 − 𝑦) log(1 − 𝑝)) (4) with 𝑦 ∈ {0, 1} and a probability estimate 𝑝 = 𝑃 𝑟(𝑦 = 1). For both of them, Brier Score and Log Loss, the smaller the value, the better is the calibration. 5.4. Results and analysis We calculated the accuracy and F1-score to compare the performance of our novel DNN for time series classification approach with the leading state-of-the-art method for time series classification, which is a 1NN with Dynamic Time Warping ([50], [51]). We have designed a data strategy to ensure the rigor of the study. In all cases, we guarantee the separation between patients is maintained. Firstly, to confirm the reliability of the results, a Stratified 5-fold Cross Validation is used. This involves generating five datasets where patients are split into 80% for training and 20% for testing each time, maintaining class proportions in each fold. The values in Tables 1 and 2 represent the mean values from the five runs for each metric on the test set, demonstrating that our DNN approach outperforms the state-of-the-art 1NN-DTW in both performance and calibration. These results confirm previous analyses where Inductive Venn-ABERS and other CP methods outperformed Platt’s Scaling and Isotonic Regression [46]. In order to unleash the true power of the DNN+CP (actually, Inductive Venn-ABERS) combination, we conduct another experiment where the dataset is partitioned into training and test sets, with 80% of the patients used for training and 20% for testing, again in a stratified manner. We plot in Figure 2 the predicted and conformalized probabilities for the 12 instances (MRI scans) in the test set. On the X-axis, we can see the 12 predictions made by the model. On the Y-axis, we find the probability expressed by the model, with 0 representing a prediction of the LGG class and 1 representing a prediction of the HGG class. The black dots indicate the actual labels of the test examples. As can be seen, there are 3 Method Acc F1-Score Brier Score Log Loss DNN 0.642 0.769 0.289 1.168 DNN + IR 0.733 0.842 0.2 1.185 DNN + PS 0.733 0.844 0.2 0.612 DNN + C-V-A 0.677 0.787 0.249 0.755 DNN + I-V-A 0.752 0.857 0.197 0.598 Table 1 Performance metrics include accuracy and F1-score (the higher the better), and calibration metrics include Brier Score and Log Loss (the smaller the better) for the DNN model, which outperforms the state-of-the-art 1NN-DTW model (see Table 2). The last four rows correspond to the implementations of the four calibration methods: IR stands for Isotonic Regression, PS stands for Platt’s Scaling, C-V-A stands for Cross Venn-ABERS, and I-V-A stands for Inductive Venn-ABERS. All methods improve model performance, with Inductive Venn-ABERS significantly reducing calibration errors. The best results are highlighted in bold. Method Acc F1-Score Brier Score Log Loss 1NN-DTW 0.692 0.769 0.288 10.376 1NN-DTW + IR 0.714 0.813 0.233 0.636 1NN-DTW + PS 0.73 0.844 0.248 0.634 1NN-DTW + C-V-A 0.675 0.786 0.27 0.857 1NN-DTW + I-V-A 0.746 0.848 0.201 0.62 Table 2 Similar to Table 1, here are the results for the state-of-the-art 1NN-DTW model. The classifier shows lower performance and calibration metrics compared to our approach. IR stands for Isotonic Regression, PS stands for Platt’s Scaling, C-V-A stands for Cross Venn-ABERS, and I-V-A stands for Inductive Venn-ABERS. The best results are highlighted in bold. cases of LGG and 9 cases of HGG, maintaining the stratification of the original dataset. The larger light blue dot corresponds to the predictions made by the dense neural network without conformalization. When the black dot and the blue dot coincide, it indicates that the neural network correctly predicted the class. Conversely, if they do not coincide, the neural network made an incorrect prediction. The smaller blue dot (𝑝0 ) and the red dot (𝑝1 ) represent the probability interval containing each of the calibrated predictions (𝑝), depicted by the orange dot. This orange dot represents the predictions of our DNN+CP model expressed in terms of probability, where a probability 𝑝 ≥ 0.5 results in a predicted label of 1 (HGG class). Conversely, if 𝑝 < 0.5, the predicted label is 0 (LGG class). The distance between 𝑝0 and 𝑝1 (𝑝1 − 𝑝0 ) indicates the uncertainty in the model’s decision: the greater the distance between them, the higher the uncertainty in the prediction. This uncertainty is represented by the pink curve. As observed, the neural network without conformalization makes two errors (the second and the eleventh cases). Thanks to the conformalization of the predictions, the DNN+CP model correctly predicts the second case, as the predicted probability is below 0.5, resulting in an LGG class label. Regarding the eleventh case, the DNN+CP, like the non-conformalized DNN, makes the error of categorizing the case as HGG when it is actually LGG. However, it can be seen that the uncertainty in that prediction is very high, showing confidence of less than 60%. Similarly, predictions with some uncertainty are made in the first (slightly over 60%) and third cases (close to 80%). The rest of predictions are more reliable, showing a confidence ranging from 85% to 99%. Therefore, the combination of DNN+CP is not only more accurate, as it corrects errors that an uncalibrated neural network would make, but it also provides very valuable information to doctors about the confidence of the predictions, allowing specialists to assess each case with much more trustworthy information. 1.0 0.8 non-calibrated Probability 0.6 p1 upper probability p0 lower probability conformal probability 0.4 true label p1-p0 interval width 0.2 0.0 Test set sorted by the conformal probability Figure 2: Probability vs conformal probability for MRI scans in the test set. Only thanks to the conformalization, the misclassified patient (second from the left) is correctly classified as a high glioma case. Large intervals show a low confidence in those predictions. 6. Conclusions In this article, we address a global public health issue, ranked among the top 10 causes of cancer- related mortality worldwide: the detection of central nervous system tumors. Specifically, we focus on differentiating gliomas into their two types: high-grade gliomas (HGG) and low-grade gliomas (LGG). To achieve this, we introduce a novel methodology based on time series classification. By utilizing MRI-perfusion, we transform the image into a time series, called the perfusion curve, which reflects tissue vascularization in a non-invasive manner. To classify these perfusion curves between LGG and HGG, a conformal predictor based on a deep neural network is trained. We have compared our proposal with the state-of-the-art technique in time series classification through rigorous experimentation, based on a dataset of patients from the radiology department of the “Virgen de las Nieves Hospital” in Granada, Spain, obtaining satisfactory results. We demonstrate that our methodology is not only innovative in the way it transforms the image problem into sequences, but also that the combination of deep neural networks and conformal prediction for time series classification generates an ideal tool for radiologists This tool displays exceptional generalization capability and the reduced uncertainty in its predictions, making it useful, reliable and trustworthy. This study contributes to quantifying uncertainty at patient-level with applications in personalized and precision medicine. Besides, by improving the confidence in AI-medical applications, this study also aligns with the requirements of the European Parliament AI regulation to achieve Trustworthy AI. Acknowledgments This research has been partially funded by Spanish Ministry of Economy, Industry and Competitiveness (PID2020-118224RB-100 and PID2023-151336OB-I00), con-financed by the European Union (FEDER). References [1] J. Ferlay, I. Soerjomataram, R. Dikshit, S. Eser, C. Mathers, M. Rebelo, et al., Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012, Int. J. Cancer 136 (2015) E359e86. [2] Q. Ostrom, L. Bauchet, F. Davis, et al., The epidemiology of glioma in adults: a “state of the science” review, Neuro Oncology 16 (2014) 894–913. [3] B. Rasmussen, S. Hansen, R. Larsen, M. Kosteljanetz, H. Schultz, B. Bøgård, et al., Epidemiology of glioma: clinical characteristics, symptoms, and predictors of glioma pationes grade I–IV in the danish neuro-oncology registry, Journal of Neuro-Oncology 135 (2017) 571–549. doi:10.1007/ s11060-017-2607-5. [4] Y. Mizobuchi, K. Nkajima, T. Fujihara, K. Matsuzaki, H. Mure, S. Nagahiro, Y. Takagi, The risk of hemorrhage in steriotactic biopsy for brain tumorus, J. Medical Investigation 66 (2019) 317–318. doi:10.2152/jmi.66.314. [5] J. Abrigo, D. Fountain, J. Provenzale, E. Law, J. Kwong, M. Hart, W. Tam, Magnetic resonance perfusion for differentiating low-grade frmo high-grade gliomas at first presentation (Review), Cochrane Database of Systematic Reviews (2018). doi:10.1002/14651858.CD011551.pub2. [6] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, V. Makarenkov, S. Nahavandi, A review of uncertainty quantification in deep learning: Techniques, applications and challenges, Information Fusion 76 (2021) 243–297. URL: https://www.sciencedirect.com/science/article/pii/S1566253521001081. doi:https://doi. org/10.1016/j.inffus.2021.05.008. [7] H. Karimi, R. Samavi, Quantifying deep learning model uncertainty in conformal prediction, Proceedings of the AAAI Symposium Series 1 (2023) 142–148. URL: http://dx.doi.org/10.1609/ aaaiss.v1i1.27492. doi:10.1609/aaaiss.v1i1.27492. [8] J. Fayyad, S. Alijani, H. Najjaran, Empirical validation of conformal prediction for trustworthy skin lesions classification, Computer Methods and Programs in Biomedicine 253 (2024) 108231. URL: https://www.sciencedirect.com/science/article/pii/S0169260724002268. doi:https://doi. org/10.1016/j.cmpb.2024.108231. [9] H. Olsson, K. Kartasalo, N. Mulliqi, M. Capuccini, P. Ruusuvuori, H. Samaratunga, B. Delahunt, C. Lindskog, E. A. M. Janssen, A. Blilie, L. Egevad, O. Spjuth, M. Eklund, I. P. I. E. Panel, Estimating diagnostic uncertainty in artificial intelligence assisted pathology using conformal prediction, Nature Communications 13 (2022) 7761. URL: https://doi.org/10.1038/s41467-022-34945-8. doi:10. 1038/s41467-022-34945-8. [10] C. Lu, K. Chang, P. Singh, J. Kalpathy-Cramer, Three applications of conformal predic- tion for rating breast density in mammography, 2022. URL: https://arxiv.org/abs/2206.12008. arXiv:2206.12008. [11] C. Lu, A. N. Angelopoulos, S. Pomerantz, Improving trustworthiness of ai disease severity rating in medical imaging with ordinal conformal prediction sets, in: L. Wang, Q. Dou, P. T. Fletcher, S. Speidel, S. Li (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, Springer Nature Switzerland, Cham, 2022, pp. 545–554. [12] J. Vazquez, J. C. Facelli, Conformal prediction in clinical medical sciences, Journal of Healthcare Informatics Research 6 (2022) 241–252. URL: https://doi.org/10.1007/s41666-021-00113-8. doi:10. 1007/s41666-021-00113-8. [13] C. Lu, A. Lemay, K. Chang, K. Höbel, J. Kalpathy-Cramer, Fair conformal predictors for applications in medical imaging, Proceedings of the AAAI Conference on Artificial Intelligence 36 (2022) 12008– 12016. URL: https://ojs.aaai.org/index.php/AAAI/article/view/21459. doi:10.1609/aaai.v36i11. 21459. [14] H. Wieslander, P. J. Harrison, G. Skogberg, S. Jackson, M. Fridén, J. Karlsson, O. Spjuth, C. Wählby, Deep learning with conformal prediction for hierarchical analysis of large-scale whole-slide tissue images, IEEE Journal of Biomedical and Health Informatics 25 (2021) 371–380. doi:10.1109/ JBHI.2020.2996300. [15] H.-H. Cho, S.-H. Lee, J. Kim, H. Park, Classification of the glioma grading using radiomics analysis, PeerJ 6 (2018) e5982. [16] H. Mzoughi, I. Njeh, A. Wali, M. B. Slima, A. BenHamida, C. Mhiri, K. B. Mahfoudhe, Deep multi-scale 3d convolutional neural network (cnn) for mri gliomas brain tumor classification, Journal of Digital Imaging 33 (2020) 903–915. URL: https://doi.org/10.1007/s10278-020-00347-9. doi:10.1007/s10278-020-00347-9. [17] S. C. P. D. E. P. A. I. Agusti Alentorn, Alberto Duran-Peña, S. Kesari, Molecular profiling of gliomas: potential therapeutic implications, Expert Review of Anticancer Therapy 15 (2015) 955–962. URL: https://doi.org/10.1586/14737140.2015.1062368. doi:10.1586/14737140.2015.1062368. arXiv:https://doi.org/10.1586/14737140.2015.1062368. [18] J. Haubold, R. Hosch, V. Parmar, M. Glas, N. Guberina, O. A. Catalano, D. Pierscianek, K. Wrede, C. Deuschl, M. Forsting, F. Nensa, N. Flaschel, L. Umutlu, Fully automated mr based virtual biopsy of cerebral gliomas, Cancers 13 (2021). URL: https://www.mdpi.com/2072-6694/13/24/6186. doi:10.3390/cancers13246186. [19] M. Kim, S. Y. Jung, J. E. Park, Y. Jo, S. Y. Park, S. J. Nam, J. H. Kim, H. S. Kim, Diffusion- and perfusion-weighted mri radiomics model may predict isocitrate dehydrogenase (idh) mutation and tumor aggressiveness in diffuse lower grade glioma, European Radiology 30 (2020) 2142–2151. URL: https://doi.org/10.1007/s00330-019-06548-3. doi:10.1007/s00330-019-06548-3. [20] Y. Ren, X. Zhang, W. Rui, H. Pang, T. Qiu, J. Wang, Q. Xie, T. Jin, H. Zhang, H. Chen, Y. Zhang, H. Lu, Z. Yao, J. Zhang, X. Feng, Noninvasive prediction of idh1 mutation and atrx expression loss in low-grade gliomas using multiparametric mr ra- diomic features, Journal of Magnetic Resonance Imaging 49 (2019) 808–817. URL: https: //onlinelibrary.wiley.com/doi/abs/10.1002/jmri.26240. doi:https://doi.org/10.1002/jmri. 26240. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/jmri.26240. [21] Z. A. Shboul, J. Chen, K. M. Iftekharuddin, Prediction of molecular mutations in diffuse low-grade gliomas using mr imaging features, Scientific Reports 10 (2020) 3711. URL: https://doi.org/10.1038/ s41598-020-60550-0. doi:10.1038/s41598-020-60550-0. [22] E. Calabrese, J. D. Rudie, A. M. Rauschecker, J. E. Villanueva-Meyer, J. L. Clarke, D. A. Solomon, S. Cha, Combining radiomics and deep convolutional neural network features from preoperative MRI for predicting clinically rele- vant genetic biomarkers in glioblastoma, Neuro-Oncology Advances 4 (2022) vdac060. URL: https://doi.org/10.1093/noajnl/vdac060. doi:10.1093/noajnl/vdac060. arXiv:https://academic.oup.com/noa/article-pdf/4/1/vdac060/43778051/vdac060.pdf. [23] AI Act — digital-strategy.ec.europa.eu, https://digital-strategy.ec.europa.eu/en/policies/ regulatory-framework-ai, 2024. [Accessed 24-04-2024]. [24] N. Díaz-Rodríguez, J. Del Ser, M. Coeckelbergh, M. López de Prado, E. Herrera-Viedma, F. Herrera, Connecting the dots in trustworthy artificial intelligence: From ai principles, ethics, and key requirements to responsible ai systems and regulation, Information Fusion 99 (2023) 101896. URL: https://www.sciencedirect.com/science/article/pii/S1566253523002129. doi:https://doi.org/ 10.1016/j.inffus.2023.101896. [25] CHAI — coalitionforhealthai.org, https://www.coalitionforhealthai.org/, 2024. [Accessed 24-04- 2024]. [26] C. Molnar, Introduction to conformal prediction with python, 2023. [27] V. Vovk, I. Petej, Venn-abers predictors, in: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, AUAI Press, Arlington, Virginia, USA, 2014, p. 829–838. [28] J. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, 1999. URL: https://api.semanticscholar.org/CorpusID:56563878. [29] V. Manokhin, Practical guide to applied conformal prediction in python: Learn and apply the best uncertainty frameworks to your industry applications, 2023. [30] B. Zadrozny, C. Elkan, Transforming classifier scores into accurate multiclass probability estimates, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, Association for Computing Machinery, New York, NY, USA, 2002, p. 694–699. URL: https://doi.org/10.1145/775047.775151. doi:10.1145/775047.775151. [31] A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with supervised learning, in: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, Association for Computing Machinery, New York, NY, USA, 2005, p. 625–632. URL: https://doi.org/10.1145/ 1102351.1102430. doi:10.1145/1102351.1102430. [32] S. Arvidsson, O. Spjuth, L. Carlsson, P. Toccaceli, Prediction of metabolic transformations using cross Venn-ABERS predictors, in: A. Gammerman, V. Vovk, Z. Luo, H. Papadopoulos (Eds.), Proceedings of the Sixth Workshop on Conformal and Probabilistic Prediction and Applications, volume 60 of Proceedings of Machine Learning Research, PMLR, 2017, pp. 118–131. URL: https: //proceedings.mlr.press/v60/arvidsson17a.html. [33] I. Nouretdinov, S. G. Costafreda, A. Gammerman, A. Chervonenkis, V. Vovk, V. Vapnik, C. H. Y. Fu, Machine learning classification with confidence: application of transductive conformal predictors to MRI-based diagnostic and prognostic markers in depression, Neuroimage 56 (2010) 809–813. [34] D. Devetyarov, I. Nouretdinov, B. Burford, S. Camuzeaux, A. Gentry-Maharaj, A. Tiss, C. Smith, Z. Luo, A. Chervonenkis, R. Hallett, V. Vovk, M. Waterfield, R. Cramer, J. F. Timms, J. Sinclair, U. Menon, I. Jacobs, A. Gammerman, Conformal predictors in early diagnostics of ovarian and breast cancers, Progress in Artificial Intelligence 1 (2012) 245–257. URL: https://doi.org/10.1007/ s13748-012-0021-y. doi:10.1007/s13748-012-0021-y. [35] H. Papadopoulos, A. Gammerman, V. Vovk, Reliable diagnosis of acute abdominal pain with confor- mal prediction, International journal of engineering intelligent systems for electrical engineering and communications 17 (2009) 127–137. URL: https://api.semanticscholar.org/CorpusID:18515829. [36] A. Lambrou, H. Papadopoulos, A. Gammerman, Evolutionary conformal prediction for breast can- cer diagnosis, in: 2009 9th International Conference on Information Technology and Applications in Biomedicine, 2009, pp. 1–4. doi:10.1109/ITAB.2009.5394447. [37] L. Alnemer, L. Rajab, I. Aljarah, Conformal prediction technique to predict breast cancer surviv- ability, International journal of advanced science and technology 96 (2016) 1–10. doi:10.14257/ ijast.2016.96.01. [38] A. S. Millar, J. Arnn, S. Himes, J. C. Facelli, Uncertainty in breast cancer risk prediction: A conformal prediction study of race stratification (2024). URL: http://dx.doi.org/10.3233/SHTI231113. doi:10.3233/shti231113. [39] S. Hernandez-Hernandez, Q. Guo, P. J. Ballester, Conformal prediction of molecule-induced cancer cell growth inhibition challenged by strong distribution shifts, bioRxiv (2024). URL: https://www. biorxiv.org/content/early/2024/03/17/2024.03.15.585269. doi:10.1101/2024.03.15.585269. arXiv:https://www.biorxiv.org/content/early/2024/03/17/2024.03.15.585269.full.pdf. [40] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, E. Silverman, An Empirical Distribution Function for Sampling with Incomplete Information, The Annals of Mathematical Statistics 26 (1955) 641 – 647. URL: https://doi.org/10.1214/aoms/1177728423. doi:10.1214/aoms/1177728423. [41] A. Gammerman, V. Vapnik, V. Vovk, Learning by transduction, in: Proceedings of the Fourteenth Conference on Uncertainty in Articial Intelligence, Morgan Kaufmann, 1998, pp. 148–156. [42] C. Saunders, A. Gammerman, V. Vovk, Transduction with confidence and credibility, in: Sixteenth International Joint Conference on Artificial Intelligence (IJCAI ’99) (01/01/99), 1999, pp. 722–726. URL: https://eprints.soton.ac.uk/258961/. [43] V. Vovk, A. Gammerman, C. Saunders, Machine-learning applications of algorithmic randomness, in: Proceedings of the Sixteenth International Conference on Machine Learning, Morgan Kaufmann, 1999, pp. 444–453. [44] V. Manokhin, Multi-class probabilistic classification using inductive and cross Venn–Abers predictors 60 (2017) 228–240. URL: https://proceedings.mlr.press/v60/manokhin17a.html. [45] V. VOVK, Algorithmic learning in a random world, 2023. [46] T. Pereira, S. Cardoso, M. Guerreiro, A. Mendonça, de, S. C. Madeira, Alzheimer’s Disease Neu- roimaging Initiative, Targeting the uncertainty of predictions at patient-level using an ensemble of classifiers coupled with calibration methods, Venn-ABERS, and conformal predictors: A case study in AD, J. Biomed. Inform. 101 (2020) 103350. [47] K. J. Friston, Statistical Parametric Mapping, Springer US, Boston, MA, 2003, pp. 237–250. URL: https://doi.org/10.1007/978-1-4615-1079-6-16. doi:10.1007/978-1-4615-1079-6-16. [48] R. Ranjbarzadeh, A. Caputo, E. B. Tirkolaee, S. Jafarzadeh Ghoushchi, M. Bendechache, Brain tumor segmentation of mri images: A comprehensive review on the application of artificial intelligence tools, Computers in Biology and Medicine 152 (2023) 106405. URL: https://www.sciencedirect. com/science/article/pii/S0010482522011131. doi:https://doi.org/10.1016/j.compbiomed. 2022.106405. [49] A. Fedorov, R. Beichel, J. Kalpathy-Cramer, J. Finet, J.-C. Fillion-Robin, S. Pujol, C. Bauer, D. Jennings, F. Fennessy, M. Sonka, J. Buatti, S. Aylward, J. V. Miller, S. Pieper, R. Kikinis, 3d slicer as an image computing platform for the quantitative imaging network, Magnetic Resonance Imaging 30 (2012) 1323–1341. URL: https://www.sciencedirect.com/science/article/pii/S0730725X12001816. doi:https://doi.org/10.1016/j.mri.2012.05.001, quantitative Imaging in Cancer. [50] A. Bagnall, A. Bostrom, J. Large, J. Lines, The great time series classification bake off: An experimental evaluation of recently proposed algorithms. extended version, 2016. URL: https: //arxiv.org/abs/1602.01711. arXiv:1602.01711. [51] B. Dhariyal, T. L. Nguyen, G. Ifrim, Back to basics: A sanity check on modern time series classifi- cation algorithms, 2023. URL: https://arxiv.org/abs/2308.07886. arXiv:2308.07886.