Reliable Central Nervous System Tumor Differentiation on MRI Images with Deep Neural Networks and Conformal Prediction

Reliable Central Nervous System Tumor Differentiation on MRI Images with Deep Neural Networks and Conformal Prediction 20 October 2024 LuisBalderas School of Sciences, Technology and Engineering's Doctorates. Department of Computer Science and Artificial Intelligence DiCITS DaSCI iMUDS University of Granada

18071 Granada Spain

MaríaMoreno De Castro School of Sciences, Technology and Engineering's Doctorates. Department of Computer Science and Artificial Intelligence DiCITS DaSCI iMUDS University of Granada

18071 Granada Spain

MiguelLastra Department of Software Engineering DiCITS DaSCI iMUDS University of Granada

18071 Granada Spain

JosePMartínez Radiodiagnosis Service Hospital Universitario "Virgen de las Nieves"

Granada Spain

ibs Granada Instituto Biosanitario de Granada

Granada Spain

FranciscoJPérez Radiodiagnosis Service Hospital Universitario "Virgen de las Nieves"

Granada Spain

AntonioLaínez Radiodiagnosis Service Hospital Universitario "Virgen de las Nieves"

Granada Spain

ibs Granada Instituto Biosanitario de Granada

Granada Spain

AntonioArauzo-Azofra Rural Engineering Department DiCITS Lab University of Córdoba

14005 Córdoba Spain

JoséMBenítez School of Sciences, Technology and Engineering's Doctorates. Department of Computer Science and Artificial Intelligence DiCITS DaSCI iMUDS University of Granada

18071 Granada Spain

ibs Granada Instituto Biosanitario de Granada

Granada Spain

Santiago de Compostela Spain

Reliable Central Nervous System Tumor Differentiation on MRI Images with Deep Neural Networks and Conformal Prediction 1613-0073 20 October 2024 A4E79D3136B095CB90A82D253938EF71 GROBID - A machine learning software for extracting information from scholarly documents 0000-0002-3845-8848 (L. Balderas) 0000-0003-0440-0864 (M. Moreno de Castro) 0000-0002-7278-2668 (M. Lastra) 0000-0002-0663-2856 (A. Laínez) 0000-0002-2486-5792 (A. Arauzo-Azofra) 0000-0002-2346-0793 (J. M. Benítez)

Central nervous system tumors, particularly gliomas, rank among the top 10 causes of cancer-related deaths worldwide. Thus, precise differentiation of these tumors is crucial for effective treatment, which can reduce patient suffering and lower mortality rates. We propose a deep learning-based technique for glioma differentiation using MRI imaging, which incorporates two novel approaches. First, we represent perfusion sequences as time series, which serve as inputs for a Deep Neural Network (DNN) classifier. This classifier is trained to distinguish the sequences between low-grade glioma (LGG) and high-grade glioma (HGG). Second, we employ Conformal Prediction to calibrate the results, ensuring they include the true category with a 90% probability. This approach has been rigorously tested to evaluate its performance. Our experimental results demonstrate that our deep neural network not only provides accurate but also ensures trustworthy predictions.

Introduction

Central nervous system (CNS) tumors have traditionally been regarded as a lethal form of cancer, ranking among the top 10 causes of deaths globally. The GLOBOCAN project using nationwide data from 184 countries estimated the annual incidence of malignant-only CNS tumors to be 3.4 per 100,000 individuals [1]. Within the variety of tumor types, gliomas are the most frequent primary intracerebral tumors in adults. Despite their high fatality rates, the prognosis of CNS tumors has considerably improved over the last decades, possibly because of the prompt detection, the optimization of treatment protocols including the introduction of temozolomide and the advances in neurosurgical procedures.

Accurate preoperative differentiation of primary CNS tumors is crucial because the treatment strategies differ substantially (e.g. stereotactic biopsy + chemotherapy vs. gross total resection + chemotherapy in lymphoma vs. glioblastoma). Magnetic Resonance Imaging (MRI) is the most important non-invasive technique in the diagnosis and differentiation of CNS tumors. In this context, MRI-perfusion is an imaging technique frequently used to assess the vascularization of brain lesions in a minimally invasive way. There are different approaches for the study of perfusion and lesions of CNS.

Machine learning has been employed in numerous medical problems as part of expert decision support systems. However, in most instances, these systems do not provide any information regarding the confidence of their predictions. In this paper we propose an innovative approach that treats perfusion curves as time series data and apply Deep Neural Networks (DNN) and Conformal Predictors (CP) to assess the type of glioma a patient suffers, along with measures of confidence.

In essence, this involves framing the differentiation problem as a classification task using the perfusion curve as input. This requires finding a suitable representation for the series and then identifying a technique that delivers high classification performance. On top of the classifier, a conformal predictor is built to quantify the level of uncertainty inherent in the model's predictions using specific metrics. Our research has resulted in an effective automated procedure to assist radiologists in their work, contributing to the quantification of uncertainty at the patient level, with applications in personalized and precision medicine.

The structure of the article is as follows: In Section 2, we present state-of-the-art methods that use deep learning and conformal prediction for glioma differentiation, including medical context for a better understanding of the problem. In Section 3 we detail Conformal Predictors. In Section 4, we present our proposal for glioma differentiation. In Section 5, we include the empirical study and discussion of the results. Finally, Section 6 contains the conclusions of the work.

Related work

Medical contextualization

As noted in the introduction, gliomas are the most frequent primary intracerebral tumors in adults. The age-adjusted annual incidence of histologic verified glioma has been reported to be 7.3 cases per 100,000 person-years [2]. Most patients with gliomas have a fatal prognosis, and the disease has considerable impact on patients and their families' physical, psychological, and social status. A recent study on the epidemiology of gliomas found that high-grade gliomas -HGG, grades 3 and 4 according to World Health Organization (WHO)-were present in 85% and low-grade glioma (LGG, WHO 1 and 3 grades) in 15% of the cases, with 5-year overall survival of 82, 54, 22 and 3% for grade 1, 2, 3, and 4, respectively [3].

Accurate preoperative differentiation of primary CNS tumors is essential because the treatment strategies differ substantially. MRI is the most important non-invasive technique in the diagnosis of CNS tumors. In particular, MRI-perfusion is frequently used to assess the vascularization of brain lesions. There are different approaches for the study of perfusion and lesions of CNS. The most widely used MRI-perfusion approach in clinical practice is T2 gradient echo or dynamic susceptibility contrast (DSC) due to its high sensitivity and specificity to differentiate lesions and its short acquisition time. DSC perfusion allows the study of changes in the signal intensity of the tissues derived from the passage of an intravenous gadolinium bolus administered in a peripheral vein. The passage of the contrast bolus leads to specific imaging patterns in different tissues and tumors, with changes in their signal intensity proportional to the amount of gadolinium that passes through or is deposited in them. Changes in signal intensity over time allow a curve to be obtained, from which hemodynamic parameters are inferred.

At present, pathological anatomy techniques are essential in the diagnosis and genetic and histological differentiation of gliomas. However, in the presence of lesions such as metastases, an adequate diagnostic imaging can avoid a brain biopsy. The procedure of stereotactic brain biopsy entails a considerable risk of complications (e.g., haemorrhage, infection, wound breakdown), which can be as high as 15.3% in tumors located in midline areas, while 28.8% of post-procedural intracranial haemorrhages have been reported [4].

A recent Cochrane review [5] reported limited available evidence on the ability of perfusion MRI to distinguish HGG from LGG, which precludes reliable estimation of the performance of DSC MRI perfusion-derived parameters (e.g., relative cerebral blood volume or rCBV) for determining the tumor grade, specifically in untreated solid and non-enhancing LGG vs. HGG. MRI-perfusion has also been advocated as a useful tool to differentiate between gliomas and other tumors of the CNS, such as lymphomas or metastases, but with variable diagnostic yields depending on the publications. Furthermore, it demonstrated to be useful in differentiating between tumor recurrence of gliomas and post-treatment changes such as pseudo progression or radionecrosis. However, the available evidence is restricted to isolated non-clinically validated studies.

Most of these publications have focused on a classical analysis of MRI perfusion data, with special emphasis on the quantification of rCBV (an estimate of tumor perfusion), which is not quantifiable in absolute terms (thus requiring comparison with areas of normal cerebral parenchyma). A number or publications that have applied Artificial Intelligence algorithms to the analysis of brain perfusion can be found. They are discussed in section 2.2.

State-of-art Artificial Intelligence methods

In this section, we present the most relevant articles in the state of the art related to the topic. To the best of our knowledge, there is no technique in the literature that utilizes the perfusion curve as input for a time series classification model in low-grade gliomas (LGG) and high-grade gliomas (HGG). Moreover, there are numerous approaches in the literature that apply CP to quantify the uncertainty of deep learning models ( [6], [7]). In fact, the combination of deep learning and CP is especially useful in biomedical problems. For example, in skin lesion classification ( [8]), for the diagnosis and grading of prostate biopsies ( [9]), for rating breast density in mammography ( [10]), for grading the severity of spinal stenosis in lumbar spine MRI ( [11]); among other applications ( [12], [13], [14]), we have not encountered in the literature any work that combines DNN and CP methods in the differentiation of central nervous system tumors.

However, the non-invasive differentiation of gliomas through the application of machine learning, specifically distinguishing between LGG and HGG gliomas, has been extensively investigated in recent years. For instance, in [15], a substantial number of radiological features were extracted from MRI sequences, including T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and FLAIR, across a total of 285 cases (210 HGG, 75 LGG). These features were used to train three classifiers (logistic regression, random forest, and support vector machine) to determine the glioma type, achieving an average AUC of 0.9030 for test cohorts.

In [16], a deep multi-scale 3D Convolutional Neural Network (CNN) architecture was proposed to categorize gliomas into LGG and HGG using volumetric T1 contrast-enhanced MRI sequences, achieving an accuracy of 96.49%.

Recently, some studies advocate mixed approaches incorporating molecular profiling for the differentiation of LGG and HGG ( [17]). For instance, in the work by authors in [18], clinical and laboratory data were integrated to create a tool for predicting the molecular status (ATRX, IDH1/2, MGMT, and 1p19q co-deletion), distinguishing between low-grade and high-grade gliomas. The system achieved an AUC of 0.885 for this specific learning task.

Finally, other studies such as [19], [20], [21], and [22] utilized cases of high-grade or low-grade gliomas to conduct specific studies on various features, but did not present tools for classification between these two grades of gliomas.

Conformal predictors

In spite of their generally accurate performance, many machine learning models are known to estimate poorly prediction probabilities. To deliver reliable estimates and avoid over or under-confidence predictions, prediction uncertainty must be quantified. This decision aligns with the demand of trustworthy Artificial Intelligence, which, as expressed in recent European Parliament AI regulations [23], includes as a requirement the models ability to quantify and communicate their confidence in their outputs ( [24], [6]) with special focus on high-risk applications as AI-supported medicine [25].

Performance metrics, such as accuracy or F1-score, cannot tell about the confidence or uncertainty of the model's predictions. Even for a model that is right most of the time, not all the cases are equally easy to classify. Some of them might even be doubtful (e.g., they belong to a different distribution or their kind was mostly under represented, overlooked, or misclassified). For example, we can think about patients images from a different hospital, taken from a different MRI scanner, or labeled by another team. The model may provide unnoticed overly confident predictions because machine learning algorithms do not include a built-in warning mechanism to prevent well-informed predictions from looking the same as wild guesses [26] and refers to the alignment between predicted probabilities directly provided by the model and relative frequencies in the actual data. Well-calibrated models are those where the predicted probabilities match the empirical probabilities. Thus, it is about calculating "the probability that the predicted probability is right". For example, given an MRI scan, if the model predicts class A with 80% probability, then class A should occur approximately 80 times out of 100 predictions. More formally, a well-calibrated classifier satisfies the following formula [27]:

𝑃 (actual class is A|predicted probability of A is 𝑝) ≈ 𝑝(1)

To provide confidence to classification results, calibration models should be applied. Two of the most popular calibration methods are Platt's Scaling and Isotonic Regression. Platt's Scaling [28] is a very popular method that maps each prediction to its empirical frequency by passing it through a sigmoid. The method is therefore parametric and presupposes normally distributed and heteroscedastic per-class scores, a notably limiting assumption [29]. On the other hand, Isotonic Regression [30] is more general because the map is performed with an isotonic (monotonically increasing) function. It assumes that the classifier perfectly ranks objects in the test set, essentially implying an ROC AUC of 1 and it is not recommended for small sets [29]. A detailed explanation of how both calibration methods can be found in [31].

One step ahead of calibration is conformalization. To conformalize means that each single-point prediction is not only calibrated but also provided with a prediction interval, with statistical guarantees of including the true value at patient-level ( [27], [32], [29]). Conformal Prediction (CP) methods conformalize the prediction. CP methods, which are distribution-free and lightweight, have shown some successful application in health-related domains, mostly for cancer prediction ( [33], [34], [35], [32], [36], [34], [37], [38], [39]).

Venn-ABERS methods ( [40], [27]) are members of the CP family [41], [42], [43]). They share the skill to evaluate how different is an unseen instance from the training dataset. These methods were created to work on top of binary classifiers (although new implementations work for multi-class problems too [44]). They conformalize the predicted probability to match the actual frequency of its class, while delivering in the process the upper (𝑝 1 ) and lower (𝑝 0 ) bounds of the probability interval that includes the true class (a property known as coverage or validity) with statistical guarantee. This prediction interval quantifies the uncertainty of predictions at instance level: the larger the interval, the lower the confidence the model has in that prediction. Venn-ABERS are an adaptation of Isotonic Regression (where it is applied twice, to fit the probability of each class) and a special case of Venn Predictors ( [45], [46] , [29]), from which they inherit guaranteed validity in the form of well-calibrated probability prediction.

There are two possible implementations of the Venn-ABERS algorithm depending on whether or not an independent calibration set is required separate from the training set. On one hand, an inductive approach, or Inductive Venn-ABERS (I-V-A), where the calibration is trained over a hold-on split of the training set, thus the conformalization is model-agnostic [46]. On the other hand, Cross Venn-ABERS (C-V-A) where the splits are performed via k-fold, thus there is not need to reserve data just for calibration as in the inductive case and we use all available data to train the classifier but the classifier must be retrained each time we predict and conformalize the probabilitiy of a new patient [33].

Our proposal

This work presents a novel methodology for the non-invasive differentiation of a specific family of CNS tumors, specifically distinguishing between high-grade (HGG) and low-grade gliomas (LGG). Patient data are 3D MRI scans focused on the T2 sequence. The T2 sequence is designed to measure the evolution of tissue signal intensity as the contrast bolus passes through. From these changes in signal intensity in the image, once processed, we construct the perfusion curve, which serves as input for our machine learning model. Thus, we transform an image processing problem into a time series classification one, which we address by combining two impactful technologies: on one hand, a deep The process initiates with the reception of a new perfusion sequence. After preprocessing the images, we extract the perfusion curve by calculating the mean luminance of the voxels at different time points. Subsequently, a CP-DNN is trained to extract the final prediction and quantify its uncertainty.

neural network capable of extracting patterns with high precision; and on the other hand, conformal predictors for quantifying the uncertainty of predictions. Consequently, we develop a decision support system with significant generalization capacity, providing valuable and calibrated information to the radiologist. The presented pipeline is divided into three phases: preprocessing of the sequence, generation of the perfusion curve, and finally, glioma differentiation with a machine learning model, specifically, a conformal predictor based on a deep neural network classifier. Below, we provide a detailed description of each phase. Figure 1 show the complete pipeline.

Preprocessing the perfusion MRI sequence

The differentiation process begins with the arrival of a new perfusion sequence. The first phase, related to preprocessing, involves normalizing the image in both size and grayscale levels using the Statistical Parametric Mapping technique [47]. Additionally, it is crucial that the image is segmented. Despite numerous initiatives aiming to develop deep learning models for brain segmentation in MRI (as illustrated in [48] and other reviews), there is, to the best of our knowledge, no automatic technique for segmenting an image in the perfusion sequence. Therefore, in our approach, cases have been segmented by radiologists using the 3D Slicer tool (https://www.slicer.org, [49]).

Generating the perfusion curve

Once the perfusion sequence is normalized and segmented, we generate the curve. The perfusion sequence consists of a series of snapshots taken over a specified period. This procedure generates 𝑛 3D images of the brain and, specifically, the glioma. By processing each of these 𝑛 images, calculating the mean luminance value of each voxel within the image in each snapshot, we generate a time series with as many points as snapshots. This way, a time series is generated out of the perfusion sequence, the perfusion curve. After that, the time series is normalized by subtracting the mean of the points and dividing by their standard deviation.

Characterizing the Glioma with a CP based on DNN

The differentiation of gliomas into low grade (LGG) and high grade (HGG) requires a machine learning system capable of extracting relevant patterns to distinguish between the two tumor types. Consequently, we constructed a deep dense neural network that takes a perfusion curve as input and classifies it into LGG and HGG. This is an unbalanced classification problem due to the prevalence of the HGG class in the data set over the LGG class. Thus, to evaluate the DNN performance, we calculate the accuracy and F1-score.

To enhance the trustworthiness of the classifier outputs, its probabilities must be calibrated. We calibrate the predictions of the model applying four independent calibration techniques: Platt's Scaling (PS), Isotonic (IR), Cross Venn-ABERS (C-V-A), and Inductive Venn-ABERS (I-V-A).

The performance of the calibrators is evaluated with classification and calibration metrics. In this way, physicians have access not only to artificial intelligence-based tools with a high level of accuracy but also to the ability to assess the confidence that these models have in each of the predictions associated with the patients, lending credibility to those predictions in which the model has sufficient confidence. To achieve this, we employ a Conformal Predictor with a specified confidence level, 𝜂, (e.g. 90%) on top of the DNN. Thus, the radiologist will know, with each prediction, whether the model has confidence exceeding 𝜂 in determining whether the tumor is a low-grade or high-grade glioma, what would increase his or her trust on the differentiation model performance.

To the best of our knowledge, there is no state-of-the-art method that addresses the problem of glioma differentiation by applying explainability and trustworthiness methods, using perfusion sequences as time series and applying DNN and CP in this binary classification context.

Empirical study

To demonstrate the ability of our method to differentiate between high-grade and low-grade gliomas satisfactorily, we have conducted a rigorous experimentation. It is detailed in this section along with the results and their analysis.

Dataset

We have compiled a cohort of 58 patients from the radiology department of the Virgen de las Nieves Hospital in Granada, Spain, selected for their exceptional quality and homogeneity in the MRI scans taken for their diagnosis and treatment.

Due to the rather small size of the dataset, we have designed a data augmentation strategy to effectively train our machine learning models. For the training partition, we generate perfusion curves not only at the level of the entire volume but also across the different 2D slices that compose it along the Z-axis. All slices of the same sequence inherit the label, either LGG or HGG, from the original sequence. This extends the training set to exceed 700 curves. Besides the anticipated imbalanced character of the problem arises. The distribution of instances between the two classes is 70% (HGG) and 30% (LGG). To address this issue, the models need to be adjusted to weigh the prediction error, making it more costly to make errors in predicting the LGG class than the HGG class.

Deep Neural Network: architecture and hyperparameters

Taking into account that the patient cohort is not very extensive, the dense neural network to be used cannot be very large, as we might incur in overfitting issues. After different attempts, an architecture with five layers (input, output, and three hidden layers with 70, 30, and 5 neurons) is considered, which yields competitive results.

For the selection of hyperparameters, a Grid Search Cross Validation is applied to choose the optimization algorithm and the parameters 𝛼 (strength of the L2 regularization term), initial learning rate, 𝛽 1 (exponential decay rate for fist moment vector), 𝛽 2 (exponential decay rate for second moment vector), and 𝜖 (value for numerical stability). The result of this hyperparameter tuning process produced the following value selection:

• Optimization algorithm: Adam • Initial Learning rate: 0.001

• 𝛼 : 0.0001

• 𝛽 1 : 0.9

• 𝛽 2 : 0.9

• 𝜖 : 0.000000001

Metrics

Our proposal combines deep learning and conformal prediction. We use classic metrics to evaluate the goodness of the prediction and calibration metrics to assess the level of uncertainty in the predictions. For classification metrics, we use accuracy and F1-Score. For calibration metrics, we use Brier Score and Log Loss. The first one is defined as:

𝐵𝑆 = 1 𝑛 𝑛 ∑︁ 𝑖=1 (𝑝 𝑖 − 𝑜 𝑖 ) 2 , (2)

where 𝑝 is the prediction probability of occurrence of the event 𝑖, and 𝑜 𝑖 is defined as follows:

𝑜 𝑖 = {︂ 1 if event 𝑖 ocurred 0 if event 𝑖 not ocurred(3)

Log Loss is defined as

𝐿 𝑙𝑜𝑔 (𝑦, 𝑝) = −(𝑦 log(𝑝) + (1 − 𝑦) log(1 − 𝑝))(4)

with 𝑦 ∈ {0, 1} and a probability estimate 𝑝 = 𝑃 𝑟(𝑦 = 1).

For both of them, Brier Score and Log Loss, the smaller the value, the better is the calibration.

Results and analysis

We calculated the accuracy and F1-score to compare the performance of our novel DNN for time series classification approach with the leading state-of-the-art method for time series classification, which is a 1NN with Dynamic Time Warping ( [50], [51]).

We have designed a data strategy to ensure the rigor of the study. In all cases, we guarantee the separation between patients is maintained. Firstly, to confirm the reliability of the results, a Stratified 5-fold Cross Validation is used. This involves generating five datasets where patients are split into 80% for training and 20% for testing each time, maintaining class proportions in each fold.

The values in Tables 1 and 2 represent the mean values from the five runs for each metric on the test set, demonstrating that our DNN approach outperforms the state-of-the-art 1NN-DTW in both performance and calibration. These results confirm previous analyses where Inductive Venn-ABERS and other CP methods outperformed Platt's Scaling and Isotonic Regression [46].

In order to unleash the true power of the DNN+CP (actually, Inductive Venn-ABERS) combination, we conduct another experiment where the dataset is partitioned into training and test sets, with 80% of the patients used for training and 20% for testing, again in a stratified manner. We plot in Figure 2 the predicted and conformalized probabilities for the 12 instances (MRI scans) in the test set. On the X-axis, we can see the 12 predictions made by the model. On the Y-axis, we find the probability expressed by the model, with 0 representing a prediction of the LGG class and 1 representing a prediction of the HGG class. The black dots indicate the actual labels of the test examples. As can be seen, there are 3 cases of LGG and 9 cases of HGG, maintaining the stratification of the original dataset. The larger light blue dot corresponds to the predictions made by the dense neural network without conformalization. When the black dot and the blue dot coincide, it indicates that the neural network correctly predicted the class. Conversely, if they do not coincide, the neural network made an incorrect prediction.

The smaller blue dot (𝑝 0 ) and the red dot (𝑝 1 ) represent the probability interval containing each of the calibrated predictions (𝑝), depicted by the orange dot. This orange dot represents the predictions of our DNN+CP model expressed in terms of probability, where a probability 𝑝 ≥ 0.5 results in a predicted label of 1 (HGG class). Conversely, if 𝑝 < 0.5, the predicted label is 0 (LGG class). The distance between 𝑝 0 and 𝑝 1 (𝑝 1 − 𝑝 0 ) indicates the uncertainty in the model's decision: the greater the distance between them, the higher the uncertainty in the prediction. This uncertainty is represented by the pink curve.

As observed, the neural network without conformalization makes two errors (the second and the eleventh cases). Thanks to the conformalization of the predictions, the DNN+CP model correctly predicts the second case, as the predicted probability is below 0.5, resulting in an LGG class label. Regarding the eleventh case, the DNN+CP, like the non-conformalized DNN, makes the error of categorizing the case as HGG when it is actually LGG. However, it can be seen that the uncertainty in that prediction is very high, showing confidence of less than 60%. Similarly, predictions with some uncertainty are made in the first (slightly over 60%) and third cases (close to 80%). The rest of predictions are more reliable, showing a confidence ranging from 85% to 99%.

Therefore, the combination of DNN+CP is not only more accurate, as it corrects errors that an uncalibrated neural network would make, but it also provides very valuable information to doctors about the confidence of the predictions, allowing specialists to assess each case with much more trustworthy information.

Conclusions

In this article, we address a global public health issue, ranked among the top 10 causes of cancerrelated mortality worldwide: the detection of central nervous system tumors. Specifically, we focus on differentiating gliomas into their two types: high-grade gliomas (HGG) and low-grade gliomas (LGG). To achieve this, we introduce a novel methodology based on time series classification. By utilizing MRI-perfusion, we transform the image into a time series, called the perfusion curve, which reflects tissue vascularization in a non-invasive manner. To classify these perfusion curves between LGG and HGG, a conformal predictor based on a deep neural network is trained. We have compared our proposal with the state-of-the-art technique in time series classification through rigorous experimentation, based on a dataset of patients from the radiology department of the "Virgen de las Nieves Hospital" in Granada, Spain, obtaining satisfactory results. We demonstrate that our methodology is not only innovative in the way it transforms the image problem into sequences, but also that the combination of deep neural networks and conformal prediction for time series classification generates an ideal tool for radiologists This tool displays exceptional generalization capability and the reduced uncertainty in its predictions, making it useful, reliable and trustworthy. This study contributes to quantifying uncertainty at patient-level with applications in personalized and precision medicine. Besides, by improving the confidence in AI-medical applications, this study also aligns with the requirements of the European Parliament AI regulation to achieve Trustworthy AI.

Figure 1 :1Figure 1: Complete processing pipeline for glioma differentiation between low-grade and high-grade.The process initiates with the reception of a new perfusion sequence. After preprocessing the images, we extract the perfusion curve by calculating the mean luminance of the voxels at different time points. Subsequently, a CP-DNN is trained to extract the final prediction and quantify its uncertainty.

Figure 2 :2Figure 2: Probability vs conformal probability for MRI scans in the test set. Only thanks to the conformalization, the misclassified patient (second from the left) is correctly classified as a high glioma case. Large intervals show a low confidence in those predictions.

Table 11Performance metrics include accuracy and F1-score (the higher the better), and calibration metrics include Brier Score and Log Loss (the smaller the better) for the DNN model, which outperforms the state-of-the-art 1NN-DTW model (see Table2). The last four rows correspond to the implementations of the four calibration methods: IR stands for Isotonic Regression, PS stands for Platt's Scaling, C-V-A stands for Cross Venn-ABERS, and I-V-A stands for Inductive Venn-ABERS. All methods improve model performance, with Inductive Venn-ABERS significantly reducing calibration errors. The best results are highlighted in bold.MethodAccF1-Score Brier Score Log LossDNN0.642 0.7690.2891.168DNN + IR0.733 0.8420.21.185DNN + PS0.733 0.8440.20.612DNN + C-V-A 0.677 0.7870.2490.755DNN + I-V-A0.752 0.8570.1970.598MethodAccF1-Score Brier Score Log Loss1NN-DTW0.692 0.7690.28810.3761NN-DTW + IR0.714 0.8130.2330.6361NN-DTW + PS0.730.8440.2480.6341NN-DTW + C-V-A 0.675 0.7860.270.8571NN-DTW + I-V-A0.746 0.8480.2010.62

Table 22Similar to Table1, here are the results for the state-of-the-art 1NN-DTW model. The classifier shows lower performance and calibration metrics compared to our approach. IR stands for Isotonic Regression, PS stands for Platt's Scaling, C-V-A stands for Cross Venn-ABERS, and I-V-A stands for Inductive Venn-ABERS. The best results are highlighted in bold.

Acknowledgments

This research has been partially funded by Spanish Ministry of Economy, Industry and Competitiveness (PID2020-118224RB-100 and PID2023-151336OB-I00), con-financed by the European Union (FEDER).

Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012 JFerlay ISoerjomataram RDikshit SEser CMathers MRebelo Int. J. Cancer 136 2015 The epidemiology of glioma in adults: a "state of the science" review QOstrom LBauchet FDavis Neuro Oncology 16 2014 Epidemiology of glioma: clinical characteristics, symptoms, and predictors of glioma pationes grade I-IV in the danish neuro-oncology registry BRasmussen SHansen RLarsen MKosteljanetz HSchultz BBøgård 10.1007/s11060-017-2607-5 Journal of Neuro-Oncology 135 2017 The risk of hemorrhage in steriotactic biopsy for brain tumorus YMizobuchi KNkajima TFujihara KMatsuzaki HMure SNagahiro YTakagi 10.2152/jmi.66.314 J. Medical Investigation 66 2019 Magnetic resonance perfusion for differentiating low-grade frmo high-grade gliomas at first presentation (Review) JAbrigo DFountain JProvenzale ELaw JKwong MHart WTam 10.1002/14651858.CD011551.pub2 Cochrane Database of Systematic Reviews 2018 A review of uncertainty quantification in deep learning: Techniques, applications and challenges MAbdar FPourpanah SHussain DRezazadegan LLiu MGhavamzadeh PFieguth XCao AKhosravi URAcharya VMakarenkov SNahavandi 10.1016/j.inffus.2021.05.008 Information Fusion 76 2021 Quantifying deep learning model uncertainty in conformal prediction HKarimi RSamavi 10.1609/aaaiss.v1i1.27492 Proceedings of the AAAI Symposium Series the AAAI Symposium Series 2023 1 Empirical validation of conformal prediction for trustworthy skin lesions classification JFayyad SAlijani HNajjaran 10.1016/j.cmpb.2024.108231 Computer Methods and Programs in Biomedicine 253 108231 2024 Estimating diagnostic uncertainty in artificial intelligence assisted pathology using conformal prediction HOlsson KKartasalo NMulliqi MCapuccini PRuusuvuori HSamaratunga BDelahunt CLindskog EA MJanssen ABlilie LEgevad OSpjuth MEklund IP I EPanel 10.1038/s41467-022-34945-8 Nature Communications 13 7761 2022 Three applications of conformal prediction for rating breast density in mammography CLu KChang PSingh JKalpathy-Cramer 2022 Improving trustworthiness of ai disease severity rating in medical imaging with ordinal conformal prediction sets CLu ANAngelopoulos SPomerantz Medical Image Computing and Computer Assisted Intervention -MICCAI 2022 LWang QDou PTFletcher SSpeidel SLi

Nature Switzerland, Cham

Springer 2022 Conformal prediction in clinical medical sciences JVazquez JCFacelli 10.1007/s41666-021-00113-8 Journal of Healthcare Informatics Research 6 2022 Fair conformal predictors for applications in medical imaging CLu ALemay KChang KHöbel JKalpathy-Cramer 10.1609/aaai.v36i11.21459 Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2022 36 Deep learning with conformal prediction for hierarchical analysis of large-scale whole-slide tissue images HWieslander PJHarrison GSkogberg SJackson MFridén JKarlsson OSpjuth CWählby 10.1109/JBHI.2020.2996300 IEEE Journal of Biomedical and Health Informatics 25 2021 Classification of the glioma grading using radiomics analysis H.-HCho S.-HLee JKim HPark PeerJ 6 e5982 2018 Deep multi-scale 3d convolutional neural network (cnn) for mri gliomas brain tumor classification HMzoughi INjeh AWali MBSlima ABenhamida CMhiri KBMahfoudhe 10.1007/s10278-020-00347-9 Journal of Digital Imaging 33 2020 Molecular profiling of gliomas: potential therapeutic implications SC P D E P A IAgusti Alentorn AlbertoDuran-Peña SKesari 10.1586/14737140.2015.1062368 arXiv: Expert Review of Anticancer Therapy 15 2015 Fully automated mr based virtual biopsy of cerebral gliomas JHaubold RHosch VParmar MGlas NGuberina OACatalano DPierscianek KWrede CDeuschl MForsting FNensa NFlaschel LUmutlu 10.3390/cancers13246186 Cancers 13 2021 Diffusion-and perfusion-weighted mri radiomics model may predict isocitrate dehydrogenase (idh) mutation and tumor aggressiveness in diffuse lower grade glioma MKim SYJung JEPark YJo SYPark SJNam JHKim HSKim 10.1007/s00330-019-06548-3 European Radiology 30 2020 Noninvasive prediction of idh1 mutation and atrx expression loss in low-grade gliomas using multiparametric mr radiomic features YRen XZhang WRui HPang TQiu JWang QXie TJin HZhang HChen YZhang HLu ZYao JZhang XFeng 10.1002/jmri.26240 Journal of Magnetic Resonance Imaging 49 2019 Prediction of molecular mutations in diffuse low-grade gliomas using mr imaging features ZAShboul JChen KMIftekharuddin 10.1038/s41598-020-60550-0 Scientific Reports 10 3711 2020 Combining radiomics and deep convolutional neural network features from preoperative MRI for predicting clinically relevant genetic biomarkers in glioblastoma ECalabrese JDRudie AMRauschecker JEVillanueva-Meyer JLClarke DASolomon SCha Neuro-Oncology Advances 4 2022 <idno type="DOI">10.1093/noajnl/vdac060</idno> <ptr target="https://academic.oup.com/noa/article-pdf/4/1/vdac060/43778051/vdac060.pdf" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b23"> <monogr> <ptr target="https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai" /> <title level="m">AI Act -digital-strategy 2024. 24-04-2024 ec.europa Connecting the dots in trustworthy artificial intelligence: From ai principles, ethics, and key requirements to responsible ai systems and regulation NDíaz-Rodríguez JDel MSer MCoeckelbergh ELópez De Prado FHerrera-Viedma Herrera 10.1016/j.inffus.2023.101896 Information Fusion 99 101896 2023 CHAI -coalitionforhealthai 2024. Accessed 24-04-2024 Introduction to conformal prediction with python CMolnar 2023 Venn-abers predictors VVovk IPetej Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI'14 the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI'14

Arlington, Virginia, USA

AUAI Press 2014 Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods JPlatt 1999 VManokhin Practical guide to applied conformal prediction in python: Learn and apply the best uncertainty frameworks to your industry applications 2023 Transforming classifier scores into accurate multiclass probability estimates BZadrozny CElkan 10.1145/775047.775151 doi:10.1145/775047.775151 Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02 the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02

New York, NY, USA

Association for Computing Machinery 2002 Predicting good probabilities with supervised learning ANiculescu-Mizil RCaruana 10.1145/1102351.1102430 doi:10.1145/1102351.1102430 Proceedings of the 22nd International Conference on Machine Learning, ICML '05 the 22nd International Conference on Machine Learning, ICML '05

New York, NY, USA

Association for Computing Machinery 2005 Prediction of metabolic transformations using cross Venn-ABERS predictors SArvidsson OSpjuth LCarlsson PToccaceli Proceedings of the Sixth Workshop on Conformal and Probabilistic Prediction and Applications AGammerman VVovk ZLuo HPapadopoulos the Sixth Workshop on Conformal and Probabilistic Prediction and Applications

PMLR

2017 60 Proceedings of Machine Learning Research Machine learning classification with confidence: application of transductive conformal predictors to MRI-based diagnostic and prognostic markers in depression INouretdinov SGCostafreda AGammerman AChervonenkis VVovk VVapnik CH YFu Neuroimage 56 2010 Conformal predictors in early diagnostics of ovarian and breast cancers DDevetyarov INouretdinov BBurford SCamuzeaux AGentry-Maharaj ATiss CSmith ZLuo AChervonenkis RHallett VVovk MWaterfield RCramer JFTimms JSinclair UMenon IJacobs AGammerman 10.1007/s13748-012-0021-y Progress in Artificial Intelligence 1 2012 Reliable diagnosis of acute abdominal pain with conformal prediction HPapadopoulos AGammerman VVovk International journal of engineering intelligent systems for electrical engineering and communications 17 2009 Evolutionary conformal prediction for breast cancer diagnosis ALambrou HPapadopoulos AGammerman 10.1109/ITAB.2009.5394447 9th International Conference on Information Technology and Applications in Biomedicine 2009. 2009 Conformal prediction technique to predict breast cancer survivability LAlnemer LRajab IAljarah 10.14257/ijast.2016.96.01 International journal of advanced science and technology 96 2016 ASMillar JArnn SHimes JCFacelli 10.3233/shti231113 Uncertainty in breast cancer risk prediction: A conformal prediction study of race stratification 2024 Conformal prediction of molecule-induced cancer cell growth inhibition challenged by strong distribution shifts SHernandez-Hernandez QGuo PJBallester 10.1101/2024.03.15.585269 arXiv: 2024 bioRxiv An Empirical Distribution Function for Sampling with Incomplete Information MAyer HDBrunk GMEwing WTReid ESilverman 10.1214/aoms/1177728423 The Annals of Mathematical Statistics 26 1955 Learning by transduction AGammerman VVapnik VVovk Proceedings of the Fourteenth Conference on Uncertainty in Articial Intelligence the Fourteenth Conference on Uncertainty in Articial Intelligence Morgan Kaufmann 1998 Transduction with confidence and credibility CSaunders AGammerman VVovk Sixteenth International Joint Conference on Artificial Intelligence (IJCAI '99) 01/01/99. 1999 Machine-learning applications of algorithmic randomness VVovk AGammerman CSaunders Proceedings of the Sixteenth International Conference on Machine Learning the Sixteenth International Conference on Machine Learning Morgan Kaufmann 1999 Multi-class probabilistic classification using inductive and cross Venn-Abers predictors VManokhin 2017 60 Algorithmic learning in a random world VVovk 2023 Alzheimer's Disease Neuroimaging Initiative, Targeting the uncertainty of predictions at patient-level using an ensemble of classifiers coupled with calibration methods, Venn-ABERS, and conformal predictors: A case study in AD TPereira SCardoso MGuerreiro AMendonça SCDe Madeira J. Biomed. Inform 101 103350 2020 Statistical Parametric Mapping KJFriston 10.1007/978-1-4615-1079-6-16 doi:10.1007/978-1-4615-1079-6-16 2003 Springer US Boston, MA Brain tumor segmentation of mri images: A comprehensive review on the application of artificial intelligence tools RRanjbarzadeh ACaputo EBTirkolaee SJafarzadeh MGhoushchi Bendechache 10.1016/j.compbiomed.2022.106405 Computers in Biology and Medicine 152 106405 2023 3d slicer as an image computing platform for the quantitative imaging network AFedorov RBeichel JKalpathy-Cramer JFinet J.-CFillion-Robin SPujol CBauer DJennings FFennessy MSonka JBuatti SAylward JVMiller SPieper RKikinis 10.1016/j.mri.2012.05.001 Magnetic Resonance Imaging 30 2012 ABagnall ABostrom JLarge JLines The great time series classification bake off: An experimental evaluation of recently proposed algorithms 2016 extended version Back to basics: A sanity check on modern time series classification algorithms BDhariyal TLNguyen GIfrim 2023