1. Introduction

Multi-contrast Medical Image Segmentation⋆

Tianyi Ren

Juampablo Heras Rivera

Hitender Oswal

hitender@uw.edu 2

Yutong Pan

ypan4@cs.washington.edu 2

Agamdeep Chopra

achopra4@uw.edu 0

Jacob Ruzevick

ruzevick@uw.edu 1

Mehmet Kurt

mkurtr@uw.edu 0 0 Department of Mechanical Engineering, University of Washington , 3900 E Stevens Way NE, Seattle, WA 98195 , USA 1 Department of Neurological Surgery, University of Washington , 1959 NE Pacific Street, Seattle, WA 98195 , USA 2 Paul G. Allen School of Computer Science, University of Washington , 185 E Stevens Way NE Seattle, WA 98195 , USA

Deep learning has been successfully applied to medical image segmentation, enabling accurate identification of regions of interest such as organs and lesions. This approach works efectively across diverse datasets, including those with single-image contrast, multi-contrast, and multimodal imaging data. To improve human understanding of these black-box models, there is a growing need for Explainable AI (XAI) techniques for model transparency and accountability. Previous research has primarily focused on post hoc pixel-level explanations, using methods gradient-based and perturbation-based approaches. These methods rely on gradients or perturbations to explain model predictions. However, these pixel-level explanations often struggle with the complexity inherent in multicontrast magnetic resonance imaging (MRI) segmentation tasks, and the sparsely distributed explanations have limited clinical relevance. In this study, we propose using contrast-level Shapley values to explain state-of-the-art models trained on standard metrics used in brain tumor segmentation. Our results demonstrate that Shapley analysis provides valuable insights into diferent models' behavior used for tumor segmentation. We demonstrated a bias for U-Net towards over-weighing T1-contrast and FLAIR, while Swin-UNETR provided a cross-contrast understanding with balanced Shapley distribution.

eol>Image Segmentation XAI Shapley Value MRI Brain Tumor

1. Introduction

Segmentation is a fundamental task in medical imaging, involving identifying regions of interest (ROIs) such as organs, lesions, and tissues. By precisely outlining anatomical and pathological structures, segmentation plays a pivotal role in computer-aided diagnosis, ultimately improving diagnostic precision [ 1 ]. Typically, segmentations task are carried out using multi-contrast MRI or multi-modal imaging datasets, due to the necessity of identifying unique microstructural features, such as in gliomas [ 2 ], that are only apparent in some MRI contrasts, but not others. Many deep learning models, including those used for segmentation, are considered black boxes, ofering limited interpretability, resulting in a lack of transparency and accountability [ 3 ]. Various Explainable AI (XAI) techniques have been developed in the literature [ 4 ] to tackle this problem, primarily categorized into gradient-based and perturbation-based methods.

Gradient-based techniques, such as saliency maps [ 5 ] and Grad-CAM [ 6 ], visualize deep learning predictions by identifying influential regions in input data, while perturbation-based approaches (Shapley values [ 7 ] and LIME [ 8 ]) observe model behavior by systematically perturbing inputs and measuring impact. These methods have been applied successfully to explain the classification problem, however, explaining segmentation still presents significant challenges. There is ongoing debate about whether explanations are necessary for segmentation, as the masks themselves may serve as explanations. Furthermore, there remains uncertainty regarding which components should be explained—when using (a) Brain tumor segmentation from multi contrast MRI t1c t1n t2 t2f

GT (b) 1C)linical practice:

Which image contrast or modality conveys the most information? 2) Which image feature does the model look at? (1) Cross contrast understanding (Proposed Methods) t1c t1n t2f t2w (2) Pixel-level understanding (ex. GradCam)

t1c t1n t2f Clinical benefits: 1) Which method is more intuitive and comparative? 2) Which result conveys more information? t2w gradient-based approaches for models like U-Net, no consensus exists on which layer to target, and in clinical application, which MRI contrasts to explain. Moreover, pixel-level explanations, typically represented as discretized heatmap maps, require further interpretation for grouping analysis [ 9 ].

Since in clinical practice radiologists detect lesions by analyzing diferences between diferent MRI contrasts [ 2 ], an explainability framework that reveals deep learning model behavior with regards to how they weigh diferent MRI contrasts in the segmentation process would be immediately clinically relevant. Therefore, the main objective of this paper is to establish a framework for explaining the contributions of diferent MRI contrasts in the segmentation process with an application in brain tumor segmentation. This method delivers intuitive quantitative model explanations and enables efective comparisons at multiple levels: between contrasts within a subject (see Figure 4), and between model architectures for comprehensive model behavior interpretation (see Section 3). We perform systematic experiments to explain how the state-of-the-art models such as U-Net and Transformer (Swin-UNETR) weigh diferent MRI contrasts with respect to diferent evaluation metrics such as Dice and HD95. We conduct statistical analyses to provide an in-depth understanding of how and why diferent model architectures weigh MRI contrasts diferently, even when they achieve similar segmentation performance. In summary, our paper, to the best of our knowledge, is the first study to propose a clinically-relevant explanation framework for brain tumor segmentation in multi-contrast MRI.

2. Methods 2.1. Dataset and Learning Objectives

The training dataset is sourced from the Brain Tumor Segmentation (BraTS) Challenge 2024 GoAT challenge [ 10 ], consisting of 1,351 subjects. For each subject, four MRI contrasts were given: Native (1), Post-contrast T1-weighted (1), T2-weighted (2), and T2 Fluid Attenuated Inversion Recovery (2 ). The ground truth annotations consist of three disjoint classes: Enhancing tumor (ET), Peritumoral edematous tissue (ED), and Necrotic tumor core (NCR). The detailed preprocessing and training pipeline can be found in our previous research [ 11, 12 ].

2.2. Model Architectures and Evaluating Metric

Several state-of-the-art model architectures are tested in this study, including U-Net [ 13 ], Seg-Resnet [ 14 ], UNETR [ 15 ], and Swin-UNETR [ 16 ]. To evaluate the segmentation quality, we used common metrics, including the Dice coeficient and the 95th percentile Hausdorf distance (HD95).

2.3. Contrast Level Shapley Value

Given a training dataset comprised of the pairs {(, 0)}=1, where ∈ R4× × × represents the four 3D-MRI contrast as a multi-channel input, 0 ∈ R3× × × represents the associated one-hot encoded segmentation mask, with 3 tumor labels: ED, NCR, and ET as described in Section 2.1. The deep learning models () were trained to predict the tumor labels ^0 given the input : ^0 = ().

Derived from the Shapley value [ 7 ]. The Contrast level Shapley value (M ) was then evaluated with respect to each specific metric (M) by: (M ) =

∑︁ ⊆ ∖{} ||!(| | − | | − 1)! | |! (M ( ∪ {}) −

M ()) where is the set of all of MRI contrasts; | | is the total number of contrasts; is a subset of MRI contrasts excluding certain contrast ( ⊆ ∖ {}) and || is the number of contrasts in ; M () is the target metric evaluated on the subset .

The contrast-level Shapley values are examined to assess whether observed diferences(group means and variances) across folds or between models are statistically significant. Test for equal variance: Levene’s test is applied to assess homogeneity of variance even when the normality assumption cannot be guaranteed. Test for equal mean: If the normality assumption cannot be guaranteed, the KruskalWallis test is used instead of ANOVA, and Dunn’s test is applied for post-hoc analysis instead of Tukey’s test. Confidence interval of the diference : If a significant diference in means is observed, we further generate the confidence interval of the mean diference between groups when the normality assumption is not violated.

3. Experiments and Results

puted using four model architectures across five data folds.

We define the matrix of contrastlevel Shapley values for each combination of metric {U-Net, SegResNet, UNETR, Swin-UNETR}, and fold = 1, . . . , 5 as: ∈ {Dice, HD95}, model ∈ ⎛1,,1( ) 1,,2( ) · · · subject-wise vector S

, ( ) ( = 1, . . . , ) are defined as follows: and metric . We use to denote the total number of subjects in fold . where ,, ( ) represents the Shapley value for the -th subject in fold , given contrast , model , For a given combination (, , ), the contrast-wise vector C, ( ) ( ∈ {t1n, t1c, t2w, t2f}) and C, ( ) = Φ,·, ( ) = ︁( ,1, ( ), ,2, ( ), · · · , ,, ( ) , C, ( ) ∈ R ︁) S, ( ) = Φ·,, ( ) = ︁( 1,, ( ), 1,, ( ), 2, ( ), 2, ( ) , , ︁) , S, ( ) ∈ R 4

In this study, we utilized four NVIDIA A40 GPUs to train our deep learning model and calculate the Shapley value. The evaluation time for each fold and model is approximately 1–2 minutes per subject.

3.1. Shapley-based prediction insights: a clustering analysis

To analyze how segmentation performance overlaps with model weighting of MRI contrasts via contrastlevel Shapley values, we applied k-means clustering. For each model-metric pair (, ), clustering was performed on the S , ( ) across five folds, i.e.,

5 ∪=1 ∪=1 {S , ( )}.

We then use UMAP to visualize the clusters of Shapley value embeddings. Figure 2 illustrates an example with a significant pattern. For U-Net and Swin-UNETR, Shapley embedding clusters diferentiate subjects with higher Dice scores from those with lower Dice scores. (3) (4) (a) UNETR

(b) SEGRES

3.2. Shapley-based model prediction consistency: a comparative analysis 3.2.1. Does each model learn consistent explanations?

To assess the consistency of explanations across folds for each model, we analyzed the distribution of C, ( ). The group standard deviation and mean are key factors for determining distribution similarity, and statistical tests were applied to these metrics: 0( |, , ) : (C,1( )) = (C,2( )) = (C,3( )) = (C,4( )) = (C,5( )), (5) 0( |, , ) : (C,1( )) = (C,2( )) = (C,3( )) = (C,4( )) = (C,5( )). If significant diferences in mean or standard deviation are found, we conclude that inconsistent explanations are present across folds for a given pair of (, , ).

Since the normality assumption for the Shapley value distribution C, ( ) could not be guaranteed for some contrasts , as indicated by the normality tests and non-zero skewness (Figure 3), Levene’s test, Kruskal-Wallis, and Dunn’s post-hoc tests were applied.

For all combinations of (, , ), we get < 0.01 in all 32 Levene’s tests, rejecting 0( |, , ) and indicating unequal variances across the five folds. Similarly, all 32 Kruskal-Wallis tests yield < 0.01, rejecting 0( |, , ) and suggesting unequal means. These results invalidate the assumption that “Model learns consistent explanations across all five folds using contrast for metric evaluation," indicating significant diferences in variance and means for at least one fold pair of each (, , ) combination.

Post-hoc tests are conducted to evaluate which pairs ( , ′ ) show consistency explanation with the following null hypothesis: 0( |, , , ( , ′ )) : (C ,′ ( )), , ′ ∈ {1, 2, · · · , 5}; ̸= ′ . , ( )) = (C (6) Dunn’s post-hoc tests reveal no significant diferences in the 1 explanation between fold pairs 1 & 5, 2 & 3, 2 & 4, and 4 & 5 for Swin-UNETR, while significant diferences exist in all other tests (Table 3). For example, in Table 3, = 0.038 in the 1st column, the null hypothesis (CS1win-UNETR,1()) = (CS1win-UNETR,5()) is not rejected, indicating “Swin-UNETR learns consistent 1 contrast-level explanations between the 1st and 5th folds." (a) U-net (b) Swin-UNETR

3.2.2. Do diferent models learn consistent explanations?

We first visualize the contrast-level Shapley value across all five folds for U-net, CU-net, (Dice), and Swin-UNETR, CS win-UNETR, (Dice), using violin plot in Figure 3. We could observe that 1 and 2 are the most important image contrasts with the highest contrast-level Shapley value, this finding is consistent with the clinical explanation where 2 suppresses cerebrospinal fluid signal, making edema and infiltration more visible, while 1 provides clear delineation of enhancing tumor (see section 2.1). We can also observe from this figure that Swin-UNETR weights 1 significantly higher than U-Net.

To further investigate how model explanations are diferent within folds, we follow the procedure from Section 3.2.1, with the key diference being that we compare results across multiple models while ifxing the fold, unlike the previous tests where the models were fixed: 0( |, , ) : (CU -Net, ( )) = (CSegresnet, ( )) = (CU NETR, ( )) = (CSwin-UNETR, ( )), 0( |, , ) : (CU -Net, ( )) = (CSegresnet, ( )) = (CU NETR, ( )) = (CSwin-UNETR, ( )). (7) For all combinations of (, , ), the assumption that “Within each fold , all models learned consistent explanations when using contrast for metric " is invalid [Levene’s test ( < 0.01), Kruskal-Wallis test ( < 0.01) for all tests]. However, the post-hoc tests do not reveal generalizable patterns across the models similar to the conclusion we presented in Table 3. To highlight performance diferences, we provide the confidence intervals.

Since the distributions of Shapley values are independent across models, and for each input , the diferences between Shapley values, ,, ( )− ,

′, ( ) ( ̸= ′), passed the normality test, we further assess the diference between models by evaluating the confidence interval a desired level , where we define: ( (C(,′), ( ))) given C(,′), ( ) = ︁( ,1, ( ) − ,1′, ( ), · · · , ,, ( ) − ,′, ( )︁) (8) with denoting the total number of subjects in fold from Definition (3).

Here, we focus on the model diference in t1n, to test the hypothesis that Swin-UNETR has a higher contrast shapley value compared to other models, indicating a more balanced shapley value distribution and less basis toward t1c and t2f. The confidence intervals for the mean diference in Shapley values (Swin-UNETR minus the other models) indicate a significant positive diference at a confidence level of 0.95, suggesting that Swin-UNETR places more attention on the 1 contrast (Figure 3).

To understand how transformer-based models difer from convolutional neural networks, we analyze cases where the Swin-UNETR model achieves a Dice score at least 20% higher than U-Net and vice versa. Specifically, we examine cases where the Swin-UNETR model achieves a Dice score 25% higher than U-Net (Figure 4), and U-Net achieves a Dice score 23% higher than Swin-UNETR (Figure 4). This comparison highlights the advantages and limitations of each architecture in medical image segmentation tasks.

4. Discussion

In this study, we systematically investigated the Shapley value for model explanation in multi-contrast medical image segmentation. Our proposed contrast-level Shapley explainability framework has three key contributions: (1) It is the first study to use Shapley analysis to explain multi-contrast medical image segmentation; (2) It is the first paper to analyze how diferent network structures weigh various GT

UNet SwinUNETR

MRI contrasts when making segmentation decisions; (3) It enhances clinical relevance by providing deeper insights into model performance with aggregate contributions of each MRI contrast in the tumor segmentation process, which is inherently interpretable by neuroradiologists, as they detect lesions by analyzing diferences between diferent MRI contrasts in clinical practice.

Specifically, the contrast-level Shapley value reveals the (in)consistency of each model’s explanations. The statistics indicate that Swin-UNETR is the most robust among all tested architectures. Despite being trained on diferent folds, Swin-UNETR consistently learns invariant representations across data subsets, whereas other models show variations in their explanations across folds (Table 1).

Moreover, the contrast-level Shapley value provides insights on the diferences among model architectures. As shown in Figure 3, the model explanations indicate that U-Net exhibits a bias toward features from 1 and 2 , while Swin-UNETR distributes its explanations more evenly across contrasts. This was further confirmed by comparing 1 Shapley values across diferent models, which revealed statistically higher Shapley values for Swin-UNETR (Table 3).

We also present a case in Figure 4 to demonstrate how explanations of diferent models could provide key insights into model failure. As discussed before, the training data includes 3 diferent tumor subtypes (see section 2.1). The innermost component of the tumor (shown in red in Figure 4) is necrotic tissue in glioblastoma and meningioma, however, in metastasis, the definition of the innermost component is any tumor component that is not enhancing (but not necrotic). This implies that in 2 images, the necrotic core will appear dark but non-enhancing metastatic tumor core and edema will appear bright.

Due to its dependence on contrasts with the highest intensity diferences, namely 1 and 2 , the U-Net architecture fails to accurately capture the innermost component (NCR). This suggests a potential bias towards 1 and 2 , as indicated by the distribution of 1, (Dice) and 2, (Dice) exhibiting a significantly higher central tendency compared to 1, (Dice) and 2,(Dice) across all folds and models ∈ {UNET, Seg-Resnet, UNETR, Swin-UNETR }, as shown in Figure 2 and supported by statistical tests in Section 3.2. This bias may contribute to confusion with edema prediction, causing over-prediction relying on 2 (edema appears bright as shown in Figure 4). However, swin-UNETR efectively learns both local and global relationships within diferent contrasts through its self-attention mechanism, and was able to more accurately localize the tumor core in this challenging case.

Finally, for this case, we provide a comparison between GradCAM and our proposed contrast-level Shapley. As seen in Figure 4, pixel-level explanations provided by GradCAM on each MRI contrast show model diferences in terms of using pixel-level features. The heatmap of Swin-UNETR is more smooth while the heatmap of U-Net highlights only a few regions, but both of the explanations fail to capture clinically relevant explanations regarding contrast-level importance. For example, in Swin-UNETR, GradCAM exhibits a higher attention to 1 compared to 2 . However, Contrast Shapley reveals that t1c negatively impacts the final Dice score, with a lower impact magnitude compared to 2 .

5. Conclusion

In this study, we propose Contrast Shapley for multi-contrast glioma segmentation. This method provides a quantitative framework for model explanation, ofering insights into the fundamental characteristics of diferent deep learning architectures.

Declaration on Generative AI

The author has not employed any Generative AI tools.

[1]

M. H.

Hesamian ,

Jia ,

He ,

Kennedy , Deep learning techniques for medical image segmentation: achievements and challenges , Journal of digital imaging 32 ( 2019 ) 582 - 596 .

[2]

Yan ,

Wang ,

Zhong , Y. Wang, Clinical inspired mri lesion segmentation , arXiv preprint arXiv:2502.16032 ( 2025 ).

[3]

Rudin , Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , Nature machine intelligence 1 ( 2019 ) 206 - 215 .

[4]

B. H.

Van der Velden ,

H. J.

Kuijf ,

K. G.

Gilhuijs ,

M. A.

Viergever , Explainable artificial intelligence (xai) in deep learning-based medical image analysis , Medical Image Analysis 79 ( 2022 ) 102470 .

[5]

Simonyan ,

Vedaldi ,

Zisserman , Deep inside convolutional networks: Visualising image classification models and saliency maps , arXiv preprint arXiv:1312.6034 ( 2013 ).

[6]

R. R.

Selvaraju ,

Cogswell , A. Das , R.

Vedantam , D.

Parikh , D.

Batra , Grad-cam: Visual explanations from deep networks via gradient-based localization , in: Proceedings of the IEEE international conference on computer vision , 2017 , pp. 618 - 626 .

[7]

Lundberg , A unified approach to interpreting model predictions , arXiv:1705.07874 ( 2017 ).

[8]

M. T.

Ribeiro ,

Singh ,

Guestrin , " why should i trust you?" explaining the predictions of any classifier , in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , 2016 , pp. 1135 - 1144 .

[9]

S. N.

Hasany ,

Mériaudeau ,

Petitjean , Misure is all you need to explain your image segmentation , arXiv preprint arXiv:2406.12173 ( 2024 ).

[10]

Baid ,

Ghodasara ,

Mohan ,

Bilello ,

Calabrese ,

Colak ,

Farahani , J. KalpathyCramer , F. C.

Kitamura , S.

Pati , et al., The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification , arXiv preprint arXiv:2107.02314 ( 2021 ).

[11]

Ren , E. Honey,

Rebala ,

Sharma ,

Chopra ,

Kurt , An optimization framework for processing and transfer learning for the brain tumor segmentation , arXiv:2402.07008 ( 2024 ).

[12]

Ren ,

Sharma ,

J. E. H.

Rivera ,

L. H.

Rebala ,

Honey ,

Chopra ,

Kurt , Re-difinet: Modeling discrepancy in tumor segmentation using difusion models , in: Medical Imaging with Deep Learning , 2024 .

[13] Ö. Çiçek , A.

Abdulkadir , S. S.

Lienkamp , T.

Brox , O. Ronneberger,

3d u-net: Learning dense volumetric segmentation from sparse annotation , CoRR abs/1606 .06650 ( 2016 ). arXiv: 1606 . 06650 .

[14]

Myronenko , 3d mri brain tumor segmentation using autoencoder regularization , in: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada , Spain, September 16 , 2018 ,

Revised

Selected Papers , Part II 4 , Springer, 2019 , pp. 311 - 320 .

[15]

Hatamizadeh ,

Nath ,

Tang ,

Yang ,

H. R.

Roth ,

Xu , Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images , in: International MICCAI Brainlesion Workshop, Springer, 2021 , pp. 272 - 284 .

[16]

Hatamizadeh ,

Tang ,

Nath ,

Yang ,

Myronenko ,

Landman ,

H. R.

Roth ,

Xu , Unetr: Transformers for 3d medical image segmentation , in: Proceedings of the IEEE/CVF winter conference on applications of computer vision , 2022 , pp. 574 - 584 .