Assesing the Interpretability of the Statistical Radiomic Features via Image Saliency Maps in Medical Image Classification Tasks Oleksandr Davydko1,* 1 Technological University Dublin Abstract The presented research aims to improve the interpretability of medical image classification models trained with statistical radiomic features. While showing classification results comparable with state- of-the-art convolutional neural network models, statistical radiomic features’ interpretability is still understudied. Neural network models use saliency map approaches to provide a human operator with intuitive visualisation of the model’s attention, but statistical radiomic-based models still have no such tools developed. This research aims to eliminate this gap and allow the saliency map generation for models trained with statistical radiomic features. Preliminary results show that the proposed approach may generate faithful saliency maps for the ResNet-50 classification model trained the first-order statistical radiomic features. Keywords Medical image classification, Texture analysis, Statistical radiomic features, Saliency maps 1. Context and motivation 1975 Haralik et al. introduced a way to obtain high-order image features from texture. These features describe the image texture properties using statistics. They allowed the large image to be compressed into a set of 14 features, which allowed solving the image classification tasks, as computation capabilities were limited then. Even though those feature-forming methods were originally developed for aerial photograph and satellite image classification, the most popular application of those methods is medical image classification. Even after introducing high-performance GPUs and convolutional neural networks, some research like [1] still indicate that using statistical radiomic features can show near state-of-the-art results in medical image classification tasks. However, a lack of classification result interpretability could be a reason to choose automatically extracted features instead of statistical radiomic. In the current literature, the interpretability of statistical radiomic features is mainly addressed by attributing ad-hoc high-order statistical radiomic features, while neural network solutions utilize saliency map methods, which produce much more understandable Late-breaking work, Demos and Doctoral Consortium, colocated with The 2nd World Conference on eXplainable Artificial Intelligence: July 17–19, 2024, Valletta, Malta * Corresponding author. $ d22125337@mytudublin.ie (O. Davydko) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings visual explanations than standard numerical feature importance. This research aims to introduce a method for generating a saliency map when a classification model is trained with statistical radiomic features, subsequently improving the explainability of statistical-radiomic-based classification models for medical images. 2. Related work The statistical radiomic features were applied to solve many different medical image classification tasks, showing near state-of-the-art classification performance. Authors of the research [2] have attempted to fuse grey-level co-occurrence matrix (GLCM), grey-level run length matrix (GLRLM), and segmentation-based fractal texture analysis (SFTA) features to detect the COVID-19 lesions presence on the chest X-ray images. The fusion of features combined with feature selection done by principle components analysis allowed reaching 0.94 F1-score while distinguishing between healthy and COVID pneumonia lung images. Similar results were observed in the same task when using first-order statistics (FOS), GLCM, GLRLM, and grey-level size zone matrix (GLSZM) feature extraction methods in the work [1]. Here, the authors report the F1-score at a 0.98 rate. In research [3] authors report 0.975 accuracy while performing a classification of brain tumors on magnetic resonance images (MRI) Current advances in the interpretability of radiomic-based models mostly include interpreting importances of used high-order radiomic features. Authors of the research in [4] use SHAP [5], which allowed them to identify the most influential features. In another research work [6], authors interpret radiomic feature groups importances by analysing logistic regression coefficients. A study [7] uses SHAP to reveal the most influential features to diagnose schizophrenia by brain magnetic resonance images (MRI). The authors of the work [8] use the same technique to find the connection between particular features and panic disorder signs. At the same time, researchers utilise saliency map methods such as Integrated Gradients [9], layer-wise propagation [10], DeepLIFT [11], GradCAM [12] for interperting convolutional neural network predictions. The saliency map is much easier to understand from the point of view of human perception. For radiomic-based models, a little work discusses some analogs of saliency maps. In the research [13], authors discuss the interpretability of tumor tissue signature identification when local statistical radiomic features are used. The problem of interpretability was tackled by visualizing feature activation maps for a single high-order feature. It can be stated, that the problem of statistical radiomic features interpretability is definitely in focus of researchers but requires further investigations. The main problem to investigate is the generation of understandable image saliency maps for radiomic-based models. 3. Design and methodology This research describes methods for saliency map generation of the first- and high-order statistical radiomics features. 3.1. Mapping first-order radiomic features’ attributions First-order statistical radiomic features are formed by computing the frequency of some texture substructure appearing in the image. Some examples of such substructures are: 1. Pair of pixels with intensities 𝑖, 𝑗 2. Run-length of pixels with the same intensity 𝑖 3. Cluster of connected pixels with same intensity 𝑖 A proposed method of saliency map generation includes the computation of features’ attributions and subsequently adding those attribution values to pixels involved in forming particular feature values (figure 1). Regarding the size of the statistical radiomics features matrices (256x256 at least), it is proposed to use convolutional neural networks to build the classifiers and gradient- based methods to obtain attributions (such as Integrated Gradients). Figure 1: A visualisation of the first-order statistical features’ attributions mapping to saliency map values 3.2. Mapping high-order radiomic features’ attributions A mapping of high-order statistical radiomic features could be defined as an extension of the first-order features mapping procedure. It is a noticeable fact that all high-order feature formulas are represented with differentiable functions. That allows the usage of gradient-based methods (such as Integrated Gradients, GradCAM, DeepLIFT) directly to obtain first-order feature attributions and subsequently map them to medical image pixel attributions by applying procedure from section 3.1. The graphical representation of this process is displayed in figure 2. This approach is feasible if the classification model is of a fully-connected neural network type. Known model-agnostic methods such as SHAP cannot be used to attribute the massive number (at least 65536 for GLCM and more for other methods) of the first-order features due to slowness in the calculation process, which leaves the usage of other classification models as an open question out of this research scope. 3.3. Experiment The experiment is designed to test the faithfulness of the saliency maps generated by meth- ods described in 3.2 and 3.2 and compare them with those generated by existing methods in Figure 2: A visualisation of the high-order statistical features’ attributions mapping to saliency map values different classification tasks on X-ray and MRI images. The experiment’s scheme is presented in the figure 3. The described experiment is run separately with different datasets to test the generalisability of the proposed approach. Figure 3: The experiment scheme. High-order radiomic features are computed from first-order statistical radiomic features (A) with deep learning (B) or ad-hoc formulas (C) and become an input for the classification of the multi-layer perceptron model (D). Attributions for the features are computed with GradCAM, DeepLIFT, and Integrated Gradients (E) and mapped to the image saliency map (F). The same is performed for plain image models (G and H). 3.3.1. Data Preparation Each image in the dataset is converted into greyscale if needed, as statistical radiomic features are defined only for greyscale textures in the current literature. Images are left intact for the baseline pipeline (steps G and H in the figure 3). 3.3.2. Feature extraction First-order statistical radiomic features are extracted with GLCM, GLRLM, GLSZM, grey-level dependency matrix (GLDM), and neighboring grey-tone difference matrix (NGTDM) methods into matrices, as described in works [14, 15, 16, 17, 18]. While calculating statistical radiomics features, background pixels are not taken into account. The parameters for the mentioned methods are to be found with a hyperparameter search procedure. High-order statistical radiomics features are extracted out of matrices containing first-order radiomic features with formulas defined in [14, 15, 16, 17, 18] and concatenated into a single feature vector. The list of features calculated is provided in table 1 Table 1 The high-order statistical radiomic features employed in this research Method Features Total Reference GLCM Angular Second Moment, Contrast, Correlation, Sum of Squares, Inverse 13 [14] Difference Moment, Sum Average, Sum Variance, Sum Entropy, Entropy, Difference Variance, Difference Entropy, Information Measures of Correla- tion (2x) GLRLM Short Run Emphasis, Long Run Emphasis, Grey Level Non-Uniformity, 16 [15] Gray Level Non-Uniformity, Run Length Non-Uniformity, Run Length Non- Uniformity Normalized, Run Percentage, Grey Level Variance, Run Variance, Run Entropy, Low Gray Level Run Emphasis, High Gray Level Run Emphasis, Short Run Low Gray Level Emphasis, Short Run High Gray Level Emphasis, Long Run Low Gray Level Emphasis, Long Run High Gray Level Emphasis GLSZM Small Area Emphasis, Large Area Emphasis, Gray Level Non-Uniformity, 16 [16] Gray Level Non-Uniformity Normalized, Size-Zone Non-Uniformity, Size- Zone Non-Uniformity Normalized, Zone Percentage, Gray Level Variance, Zone Variance, Zone Entropy, Low Gray Level Zone Emphasis, High Gray Level Zone Emphasis, Small Area Low Gray Level Emphasis, Small Area High Gray Level Emphasis, Large Area Low Gray Level Emphasis, Large Area High Gray Level Emphasis GLDM Small Dependence Emphasis, Large Dependence Emphasis, Gray Level Non- 14 [17] Uniformity, Dependence Non-Uniformity, Dependence Non-Uniformity Normalized, Gray Level Variance, Dependence Variance, Dependence En- tropy, Low Gray Level Emphasis, High Gray Level Emphasis, Small De- pendence Low Gray Level Emphasis, Small Dependence High Gray Level Emphasis, Large Dependence Low Gray Level Emphasis, Large Dependence High Gray Level Emphasis NGTDM Coarseness, Contrast, Busyness, Complexity, Strength 5 [18] Each of the described features has its unique formula for calculation. For example, a formula for the Contrast feature of the GLCM matrix is defined as: 𝑁𝑔 𝑁𝑔 ∑︁ ∑︁ Contrast = (𝑖 − 𝑗)2 𝐺𝐿𝐶𝑀 (𝑖, 𝑗) 𝑖=1 𝑗=1 Here 𝑁 𝑔 - number of grey shades in the image (usually equals 256), 𝐺𝐿𝐶𝑀 - matrix containing grey-level co-occurrence first-order radiomic features. 3.3.3. Classification models and their training Three types of classification models are trained. A first type receives first-order statistical radiomic feature matrices. Models of the first type are represented with convolutional neural networks. In this particular experiment, ResNet-50, VGG-16, and EfficientNet architectures are used. Models of the second type are represented with special architecture, which implements high-order statistical formulas from [14, 15, 16, 17, 18]. A multi-layer perceptron follows the implementation of the block with formulas along with a sigmoid or softmax layer. The baseline models receive plain images as input and are represented by ResNet-50, VGG-16, and EfficientNet architectures. Each classification model of every type is trained from scratch. The training is held until the F1-score on the development set stops to improve for ten subsequent epochs. Adam algorithm is used as an optimizer with a 1𝑒 − 4 learning rate. 3.3.4. Statistical radiomic feature attribution and saliency maps As all classification models described in 3.3.3 are represented by neural networks, is is possible to use gradient-based methods for feature attributing. In this research, it is proposed to use Integrated Gradients, GradCAM, and DeepLIFT without any modifications. Subsequently, for radiomic-based models, an additional mapping, which is described in section 3.1, is conducted to obtain the saliency map. For plain images, attributions may be used as ready saliency map without additional transformations. 3.3.5. Evaluation The classification models’ performance are measured with accuracy and F1-score metrics. The faithfulness of the saliency maps is measured numerically by Increase-In-Confidence and Average Drop metrics [19] and compared to the same metrics for plain statistical radiomic features attributions. Additionally, the same evaluation is conducted with Insertion Correlation (IC) and Deletion Correlation (DC) metrics [20] as they also taking into account magnitudes of saliency values. However, for IC and DC some iterations in the computation procedure will be merged to drastically reduce the number of iterations, as the 256x256 saliency map requires more than 3000 predictions to compute these metrics. 3.3.6. Datasets for experiment Schenzen tuberculosis open-access dataset [21] contains 662 x-ray scans. The dataset is balanced; there are 326 images of the healthy lungs and 336 images of the lung with the signs of tuberculosis. No additional transformations are applied to this dataset, except those described in 3.3.1. During the experiment, the task of distinguishing between healthy and tuberculosis lungs was assessed with this dataset.COVIDx CXR-4 dataset [22] contains 84,818 chest X-ray scans. There are 65,681 scans containing COVID-19 lesions, and 19,137 are healthy lungs. Test and validation sets for this dataset were formed balanced while leaving the train set unbalanced to ensure faithful classification metrics. No additional transformations are applied to this dataset, except those described in 3.3.1. During the experiment, the task of distinguishing between healthy and COVID-19-damaged lungs was solved with this dataset. Cancer-Net BCa contains 253 breast MRI scans with evidence of breast cancer. During the experiment, the task of full remission prediction is considered. 4. Research question and hypothesis This research aims to answer the next question: how a medical image saliency map could be generated to explain a classification result when a classification model is trained with the statistical radiomic features? According to this, the research hypothesis may be defined as follows: Research hypothesis IF neural network trained with first- or high-order statistical radiomic features to perform the medical image classification AND aforementioned features attributed with Integrated Gradients, GradCAM, DeepLIFT AND image saliency map generated with proposed mapping method THEN Increase-In-Confidence, Average Drop, Insertion Correlation, and Deletion Correlation metrics for generated image saliency maps will be at least the same or statistically significantly higher as for direct feature attributions. 5. Preliminary results The preliminary results, described in the author’s previous work [23], indicate that the saliency maps generated method described in 3.1 could be considered faithful in terms of a numerical evaluation, maintaining Increase-In-Confidence metric at 50% − 80% level and Average Drop at 10% − 38%. Also, results indicate that the ResNet-50 classification model trained with only first-order statistical radiomic features yields the same classification quality as the ResNet-50 model with raw image input, indicating that results are eligible for practical usage. 6. Expected final contribution to knowledge The final contribution of the described research is expected to be a method for visually explaining a classification result via saliency maps when the classification model is trained with first- or high-order statistical radiomic features. The newly proposed method should allow for the explanation and validation of the results of the previous work which uses statistical radiomic features to solve medical image classification problems. References [1] H. Koyuncu, M. Barstuğan, Covid-19 discrimination framework for x-ray images by considering radiomics, selective information, feature ranking, and a novel hybrid classifier, Signal Processing: Image Communication 97 (2021) 116359. URL: https://www.sciencedirect.com/science/article/pii/S092359652100165X. doi:https: //doi.org/10.1016/j.image.2021.116359. [2] Ş. Öztürk, U. Özkaya, M. Barstuğan, Classification of coronavirus (covid-19) from x-ray and ct images using shrunken features, International Journal of Imaging Systems and Technology 31 (2020) 5–15. doi:10.1002/ ima.22469. [3] N. Zulpe, V. Pawar, Glcm textural features for brain tumor classification, IJ CSI 9 (2012) 354–359. [4] J.-Y. Ye, P. Fang, Z.-P. Peng, X.-T. Huang, J.-Z. Xie, X.-Y. Yin, A radiomics-based interpretable model to predict the pathological grade of pancreatic neuroendocrine tumors, European Radiology 34 (2023) 1994–2005. doi:10.1007/s00330-023-10186-1. [5] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., 2017, pp. 4765–4774. [6] M. R. Orton, E. Hann, S. J. Doran, S. T. C. Shepherd, D. Ap Dafydd, C. E. Spencer, J. I. López, V. Albarrán- Artahona, F. Comito, H. Warren, J. Shur, C. Messiou, J. Larkin, S. Turajlic, D.-M. Koh, Interpretability of radiomics models is improved when using feature group selection strategies for predicting molecular and clinical targets in clear-cell renal cell carcinoma: insights from the tracerx renal study, Cancer Imaging 23 (2023). doi:10.1186/s40644-023-00594-3. [7] M. Bang, J. Eom, C. An, S. Kim, Y. W. Park, S. S. Ahn, J. Kim, S.-K. Lee, S.-H. Lee, An interpretable multiparametric radiomics model for the diagnosis of schizophrenia using magnetic resonance imaging of the corpus callosum, Translational Psychiatry 11 (2021). doi:10.1038/s41398-021-01586-2. [8] M. Bang, Y. W. Park, J. Eom, S. S. Ahn, J. Kim, S.-K. Lee, S.-H. Lee, An interpretable radiomics model for the diagnosis of panic disorder with or without agoraphobia using magnetic resonance imaging, Journal of Affective Disorders 305 (2022) 47–54. doi:10.1016/j.jad.2022.02.072. [9] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, JMLR.org, 2017, p. 3319–3328. [10] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, K.-R. Müller, Layer-Wise Relevance Propagation: An Overview, in: W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, K.-R. Müller (Eds.), Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700, Springer International Publishing, Cham, 2019, pp. 193–209. URL: http://link.springer.com/10.1007/978-3-030-28954-6_10. doi:10.1007/978-3-030-28954-6_ 10, series Title: Lecture Notes in Computer Science. [11] A. Shrikumar, P. Greenside, A. Kundaje, Learning Important Features Through Propagating Activation Differ- ences, 2019. URL: http://arxiv.org/abs/1704.02685, arXiv:1704.02685 [cs]. [12] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. doi:10.1109/ICCV.2017.74. [13] D. Vuong, S. Tanadini-Lang, Z. Wu, R. Marks, J. Unkelbach, S. Hillinger, E. I. Eboulet, S. Thierstein, S. Peters, M. Pless, M. Guckenberger, M. Bogowicz, Radiomics Feature Activation Maps as a New Tool for Signature Interpretability, Frontiers in Oncology 10 (2020) 578895. URL: https://www.frontiersin.org/articles/10.3389/ fonc.2020.578895/full. doi:10.3389/fonc.2020.578895. [14] R. M. Haralick, K. Shanmugam, I. Dinstein, Textural features for image classification, IEEE Transactions on Systems, Man, and Cybernetics SMC-3 (1973) 610–621. doi:10.1109/TSMC.1973.4309314. [15] M. M. Galloway, Texture analysis using gray level run lengths, Computer Graphics and Image Processing 4 (1975) 172–179. doi:10.1016/S0146-664X(75)80008-6. [16] G. Thibault, B. FERTIL, C. Navarro, S. Pereira, N. Lévy, J. Sequeira, J.-L. Mari, Texture indexes and gray level size zone matrix application to cell nuclei classification, 2009. [17] C. Sun, W. G. Wee, Neighboring gray level dependence matrix for texture classification, Computer Vision, Graphics, and Image Processing 23 (1983) 341–352. doi:10.1016/0734-189X(83)90032-4. [18] M. Amadasun, R. King, Textural features corresponding to textural properties, IEEE Transactions on Systems, Man, and Cybernetics 19 (1989) 1264–1274. doi:10.1109/21.44046. [19] A. Chattopadhay, A. Sarkar, P. Howlader, V. N. Balasubramanian, Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 839–847. doi:10.1109/WACV.2018.00097. [20] T. Gomez, T. Fréour, H. Mouchère, Metrics for Saliency Map Evaluation of Deep Learning Explanation Methods, Springer International Publishing, 2022, p. 84–95. doi:10.1007/978-3-031-09037-0\_8. [21] S. Jaeger, S. Candemir, S. Antani, Y.-X. J. Wáng, P.-X. Lu, G. Thoma, Two public chest x-ray datasets for computer-aided screening of pulmonary diseases, Quant. Imaging Med. Surg. 4 (2014) 475–477. [22] Y. Wu, H. Gunraj, C.-e. A. Tai, A. Wong, COVIDx CXR-4: An Expanded Multi-Institutional Open-Source Benchmark Dataset for Chest X-ray Image-Based Computer-Aided COVID-19 Diagnostics, 2023. URL: http: //arxiv.org/abs/2311.17677, arXiv:2311.17677 [cs, eess]. [23] O. Davydko, V. Pavlov, P. Biecek, L. Longo, SRFAMap: A Method for Mapping Integrated Gradients of a CNN Trained with Statistical Radiomic Features to Medical Image Saliency Maps, Springer Nature Switzerland, 2024, p. 3–23. URL: http://dx.doi.org/10.1007/978-3-031-63803-9_1. doi:10.1007/978-3-031-63803-9_1.