-

Journal of Afective Disorders 305 (2022) 47-54. doi: 10.1016/j.jad.2022.02.072. [9] M. Sundararajan

10.1186/s40644-023-00594-3

Assesing the Interpretability of the Statistical Radiomic Features via Image Saliency Maps in Medical Image Classification Tasks

Oleksandr Davydko

0 0 Technological University Dublin

2017

70 4765 4774

The presented research aims to improve the interpretability of medical image classification models trained with statistical radiomic features. While showing classification results comparable with stateof-the-art convolutional neural network models, statistical radiomic features' interpretability is still understudied. Neural network models use saliency map approaches to provide a human operator with intuitive visualisation of the model's attention, but statistical radiomic-based models still have no such tools developed. This research aims to eliminate this gap and allow the saliency map generation for models trained with statistical radiomic features. Preliminary results show that the proposed approach may generate faithful saliency maps for the ResNet-50 classification model trained the first-order statistical radiomic features.

eol>Medical image classification Texture analysis Statistical radiomic features Saliency maps

1. Context and motivation visual explanations than standard numerical feature importance.

This research aims to introduce a method for generating a saliency map when a classification model is trained with statistical radiomic features, subsequently improving the explainability of statistical-radiomic-based classification models for medical images.

2. Related work

The statistical radiomic features were applied to solve many diferent medical image classification tasks, showing near state-of-the-art classification performance. Authors of the research [ 2 ] have attempted to fuse grey-level co-occurrence matrix (GLCM), grey-level run length matrix (GLRLM), and segmentation-based fractal texture analysis (SFTA) features to detect the COVID-19 lesions presence on the chest X-ray images. The fusion of features combined with feature selection done by principle components analysis allowed reaching 0.94 F1-score while distinguishing between healthy and COVID pneumonia lung images. Similar results were observed in the same task when using first-order statistics (FOS), GLCM, GLRLM, and grey-level size zone matrix (GLSZM) feature extraction methods in the work [ 1 ]. Here, the authors report the F1-score at a 0.98 rate. In research [ 3 ] authors report 0.975 accuracy while performing a classification of brain tumors on magnetic resonance images (MRI)

Current advances in the interpretability of radiomic-based models mostly include interpreting importances of used high-order radiomic features. Authors of the research in [ 4 ] use SHAP [5], which allowed them to identify the most influential features. In another research work [ 6], authors interpret radiomic feature groups importances by analysing logistic regression coeficients. A study [ 7] uses SHAP to reveal the most influential features to diagnose schizophrenia by brain magnetic resonance images (MRI). The authors of the work [8] use the same technique to find the connection between particular features and panic disorder signs.

At the same time, researchers utilise saliency map methods such as Integrated Gradients [9], layer-wise propagation [10], DeepLIFT [11], GradCAM [12] for interperting convolutional neural network predictions. The saliency map is much easier to understand from the point of view of human perception. For radiomic-based models, a little work discusses some analogs of saliency maps. In the research [13], authors discuss the interpretability of tumor tissue signature identification when local statistical radiomic features are used. The problem of interpretability was tackled by visualizing feature activation maps for a single high-order feature.

It can be stated, that the problem of statistical radiomic features interpretability is definitely in focus of researchers but requires further investigations. The main problem to investigate is the generation of understandable image saliency maps for radiomic-based models.

3. Design and methodology

This research describes methods for saliency map generation of the first- and high-order statistical radiomics features.

3.1. Mapping first-order radiomic features’ attributions

First-order statistical radiomic features are formed by computing the frequency of some texture substructure appearing in the image. Some examples of such substructures are: 1. Pair of pixels with intensities , 2. Run-length of pixels with the same intensity 3. Cluster of connected pixels with same intensity A proposed method of saliency map generation includes the computation of features’ attributions and subsequently adding those attribution values to pixels involved in forming particular feature values (figure 1). Regarding the size of the statistical radiomics features matrices (256x256 at least), it is proposed to use convolutional neural networks to build the classifiers and gradientbased methods to obtain attributions (such as Integrated Gradients).

3.2. Mapping high-order radiomic features’ attributions

A mapping of high-order statistical radiomic features could be defined as an extension of the first-order features mapping procedure. It is a noticeable fact that all high-order feature formulas are represented with diferentiable functions. That allows the usage of gradient-based methods (such as Integrated Gradients, GradCAM, DeepLIFT) directly to obtain first-order feature attributions and subsequently map them to medical image pixel attributions by applying procedure from section 3.1. The graphical representation of this process is displayed in figure 2. This approach is feasible if the classification model is of a fully-connected neural network type. Known model-agnostic methods such as SHAP cannot be used to attribute the massive number (at least 65536 for GLCM and more for other methods) of the first-order features due to slowness in the calculation process, which leaves the usage of other classification models as an open question out of this research scope.

3.3. Experiment

The experiment is designed to test the faithfulness of the saliency maps generated by methods described in 3.2 and 3.2 and compare them with those generated by existing methods in diferent classification tasks on X-ray and MRI images. The experiment’s scheme is presented in the figure 3. The described experiment is run separately with diferent datasets to test the generalisability of the proposed approach.

3.3.1. Data Preparation

Each image in the dataset is converted into greyscale if needed, as statistical radiomic features are defined only for greyscale textures in the current literature. Images are left intact for the baseline pipeline (steps G and H in the figure 3).

3.3.2. Feature extraction

First-order statistical radiomic features are extracted with GLCM, GLRLM, GLSZM, grey-level dependency matrix (GLDM), and neighboring grey-tone diference matrix (NGTDM) methods Method GLCM GLRLM GLSZM GLDM NGTDM

Features Angular Second Moment, Contrast, Correlation, Sum of Squares, Inverse Diference Moment, Sum Average, Sum Variance, Sum Entropy, Entropy, Diference Variance, Diference Entropy, Information Measures of Correlation (2x) Short Run Emphasis, Long Run Emphasis, Grey Level Non-Uniformity, 16 Gray Level Non-Uniformity, Run Length Non-Uniformity, Run Length NonUniformity Normalized, Run Percentage, Grey Level Variance, Run Variance, Run Entropy, Low Gray Level Run Emphasis, High Gray Level Run Emphasis, Short Run Low Gray Level Emphasis, Short Run High Gray Level Emphasis, Long Run Low Gray Level Emphasis, Long Run High Gray Level Emphasis Small Area Emphasis, Large Area Emphasis, Gray Level Non-Uniformity, 16 Gray Level Non-Uniformity Normalized, Size-Zone Non-Uniformity, SizeZone Non-Uniformity Normalized, Zone Percentage, Gray Level Variance, Zone Variance, Zone Entropy, Low Gray Level Zone Emphasis, High Gray Level Zone Emphasis, Small Area Low Gray Level Emphasis, Small Area High Gray Level Emphasis, Large Area Low Gray Level Emphasis, Large Area High Gray Level Emphasis Small Dependence Emphasis, Large Dependence Emphasis, Gray Level Non- 14 Uniformity, Dependence Non-Uniformity, Dependence Non-Uniformity Normalized, Gray Level Variance, Dependence Variance, Dependence Entropy, Low Gray Level Emphasis, High Gray Level Emphasis, Small Dependence Low Gray Level Emphasis, Small Dependence High Gray Level Emphasis, Large Dependence Low Gray Level Emphasis, Large Dependence High Gray Level Emphasis

Coarseness, Contrast, Busyness, Complexity, Strength 5 into matrices, as described in works [14, 15, 16, 17, 18]. While calculating statistical radiomics features, background pixels are not taken into account. The parameters for the mentioned methods are to be found with a hyperparameter search procedure. High-order statistical radiomics features are extracted out of matrices containing first-order radiomic features with formulas defined in [ 14, 15, 16, 17, 18] and concatenated into a single feature vector. The list of features calculated is provided in table 1

Each of the described features has its unique formula for calculation. For example, a formula for the Contrast feature of the GLCM matrix is defined as: =1 =1

Contrast = ∑︁ ∑︁ ( − )2 (, ) Here - number of grey shades in the image (usually equals 256), - matrix containing grey-level co-occurrence first-order radiomic features.

3.3.3. Classification models and their training

Three types of classification models are trained. A first type receives first-order statistical radiomic feature matrices. Models of the first type are represented with convolutional neural Total 13

Reference [14] [15] [16] [17] [18] networks. In this particular experiment, ResNet-50, VGG-16, and EficientNet architectures are used. Models of the second type are represented with special architecture, which implements high-order statistical formulas from [14, 15, 16, 17, 18]. A multi-layer perceptron follows the implementation of the block with formulas along with a sigmoid or softmax layer. The baseline models receive plain images as input and are represented by ResNet-50, VGG-16, and EficientNet architectures. Each classification model of every type is trained from scratch. The training is held until the F1-score on the development set stops to improve for ten subsequent epochs. Adam algorithm is used as an optimizer with a 1 − 4 learning rate.

3.3.4. Statistical radiomic feature attribution and saliency maps

As all classification models described in 3.3.3 are represented by neural networks, is is possible to use gradient-based methods for feature attributing. In this research, it is proposed to use Integrated Gradients, GradCAM, and DeepLIFT without any modifications. Subsequently, for radiomic-based models, an additional mapping, which is described in section 3.1, is conducted to obtain the saliency map. For plain images, attributions may be used as ready saliency map without additional transformations.

3.3.5. Evaluation

The classification models’ performance are measured with accuracy and F1-score metrics. The faithfulness of the saliency maps is measured numerically by Increase-In-Confidence and Average Drop metrics [19] and compared to the same metrics for plain statistical radiomic features attributions. Additionally, the same evaluation is conducted with Insertion Correlation (IC) and Deletion Correlation (DC) metrics [20] as they also taking into account magnitudes of saliency values. However, for IC and DC some iterations in the computation procedure will be merged to drastically reduce the number of iterations, as the 256x256 saliency map requires more than 3000 predictions to compute these metrics.

3.3.6. Datasets for experiment

Schenzen tuberculosis open-access dataset [21] contains 662 x-ray scans. The dataset is balanced; there are 326 images of the healthy lungs and 336 images of the lung with the signs of tuberculosis. No additional transformations are applied to this dataset, except those described in 3.3.1. During the experiment, the task of distinguishing between healthy and tuberculosis lungs was assessed with this dataset.COVIDx CXR-4 dataset [22] contains 84,818 chest X-ray scans. There are 65,681 scans containing COVID-19 lesions, and 19,137 are healthy lungs. Test and validation sets for this dataset were formed balanced while leaving the train set unbalanced to ensure faithful classification metrics. No additional transformations are applied to this dataset, except those described in 3.3.1. During the experiment, the task of distinguishing between healthy and COVID-19-damaged lungs was solved with this dataset. Cancer-Net BCa contains 253 breast MRI scans with evidence of breast cancer. During the experiment, the task of full remission prediction is considered.

4. Research question and hypothesis

This research aims to answer the next question: how a medical image saliency map could be generated to explain a classification result when a classification model is trained with the statistical radiomic features? According to this, the research hypothesis may be defined as follows: Research hypothesis IF neural network trained with first- or high-order statistical radiomic features to perform the medical image classification AND aforementioned features attributed with Integrated Gradients, GradCAM, DeepLIFT AND image saliency map generated with proposed mapping method THEN Increase-In-Confidence, Average Drop, Insertion Correlation, and Deletion Correlation metrics for generated image saliency maps will be at least the same or statistically significantly higher as for direct feature attributions.

5. Preliminary results

The preliminary results, described in the author’s previous work [23], indicate that the saliency maps generated method described in 3.1 could be considered faithful in terms of a numerical evaluation, maintaining Increase-In-Confidence metric at 50% − 80% level and Average Drop at 10% − 38%. Also, results indicate that the ResNet-50 classification model trained with only ifrst-order statistical radiomic features yields the same classification quality as the ResNet-50 model with raw image input, indicating that results are eligible for practical usage.

6. Expected final contribution to knowledge

The final contribution of the described research is expected to be a method for visually explaining a classification result via saliency maps when the classification model is trained with first- or high-order statistical radiomic features. The newly proposed method should allow for the explanation and validation of the results of the previous work which uses statistical radiomic features to solve medical image classification problems.

[1]

Koyuncu ,

Barstuğan , Covid -19 discrimination framework for x-ray images by considering radiomics, selective information, feature ranking, and a novel hybrid classifier , Signal Processing: Image Communication 97 ( 2021 ) 116359 . URL: https://www.sciencedirect.com/science/article/pii/S092359652100165X. doi:https: //doi.org/10.1016/j.image. 2021 . 116359 .

[2]

Ş.

Öztürk ,

Özkaya ,

Barstuğan , Classification of coronavirus (covid-19) from x-ray and ct images using shrunken features , International Journal of Imaging Systems and Technology 31 ( 2020 ) 5 - 15 . doi: 10 .1002/ ima.22469.

[3]

Zulpe ,

Pawar , Glcm textural features for brain tumor classification , IJ CSI 9 ( 2012 ) 354 - 359 .

[4]

J.-Y.

Ye ,

Fang ,

Z.-P.

Peng ,

X.-T.

Huang ,

J.-Z.

Xie ,

X.-Y.

Yin , A radiomics-based interpretable model to predict the pathological grade of pancreatic neuroendocrine tumors , European Radiology 34 ( 2023 ) 1994 - 2005 . doi: 10 .1007/s00330-023-10186-1.