1. Introduction

HC@AIxIA

Features selection throught autoencoder filtering and DeepShap: an iterative algorithm

Edoardo De Rose

Carlo Adornetto

Francesco Calimeri

Gianluigi Greco

0 0 Department of Mathematics and Computer Science, University of Calabria , Via Pietro Bucci, Rende , Italy

2024

25 25 28

In many fields, such as functional genomics or finance, data analysis, and predictive modeling are always challenging for the course of dimensionality and noisy data. In these cases, efective feature selection algorithms, based on Machine and Deep Learning, can perform and improve the identification of important features, leading to more treatable problems in terms of dimensionality. The paper proposes a novel algorithm to perform Feature Selection on highly dimensional data, which exploits the reconstruction capabilities of autoencoders and an ad-hoc defined Explainable Artificial Intelligence-based score to select the most informative feature for predictions. We benchmark such an approach on several state-of-the-art datasets and against the previously proposed algorithm in the literature, showcasing its efectiveness.

eol>Deep Learning Explainable AI Genomics

1. Introduction

In the field functional genomics, starting from the results of the Human Genome Project, the evolution of sequencing techniques provides big volumes of data for each single patient by taking advantage of the high-throughput and next-generation sequencing i.e., a set of time and cost-efective techniques for sequencing DNA and RNA. By means of them, it is possible to measure the expression of thousands of genes for each individual and hence to collect quantitative gene expression profiles (GEP) to be used for research and clinical purposes. But despite GEP datasets represent a valuable source of information in healthcare—they are indeed used for diagnosis, prevention, and precision medicine—their analysis results challenging for three main reasons. The first one is the course of dimensionality: a genomics dataset typically consists of a very large number of features (genes) and a small number of samples (patients); the second problem concerns imbalanced classes: in the analysis of diferent groups of patients, genomics data are often stratified in classes according to diferent pathologies. In most cases, there is a significant diference between the number of instances in each class; finally, sequencing data are typically collected from multiple sources, diferent laboratories, and sequencing tools. This results in noisy datasets which are dificult to analyze [ 1 ].

In recent years, Machine Learning (ML) and Deep Learning (DL) have been widely adopted in this field, providing breakthrough results and meaningful insights into the relationship between genomics and cancer [ 2, 3 ]. Although still very promising, DL models are in general not immediately interpretable, meaning that it is dificult to understand the causal relationship between the inputs and their outcomes. This is an even more severe problem in the bioinformatics domain, where it is crucial to understand, for example, in the case of genomics, how the expression of a gene can afect the progression of oncological patients.

We propose a new algorithm, based on DL and Explainable Artificial Intelligence (XAI), for genomics whose aim is threefold: first, select the most meaningful genes for a regression/classification problem; second, provide a more accurate prediction model; third, quantify and evaluate the efect of features on the predictions, through XAI. We used our algorithm for the GEP analysis of acute lymphoblastic leukemia (ALL) patients, identifying a meaningful subset of genes for the disease prognosis. The following sections are organized as follows. First, we review the most relevant related works in Section 2, and we then give a formal definition of the algorithm in Section 3. The application and the results obtained by the algorithm for the CLL study are discussed in Section 4. Finally, directions for further research are proposed in Section 5.

2. Related Works

A number of recent studies propose and evaluate new approaches for feature selection (FS) on GEP datasets for cancer diagnosis and prognosis[ 2 ]. Such methodologies mainly aim at selecting the most informative genes, which are able to characterize classes and identify groups of patients. In this context, the adoption of XAI methods has started to gain momentum for interpretability purposes as well as to enhance FS[ 4, 5, 6 ]. A widely used approach to overcome the course of dimensionality problem is to perform dimensionality reduction using AEs [ 7 ]. While this has been proven to be efective, the encoding is typically a non-linear projection of the variables into a lower-dimensional space, which makes it dificult to provide the proper interpretations of the results. In this work we propose a novel approach, which uses AEs for selecting the most informative genes without any change into the original features space, hence enhancing the explainability of the results, and still exploiting the representation abilities of AEs.

We moreover use an ad-hoc defined XAI-based score in order to iteratively select the features by taking advantage of the Shapely Additive ex-Planation method (SHAP)[ 8 ], a cooperative game theory-based approach for computing the shapely values. Such values measure, locally (at the sample level), the contribution of each feature to the predictions of an ML model. In particular, for a given sample , the set of features , the contribution of the feature ∈ is defined as: =

∑︁ ⊆ ∖{} ||!(| | − | | − 1)! | |! [∪{}(∪{}) − ()] (1) with ∈ R and where ∪{} and ∪{} denote the prediction model and the sample considering the only subset of features without the -th one. In words, SHAP computes the contribution of a feature by comparing the model predictions obtained with and without a feature, for all the possible combinations . Since the computation of the Equation 1 is ineficient in the case of NN as a prediction model—a NN should be re-trained for each combination of features (2| |)—the authors demonstrate in [ 8 ] that shapely values can be computed by solving a weighted linear least square regression with the proper shapely kernel. Although we used such an alternative method, we omitted the details and focus on the only definition of shapely values.

3. The Algorithm

The proposed algorithm is based on two main ideas: (1) we use a clustered correlation matrix in order to group features that enclose similar patterns and we then filter the redundant information for each group by using AEs. In contrast with previous works, in which AEs are used for dimensionality reduction, we still work at the level of the original features. In particular, we take advantage of the encoding and reconstruction abilities of the AEs assuming that the more accurate is the reconstruction of a feature, the more that feature is representative of the cluster it belongs to. We hence provide a more treatable dataset in terms of dimensionality, without loss of representativeness, by filtering redundant features; (2) we train NNs and we iteratively select the most meaningful features using a new ad-hoc defined SHAP score. We repeat the analysis by removing at each iteration, the previously selected features. We

Clusters k1 . . . kq Number of genes ≥ N Collect selected genes Number of genes < N

Explainability-based Selection Select the most meaningful genes according to an ad-hoc defined

SHAP-based score

Correlation Clustering

Autoencoder (AE) Filter

Patients used a1sfea.tu.re.sforneach k1 . . . kq

AE.1 .

. AEq Remove selected genes from k1 . . . kq

NN Training eventually use the set of selected features (from all the iterations) to train and explain a final model. Figure 1 shows the main algorithm phases.

3.1. Formal Setting

Let be = {, } a dataset such that ∈ R× is the matrix of inputs, and ∈ R× is the matrix of the corresponding labels. Let us further assume ≫ meaning that the dataset is characterized by a way larger set of features with respect to the number of samples.

As a novelty contribution, we introduce a new impact score, which, by means of the SHAP local explanation, measures the global impact of each feature on model predictions. We hence associate to each feature (column) of , used to train a model N, a couple ( ,N , ,N) were ,N is the correlation between the -th columns of and their shapely values {1, , ..., , }, and ,N is defined as follows: ∑︀=1 |, | * 2, ,N = ∑︀ℎ=1 ∑︀=1 |,ℎ| * 2,ℎ (2) With ,N and ,N we want to emphasize how and how much, respectively, a feature globally afect the predictions of N.

3.2. Algorithm

For sake of clarity, we introduce our algorithm by first defining a set of sub-procedures. The first one (Algorithm 1) computes the pairwise correlation matrix ∈ R× between the features (columns) of a generic real-valued matrix . Finally it clusters in order to return a set = {1, ..., } such that for each = 1, .., , is a set of indexes—a partition (cluster) for the columns of . The second sub-procedure, defined in Algorithm 2, trains an AE for each cluster, by using the transpose of the input matrix —meaning that, for the AE model, each feature represents a sample and vice versa. The rationale here is that we assume the best-reconstructed feature (over the samples) to be the most representative of the cluster it belongs to. We denote ∈ R×| | as a matrix including the only columns of which indexes are in . The function provides the column indexes of associated with the best-reconstructed feature. Finally, the sub-procedure returns a set of indexes—one for each cluster.

Algorithm 1 Corr. & Clustering

function CorrClustering() ← () ← () return end function

Algorithm 2 AE Filtering

function AEFiltering(,) ← ∅ for ∈ do

AE ← ( ) ← ∪ (AE , ) end for return end function The last sub-procedure, reported in Algorithm 3, takes as input: the data, a matrix of shapely values Φ and the threshold ∈ R, with ∈ [ 0, 1 ]. It first computes the correlations between each column of and the corresponding columns of Φ . Subsequently, it computes the intensity for each feature following the denfiition of equation 2. It then selects the column indexes according to and the mean intensity, to finally provide a set ˜ of column indexes for .

Algorithm 3 Selection function select(Φ, , ) ← (Φ, ) ←← |1|∑︀∈(Φ, ) ˜ ← { | | | > ∧ > , ∀ ∈ , ∀ ∈ } return ˜ end function

The main procedure is described by Algorithm 4. After clustering the correlation matrix, it selects a set of meaningful features index to be added to . It then removes the selected indexes from their corresponding clusters in and proceeds by repeating the analysis. Here we denote ∈ R×| | (and accordingly ) as a matrix including the columns of which indexes are in , and N (and accordingly N) as a NN trained on { , }. The iterative analysis stops when ∈ N, ≤ features are selected or on a maximum number of iterations. The algorithm eventually trains and explains a ifnal NN using the set .

4. A Use Case: Leukemia-ALLAML 4.1. Materials and Methods

To validate the efectiveness of the method, we tested it on a synthetic toy dataset. This allowed us to verify that the method correctly selected the centroid features for each cluster, ensuring that the Algorithm 4 Require: , ← CorrClustering() ← ∅ while || < ∨ not do ← AEFiltering(, ) , ← dataBalancing( , ) N ← findModel( , ) Φ ← Shap(N , ) ˜ ← select(Φ, ) ← ∪ { ∈ | ∈ ˜} ← ∖ end while N ← findModel(, ) Φ ← Shap(N, ) * ← select(Φ, ) ◁ Model Selection & Training ◁ matrix of shapely values Φ ∈ R×| | ◁ remove from their corresponding cluster most representative features were identified. We applied our algorithm for analyzing GEP of patients from Leukemia from the Kent Ridge biomedical data repository [ 9 ]. The leukemia dataset consists of two classes of acute leukemia known as acute lymphoblastic leukemia (ALL), arising under lymphoid precursors, and acute myeloid leukemia (AML), arising under myeloid precursors. There are 72 bone marrow samples in the dataset with 47 ALL and 25 AML cases and each contain 7129 gene probes. We used the proposed algorithm for training a NN to solve such a classification problem as well as to identify a set of meaningful genes over the whole set of 7129. We additionally provide insight into the prognostic power of such genes. The genes were initially clustered in groups based on their feature correlations. First, we computed the correlation matrix of the features, capturing the pairwise correlations between genes. We then applied a correlation threshold to define significant relationships between features. Specifically, if the absolute value of the correlation between two genes exceeded a predefined threshold, we considered them to be correlated. We then identified clusters of correlated genes by detecting the connected components. Each connected component represents a group of genes that are strongly correlated with each other. This method allowed us to group the genes into distinct clusters, capturing the structure of the data without relying on predefined assumptions about the number of clusters. These clusters were then used for further analysis and filtering. The AE filtering selects genes and we further applied a statistical filter in order to select 50 genes. After re-balancing the classes with the Synthetic Minority Over-sampling Technique (SMOTE), we perform model selection with 10-fold cross-validation in order to find the best (in terms of binary accuracy on the test set) NN for solving the classification problem.

We finally use our SHAP scores (defined in Section 3.1) to select the most meaningful genes, by setting = 0.85. After selecting a set of = 50 genes through the iterations of the algorithm, we use them to train and explain a final NN.

The algorithm has been implemented using the Python (v3.8.11) programming language. NNs have been implemented by taking advantage of the Pytorch (v2.4.1) framework. XAI analysis was performed by means of the SHAP library [ 8 ].

4.2. Results

The overall results are reported in Table 1. In particular, for each iteration of the algorithm, we measured the accuracy of all the models obtained during cross-validation, for which we report the confidence interval. As we expected, the classification accuracy decreases with the algorithm iterations: the reason is that the previously chosen features—expected to be the most representative of each cluster—are no more considered for the subsequent analysis. An improvement in accuracy is instead reported for the ifnal step of the algorithm, by which a model is trained using the set of genes selected during each iteration. The accuracy of the best final model is 100%.

Figure 3 reports, on the left side, a summarized representation of the shap values and, on the right side, the values for correlation and intensity for the most interesting genes found by our algorithm. In this context, it is important to compare our findings with the work of Al-Azani et al. [ 10 ] and Bennet et al [ 11 ]. Al-Azani et al. conducted an empirical study utilizing a feature selection technique that combined Chi-square (ChiS) and Information Gain (IG) methods. Their evaluation of various ensemble-based learning models, including bagging, random forests, stacking, voting, and boosting, culminated in a best classification accuracy of 96.88%. This study emphasizes the efectiveness of ensemble methods in improving model performance.

Conversely, Bennet et al. introduced a hybrid gene selection technique that integrates Support Vector Machine-Recursive Feature Elimination (SVM-RFE) with the Based Bayes Error Filter (BBF). Their approach involved ranking attributes with SVM-RFE and subsequently using BBF to eliminate redundant attributes, followed by classification with the SVM algorithm. Their eforts yielded an impressive classification accuracy of 97.2% on the Leukaemia dataset, underscoring the power of hybrid techniques in attribute selection.

5. Conclusions

The algorithm proposed in this work can be used as a valuable tool in genomics to identify protective (or not) sets of genes for a disease, suggesting potential pathways for further medical investigation. A natural direction for future development is to perform a large-scale assessment of the algorithm performances, by using state-of-the-art benchmark GEP datasets.

Acknowledgments

The research reported in the paper was partially supported by the PNRR projects “FAIR (PE00000013) - Spoke 9” and “Tech4You (ECS00000009) - Spoke 6”, under the NRRP MUR program funded by the NextGenerationEU. The National Plan for NRRP Complementary Investments (PNC, established with the decree-law 6 May 2021, n. 59, converted by law n. 101 of 2021) in the call for the funding of research initiatives for technologies and innovative trajectories in the health and care sectors (Directorial Decree n. 931 of 06-06-2022) - project n. PNC0000003 - AdvaNced Technologies for Human-centrEd Medicine (project acronym: ANTHEM).

[1]

Koumakis , Deep learning models in genomics; are we there yet? , Computational and Structural Biotechnology Journal 18 ( 2020 ) 1466 - 1473 .

[2]

Alhenawi ,

Al-Sayyed ,

Hudaib ,

Mirjalili , Feature selection methods on gene expression microarray data for cancer classification: A systematic review , Computers in Biology and Medicine 140 ( 2022 ) 105051 .

[3]

Bruno ,

Calimeri ,

A. S.

Kitanidis , E. De Momi, Data reduction and data visualization for automatic diagnosis using gene expression and clinical data , Artificial Intelligence in Medicine 107 ( 2020 ) 101884 .

[4]

Graham ,

Csicsery , E. Stasiowski, G. Thouvenin,

W. H.

Mather ,

Ferry ,

Cookson ,

Hasty , Genome-scale transcriptional dynamics and environmental biosensing , Proceedings of the National Academy of Sciences 117 ( 2020 ).

[5]

Meena ,

Hasija , Application of explainable artificial intelligence in the identification of squamous cell carcinoma biomarkers , Computers in Biology and Medicine 146 ( 2022 ) 105505 .

[6]

M. R.

Karim ,

Cochez ,

Beyan ,

Decker ,

Lange , Onconetexplainer: explainable predictions of cancer types based on gene expression data , in: 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE) , IEEE, 2019 , pp. 415 - 422 .

[7]

Danaee ,

Ghaeini ,

D. A.

Hendrix , A deep learning approach for cancer detection and relevant gene identification , in: Pacific symposium on biocomputing 2017 , World Scientific, 2017 , pp. 219 - 229 .

[8]

S. M.

Lundberg ,

S.-I.

Lee , A unified approach to interpreting model predictions , Advances in neural information processing systems 30 ( 2017 ).

[9]

Li ,

Liu , Kent ridge biomedical data set repository . school of computer engineering, nanyang technological university, singapore, 2004 .

[10]

Al-Azani ,

O. S.

Alkhnbashi , E. Ramadan,

Alfarraj , Gene expression-based cancer classification for handling the class imbalance problem and curse of dimensionality , International Journal of Molecular Sciences 25 ( 2024 ) 2102 .

[11]

Bennet ,

Ganaprakasam ,

Kumar , A hybrid approach for gene selection and classification using support vector machine ., International Arab Journal of Information Technology (IAJIT) 12 ( 2015 ).