Gene Selection for Cancer Diagnosis via Iterative Graph Clustering-based Approach Mehrdad Rostami1 , Mourad Oussalah1,2 1 Centre of Machine Vision and Signal Processing, Faculty of Information Technology, University of Oulu, Oulu, Finland 2 Research Unit of Medical Imaging, Physics, and Technology, Faculty of Medicine, University of Oulu, Finland Abstract The development of microarray devices has led to the accumulation of DNA microarray datasets. Through this technological advance, physicians are able to examine various aspects of gene expression for cancer diagnosis. As data accumulation rapidly increases, the task of machine learning faces considerable challenges for high-dimensional DNA microarray data classification. Gene selection is a popular and powerful approach to deal with these high-dimensional cancer data. In this paper, a novel graph clustering-based gene selection approach is developed. The developed approach has two main objectives, consisting of relevance maximization and redundancy minimization of the selected genes. In this method, in each iteration, one subgraph is extracted, and then among the existing genes in this cluster, appropriate genes are selected using filter-based measure. The reported results on five cancer datasets indicate that the developed gene selection approach can improve the accuracy of cancer diagnosis. Keywords Cancer diagnosis, Microarray data classification, Gene selection, Feature selection, Graph clustering 1. Introduction Worldwide, cancer remains the leading cause of death for both men and women. Cancer diagnosis is important to increase survival chances since early treatment is available to the patients, provided successful early diagnosis. As a result, much research has been conducted to develop better strategies to prevent, diagnose, and treat this disease to decrease the mortality [1, 2]. Using the emerging DNA microarray data, in an experiment, multiple aspects of gene expression can be investigated to diagnose or detect different kinds of cancer. Existing machine learning and pattern recognition approaches to handle large volume of DNA microarray have been challenged by the high-dimensional structure of a such data [3]. The high data volume makes many genes irrelevant or redundant to cancer diagnosis or classification task. Gene selection is a powerful and efficient technique in microarray data to deal with this challenge [3]. By using this strategy, the training process can be simplified, which, in turn, enhances machine learning performance, and, thereby, the general diagnosis [4]. The main goal of our study is to develop an efficient clustering-based approach to choose a subset of primary genes where one cluster of genes is chosen at each iteration. Then, among the existing genes in this cluster, appropriate ones are selected using a filter-based measure. AIxIA 2021 SMARTERCARE Workshop, November 29, 2021, Milan, IT Envelope-Open Mehrdad.Rostami@oulu.fi (M. Rostami); Mourad.Oussalah@oulu.fi (M. Oussalah) Orcid 0000-0001-5710-217X (M. Rostami); 0000-0002-4422-8723 (M. Oussalah) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Mehrdad Rostami et al. CEUR Workshop Proceedings 1–6 This process of finding a cluster and selecting a candidate gene from each cluster is iterated until all clusters are selected. We expect that our model, in addition to selecting genes with the least amount of redundancy, will also maximize the relevancy of selected genes. The remainder of this paper is organized as below: Section 2 reviews some related works. The developed prediction method is presented in Section 3. The experimental results are reported in section 4 and finally, section 5 present the conclusion and future works. 2. Related Works A significant challenge in handling microarray data for cancer diagnosis is their high-dimensionality where the number of genes is much greater than the number of patterns [5, 6, 7], which leads to a well-known problem known as “curse of dimensionality”. Gene selection is one popu- lar technique to eliminate irrelevant and/or redundant genes [8]. Previously, gene selection approaches were classified as filter, wrapper, hybrid, and embedded approaches. In the filter techniques, relevant genes are evaluated without a learning model. As a result, these techniques are typically fast [9]. There have been many filter-based approaches for gene selection in cancer diagnosis such as, Filtering and ranking techniques [10], Simplified Swarm Optimization (SSO) [11], Tabu Asexual Genetic Algorithm (TAGA) [12], Feature Clustering and Feature Discretiza- tion assisting gene selection (FCFD) [13], Least Loss (LL) [14], etc. The wrapper gene selection strategies employ a learning model to evaluate the efficiency of the chosen gene subset [15]. In this category, an iterative search algorithm-based process is applied to seek the optimum gene subset, and at each step, a subset of original genes are chosen and with a fitness function determining which genes are the best. Despite the fact that wrapper strategies choose a effective subset of original genes, they are computationally complex and may present challenges within analysis of DNA microarray datasets [16]. Hybrid strategies combine the benefits of both filter and wrapper strategies [16]. In addition, the embedded strategies make use of gene selection in the learning process. 3. Proposed Method Our novel gene selection algorithm is introduced by combining the notions of Graph Clustering with Feature Weighting (GCFW). We can group our algorithm under the category of filter gene selection technique and this algorithm measure relevancy and redundancy notions in its selection mechanism. GCFW consists of two main phases including (1) Gene similarity calculation, (2) Iterative subgraph finding and gene selection. In the reminder of this section the details of these two phases are explained. 3.1. Gene similarity calculation Initially, microarray datasets are represented as a weighted graph. In this demonstration, each gene is indicated by a node and the value of each edge shows by gene similarities. In this graph representation, the Pearson correlation coefficient criterion [17] is used to measure the similarity values between two genes. This similarity measure maps the gene space of a microarray dataset 2 Mehrdad Rostami et al. CEUR Workshop Proceedings 1–6 into a fully weighted and connected graph. To make the graph sparser, the edges with values lower than the threshold 𝜃 are deleted. 𝜃 is an adjustable parameter and takes its value in the range [0 1]. In our experiment 𝜃 is set to 0.6. 3.2. Iterative subgraph finding and gene selection In order to avoid choosing similar genes, initial genes are divided to several groups to to reduce the possibility of selecting redundant genes. Moreover to select a subset of relevant genes, a feature weighting strategy is developed. In the proposed method, in each iteration, a subgraph is identified and then among these present genes in this subgraph, the suitable ones are chosen using gene relevance. The purpose of the subgraph discovery is to divide the genes into clusters so that the genes of each cluster are as similar as possible. Previous gene clustering algorithms suffer from inherent shortcomings, like the need to specify the number of clusters, ignoring the distribution of genes and equal consideration of all genes. To deal with these issues, quite different from existing feature clustering algorithms, a fast algorithm for subgraph discovery [18] is employed for gene clustering. This algorithm yields grouping faster than previous gene clustering method. Moreover, in this proposed method, Fisher Score weighting technique is used to rank the genes of each subgraph and select the relevant genes for representing the genes of that cluster. Therefore, it can claim that the genes selected satisfy both qualities: maximum relevancy and minimal redundancy. In other words, the use of subgraph discovery criteria avoids to choose the redundant genes and utilization of the notion of feature weighting results in the selection of appropriate genes.Then, from each subgraph, the relevant gene is selected by performing the feature weighting technique. The purpose of feature weighting is to select a representative gene that is most relevant to the cancer diagnosis task. Fisher Score (FS) is a supervised univariate filter which gives higher values to those genes that have a separation property. The Fisher Score of gene k is calculated as below: 2 ∑𝑣∈𝑉 𝑛𝑣 (𝐺𝑘𝑣̄ − 𝐺𝑘̄ ) 𝐹 𝑆 (𝐺𝑘 ) = (1) 𝑛𝑣 (𝜎𝑣 (𝐺𝑘 ))2 where 𝐺𝑘̄ is the average value of all the patterns related to the gene 𝐺𝑘 , 𝑉 is a set of all classes in a dataset, 𝑛𝑣 is the size of samples on the class v, and 𝜎𝑣 (𝐺𝑘 ) and 𝐺𝑘𝑣̄ indicates the variance and mean of gene 𝐺𝑘 on class v, respectively. According to the algorithm design, a subgraph of genes is extracted in each iteration using a repetitive process, and then the appropriate genes are selected from the existing genes in this subgraph using a feature-weighting strategy. This process is repeated again for all remaining genes after deleting the genes present in the extracted subgraph. 4. Experimental results To investigate the performance of our cancer diagnosis algorithm, various experiments are performed.The efficiency of our algorithm is compared with three new methods of filter-based 3 Mehrdad Rostami et al. CEUR Workshop Proceedings 1–6 Table 1 Characteristics of the used microarray datasets Dataset # Genes # Classes # Patterns Colon 2000 2 62 SRBCT 2328 4 83 Leukemia 7129 2 72 Prostate Tumor 10509 2 102 Lung Cancer 12600 5 203 Figure 1: Average accuracy of different methods on (a) SVM classifiers (b) AB classifiers. cancer diagnosis: Tabu Asexual Genetic Algorithm (TAGA) [12], Feature Clustering and Fea- ture Discretization assisting gene selection (FCFD) [13], Least Loss (LL) [14]. Moreover, the experiments in this study use a variety of datasets with different properties to demonstrate the effectiveness of the developed algorithm. These microarray data include of Colon, Leukemia, SRBCT, Prostate Tumor, and Lung Cancer [6]. The primary characteristics of these datasets are detailed in Table 1. Additionally, to assess the flexibility of the proposed algorithm on different classifiers, we also examined the performance of two frequent used classifiers, including Support Vector Machine (SVM) and AdaBoost (AB). In our experiments, the efficiency of our algorithm is measured using the mentioned classifiers. Figure 1 summarizes the average classification accuracy over ten separate and autonomous runs of the different gene selection algorithms. The reported results indicate that in all cancer datasets, the developed gene selection performs better than those of other alternative approaches. For example, for the Colon dataset on SVM classifier, the proposed algorithm obtained a 88.91% accuracy while for TAGA [12], FCFD [13], LL [14], this value is 81.52%, 87.82% and 87.24%, cor- respondingly. Furthermore, the reported results for AB classifier were similar to SVM classifier, and in all cases the developed algorithm was more precise than the other compared algorithms. 4 Mehrdad Rostami et al. CEUR Workshop Proceedings 1–6 5. Conclusion In this paper, an efficient dimensionality reduction algorithm in cancer diagnosis has been de- veloped utilizing the subgraph discovery and feature weighting. The main aim of our algorithm is to choose a subset of appropriate and non-redundant genes that are most closely associated to the target class of microarray data classification. The proposed algorithm has been compared to the recent gene selection algorithms on the cancer microarray datasets. The experimental results show that our cancer diagnosis algorithm gained the highest performance. In future work, our goals are (1) integrate our proposed gene selection approach along with deep learning techniques for accurate cancer diagnosis, and (2) study explainable artificial intelligence in detail to see how explainable artificial intelligence can improve further interpretability and transparency of diagnosis. Acknowledgments This project is supported by the Academy of Finland Profi5 DigiHealth project, which is gratefully acknowledged. References [1] S. Yang, H. Xiao, L. Cao, Recent advances in heat shock proteins in cancer diagnosis, prognosis, metabolism and treatment, Biomedicine & Pharmacotherapy 142 (2021) 112074. [2] R. Daneshjou, B. He, D. Ouyang, J. Y. Zou, How to evaluate deep learning for cancer diagnostics – factors and recommendations, Biochimica et Biophysica Acta (BBA) - Reviews on Cancer 1875 (2021) 188515. doi:https://doi.org/10.1016/j.bbcan.2021.188515 . [3] Y. Liu, F. Nie, Q. Gao, X. Gao, J. Han, L. Shao, Flexible unsupervised feature extraction for image classification, Neural Networks 115 (2019) 65–71. doi:https://doi.org/10.1016/ j.neunet.2019.03.008 . [4] S. Forouzandeh, K. Berahmand, M. Rostami, Presentation of a recommender system with ensemble learning and graph embedding: a case on movielens, Multimedia Tools and Applications 80 (2021) 7805–7832. URL: https://doi.org/10.1007/s11042-020-09949-5. doi:10.1007/s11042- 020- 09949- 5 . [5] M. Rostami, K. Berahmand, E. Nasiri, S. Forouzandeh, Review of swarm intelligence- based feature selection methods, Engineering Applications of Artificial Intelligence 100 (2021) 104210. URL: https://www.sciencedirect.com/science/article/pii/S0952197621000579. doi:https://doi.org/10.1016/j.engappai.2021.104210 . [6] M. Rostami, S. Forouzandeh, K. Berahmand, M. Soltani, Integration of multi-objective pso based feature selection and node centrality for medical datasets, Genomics 112 (2020) 4370–4384. URL: https://www.sciencedirect.com/science/article/pii/S088875432030224X. doi:https://doi.org/10.1016/j.ygeno.2020.07.027 . [7] M. Rostami, K. Berahmand, S. Forouzandeh, A novel method of constrained feature selec- tion by the measurement of pairwise constraints uncertainty, Journal of Big Data 7 (2020) 83. URL: https://doi.org/10.1186/s40537-020-00352-3. doi:10.1186/s40537- 020- 00352- 3 . 5 Mehrdad Rostami et al. CEUR Workshop Proceedings 1–6 [8] M. Rostami, K. Berahmand, S. Forouzandeh, A novel community detection based genetic algorithm for feature selection, Journal of Big Data 8 (2021) 2. URL: https://doi.org/10. 1186/s40537-020-00398-3. doi:10.1186/s40537- 020- 00398- 3 . [9] M. Labani, P. Moradi, F. Ahmadizar, M. Jalili, A novel multivariate filter method for feature selection in text classification problems, Engineering Applications of Artifi- cial Intelligence 70 (2018) 25–37. URL: https://www.sciencedirect.com/science/article/ pii/S0952197617303172. doi:https://doi.org/10.1016/j.engappai.2017.12.014 . [10] W. De Smet, K. De Loof, P. De Vos, P. Dawyndt, B. De Baets, Filtering and ranking techniques for automated selection of high-quality 16s rrna gene sequences, Systematic and Applied Microbiology 36 (2013) 549–559. URL: https://www.sciencedirect.com/science/ article/pii/S0723202013001495. doi:https://doi.org/10.1016/j.syapm.2013.09.001 . [11] C.-M. Lai, H.-P. Huang, A gene selection algorithm using simplified swarm optimization with multi-filter ensemble technique, Applied Soft Computing 100 (2021) 106994. URL: https://www.sciencedirect.com/science/article/pii/S1568494620309339. doi:https://doi. org/10.1016/j.asoc.2020.106994 . [12] S. Salesi, G. Cosma, M. Mavrovouniotis, Taga: Tabu asexual genetic algorithm em- bedded in a filter/filter feature selection approach for high-dimensional data, Informa- tion Sciences 565 (2021) 105–127. URL: https://www.sciencedirect.com/science/article/pii/ S0020025521000475. doi:https://doi.org/10.1016/j.ins.2021.01.020 . [13] H.-Y. Lin, Feature clustering and feature discretization assisting gene selection for molecular classification using fuzzy c-means and expectation–maximization algorithm, The Journal of Supercomputing 77 (2021) 5381–5397. URL: https://doi.org/10.1007/s11227-020-03480-y. doi:10.1007/s11227- 020- 03480- y . [14] F. Thabtah, F. Kamalov, S. Hammoud, S. R. Shahamiri, Least loss: A simplified fil- ter method for feature selection, Information Sciences 534 (2020) 1–15. URL: https: //www.sciencedirect.com/science/article/pii/S0020025520304242. doi:https://doi.org/ 10.1016/j.ins.2020.05.017 . [15] M. Rostami, P. Moradi, A clustering based genetic algorithm for feature selection, in: 2014 6th Conference on Information and Knowledge Technology (IKT), 2014, pp. 112–116. doi:10.1109/IKT.2014.7030343 . [16] P. García-Díaz, I. Sánchez-Berriel, J. A. Martínez-Rojas, A. M. Diez-Pascual, Unsupervised feature selection algorithm for multiclass cancer classification of gene expression rna-seq data, Genomics 112 (2020) 1916–1925. URL: https://www.sciencedirect.com/science/article/ pii/S0888754319304811. doi:https://doi.org/10.1016/j.ygeno.2019.11.004 . [17] M. M. Kabir, M. Shahjahan, K. Murase, A new local search based hybrid genetic algorithm for feature selection, Neurocomputing 74 (2011) 2914–2928. URL: https: //www.sciencedirect.com/science/article/pii/S0925231211002748. doi:https://doi.org/ 10.1016/j.neucom.2011.03.034 . [18] M. Bressan, Faster algorithms for counting subgraphs in sparse graphs, Algorith- mica 83 (2021) 2578–2605. URL: https://doi.org/10.1007/s00453-021-00811-0. doi:10.1007/ s00453- 021- 00811- 0 . 6