Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Stacked Sparse autoencoder for unsupervised features learning in PanCancer miRNA cancer classification 1st Imene Zenbout 2nd Abdelkrim Bouramoul IFA department, NTIC faculty, Constantine 2 University IFA department, NTIC faculty, Constantine 2 University CRBT, CERIST Misc laboratory Constantine , Algeria Constantine, Algeria imene.zenbout@univ-constantine2.dz abdelkrim.bouramoul@univ-constantine2.dz 3rd Souham Meshoul Princess Nourah bint Abderahmen University Riyadh, Saudi Arabia sbmeshoul@pnu.edu.sa Abstract—The recent progress in cancer diagnosis is genomic and post treatment [8], which represents the main motivation data analysis oriented. miRNA is playing an important role as of this work. The miRNA data share the same issue with gene cancer biomarkers to move with cancer diagnosis and therapy expression data which is the very small sample size with regard towards personalized medicine with the ultimate goal to augment survival rate and disease prevention. The recent explosion in to the high profiles dimensionality .i.e there is some profiles genomic data generation has motivated the use of miRNA to that are irrelevant in cancer diagnosis and related decisions enhance diagnosis, prognosis and treatment. In this work we compared to the low number of patient samples. Obviously, have explored the integrated Atlas PanCancer miRNA profiles, this lends itself to a dimensionality reduction problem where using deep features learning based on unsupervised Stacked it is required to extract the miRNA signature representation Sparse AutoEncoder (SSAE). The proposed SSAE model learns features representation from the used data. The consistency that can be a relevant predictors in cancer diagnosis. of the learned features has been tested using classification of In this work we propose a deep unsupervised features learning samples according to 31 cancer types. The model performance model, based on stacking three sparse autoencoders to learn has been compared to state-of-the-art unsupervised features new features from the initial noisy miRNA profiles inputs. learning models. The obtained results exhibit the competitiveness The learned features through the different abstraction levels, and promising performance of our model, where an accuracy rate of about 95% has been achieved. have been used to train classifiers to predict the cancer type Index Terms—Deep learning, Bioinformatics, features learning, of a specific sample according to 31 different cancer type. Sparse autoencoders, miRNA, PanCancer. The proposed unsupervised and supervised models have been trained on the Atlas PanCancer [9] data set. The particularity I. I NTRODUCTION of this data set is that it combines different cancer type. This The recent and tremendous advance in high sequencing may help us to draw information from the well explored technologies [1] have forstred the role of genomic data across cancer type that have a big number of samples and/or a high all the transcriptomic as a key answer to different biological correlation between the different miRNA profiles and apply related questions and precisely in disease genetics. With these these information to classify, or understand the cancer type new genomic and genetic data availability and transparency, with poor exploration rate. The features learning model has miRNA role moved from noisy particles to a highly engaged been compared to some of the most known unsupervised genomic instances in gene regulation and post protein function. features learning and dimensionality reduction models, here This has led to a direct involving of miRNA in the occurrence we used pricipal component analysis PCA and kernel principle or the suppression of cancer [2]. component analysis KPCA. The rest of the paper is organized microRNA (miRNA) are classified as non-coding regulatory as follows: A literature review in section II. Section III is genes [3], that can be found in small fragments of non-coding devoted to a brief introduction to sparse autoencoders. Section RNA regions (about 21-23 nucleotide) [3], [4]. Since the IV describes the data set and the preprocessing steps. Our discovery of miRNA in 1993 by R.C.Lee [5], the generation of proposal is presented in section V along with the set of miRNA data using high throughput technologies [6], [7] to ex- experimental results and discussion. plore the direct role of miRNA and cancer diagnosis and gene impact become intensive. The particularity of miRNA profiles is their ability to be a direct tool in cancer analysis, therapy II. MI RNA CANCER C LASSIFICATION Recently, the exploration of noncoding regions rule in cancer diagnosis and therapy is attracting a large community of scientists. The miRNA data set analysis using statistical and machine learning become one of the trending problems in bioinformatics [3]. In cancer diagnosis and classification, we cite the work of J.Lu et all [10] where the authors analysed mammalian miRNA using k-nearest neighbors and probabilis- tic neural network algorithm. Kotlarchyk et al [11], used ensemble methodology to classify different cancer type based on miRNA profiles. A statiscical support vector machine- k- Fig. 1: Sparse Autoencoder Architecture nearest neighbors is proposed by D. Ting-ting et al [12], where they used t-statics to select relevent miRNA feautures and a TABLE I: Data Set description befor/after preprocessing, combination of kNN and SVM as classifiers to distinguish number of training/ testing sets between positive and negative samples in different cancer type Before After Cancer type data set. For multiclass cancer classification, P.Yongjun [13] Number of patients 10824 10783 31 used subset-based ensenmble method features selection, by Number of regions 743 494 31 Training Samples Testing Samples generating multiple miRNA subset based on the correlation 7548 3235 among miRNAs and then using classifiers to learn valuable knoweledge from each subset to finally combine the results of each classifier by averaging probabilities.A fuzzy normal- ization based approach is proposed by M Anidha et al [14], L(x, g(h)) + σ(h) (1) where the authors used relevant information gain and F-score to select the most important features in cancer diagnosis, yet g(h) is the decoder output and h = f (x) is the encoder in this work the experiments were for binary classification output. A detailed description of the autoencoder architecture tasks only. A web advisor consisting of semi-supervised clas- is in sectionV. Sparse autoencoders have been intensively used sifiers, with pearson correlation, Kappa statistics and recursive for feature learning problems in different domains, emotion feature elimination for selecting the best miRNA profiles, detection and robotics [21], medical imaging [20] also and was conducted by N.Cheerla et al [8] ,to perdict cancer type not only medical diagnosis [22]. and treatment recommendation based on the Atlas PanCancer IV. DATA C OLLECTION AND PREPROCESSING data set. In paper [15], the authors used Beep belief nets and active learning to apply multi-level gene/miRNA feature We collected the Atlas PanCancer [9] miRNA Data set selection, and to visualize the impact between genes and used for predicting cancer type from the TCGA data base miRNAs, and select the most discriminating miRNAs profiles, repository(10/12/2018 18:14 ). The miRNA data set have been the paper tested the performance of the proposed approach generated using next generation sequencing on around 33 in classifying 3 cancer types. Whereas L.Fu et al [16] used types of cancer in the US hospitals. The initial miRNA data set stacked auto-encoders to enhance cancer diagnosis and treat- consist of more than 10 thousand patients and around 800 short ment, by building both miRNA-miRNAs and human disease- non-coding RNAs profiles. We have applied a preprocessing disease similarities network and then use stacked autoencoder to the data matrix by eliminating the miRNA instances with to extract the best features set from the similarity results in more than 20% zero values, also we used a log transformation order to employit in predicting cancer type. Convolution net to eliminate the skewed data and finally data imputation to works CNN were also used by A. L.Rincon et al [17], to replace the missing values, After we have divided our final classify the PanCancer data types, where the authors applied data matrix to 70% samples used to train the supervised model, Evolutionary algorithm to optimize the architecture of the and 30% samples to evaluate the performance of the trained CNN model. classifier. Table I exhibit the data set description before and after preprocessing and table II illustrate the distribution of III. S PARSE AUTO E NCODERS samples on the different cancer types. An autoencoder is a symmetric neural network, which V. SSAE FEATURES LEARNING copies the input of the network to its output passing We can denote the tackled problem as a matrix X of through a bottleneck layer that represents the latent features a dimension N ∗ M where N represents the number of space(figure1). A sparse autoencoder is an autoencoder with samples and M represents the set of non-coding regions, applying a sparsity value σ(h) on the training of the encoder where each xij corresponds to a miRNA value i of a sample part, in addition to the reconstruction loss [18]. This sparsity j. The proposed architecture(Figure2) consist of two phases, value will deactivate the low value nodes, which led to the a dimensionality reduction phase and a predictive phase. extraction of more relevant features representation. In phase one we have used unsupervised features learning TABLE II: Distribution of samples among cancer types A. First Phase Cancer Type Number of Samples In this step, we have used SSAE to extract a new 1- BRCA 1164 2- KIRC 570 features representation, that is more accurate in multi-class 3- THCA 569 cancer diagnosis. The first sparse autoencoder SAE1 takes the 4- HNSC 565 features vector S of the matrix X of range M , and fed it to the 5- LUAD 555 6- PRAD 544 encoder, in the bottleneck layer a new latent space F1 of range 7- UCEC 542 K, where K < M is generated and based on this latent space 8- LGG 527 the decoder try to reconstruct the input S as close as possible 9- LUSC 511 10- OV 486 at the output of the decoder where S ≈ S 0 . The output S 0 11- STAD 474 of SAE1 become the input of SAE2 and the same steps are 12- SKCM 452 followed to generate a latent space F1 and the decoder try to 13- COAD 429 14- BLCA 429 reconstruct S 0 at the output of the decoder S 00 where S 0 ≈ S 00 . 15- LIHC 421 Equally S 00 is the new input of SAE3 and the bottleneck of 16- KIRP 321 the third sparse autoencoder generate the last latent features 17- CESC 311 18- SARC 260 space vector F3 . The consistency of each autoencoder and their 19- ESCA 195 final architecture settings has been evaluated by calculating the 20- LAML 188 reconstruction error loss between the input of the encoder and 21- PCPG 186 22- PAAD 182 the output of the decoder for each SAEi . In our proposal we 23- READ 155 have used the mean abselout error loss function(eq2). The 24- TGCT 138 three generated features representation[F1 , F2 , F3 ] from each 25- THYM 126 26- KICH 89 sparse auto encoder have been concatenated in one features 27- MESO 87 vector F 4 to be used to train the classifiers. 28- UVM 80 n n 29- ACC 79 X X 30- UCS 56 mae = 1/n |xi − x0i |/n = |ei |/n (2) 31- DLBC 47 i=1 i=1 While zooming on the architecture of each autoencoder in phase one (tableIII), we describe it as follows: * SAE1 : We have used a deep architecture, where the encoder consist of two fully connected layers (494,250 node) with a L2 regularization as sparse penalty, a latent space layer with 50 node that will further generate the new space features F1 , and a symmetric decoder to reconstruct the encoder input with 250, 494 node for each layer respectively. * SAE2 : Equally, we have used a deep autoencoder with two fully connected layers of 494, 150 nodes for each to represent the encoder, and we applied on it a sparse L2 regularization penalty, a 50 nodes bottleneck layer Fig. 2: Stacked Sparse Autoencoder architecture for miRNA to generate the new F2 features representation, and a based cancer classification symmetrical decoder. * SAE3 : In the last step we used the simple representation of a sparse autoencoder, since our data have been purified to train a stacked sparse autoencoders (SSAE), where we from the biggest amount of noise in the two previous have piled three sparse autoencoders[SAE1 , SAE2 , SAE3 ], in sparse auto encoders, we need to avoid falling into the which the input of SAEi is the output of SAEi−1 , where curse of overfitting and underfitting problem where our the particularity of the output of autoencoders is that the data autoencoder will only copy the input to the output without is a reconstruction of the input with less noise. The features learning a new features representation. So our SAE3 vectors generated from the three AEs has been concatenated is composed of only one fully connected sparse layer to train a predictive models. These models are trained using to represent the encoder(494 nodes), a bottleneck layer supervised learning to predict the cancer type. The two steps composed of 50 nodes that represent the last features in our analytical architecture have been implemented using vector F3 , and a 494 nodes fully connected layer for the python 3.5 and Keras [23] with tensorflow backened. The mirror decoder. experimental results have been processed on HP-bs0xx with In order to tune each layer weights of the autoencoders(table Intel Core i7-7500U CPU @ 2.70GHz 4 and 8 GB memory. III), we have used a Relu nonlinear function. While the bottleneck layer has been tuned using a Softplus activation Fig. 3: Training performance of the SSAE across each autoen- coder Fig. 4: Accuracy score of the classifiers on SSAE and the other dimensionality reduction methods function. We trained the stacked autoencoder using mini − overall accuracy score of each classifier (figure4), shows that batch gardient descent training and Adam optimizer as the predictive models trained on the features representation follows: extracted from SSAE are more powerful to predict the class of 1- We trained SAE1 through 200 epochs on a batch size each sample. Hence SVM/SSAE scored the highest accuracy equals to 180 samples from the initial input data set in discriminating between the different cancer types with a that represents the value of non-coding regions of all the performance that reaches approximately 95%. While in DT available patients, to obtain a experimental reconstruction we can see that KPCA was able to overcome our approach loss value of 0.56. with a difference of 0.02. In KNN and RF the performance of 2- SAE2 have been trained on 150 epochs with a batch the classifiers on each dimensionality reduction approach was size of 150 using the reconstructed input from SAE1 , the so close with a superiority to our approach as an accuracy of experimental reconstruction loss after training was 0.32. 92% and 89% respectively. 3- The output of SAE2 have been used to train SAE3 on Since Accuracy is not enough to evaluate a classifier, also 100 epochs with a batch size of 130, the reconstruction since our problem is a multi-class classification problem we loss after training was 0.21. have choose other metrics to evaluate the performance of The figure3, demonstrate the training process of each encoder, our models all along the trained classifiers. We have used where we can see that SAE1 converged toward the best micro/macro and weighted average values to evaluate the performance around 150 epochs while SAE2 was able to consistency of each classifiers on the prediction of each class, stabilize around the epoch 125, whereas SAE3 converged tables[IV,V,VI,VII]. TableIV, represent the case with the best rapidly to its best performance around the epoch 80. After performance in each classifier. We conclude from the results training the three autoencoders we have extracted the latent that the SVM/SSAE scored the best performed model, the space of each autoencoder and concatenate the three vectors micro average score reflect the ability of the model in pre- as the new miRNA features space to be used in the second dicting positive samples with a high rate (95%) for both micro phase. average precision and micro average recall. Equally the macro average and the weighted average results are very promising B. Second Phase despite the fact that our data are size variant. Tables[V,VII] The second phase is for classification, where we have exhibit the overall performance of the classifiers, where our used three classifiers to predict the class of a cancer sample features representation learning model was able to slightly according to 31 cancer types. Support vectors machine (SVM), overcome those trained on PCA and KPCA. In tableVI, were Decision trees(DT), Random Forest(RF), and K-nearest neigh- the case our DT/SSAE model was not able to perform better bors were the chosen models to be trained to fulfill the than the DT/KPCA classifier. The collection of results tables diagnosis task. The performance of the model have been exhibit the high consistency of the SSAE features. Where all assessed through hold-out cross validation where we split along most of the classifiers our model was able to score the our data into 70% training and 30% testing. Besides, to highest values possible, and in all the experiments we have evaluate the performance of our SSAE in learning new features tested, PCA features were not able to perform better, than representations, we have compared the performances of the ours, yet KNN/PCA was so close to KNN/SSAE with equal trained classifiers with other classifiers trained on features micro average and weighted average values, here, only a small generated by some of state of the art unsupervised dimension- difference was captured by the macro average values. ality reduction methods, namely Principal component analysis Compared to the results published in [8] and [17], we (PCA), and Kernel principal component analysis(KPCA). The can say that our model was very powerful in discriminating TABLE III: Stacked Spares Autoencoders description SAE1 SAE2 SAE3 Enc LS Dec Enc LS Dec L1 L2 L3 Architecture L1 L2 L L01 L02 L1 L2 L L01 L02 494 50 494 494 250 50 250 494 494 150 50 150 494 Epochs 200 150 100 Batch size 180 150 130 Activation function [Relu-Softplus] Regularizers L2(0.001) L2(0.0001) L2(0.00001) Loss function mae mae mae Reconstruction error 0.56 0.23 0.19 TABLE IV: SVM classifier micro/macro/weighted average between the 31 cancer types, despite the fact that some of score ; P:Precision, R:Recall, f 1 − s : f 1 − score the cancer types samples are very low in count. Cheerla et al Metric micro-Av macro-Av weighted-Av Acc [8] addressed this problem by eliminating the types that have P 0.95 0.95 0.95 smaller number of patients, so they worked on only 21 cancer SSAE R 0.95 0.92 0.95 0.947 type using semi-supervised learning to augment the accuracy f1 − s 0.95 0.94 0.95 P 0.89 0.94 0.92 score to 97%. For A.L.Rincon et al [17], the authors also dealt PCA R 0.89 0.79 0.89 0.894 with 29 cancer types to reach a training accuracy 96%. Also f1 − s 0.89 0.83 0.89 we assume that by integrating more characteristics like stage P 0.80 0.63 0.78 and gender to our analytical strategy we may improve the KPCA R 0.80 0.59 0.80 0.803 f1 − s 0.80 0.59 .0.77 results of the 31 predicted cancer type. VI. C ONCLUSION TABLE V: RF classifier micro/macro/weighted average score In this paper we have implemented a stacked sparse un- ; P:Precision, R:Recall, f 1 − s : f 1 − score supervised auto encoder to learn new features representa- Metric micro-Av macro-Av weighted-Av Acc tion that may help in promoting cancer genetic diagnosis P 0.90 0.92 0.90 based on the short non-coding RNA regions, which plays SSAE R 0.90 0.85 0.90 0.899 f1 − s 0.90 0.86 0.89 a significant role in silencing, regulating and managing the P 0.87 0.90 0.88 transcription biological process in human body. The learned PCA R 0.87 0.81 0.87 0.874 features have been evaluated through a supervised models, f1 − s 0.87 0.82 0.86 P 0.88 0.89 0.88 where our proposed unsupervised features learning model was KPCA R 0.88 0.83 0.86 0.881 able to generate a new discriminant data representation leading f1 − s 0.88 0.84 .0.87 to a competitive method with regard to the state-of the art methods. We believe that the collection of new samples or TABLE VI: DT classifier micro/macro/weighted average score moving toward semi-supervised classification or integrating ; P:Precision, R:Recall, f 1 − s : f 1 − score some clinical information may enhance the results obtained in this work, also the use of the PanCancer data set may give Metric micro-Av macro-Av weighted-Av Acc P 0.84 0.86 0.84 to our model the flexibility and the easy use on other cancer SSAE R 0.84 0.79 0.84 0.838 types generated from different genomic data banks for further f1 − s 0.84 0.81 0.83 research aspects. P 0.82 0.84 0.82 PCA R 0.82 0.77 0.82 0.818 R EFERENCES f1 − s 0.82 0.79 0.82 P 0.85 0.87 0.85 [1] F.Cristiano, P.Veltri. ”Methods and techniques for miRNA data analysis”. KPCA R 0.85 0.80 0.85 0.851 in Microarray Data Analysis. Humana Press, New York, NY, 2015. pp f1 − s 0.85 0.82 .0.85 11–23. [2] S.Tam, M.S.Tsao,J.D.Mcpherson.” Optimization sof miRNA-seq data preprocessing”. Briefings in bioinformatics, 2015, pp 950–963. TABLE VII: KNN classifier micro/macro/weighted average [3] S.Sing, et al. ”Machine learning techniques in exploring microRNA score ; P:Precision, R:Recall, f 1 − s : f 1 − score gene discovery, targets, and functions” in Bioinformatics in MicroRNA Research. Humana Press, New York, NY, 2017. pp 211–224. Metric micro-Av macro-Av weighted-Av Acc [4] P.H.Gunaratne, C.Coarfa , B.Soibam , A.Tandon . ”miRNA Data P 0.92 0.91 0.92 Analysis: Next-Gen Sequencing”. in Fan JB. (eds) Next-Generation SSAE R 0.92 0.91 0.92 0.923 MicroRNA Expression Profiling Technology. Methods in Molecular f1 − s 0.92 0.91 0.92 Biology (Methods and Protocols),2012 Humana Press P 0.92 0.92 0.92 [5] R.C.LEE,R.L.Feinbaum, V.Ambros. ”The C. elegans heterochronic gene PCA R 0.92 0.90 0.92 0.919 lin-4 encodes small RNAs with antisense complementarity to lin-14. f1 − s 0.92 0.90 0.92 cell”, 1993, pp 843–854. P 0.92 0.90 0.92 [6] J.Xuan, Y.Yu , T.Qing a, L.Guo , L.Shi. Next-generation sequencing in KPCA R 0.92 0.90 0.92 0.918 the clinic: promises and challenges. Cancer letters, 2013,pp 284–95. f1 − s 0.92 0.90 .0.92 [7] K.R.Kukurba, S.B.Montgomery.”RNA sequencing and analysis”. Cold Spring Harbor Protocols, 2015. [8] N.Cheerla, O.Gevaert, ”MicroRNA based Pan-Cancer diagnosis and treatment recommendation”. BMC bioinformatics, 2017. [9] J.Liu, et al. ”An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics”. Cell, 2018, pp 400–416. [10] J.Lu, et al. ”MicroRNA expression profiles classify human cancers”. nature, 2005. [11] A. Kotlarchyk,Khoshgoftaar, T., Pavlovic, M., Zhuang, H., A.S Pandya. Identification of microRNA biomarkers for cancer by combining multi- ple feature selection techniques. Journal of Computational Methods in Sciences and Engineering, 2011. pp 283–298. [12] D,Ting-ting, S.Chang-ji, D.Yan-shou,B. Yi-duo. ”Analysis of miRNA expression profile based on SVM algorithm”.in IOP Conference Series: Earth and Environmental Science. IOP Publishing, 2018. [13] P.Yongjun, P.Minghao, R. Keun Ho. ”Multiclass cancer classification using a feature subset-based ensemble from microRNA expression profiles”. Computers in biology and medicine, 2017, pp 39–44. [14] M.Anidha; K.Premalatha, ”An application of fuzzy normalization in miRNA data for novel feature selection in cancer classification”. Biomed. Res, 2017, 28.9: 4187-4195. [15] R.Ibrahim, N.A.Yousri, M.A.Ismail, N. M.El-Makky,”Multi-level gene/MiRNA feature selection using deep belief nets and active learning”. in 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2014. pp 3957–3960. [16] L.Fu, Q.Peng.”A deep ensemble model to predict miRNA-disease asso- ciation”. Scientific reports, 2017. [17] A. L.RINCON, et al. ”Evolutionary optimization of convolutional neural networks for cancer miRNA biomarkers classification”. Applied Soft Computing, 2018, pp 91–100. [18] I.Goodfellow, Y.Bengio, A.Courville. ”Deep learning”. MIT press, 2016. [19] M.Tschannen, O.Bachem, M.Lucic, ”Recent advances in autoencoder- based representation learning”, arXiv preprint arXiv:1812.05069, 2018. [20] Y-D.Zhang, et al. ”Seven-layer deep neural network based on sparse au- toencoder for voxelwise detection of cerebral microbleed”. Multimedia Tools and Applications, 2018, pp 10521–10538. [21] L Chen, et al. ”Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction”. Information Sciences, 2018, pp 49–61. [22] C.Zhang, et al. ”Deep Sparse Autoencoder for Feature Extraction and Diagnosis of Locomotive Adhesion Status”. Journal of Control Science and Engineering,2018. [23] C.François et al.”Keras”.https://keras.io.2015