Formation and Analysis of Gene Expression Data Based on the Joint Use of Data Mining and Machine Learning Techniques Lyudmyla Yasinska-Damri1, Sergii Babichev2,3, Aleksander Spivakovsky2 and Oleksandr Lemeshchuk2 1 Ukrainian Academy of Printing, Pid Goloskom street, 19, Lviv, 79000, Ukraine 2 Kherson State University, University street, 27, Kherson, 73000, Ukraine 3 Jan Evangelista Purkyne University in Usti nad Labem, Pasteurova, 15, Usti nad Labem, 400 96, Czech Republic Abstract Creating a system of complex disease diagnosis based on gene expression data using modern data mining and machine learning techniques is one of the topical areas of recent bioinformatics. The main problem in this subject area consists of a large volume of experimental data; each investigated object contains approximately 20000 genes. In this paper, we propose the technique of both formation and analysis of gene expression data based on the joint use of various computational and intelligent methods of complex data processing. As the experimental data, we use the gene expression data of patients who were investigated for various types of cancer diseases. Initially, the data was formed and analyzed using the functions and modules of TCGAbiolinks and Bioconductor packages. Then, the non-informative genes in terms of both the statistical and entropy criteria were removed from the initial database. To identify the level of significance of gene expression profiles, we have applied the general Harrington desirability index, which contains, as the components, transformed statistical and entropies criteria. Finally, we applied the random forest classifier and convolutional neural network with the calculation of various types of classification quality criteria for the binary and multi-classification of the investigated objects in the first and the second cases, respectively. Analyzing the simulation results has shown that the identification accuracy of the examined samples is high in all cases. To our mind, the proposed technique creates the basis for improving complex disease diagnosis systems based on gene expression data. Keywords 1 Gene expression profiles, Shannon entropy, statistical criteria, Harrington desirability function, Random Forest (RF), Convolution Neural Network (CNN), classification quality criteria 1. Introduction Modern artificial information processing systems used in various fields of both intelligent data analysis and machine learning are, in most cases, based on the analogy of the functioning of relevant processes in biological organisms. Such processes should include the functioning of the gene network, immune processes, functioning of neural networks, etc. A particularity of such systems is a high level of complexity, the ability to learn, parallelization of information processing, a high level of protection, and the ability to recognize and make adequate decisions. In this context, developing modern artificial models of big data processing is possible using a systematic approach, which involves combining knowledge and methods from various practical fields, such as molecular biology, mathematics, computer science, physics, and chemistry. This approach creates conditions for increasing the IntelITSIS’2023: 4th International Workshop on Intelligent Information Technologies and Systems of Information Security, March 22–24, 2023, Khmelnytskyi, Ukraine EMAIL: Lm.yasinska@gmail.com (L. Yasinska-Damri); sergii.babichev@ujep.cz (S. Babichev); spivakovsky@ksu.kherson.ua (A. Spivakovsky); olemeshchuk@ksu.ks.ua (O. Lemeshchuk) ORCID: 0000-0002-8629-8658 (L. Yasinska-Damri); 0000-0001-6797-1467 (S. Babichev); 0000-0001-7574-4133 (A. Spivakovsky); 0000- 0002-9876-3502 (O. Lemeshchuk) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) objectivity of processing big data in real-time due to the use of ensembles of methods, hybrid models and effective quality criteria for evaluating results at the appropriate stage of the data processing. One of the actual problems of modern bioinformatics is the processing of gene expression data obtained by performing DNA microarray experiments or by the more modern method of RNA molecule sequencing. The main particularity of the experimental data is the large number of attributes that define the state of the corresponding biological organism. As well know, the number of expressed (active) genes that make up the human genome is approximately equal to 25,000. At the same time, the number of each type of gene varies within the range from 0 to hundreds of thousands ones. Creating systems of complex objects diagnosis or models of gene network reconstruction based on complete data is ineffective since interpreting the obtained results is problematic in this instance. Deep learning models such as deep neural networks, convolutional neural networks, and deep artificial networks exist and they are actively used for processing big data, allowing for obtaining satisfactory diagnostic accuracy on big data. Still, at the same time, there is a problem with training time, network sensitivity, and verification received models on other similar data. Solving the abovementioned problem is possible by developing new and improving existing intelligent models, information technologies and algorithms focused on big data processing. The implementation of this process assumes the formation of high- quality experimental data by applying adequate methods of gene expression values normalization and reduction of lowly-expressed genes at the first stage. In the second stage, the problem of forming subsets of differentially expressed and mutually correlated gene expression profiles arises by applying hybrid models of high-dimensional data clustering. The third stage is developing a medical diagnosis model on the basis of the allocated clusters of expression values using deep learning methods with appropriate verification of the obtained results. However, we would like to note that in spite of specific achievements in this field, the effective solution to the problem of gene expression data processing in order to create a medical diagnostic system that allows us to identify with acceptable accuracy the patient's state at an early stage of the disease is absent nowadays. To our mind, increasing the effectiveness of current intelligent systems focused on gene expression data processing for the following creation of various disease diagnostic systems can be achieved by providing the following: • justified choice of methods for formation and analysis of experimental data to form an array of gene expression values for the investigated objects; • application of effective metrics for assessing the proximity between gene expression profiles based on the assessment of mutual information between the respective profiles; • justified shaping subsets of diversely expressed and jointly correlated gene expression profiles using effective and adequate clustering quality criteria; • the creation of hybrid models for diagnosing the state of the investigated object using inductive models of objective clustering and deep neural networks; • justified choice of methods to validate the models of complex disease diagnosis with the application of appropriate quality criteria for evaluating the model's effectiveness on the test subset of expression values. The goal of the paper is to develop the model of formation, analysis and processing of gene expression data on the basis of the use of the functions of both TCGAbiolinks and Bioconductor packages of R software, Random Forest classification algorithm and Convolutional Neural Network. 2. Related works Recently, much scientific research focused on solving the challenges of formation and analyzing gene expression data for their reduction and further application in models for diagnosing the object state or the gene regulatory network reconstruction. So, the paper [1] proposed a hybrid method for extracting genes by the joint use of the Information Gain method and the Standard Genetic Algorithm. This method was named as IG/SGA method. Initially, the IG information increment method was used to reduce the non-informative objects. Then, the data was processed using a genetic algorithm. Finally, the authors have applied the genetic programming (GP) method to carry out the object classification procedure. The simulation procedure assumed the use of seven sets of gene expression data of patients investigated for cancer disease. The authors have shown that the application of the offered technique can allow them to reach 100% classification correctness in two instances. In other cases, the classification accuracy was less. In [2,3], the authors showed preferences of the Ant Colony Optimization (ACO) algorithm application when the hybrid models of complex data processing were developed. So, the method offered in [2] is based on the joint application of cellular learning automata and the ACO method (CLA-ACO). The practical realization of the proposed pattern assumes three phases. Initially, the Fisher-filtered method is applied to the data. At this phase, the non-informative genes in expression values were removed from the initial data. Then, CLA and ACA techniques with optimal functions are applied to a subset of gene expression profiles. In the third phase, the authors formed and analyzed the final subsets of gene expression data by assessing the errors of both the first and the second types using the ROC analyzing method. The presented in this manuscript testing results showed the high performance of the offered method for the classification of the examined objects. In [3], the authors combined ACO algorithm and Adaptive Stem Cell Optimization (ASCO) technique for the purpose of forming gene expression datasets. The proposed model filtered the data by assessing the mutual information values in the first phase. Then, various types of classifiers were applied to the formed datasets to assess the model performance. Analyzing the received results, in this instance, also indicates the high performance of this pattern. An analysis of various current hybrid patterns focused on forming the subsets of gene expression datasets for the purpose of creating or improving diagnostic systems based on gene expression data is presented in [4]. The authors analyzed in detail, with the allocation of advantages and shortcomings, various intelligent systems and algorithms such as the Mutual Information Maximization (MIM) filtering method, Genetic Algorithm (GA) grouping method and SVM classifier [5]; Correlation- based Feature Selection (CFS) filtering method, Genetic Algorithm (GA) grouping method and KNN classifier [6]; Laplacian and Fisher score filtering method, Genetic Algorithm (GA) grouping method and SVM, KNN and NB classifiers [7]; Independent Component Analysis (ICA) filtering method, Artificial Bee Colony (ABC) grouping method and NB classifier [8] and other hybrid techniques based on the combined application of various data mining and machine learning algorithms [9-14]. The analysis of various hybrid model operation results has allowed the authors to remark that the challenger of objective formation of subsets of genes, which can allow us to recognize the examined samples with high-reliability levels, does not unambiguous solved nowadays. In most proposed patterns, high classification performance is ripened when applying a small number of genes. This problem can be solved by using deep learning methods at the phase of gene expression data classification. So, in [15] the authors describe the technique focused on the early diagnosis of cancer disease using a Deep Neural Network (DNN). The proposed model was tested on 37 different kinds of cancer. The authors used datasets from the TCGA cancer genome atlas. The experimental dataset consisted of 10,663 samples (9,807 tumors and 856 normal samples) for 37 cancer types, and they contained the expression of 10,000 genes. The models of deep neural networks with three different structures were investigated: three-levels (3NN), five-levels (5NN) and nine-levels (9NN), while a comparative analysis was performed with the model on the basis of the use of SVM method by assessing the classification performance. An analysis of the obtained results has shown that the highest diagnostic accuracy was achieved when using the 5NN model, but the performance of disease classification was not satisfactory. Moreover, the sensitivity of the model to the type of cancer is also not high. This fact indicates that the proposed model needs to be refined. In [16], the authors investigated different types of patterns based on the use of an ensemble of DM and ML algorithms, including deep learning methods for the diagnosis of three types of cancer: lung, breast, and stomach cancer. 162 lung cancer samples, 878 breast cancer samples, and 271 stomach cancer samples were used. The accuracy of disease diagnosis for all data sets was 98%. However, it should be noted that despite the impressive results, the proposed model has several significant shortcomings. First of all, the proposed model is very complex and requires both a lot of time on model training and large computer resources. The second significant drawback is that the model based on an ensemble of deep learning methods is challenging to interpret. In [17], the authors investigated several structures of Convolutional Neural Networks (CNN) with various combinations of hyper-parameters for the diagnosis of different types of cancer using unstructured gene expression data. During the simulation process, three models of CNN with different combinations of hyper-parameters were investigated: one-dimensional neural network model 1D-CNN; two-dimensional CNN 2D-Vanilla-CNN; hybrid convolutional neural network 2D-Hybrid-CNN (as input contains a two-dimensional matrix and a one-dimensional vector of gene expressions). The models were trained and tested on gene expression data containing 10,340 samples from 33 types of cancer and 713 samples taken from normal tissues (no tumor detected). The data contained 7100 genes whose expression was used as attributes. Data were also taken from the Cancer Genome Atlas (TCGA). Analyzing the authors' simulation results showed the high performance of the proposed patterns. For the diagnosis of 34 classes (33 with cancer tumors and one normal), the prediction accuracy on the test subset of data varied within the range from 93.9% to 95%. However, it should be noted that training the proposed models also requires much time. In addition, the authors used unstructured data, which may also affect the accuracy and sensitivity of the proposed models operating. The hereinbefore presented brief analysis of the current state of research regarding the formation, analysis and processing of expression data to develop and improve disease diagnostic systems allows us to conclude that nowadays, there is no effective and complete information technology for processing gene expression data to diagnose complex diseases. Existing models of disease diagnosis systems on the basis of gene expression data can allow us to identify the relevant disease with a certain accuracy. However, they have certain shortcomings, which are associated with the imperfection of big data processing models for selecting informative gene expression profiles, low objectivity in the formation and further processing of gene expression profiles, low objectivity of decision-making regarding the presence or absence of the relevant disease, etc. These facts indicate the actuality of the presented research. 3. Gene expression data formation and analysis We used gene expression data of patients examined for various cancer kinds from The Cancer Genome Atlas (TCGA) [18]. The formation of experimental data was carried out using the functions of the TCGAbiolinks package [19], which is a part of the Bioconductor package [20] of the R software [21]. The preference of this data is determined by the fact that cancer is currently one of the primary reasons for death globally, and treatment methods for this disease vary in a wide range. TCGA database, which was launched in 2006 to collect and analyze different data for more than 35 various kinds of cancer by selecting many cases of different tumor types, nowadays is one of the most comprehensive human cancer databases, containing various kinds of data. Tumors formed using TCGA tools varied from solid to hematologic, from mild to highly aggressive considering survival, and from beginning to metastatic. Thus, the experimental gene expression data contained in TCGA is a very powerful platform for both the approbation and evaluation of the effectiveness of the appropriate model focused on processing gene expression data in order to create a hybrid model for the diagnosis of complex diseases. The advantages of the TCGAbiolinks package should also include the fact that its use can allow us to download and generate data from various platforms that are used in current times in various research institutions to generate experimental data based on genome analysis. Moreover, the functions of this package can allow us to implement various types of data formation, analysis and processing using various patterns of visualization of the particularities of data distribution in the relevant features space. Within the framework of the research, we used gene expression data obtained by applying the method of RNA molecule sequencing received on the Illumina platform [22]. Each sample of initial data contained, as attributes, 19947 types of genes, while the expression of the corresponding gene is proportional to its activity level. The experimental database contained nine types of data, eight of which corresponded to eight types of cancer, and the ninth data corresponded to samples for which no cancer was identified as a result of clinical trials. The total number of studied samples was equal to 3269. Thus, at the initial stage of modelling, the gene expression matrix had the following form: 𝐸𝐸 = (3296 × 19947). The classification of experimental data is presented in Table 1 and Figure 1. Table 2 shows the general statistics of the number of genes' maximum values distribution, which contained experimental data for all the studied samples. Analyzing the data in Table 2 indicates that there is a certain number of non-expressed genes for all samples (zero number) and a certain number of low-expressed genes (the number of active genes for all samples is significantly less than the number of other genes). These genes can be removed from the data without significant loss of useful information. Table 1 Experimental datasets of gene expression data No Kind of cancer Number of samples 1 Adrenocortical carcinoma - ACC 79 2 Glioblastoma multiforme - GBM 169 3 Sarcoma - SARC 263 4 Lung squamous cell carcinoma - LUSC 502 5 Lung adenocarcinoma - LUAD 541 6 Stomach adenocarcinoma - STAD 415 7 Kidney renal clear cell carcinoma - KIRC 542 8 Brain Lower Grade Glioma - LGG 534 9 Normal (without tumor) 224 Figure 1: The bar chart of the sample’s distribution in the gene expressions experimental data for the various kinds of cancer Table 2 General statistics of the number of genes' maximum values distribution for all samples Min 1st quartile Median Mean 3rd quartile Max 0 3775 12030 44324 30945 6303246 Moreover, the analysis of the data in Table 2 also shows the need to transform the absolute values of the number of expressed genes into more convenient values with a smaller range of their variation. The Bioconductor package [20] offers a method of transforming the absolute values of the number of genes into a relative value (count-per-million - cpm), which can be calculated as follows: ′ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖𝑖𝑖 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖𝑖𝑖 = ∑𝑚𝑚 ∙ 106 (1) 𝑗𝑗=1 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖𝑖𝑖 where: 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖𝑖𝑖 is the number of the j-th gene for the i-th sample; m is the number of the examined genes. The general statistics of the transformed by formula (1) the number of genes maximum values distribution for all investigated samples are presented in Table 3. Table 3 General statistics of the transformed quantity of genes' maximum values distribution for all samples Min 1st quartile Median Mean 3rd quartile Max 0 76.73 237.87 886.58 598.56 172693.66 As can be seen, the variation range of the transformed values is significantly smaller than the initial data presented in Table 2, which simplifies the procedure for further processing of experimental data. The next step of the data preparation is the removal from the experimental data of unexpressed gene expression profiles by the corresponding thresholding coefficient. In our experiment, the value of this coefficient corresponds to the condition 𝑙𝑙𝑙𝑙𝑙𝑙2 �𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑡𝑡 ′ 𝑖𝑖𝑖𝑖 � = 0 for all investigated samples. According to this condition, genes whose profiles correspond to the condition max(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑡𝑡 ′ ) ≤ 1 are considered non- expressed, and they are removed from the data. At this stage, the number of genes was reduced by 682, and the gene expression matrix took the following form: 𝐸𝐸 = (3296 × 19265). At the next stage, the values of gene expressions were transformed into a more convenient range by applying the function 𝑙𝑙𝑙𝑙𝑙𝑙2 (∙) to all values of the matrix. In this instance, the simulation results have shown that some values smaller than one were transformed into corresponding negative values. This is unacceptable because the minimum gene expression value is 0 (the gene is inactive). Therefore, at this stage, negative values were replaced by zero, corresponding to non-expressed genes. Figure 2 shows the distribution of normalized expression values of nine genes for one adrenocortical carcinoma (ACC) sample. Figure. 3 shows the nature of the expression values distribution for one type of gene for all investigated samples. In this case, the genes A1BG.1 (Fig. 3a), A2M.2 (Fig. 3b), NAT1.9 (Fig. 3c) and NAT2.10 (Fig. 3d) were studied. The analysis of the obtained results allows us to conclude that the nature of the gene expression values distribution differs significantly for different samples both in absolute value and variance. This fact creates the conditions for identifying samples by expression values when applying the appropriate classifier. Figure 2: Boxplots of the expression values distribution of different types of genes for one sample of adrenocortical carcinoma (ACC) The next phase of the experimental data processing is the extraction of uninformative genes considering various kinds of statistical and entropy criteria using Harrington's desirability function by the technique described in detail in [23]. 4. Filtering gene expression profiles by statistical and entropy criteria As statistical criteria, we used the maximum absolute value of the expression assessed for all samples and the variance of the appropriate gene expression profile. Shannon entropy was calculated using the James-Stein shrinkage estimator. Figure 4 shows the boxplots of the distribution of the various criteria values. Analyzing the charts indicates an increase in the gene expression profile significance with an increase in the statistical criteria values and a decrease in the value of the entropy criterion. In accordance with the above, the algorithm for calculating a complex criterion based on the Harrington method, which determines the level of the significance of the genes, assumes the following phases: 1. Assessing the linear equations coefficient values. At this phase, we used the boundary values of both the appropriate criteria and Y dimensionless parameter (𝑌𝑌𝑚𝑚𝑚𝑚𝑚𝑚 = −2; 𝑌𝑌𝑚𝑚𝑚𝑚𝑚𝑚 = 5): (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚,𝑣𝑣𝑣𝑣𝑣𝑣) 𝑌𝑌𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑎𝑎(𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎,𝑣𝑣𝑣𝑣𝑣𝑣) + 𝑏𝑏 (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚,𝑣𝑣𝑣𝑣𝑣𝑣) ∙ min(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚, 𝑣𝑣𝑣𝑣𝑣𝑣) (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚,𝑣𝑣𝑣𝑣𝑣𝑣) 𝑌𝑌𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑎𝑎(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚,𝑣𝑣𝑣𝑣𝑣𝑣) + 𝑏𝑏 (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚,𝑣𝑣𝑣𝑣𝑣𝑣) ∙ max(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚, 𝑣𝑣𝑣𝑣𝑣𝑣) (2) (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) 𝑌𝑌𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑎𝑎(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) − 𝑏𝑏 (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) ∙ max(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) 𝑌𝑌𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑎𝑎(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) − 𝑏𝑏 (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) ∙ min(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) Figure 3: Boxplots of the expression values distribution of one type of gene for different samples. The studied genes were: a) A1BG.1; b) A2M.2; c) NAT1.9; d) NAT2.10 Figure 4: Boxplots of the distribution of the gene expression profiles statistical criteria values and Shannon's entropy: a) maximum expression value; b) variance; c) Shannon entropy 2. Calculation of the Y parameter values for each value of the criteria: 𝑌𝑌𝑌𝑌(𝑎𝑎𝑎𝑎𝑎𝑎) = 𝑎𝑎(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚) + 𝑏𝑏 (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚) ∙ max_𝑎𝑎𝑎𝑎𝑎𝑎 𝑌𝑌 (𝑣𝑣𝑣𝑣𝑣𝑣) = 𝑎𝑎(𝑣𝑣𝑣𝑣𝑣𝑣) + 𝑏𝑏 (𝑣𝑣𝑣𝑣𝑣𝑣) ∙ var (3) 𝑌𝑌 (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) = 𝑎𝑎(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) − 𝑏𝑏 (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) ∙ entr where: max_abs is the maximum absolute expression values of the corresponding gene profile; var and entr are the Shannon entropy end variance of this gene expression profile, respectively; 3. Evaluation of d-values for each value of the Y parameter: 𝑑𝑑 (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚) = exp (−exp (−𝑌𝑌 (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚) )) 𝑑𝑑(𝑣𝑣𝑣𝑣𝑣𝑣) = exp (−exp (−𝑌𝑌 (𝑣𝑣𝑣𝑣𝑣𝑣) )) (4) 𝑑𝑑(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) = exp (−exp (−𝑌𝑌 (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) )) 4. Evaluation of the generalized index D as a geometric average of all d-values. This index indicates the gene expression profile significance: 3 𝐷𝐷 (𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔) = �𝑑𝑑(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚) ∙ 𝑑𝑑(𝑣𝑣𝑣𝑣𝑣𝑣) ∙ 𝑑𝑑(𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒) (5) A higher value of the index (5) corresponds to a higher significance of the gene expression profile according to the used criteria. Table 4 presents the general statistics of the general index values distribution. Table 4 General statistics of the general index values distribution Min 1st quartile Median Mean 3rd quartile Max 0.007239 0.029438 0.039786 0.074478 0.075635 0.947162 Filtering gene expression profiles at this stage involved removing profiles with a generalized desirability index lower than the value of the index corresponding to the first quartile, i.e., 0.029438. The number of gene expression profiles, in this case, was reduced by 4814, i.e., the filtered matrix of gene expression values took the form: 𝐸𝐸 = (3296 × 14451). 5. Evaluation of the proposed technique adequacy using the machine learning techniques Assessment of the gene expression profiles filtering model adequacy was performed in two stages. In the first stage, a binary Random Forest classifier was applied to the samples, including the complete set of gene expressions (19265) and the set of gene expressions after removing non-informative ones by the used criteria (14451). The efficiency of using the Random Forest classifier for the binary identification of samples used as the attributes of big data is proved in [24]. In the first step, the complete set of samples was divided into eight subsets, each of which contained gene expressions of samples for which no tumor was detected (224 samples) and samples with a cancerous tumor, taking into account the type of disease. In accordance with the methodology presented in [24,25], the experimental data were divided into two subsets, 65% of the samples were used for training the classifier, and 35% for testing the obtained model with the evaluation of data classification performance. In the second phase, we applied the convolutional neural network (CNN) for complete and filtered gene expression data to solve a multi-class task. Taking into account the simulation results presented in [26], we used a one-dimensional two-layered CNN, the structural topology of which is shown in Figure 5. For the correct initialization of the first and second layer filters, the experimental data were supplemented with profiles with zero expression up to 19300 and 14500 in the case of using complete and filtered gene expression data, respectively. The kernel size was set to 8, and the compact layer density was 256. The results were evaluated on a test subset of the data, the number of objects of which was 35% of the total number of investigated objects (1144 out of 3269). Table 5 shows the simulation results regarding the binary classification of the investigated samples using the complete set of gene expression profiles (19265 genes). It should be noted that when applying filtered data (14451 genes), the results were almost the same. Analysis of the data in Table 5 indicates the high accuracy of binary classification using complete gene expression data. In three cases, the model identified the patient’s state with 100% accuracy. In the other five cases, the accuracy of identifying the patient's state according to the appropriate classification criteria is not 100% but is also very high. Figure 5: Topology of one-dimensional 2NN CNN used as multi-classifier to identify both the state of the patient and type of cancer Table 5 Results of the investigated samples binary classification using the complete set of gene expression profiles Type of data Classification quality criteria Accuracy Sensitivity Specificity F-measure МСС acc 1 1 1 1 1 gbm 0.993 0.983 1 0.988 0.985 kirc 1 1 1 1 1 lgg 0.992 0.995 0.987 0.994 0.982 luad 0.989 0.984 1 0.987 0.973 lusc 0.992 0.989 1 0.990 0.982 sarc 1 1 1 1 1 stad 0.987 0.963 1 0.975 0.971 Tables 6 and 7 present the research results regarding applying a CNN to solve a multiclass task based on complete and filtered gene expression data, respectively. Table 6 The results of the application of CNN for the classification of various types of cancer when using complete data (19265 genes) Class Number of Sensitivity Specificity F-measure Accuracy samples ACC 24 0.85 0.96 0.9 GBM 73 0.97 0.93 0.95 KIRC 169 0.99 0.98 0.99 LGG 191 0.97 0.98 0.98 97% LUAD 200 0.96 0.96 0.96 LUSC 169 0.96 0.96 0.96 SARC 73 0.96 0.96 0.96 STAD 96 0.94 0.96 0.95 NORMAL 149 0.99 0.98 0.99 The simulation results indicate the high efficiency of the Convolutional Neural Network for solving multiclass problems. Thus, from 1144 samples used for the model testing, only 36 and 44 were identified incorrectly when using the complete and filtered data of gene expression profiles, respectively. The classification accuracy was 97% at using complete data and 96% when using filtered data. Moreover, the analysis of the values of the classification quality criteria by classes shows the high accuracy of identification of samples for which no cancer was detected. This fact confirms the results of the binary classification obtained using the Random Forest (RF) classification algorithm and indicates that the CNN can be effectively used successfully in systems for identifying the presence or absence of a cancerous tumour based on gene expression data. Table 7 The results of the application of CNN for the classification of various types of cancer when using filtered data (14451 genes) Class Number of Sensitivity Specificity F-measure Accuracy samples ACC 24 0.74 0.96 0.84 GBM 73 0.96 0.92 0.94 KIRC 169 0.99 0.99 0.99 LGG 191 0.97 0.98 0.97 96% LUAD 200 0.94 0.96 0.95 LUSC 169 0.96 0.93 0.95 SARC 73 0.97 0.96 0.97 STAD 96 0.95 0.93 0.94 NORMAL 149 0.99 0.99 0.99 Figure 6 shows charts of accuracy and loss function values obtained on a set of objects that were used for training the model validation. The same charts were obtained for filtered data. Figure 6: Charts of accuracy and loss function values obtained on the subset of objects that were used directly for network training and model validation using the complete dataset during the model training When the simulation procedure was implemented, the set of data allocated for training the model was shared into two subsets in the ratio of 0.7/0.3. 70% of samples were used directly for training and 30% for network validation during the training process. Convergence of both the classification accuracy and the loss function obtained on both data subsets indicates the absence of retraining of the network. In this case, the classification results received on the test data can be considered adequate. The analysis of the obtained results also shows a slight discrepancy in the results of the identification of the samples obtained on the complete and filtered subsets. This fact confirms the expediency of removing non- informative genes at the stage of data preprocessing according to statistical and entropy criteria since reducing the certain number of profiles in this case almost does not affect the results of the investigated samples classification'. However, we would like to note that the time for data processing and the amount of computer resources required for data processing are significantly less in comparison with the processing of complete data. 6. Conclusions In this research, we have presented the stepwise procedure of gene expression data formation, analysis and preprocessing based on the combined use of various techniques of big data processing. The gene expression data of patients examined for various kinds of cancer from The Cancer Genome Atlas were applied as the experimental ones when the simulation procedure was implemented. This dataset contained nine types of data, eight of which corresponded to eight types of cancer, and the ninth data corresponded to samples for which no cancer was identified as a result of clinical trials. The total number of studied samples was equal to 3269. Each sample of initial data contained, as attributes, 19947 genes. The procedure of gene expression data formation at the first step assumed the application to the initial data functions of the TCGAbiolinks package in order to transform the values of gene expression in a more suitable range. Then, the non-expressed and lowly-expressed genes for all samples were removed from the data as non-informative ones. At this phase, we removed 682 non-expressed genes, and their quantity was reduced from 19947 to 19265. In the second step, the non-informative genes in terms of various statistical and entropy criteria were removed from the dataset. To identify the genes' significance based on various kinds of criteria, we have applied the Harrington desirability function. Implementation of this phase has allowed us to reduce the genes from 19265 to 14451. In the final phase, we applied the Random Forest classifier to solve the binary classification task and the Convolutional Neural Network to solve the multi-classification task with calculation classification performance criteria at each phase of this procedure implementation. The analysis of the simulation results indicates the high efficiency of the application of both the RF algorithm and the CNN for solving both binary and multiclass tasks. In all cases, the classification accuracy calculated with the use of the test dataset was very high. From 1144 samples used for the model testing, only 36 and 44 were identified incorrectly when using the complete and filtered data of gene expression profiles, respectively. The further perspectives of the authors' research are the creation of a hybrid model of gene expression data processing based on the joint use of clustering and classification techniques, where the presented in the manuscript method will use at the stage of data pre-processing. 7. References [1] H. Salem, G. Attiya, N. El-Fishawy. Classification of human cancer diseases by gene expression profiles. Applied Soft Computing, 2017, vol. 50, pp. 124-134. doi: 10.1016/j.asoc.2016.11.026 [2] F.V. Sharbaf, S. Mosafer, M.H. Moattar. A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization. Genomics, 2016, vol. 107(6), pp. 231-238. doi: 10.1016/j.ygeno.2016.05.001 [3] S.A.A. Vijay, P.G. Kumar. Fuzzy expert system based on a novel hybrid stem cell (HSC) algorithm for classification of microarray data. Journal Medical Systems, 2018, vol. 42, art.no. 61. doi: 10.1007/s10916-018-0910-0 [4] N. Almugren, H. Alshamlan. A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE Access, 2019, vol. 7, art. no. 8736725, pp. 78533- 78548. doi: 10.1109/ACCESS.2019.2922987 [5] H. Lu, J. Chen, K. Yan, et al. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing, 2017, vol. 256, pp. 56-62. doi: 10.1016/j.neucom.2016.07.080 [6] M.A. Khan, M.I.U. Lali, K. Javed, et al. An Optimized Method for Segmentation and Classification of Apple Diseases Based on Strong Correlation and Genetic Algorithm Based Feature Selection. IEEE Access, 2019, vol. 7, art. no. 8675916, pp. 46261- 46277. doi: 10.1109/ACCESS.2019.2908040 [7] M. Dashtban, M. Balafar. Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. Genomics, 2017, vol. 109(2), pp. 91-107. doi: 10.1016/j.ygeno.2017.01.004 [8] R. Aziz, S.K. Verma, N. Srivastava. A novel approach for dimension reduction of microarray. Computational Biology and Chemistry, 2017, vol. 71, pp. 161-169. doi: 10.1016/j.compbiolchem.2017.10.009 [9] P. Moradi, M. Gholampour. A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy. Applied Soft Computing, 2016, vol. 43, pp. 117-130. doi: 10.1016/j.asoc.2016.01.044 [10] I. Jain, V.K. Jain, R. Jain. Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Applied Soft Computing, 2018, vol. 62, pp. 203-215. doi: 10.1016/j.asoc.2017.09.038 [11] E. Pashaei, M. Ozen, N. Aydin. Gene selection and classification approach for microarray data based on random forest ranking and BBHA. In Proc. IEEE-EMBS Int. Conf. Biomed. Health Inform. (BHI), 2016, pp. 308-311. doi: 10.1109/BHI.2016.7455896 [12] S.S. Shreem, S. Abdullah, M.Z. Nazri. A Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm. International Journal of Systems Science, 2016, vol. 47(6), pp. 1312-1329. doi: 10.1080/00207721.2014.924600 [13] P. Tumuluru, B. Ravi. GOA-based DBN: Grasshopper optimization algorithm-based deep belief neural networks for cancer classification. International Journal of Applied Engineering Research, 2017, vol. 12(24), pp. 14218-14231. [14] H. Djellali, S. Guessoum, N. Ghoualmi-Zine, S. Layachi. Fast correlation based filter combined with genetic algorithm and particle swarm on feature selection. In Proc. 5th Int. Conf. Elect. Eng.- Boumerdes (ICEE-B), 2017, pp. 1-6. doi: 10.1109/ICEE-B.2017.8192090 [15] M. Divate, A. Tyagi, D.J. Richard, et al. Deep Learning-Based Pan-Cancer Classification Model Reveals Tissue-of-Origin Specific Gene Expression Signatures. Cancers, 2022, vol. 14(5), art no. 1185. doi: 10.3390/cancers14051185 [16] Y. Xiao, J. Wu, Z. Lin, X. Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Computer methods and programs in biomedicine, 2018, vol. 153, pp. 1-9. doi: 10.1016/j.cmpb.2017.09.005 [17] M. Mostavi, Y.C. Chiu, Y. Huang, et al. Convolutional neural network models for cancer type prediction based on gene expression. BMC Med Genomics, 2020, vol. 13(5), art no. 44. doi: 10.1186/s12920-020-0677-2 [18] El. Resource. Available on https://www.cancer.gov/about- nci/organization/ccg/research/structural-genomics/tcga [19] A. Colaprico, T.C. Silva, C. Olsen, et al. TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Research, 2016, vol. 44(8), art. no. e71. doi: 10.1093/nar/gkv1507 [20] El. Resource. Available on https://www.bioconductor.org/ [21] R. Ihaka, R. Gentleman. R: a linguage for data analysis and graphics. Journal of Computational and Graphical Statistics, 1996, vol. 5(3), pp, 299-314. doi: 10.2307/1390807 [22] El. Resource. Available on https://www.illumina.com/ [23] I. Liakh, S. Babichev, B. Durnyak, I. Gado. Formation of Subsets of Co-expressed Gene Expression Profiles Based on Joint Use of Fuzzy Inference System, Statistical Criteria and Shannon Entropy. Lecture Notes on Data Engineering and Communications Technologies, 2023, vol. 149, pp. 25-41. doi: 10.1007/978-3-031-16203-9_2 [24] S. Babichev, J. Škvor. Technique of Gene Expression Profiles Extraction Based on the Complex Use of Clustering and Classification Methods. Diagnostics, 2020, vol. 10 (8), art. no. 584. doi: : 10.3390/diagnostics10080584 [25] S. Babichev, J. Krejci, J. Bicanek, V. Lytvynenko, V. Gene expression sequences clustering based on the internal and external clustering quality criteria. Proceedings of the 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2017, 2017, vol. 1, art. no. 8098744, pp. 91-94. doi: 10.1109/STC-CSIT.2017.8098744 [26] L. Yasinska-Damri, S. Babichev, B. Durnyak, T. Goncharenko. Application of Convolutional Neural Network for Gene Expression Data Classification. Lecture Notes on Data Engineering and Communications Technologies, 2023, vol. 149, pp. 3-24. doi: 10.1007/978-3-031-16203-9_1