Evaluation of the Gene Expression Profiles Complex Proximity Metric Effectiveness Based on a Hybrid Technique of Gene Expression Data Extraction Lyudmyla Yasinska-Damria, Igor Liakhb, Sergii Babichev c and Bohdan Durnyak a a Ukrainian Academy of Printing, Pid Goloskom street, 19, Lviv, 79000, Ukraine b Uzhhorod National University, University street, 14, Uzhhorod, 88000, Ukraine c Kherson State University, University street, 27, Kherson, 73000, Ukraine Abstract Gene expression data processing in order to develop the systems of complex diseases diagnostic or/and gene regulatory networks (GRN) reconstruction is one of the actual direction of modern bioinformatics. One of the important stages of this problem solving is an extraction of mutually correlated gene expression profiles (GEP) considering the used proximity metric. Within the framework of our research, we evaluate the complex metric of GEP proximity calculated as the combination of modified mutual information criterion and Pearson's chi-squared test using OPTICS clustering algorithm implemented using principles of the objective clustering inductive technique (OCIT). The examined objects classification accuracy was used as the main criterion to access the applied method effectiveness. The simulation results have shown that the proposed technique allows us to form an optimal GEP cluster structure in terms of maximum values of the patterns classification accuracy quality criterion. Keywords 1 Gene expression profiles, proximity metrics, OPTICS clustering algorithm, gene expression profiles classification, inductive methods of objective clustering, clustering quality criteria, classification accuracy 1. Introduction and literature review The development of models of diseases diagnostics or/and gene regulatory networks (GRN) reconstruction using gene expression data (GED) is one of the actual directions of modern bioinformatics. As a rule, the initial GED is formed as a high dimensional array with components represented the studied patterns and genes. The value of gene expression is depended on the amount of this type of gene that determines the appropriate properties of the examined biological organism. Gene expression profile (GEP) means the vector of gene expressions the values of which are evaluated for the examined patterns. Reconstruction of gene regulatory network (GRN) which adequate reflect the nature of genes interaction under the different states of a biological organism in order to develop both effective medicine and disease diagnostic and treating methods is possible provided the extraction of groups of highly and mutually expressed genes. For this reason, the stage of gene expression data pre- processing is very important at the early stage of GRN forming or under the development of a disease diagnosing model. Figure 1 illustrates a stepwise procedure for implementing this process. The filtration procedure, in this case, involves removing genes with zero expression at the first step and genes with low expression in terms of the empirically established threshold at the second step. IDDM-2021: 4th International Conference on Informatics & Data-Driven Medicine, November 19–21, 2021 Valencia, Spain EMAIL: Lm.yasinska@gmail.com (L. Yasinska-Damri); ihor.lyah@uzhnu.edu.ua (I. Liakh); sbabichev@ksu.ks.ua (S. Babichev); durnyak@uad.lviv.ua (B. Durnyak) ORCID: 0000-0002-8629-8658 (L. Yasinska-Damri); 0000-0001-5417-9403 (I. Liakh); 0000-0001-6797-1467 (S. Babichev); 0000-0003- 1526-9005 (B. Durnyak) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Moreover, data can contain gene expression profiles that are statistically significantly different from the GEP of the main group. It is obvious that such genes do not correlate with the profiles of other genes and they can also be removed from the data. Qualitative implementation of this stage allows significantly reducing the number of genes for further research. This fact also contributes to enhancing the quality of further steps of GED processing for the solving hereinafter described problem. Figure 1: Block-chart of a step-by-step procedure of GED processing to form clusters of highly and mutually expressed GEP In [1], the authors presented the “limma” module (Linear Models for Microarray and RNA-Seq Data), which contains various functions for generating, filtering and interpreting gene expression data obtained using both DNA microchips experiments and mRNA molecules sequencing method. This module is to some extent an alternative to the “Bioconductor” package, implemented in the data mining and machine learning R software [2] and it is based on the use of linear models to allocate differently expressed genes in a multifactor experiment. This module also contains functions for the genes ontology analysis, which is very important for adequate GRN reconstruction, because the interpretation of genes and their interactions based on the analysis of conceptual interconnections allows identifying target genes, to establish the nature of interconnections between target and other genes taking into account appropriate disease. The papers [3-5] considered a various tools and techniques of GED filtering that are available in the "Bioconductor" package using quantitative quality criteria for GED received by DNA microarray method [3,4] and mRNA molecules sequencing [5]. As a simulation result, the authors proposed a stepwise algorithm for extracting highly and mutually expressed gene expression profiles for their further grouping into clusters. In a review [6], the authors conducted a comparative analysis of current software to process the GED for purpose of extracting the most informative genes. The analysis of the authors' research allows concluding on the feasibility of using the R software for GEP processing in order to form clusters of highly and mutually expressed genes because this software contains all necessary modules and functions to process gene expression data according to the solved task. The review [7] presents the research results focused on the study of various hybrid techniques to extract the clusters of mutually correlated GEP to solve the problem of creation of the system of cancer disease diagnostic. In the reviewed works, various combinations of filtering, clustering and classification techniques using various types of statistical criteria and gene expression profiles proximity metrics were applied. The examined objects classification accuracy was applied as the principal quality metric to assess the appropriate hybrid model effectiveness. The following filtration techniques and methods to estimate the gene expression profiles proximity were analyzed in this review: mutual information maximization method [8],  2 Pearson's test [9], correlation-based feature selection technique [10], Laplacian and Fisher score [11], information gain method [12], Fisher criterion [13], independent component analysis [14], maximum relevance minimum redundancy [15], probabilistic random function [16], random forest ranking [17], Fisher-Markov selector [18], symmetrical uncertainty [19] and logarithmic transformation [20] method. However, we would like to note that in the analyzed research high classification accuracy in most cases is achieved when using a low number of the extracted GEP. Moreover, the parameters of the respective technique used in the appropriate hybrid models are set upped empirically when the simulation process is performed. Undoubtedly, this fact is one of the main disadvantages of the analyzed models. The works [21,22] presents the partial decision of this task. A stepwise procedure of GEP extraction on the basis of the joint application of Shannon entropy, statistical criteria, clustering technique based on the SOTA clustering algorithm and random forest binary classifier was developed in these papers. The suitable algorithm parameters considering the classification accuracy were set a priory according to the OCIT principles. However, only correlation proximity metric was used within the framework of the authors' research. Thus, the presented hereinbefore brief review allows concluding that an effective model of GEP extraction based on joint application of various proximity metrics, clustering and classification techniques is absent now. This problem can be solved on the basis of joint application of various techniques used successfully in current data science directions of scientific research nowadays [23-26]. In this work, we consider the GEP hybrid proximity metric calculated as a combination of modified mutual information maximization method and Pearson's  2 test. The modified mutual information maximization method, in this instance, takes into account various methods of Shannon entropy evaluation. The objective of the research is the development and evaluation of a hybrid model of GEP extraction on the basis of joint application of hybrid proximity metric, OPTICS clustering algorithm implemented using principles of OCIT and random forest binary classifier. 2. Materials and methods In the general instance, the clustering internal quality criterion should consider both the gene expression profiles allocation inside clusters and clusters' medians allocation relative to each other. Thus, this criterion should be complex and contains two components. If we assume that K is the number of clusters, then the formula for assessing the first component of this criterion can be calculated in the following way: 1 K 1 Nk QCW    d (ei , Ck ) (1) K k 1 N k i 1 where: N k and C k are the number of GEP in k-th cluster and the median of k-th cluster respectively; d (ei , C k ) is the distance between i-th profiles and median of this cluster calculated using complex proximity metric which contained both the modified mutual information maximization method (considered various methods of Shannon entropy calculation) and Pearson’s  2 test the effectiveness of which is proved in [27]. The second component of the internal criterion can be assessed as the average distance between the allocated clusters’ medians: K 1 K 2 QCB    d (Ci , C j ) K ( K  1) i 1 j i 1 (2) In [21], the authors performed modelling to assess the performance of different types of internal criteria, containing (1) and (2) as the components. As a result, a hybrid internal criterion formed as a ratio of Calinski-Harabasz criterion and WB index has been proposed: K ( K  1)QCW 2 QC int  (3) ( N  K )QCB2 where N is the number of objects that should be grouped. This criterion was used as the internal one during the modelling procedure performing. Assessment of the efficiency of both the GEP hybrid proximity metric and quality criteria when the profiles grouping into clusters was performed based on the application of density clustering algorithm Optics [28], which is a logical development of DBSCAN density algorithm and allows us to form a multicluster structure based on the application of respective proximity metric. The feasibility of using the OPTICS clustering algorithm is determined by the fact that its application allows us not only to form a multicluster structure containing clusters of close gene expression profiles by density in their allocation in feature space but also to allocate profiles identified as noise because of density of their allocation relative to other GEP is much lower compared to the density of the main groups of GEP distribution. We would like to note that the criterion calculated by formulas (1) – (3) does not always allow us to objectively form an adequate clustering due to the reproducibility error, which is inherent to most prevailing clustering algorithms. In other words, satisfactory results of data grouping gotten using one dataset are not always repeated when applying another similar dataset. In [29], the authors proposed the idea of reducing the reproducibility error by using “fresh data” (not used when creating the model) during the process of verifying the obtained model of object distribution into clusters and making the final decision regarding the cluster structure formation by joint using the internal, external and balance criteria, which considered possible discrepancies between internal and external criteria. This idea was further developed in [30,31] where the objective clustering inductive technology was described and implemented. The authors proposed an external quality criterion assessed in the form of normalized distinction of the internal criteria assessed on two equivalent subsets (contained the same number of pairwise similar objects) at the appropriate hierarchical level of cluster structure formation: QC1int  QC2int QC ext  (4) QC1int  QC2int The main idea was as follows. The minimal reproducibility error matches the maximum degree of the similarity of objects allocation in clusters obtained on two equivalent subsets. Since the internal criteria consider the nature of both the patterns distribution in clusters and the clusters' medians allocation relative to each other, objective clustering (minimum value of reproducibility error) in this case corresponds to the minimal difference between the corresponding values of the internal criteria. The normalizing correction in formula (4) transforms the range of the external criteria values variation from 0 (zero reproducibility error) to 1 (maximum error). The balance criterion was calculated using the Harrington desirability function according to the algorithm described in detail in [30,31]. The random forest classifier was used to implement this step. This choice is determined by the previous authors' research, presented in [21], where various types of binary classifiers were studied to classify the samples of patients examined on lung cancer. These samples contained gene expression data as attributes too. The effectiveness of the respective model was assessed using the examined samples classification accuracy. Figure 2 shows a block chart of the stepwise procedure performed within the framework of the modelling procedure executing. The practical implementation of this algorithm assumes the following stages: Stage I. Formation of GEP data and functions to calculate respective criteria. 1.1. Forming a array of GED, the components of which represent the assessed patterns and genes whose expression determines the relative amount of a given type of gene for the examined patterns respectively. 1.2. Formation of the function to estimate the proximity metrics between GEP on the basis of the joint application of the modified mutual information maximization proximity metric and Pearson's  2 test [28]. 1.3. Formation of the functions to calculate the internal, external and hybrid balance quality criteria. 1.4. Formation of the function to calculate the examined samples classification accuracy. 1.5. Formation of two equivalent subsets of GEP by the iterative distribution of the two nearest GEP according to a hybrid proximity metric into two equivalent subsets. Stage II. Setup of density-based OPTICS clustering algorithm. 2.1. Setup of range for changing the minimum number of points within the ε-neighborhood: MinPtsmin, MinPtsmax. 2.2. Creating a reachability chart. Setup of both the range and step of variation of the ε- neighborhood values: Epsmin, Epsmax, dEps. 2.3. Calculation of distances between all pairs of gene expression profiles in equal-power subsets and formation of matrixes of distances between the corresponding profiles. The obtained distance matrixes will be used as input data when the clustering procedure is implemented by applying the OPTICS algorithm. Figure 2: Structural block-chart of the algorithm for forming a multicluster structure based on the OPTICS algorithm implemented using the principles of OCIT Stage III. Stepwise clustering of GEP within the specified ranges of the algorithm appropriate parameters variation. 3.1. MinPts value initialization: k = MinPtsmin. 3.2. Eps value initialization: e = Epsmin. 3.3. Clustering of gene expression profiles contained in equivalent subsets, forming the partitions with the number of clusters K1 and K2. 3.4. If K1 = K2 > 2, calculation of internal and external quality criteria by formulas (1) - (4). Otherwise, increase the value of Eps parameter (e = e + de) and go to step 3.3 of this procedure. 3.5. Classification of objects that contain gene expression profiles in each of the allocated clusters. Calculation of the classification quality criterion (Accuracy). 3.6. If e  Eps max , go to step 3.3 of this procedure. Otherwise, calculate the hybrid balance criterion and increase the MinPts value by one: k = k + 1. 3.7. If k  MinPts max , go to step 3.2 of this procedure. Otherwise, the creation of charts of the clustering and classification quality criteria depending on the Eps value for each of the MinPts values. Stage IV. An analysis of the obtained results. 4.1. An analysis of the obtained charts. Forming conclusions regarding the effectiveness of hybrid metrics of GEP proximity in the process of forming subsets of informative genes for their further use when the creation of disease diagnosing systems or/and GRN reconstruction. 3. Experiment, results and discussion The practical implementation of the proposed algorithm was carried out using the GSE19188 gene expressions dataset of patients studied for the early stage of lung cancer [32]. The data were obtained using a DNA microchips experiment and contained 156 microchips, 65 of them contained GED of healthy patients and 91 ones included the GED of patients with lung cancer tumor (mild form). 400 the most informative GEP in terms of classification accuracy (approximately 93%) [20,21] were used during the simulation procedure implementation. The MinPts value was changed within the limits of 3 to 5. This interval was established empirically. The results of the modelling showed that a larger quantity of points within the Eps neighborhood degrades the simulation results both in terms of the number of clusters in the equal over subsets and in terms of gene expression profiles clustering quality criteria and the samples classification accuracy. The Eps values were varied from the minimum, which was calculated as the minimum distance between gene expression profiles in equal-power subsets to a 1.5 minimum distance. This range was also set empirically. When the Eps values was larger, the GEP were allocated into 2 clusters, and the clustering results were repeated. The resulting range of the Eps values variation was divided into 20 equal sections. The width of the section was equal to the step of the Eps value changing. According to the hereinbefore presented algorithm, the clustering and classification quality criteria were calculated only for cases where the number of clusters allocated on equal-power subsets was equal. This condition minimizes the reproducibility error. Tables 1 and 2 and Figures 3 and 4 present the modelling results. Table 1 The result of the division of GEP into clusters when MinPts = 3 Clusters EPS,*10-3 1 2 3 4 5 6 0.66435 24 311 6 14 6 7 0.69525 24 322 6 14 6 – 0.71070 24 323 6 14 6 – 0.72615 24 329 6 14 6 – 0.74160 24 332 6 14 6 – 0.91155 24 359 6 – – – 0.92700 24 359 6 – – – Table 2 The result of the division of GEP into clusters when MinPts = 4 and 5 EPS,*10-3 MinPts = 4 EPS,*10-3 MinPts = 5 Clusters Clusters 1 2 3 4 1 2 3 4 0.64890 24 158 115 11 0.64890 23 155 115 11 0.69525 24 322 14 – 0.66435 24 308 14 – 0.71070 24 323 14 – 0.67980 24 314 14 – 0.72615 24 326 14 – 0.69525 24 321 14 – 0.74160 24 331 14 – 0.71070 24 322 14 – – – – – – 0.72615 24 326 14 – – – – – – 0.74160 24 331 14 – Figure 3: The simulation results regarding the criterial analysis of cluster structure using OPTICS algorithm implemented on the basis of OCIT: distribution of the internal criteria assessed on the first (a) and second (b) equivalent subsets of GEP; external (c) and hybrid balance criterion (d) when the Eps and MinPts values are varied from minimum to maximum values The analysis of the obtained results allows concluding on the feasibility of using the proposed GEP proximity metric for the selection of mutually correlated profiles in the case of using a multicluster structure which is formed by applying the OPTICS clustering algorithm. The proposed method crate the condition to assess the algorithm suitable parameters in terms of the optimal nature of the GEP grouping into clusters on the one hand, and the minimum value of the reproducibility error on the other hand. As can be seen from Tables 1 and 2, when the MinPts parameter value is 3, there are seven clusters' structures. The first clustering contains six clusters, the four clustering contain five clusters, and the last two clustering contain three clusters. In the cases when MinPts values are 4 or 5, the first clustering contained four clusters, in other cases, three clusters were obtained in each clustering. It should be noted that the initial data contained approximately 400 gene expression profiles that were carefully selected by stepwise application of the SOTA clustering algorithm [20,21]. The accuracy of the samples classifying when the full set of gene expression profiles was used as attributes was approximately 93%. Analysis of the results shows also that in all cases, some of the gene expression profiles are identified as noise. These genes are not contained in any cluster. The presence of "noise" genes can be explained by the fact that the density of these GEP in terms of the used proximity metric is less than the conditional boundary value assessed by the OPTICS clustering algorithm. Analysis of the charts presented in Figure 3 has also shown that the internal and external criteria do not optimal to assess the OPTICS algorithm suitable parameters because the minimum values of these metrics do not matched to the maximum values of the object classification accuracy in the corresponding clusters. The maximum value of the hybrid balance criterion, which contains as components both the internal and external criteria is achieved in the case in a three-cluster structure with the parameters of the OPTICS algorithm: MinPts = 3, Eps = 0.00091155 or Eps = 0.00092700 (the same results are achieved in these instances). The results of the classification of objects contained in the corresponding clusters and presented in Figure 4, confirm the hereinbefore conclusions. As it can be seen from the charts, with these parameters of the algorithm, the classification accuracy is maximal for the first two clusters, while the second cluster contains the largest number of genes, i.e. it is the main in terms of the number of gene expression profiles. The third cluster contains only six genes. The classification results in the fourth, fifth and sixth clusters are not adequate because they are the same in all cases and slightly worse than the classification results in the first three clusters. It should be noted that the maximum values of the hybrid balance criterion that determines the quality of gene expression profiles clustering correspond to the maximum values of the samples classification accuracy that contain as the attributes the extracted gene expression profiles. This fact indicates the high efficiency of the proposed hybrid proximity metric and technique to asses the quality of GEP clustering. Figure 4: The results of the simulation regarding assessing the objects classification accuracy whose attributes are the gene expression profiles allocated to clusters using the OPTICS algorithm: a) the first cluster; b) the second cluster; c) the third cluster 4. Conclusions A hybrid model of GEP clusters formation in order to extract the groups of mutual similar GEP in terms of applied proximity metrics based on the application of OPTICS clustering algorithm implemented on the basis of OCIT principles has been described in this paper. The hybrid proximity metric to access the distance between GEP has been applied during the simulation. This metric has been calculated on the basis of the joint applying the modified mutual information maximization metric (considered various methods of Shannon entropy evaluation) and Pearson's  2 test. The effectiveness of this hybrid proximity metric has been proved in [27]. The structural block chart of the stepwise algorithm for set the OPTICS algorithm suitable parameters in terms of a hybrid balance clustering quality criterion, which contains as components the internal and external clustering quality criteria has been presented. The high efficiency of the proposed model has been confirmed by the convergence of quality criteria for clustering gene expression profiles and the classification of objects that contain these GEP as attributes. An analysis of the simulation results has indicated that the internal and external clustering quality criteria do not allow determining the OPTICS algorithm optimal parameters. The minimal values of these criteria do not matched to the maximum values of the object classification accuracy in the corresponding clusters. The maximal value of the hybrid balance criterion, which is formed considering both the internal and external criteria has been achieved for a three-cluster structure with the parameters of the OPTICS algorithm: MinPts = 3, Eps = 0.00091155 or Eps = 0.00092700 (the same results are achieved in these instances). The analysis of the results of objects classification has confirmed the high effectiveness of the proposed technique since the classification accuracy is maximal for the first two clusters, while the second cluster contains the largest number of genes, i.e. it is the main in terms of the number of gene expression profiles. The third cluster contains only six genes. The fourth, fifth and sixth clusters contained the same number of gene expression profiles. Additionally, classification accuracy in these cases is slightly worse than the classification results in the first three clusters. It should be noted that the maximum values of the hybrid balance criterion that determines the quality of GEP clustering matched to the maximum values of the samples classification accuracy that contain as the attributes the extracted gene expression profiles. This fact indicates the high efficiency of the proposed hybrid proximity metric and model to assess the quality of GEP clustering. However, we would like to note that the proposed proximity metric is appropriate for high dimensional gene expression profiles. In the case of the other data use, it is necessary to investigate other more suitable for this type of data metrics. This is the limitation of the proposed model. The further perspectives of the authors' research are an application of the proposed hybrid proximity metric within the framework of gene expression profiles hybrid clustering and classification techniques implemented based on other clustering and classification algorithms. 5. References [1] M.E. Ritchie, B. Phipson, D. Wu, et al. limma powers diff. express. analysis for RNA-sequencing and microarray studies. Nucl. Acids Res., 2015, vol. 43(7), art. no. e47. doi: 10.1093/nar/gkv007 [2] R. Ihaka, R. Gentleman. R: a lang. for data analysis and graphics. J. of Comp. and Graph. Statistics, 1996, vol. 5(3), pp. 299-314. doi:10.2307/1390807 [3] S. Babichev, A. Kornelyuk, et al. Computat. analysis of microarray GEP of lung cancer. Biopolymers and Cell, 2016, vol. 32(1), рр.70–79. doi: 10.7124/bc.00090F [4] S. Babichev, B. Durnyak, et al. Techniques of DNA microarray data pre-processing based on the complex use of Bioconductor tools and Shannon entropy. CEUR Workshop Proceedings, 2019, vol. 2353, pp. 365-377. [5] S. Babichev, B. Durnyak, V. Senkivskyy, et al. Exploratory analysis of neuroblast. data genes expr. based on Bioconductor package tools. CEUR Workshop Proceedings, 2019, vol. 2488, pp. 268-279. [6] C.S.Tan, W.S. Ting, M.S. Mohamad, et al. A Review of Feature Extraction Soft. for Microarray Gene Expr. Data. BioMed Res. Int., 2014, vol. 2014, art. no. 213656. doi: 10.1155/2014/213656 [7] N. Almugren, H. Alshamlan. A survey on hybrid feature selection meth. in microarray gene express. data for cancer classific.. IEEE Access, 2019., vol. 7, art. no. 8736725, pp. 78533- 78548. doi: 10.1109/ACCESS.2019.2922987 [8] H. Lu, J. Chen, K. Yan, et al. A hybrid feature selection algorithm for GED classification. Neurocomp., 2017, vol. 256, pp. 56-62. doi: 10.1016/j.neucom.2016.07.080 [9] C.P. Lee, Y. Leu. A novel hybrid feature selection method for microarray data analysis. Appl. Soft Comput., 2011, vol. 11(1), pp. 208-213. doi: 10.1016/j.asoc.2009.11.010 [10] L.Y. Chuang, C.H. Yang, K.C. Wu, C.H. Yang. A hybrid feature selection method for DNA microarray data. Comput. Biol. Med., 2011, vol. 41(4), pp. 228-237. doi: 10.1016/j.compbiomed.2011.02.004 [11] M.A. Valizade Hasanloei, R. Sheikhpour, et al. A combined Fisher and Laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities. J. Comput Aided. Mol. Des., 2018, vol. 32(2), pp. 375-384. doi: 10.1007/s10822-017- 0094-6 [12] J.R. Quinlan, Induction of decision trees. Mach Learn, 1986, vol. 1, pp. 81–106. doi: 10.1007/BF00116251 [13] L. Xiaowei, J. Chenglin, et al. A Fisher’s Criterion-Based Linear Discriminant Analysis for Predicting the Critical Values of Coal and Gas Outbursts Using the Initial Gas Flow in a Borehole. Mathematical Problems in Engineering, 2017, vol. 2017, art. no. 7189803. Doi: 10.1155/2017/7189803 [14] A. Hyvärinen. Independent component analysis: recent advances. Phil. Trans. R. Soc., 2013, vol.371, art. no. 20110534. doi: 10.1098/rsta.2011.0534 [15] H. Alshamlan, G. Badr, et al. mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray GEP. Biomed. Res. Intern., 2018, vol. 2015, art. no. 604910. doi: 10.1155/2015/604910 [16] P. Moradi, M. Gholampour. A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy. Applied Soft Computing, 2016, vol. 43, pp. 117-130. doi: 10.1016/j.asoc.2016.01.044 [17] E. Pashaei, M. Ozen, N. Aydin. Gene selection and classification approach for microarray data based on random forest ranking and BBHA. In Proc. IEEE-EMBS Int. Conf. Biomed. Health Inform. (BHI), 2016, pp. 308-311. doi: 10.1109/BHI.2016.7455896 [18] X. Li, M. Yin. Multiobjective binary biogeography based optimization for feature selection using gene expression data. IEEE Trans. Nanobiosci., 2013, vol. 12(4), pp. 343-353. doi: 10.1109/TNB.2013.2294716 [19] S.S. Shreem, S. Abdullah, M.Z.A. Nazri. Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm. Int. J. Syst. Sci., 2016, vol. 47(6), pp. 1312-1329. doi: 10.1080/00207721.2014.924600 [20] P. Tumuluru, B. Ravi. GOA-based DBN: Grasshopper optimization algorithm-based deep belief neural networks for cancer classification. Int. J. Appl. Eng. Res., 2017, vol. 12(24), pp. 14218- 14231. [21] S. Babichev, J. Škvor. Technique of Gene Expression Profiles Extraction Based on the Complex Use of Clustering and Classification Methods. Diagnostics, 2020, vol. 10 (8), art. no. 584. doi: 10.3390/diagnostics10080584 [22] S. Babichev, V. Lytvynenko, et al. Information Technology of Gene Expression Profiles Processing for Purpose of Gene Regulatory Networks Reconstruction. (2018) Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing, DSMP 2018, art.no. 8478452, pp. 336-341. doi: 10.1109/DSMP.2018.8478452 [23] I. Izonin, R. Tkachenko, V. Verhun, et al. An approach towards missing data management using improved grnn-sgtm ensemble method. International Journal Engineering Science and Technology, 2020, p. in press. doi: 10.1016/j.jestch.2020.10.005 [24] V. Lytvyn, T. Salo, V. Vysotska, et al. Identifying textual content based on thematic analysis of similar texts in big data. In: IEEE 2019 14th International Scientificc and Technical Conference on Computer Sciences and Information Technologies, CSIT 2019 – Proceedings, 2019, vol. 2, pp. 84-91. doi: 10.1109/STC-CSIT.2019.8929808 [25] A. Rzheuskyi, O. Kutyuk, V. Vysotska, et al. The architecture of distant competencies analyzing system for it recruitment. In: IEEE 2019 14th Int. Sc. and Techn.l Conf. on Comp. Sc. and Inf. Techn., 2019, vol. 3, pp. 254-261. doi: 10.1109/STC-CSIT.2019.8929762 [26] R. Tkachenko, I. Izonin, N. Kryvinska, et. al. An approach towards increasing prediction accuracy for the recovery of missing iot data based on the grnn-sgtm ensemble. Sensors (Switzerland), 2020, vol. 20(9), art. no. 2625. doi: 10.3390/s20092625 [27] S. Babichev, L. Yasinska-Damri, I. Liakh, B. Durnyak. Comparison Analysis of Gene Expression Profiles Proximity Metrics. Symmetry, 2021, vol. 13(10), art no 1812. doi: 10.3390/sym13101812 [28] M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander. OPTICS: Ordering Points to Identify the Clustering Structure. SIGMOD Record (ACM Special Interest Group on Management of Data), 1999, vol. 28(2), pp. 49-60, doi: 10.1145/304181.304187 [29] H.R. Madala, A.G. Ivakhnenko. Inductive Learning Algorithms for Complex Systems Modeling. CRC Press, 1994, 365 p. [30] S. Babichev, B. Durnyak, et al. Application of Optics Density-Based Clustering Algorithm Using Inductive Methods of Complex System Analysis. International Scientific and Technical Conference on Computer Sciences and Information Technologies, 2019, vol. 1, art. no. 8929869, pp. 169-172. doi: 10.1109/STC-CSIT.2019.8929869 [31] S. Babichev, V. Lytvynenko, M.A. Taif. Estimation of the inductive model of objective clustering stability based on k-means algorithm for different level of data noise. Radio Electronics, Comp. Science, Control, 2016, vol. 4, pp. 54-60. doi: 10.15588/1607-3274-2016-4-7 [32] J. Hou, J. Aerts, B. den Hamer, et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS ONE 2010, vol. 5, art no e10312. doi:10.1371/journal.pone.0010312.