Evaluation of the Gene Expression Profiles Complex Proximity Metric Effectiveness Based on a Hybrid Technique of Gene Expression Data Extraction

Evaluation of the Gene Expression Profiles Complex Proximity Metric Effectiveness Based on a Hybrid Technique of Gene Expression Data Extraction LyudmylaYasinska-Damri Ukrainian Academy of Printing

Pid Goloskom street, 19 79000 Lviv Ukraine

IgorLiakh Uzhhorod National University

University street, 14 88000 Uzhhorod Ukraine

SergiiBabichev sbabichev@ksu.ks.ua Kherson State University

University street, 27 73000 Kherson Ukraine

BohdanDurnyak durnyak@uad.lviv.ua Ukrainian Academy of Printing

Pid Goloskom street, 19 79000 Lviv Ukraine

Evaluation of the Gene Expression Profiles Complex Proximity Metric Effectiveness Based on a Hybrid Technique of Gene Expression Data Extraction 2CCBAB68F0417CAF4F6B3E1B5B06B89B GROBID - A machine learning software for extracting information from scholarly documents Gene expression profiles, proximity metrics, OPTICS clustering algorithm, gene expression profiles classification, inductive methods of objective clustering, clustering quality criteria, classification accuracy 0000-0002-8629-8658 (L. Yasinska-Damri) 0000-0001-5417-9403 (I. Liakh) 0000-0001-6797-1467 (S. Babichev) 0000-0003-1526-9005 (B. Durnyak)

Gene expression data processing in order to develop the systems of complex diseases diagnostic or/and gene regulatory networks (GRN) reconstruction is one of the actual direction of modern bioinformatics. One of the important stages of this problem solving is an extraction of mutually correlated gene expression profiles (GEP) considering the used proximity metric. Within the framework of our research, we evaluate the complex metric of GEP proximity calculated as the combination of modified mutual information criterion and Pearson's chi-squared test using OPTICS clustering algorithm implemented using principles of the objective clustering inductive technique (OCIT). The examined objects classification accuracy was used as the main criterion to access the applied method effectiveness. The simulation results have shown that the proposed technique allows us to form an optimal GEP cluster structure in terms of maximum values of the patterns classification accuracy quality criterion.

Introduction and literature review

The development of models of diseases diagnostics or/and gene regulatory networks (GRN) reconstruction using gene expression data (GED) is one of the actual directions of modern bioinformatics. As a rule, the initial GED is formed as a high dimensional array with components represented the studied patterns and genes. The value of gene expression is depended on the amount of this type of gene that determines the appropriate properties of the examined biological organism. Gene expression profile (GEP) means the vector of gene expressions the values of which are evaluated for the examined patterns.

Reconstruction of gene regulatory network (GRN) which adequate reflect the nature of genes interaction under the different states of a biological organism in order to develop both effective medicine and disease diagnostic and treating methods is possible provided the extraction of groups of highly and mutually expressed genes. For this reason, the stage of gene expression data preprocessing is very important at the early stage of GRN forming or under the development of a disease diagnosing model. Figure 1 illustrates a stepwise procedure for implementing this process. The filtration procedure, in this case, involves removing genes with zero expression at the first step and genes with low expression in terms of the empirically established threshold at the second step.

Moreover, data can contain gene expression profiles that are statistically significantly different from the GEP of the main group. It is obvious that such genes do not correlate with the profiles of other genes and they can also be removed from the data. Qualitative implementation of this stage allows significantly reducing the number of genes for further research. This fact also contributes to enhancing the quality of further steps of GED processing for the solving hereinafter described problem.

Figure 1: Block-chart of a step-by-step procedure of GED processing to form clusters of highly and mutually expressed GEP In [1], the authors presented the "limma" module (Linear Models for Microarray and RNA-Seq Data), which contains various functions for generating, filtering and interpreting gene expression data obtained using both DNA microchips experiments and mRNA molecules sequencing method. This module is to some extent an alternative to the "Bioconductor" package, implemented in the data mining and machine learning R software [2] and it is based on the use of linear models to allocate differently expressed genes in a multifactor experiment. This module also contains functions for the genes ontology analysis, which is very important for adequate GRN reconstruction, because the interpretation of genes and their interactions based on the analysis of conceptual interconnections allows identifying target genes, to establish the nature of interconnections between target and other genes taking into account appropriate disease.

The papers [3][4][5] considered a various tools and techniques of GED filtering that are available in the "Bioconductor" package using quantitative quality criteria for GED received by DNA microarray method [3,4] and mRNA molecules sequencing [5]. As a simulation result, the authors proposed a stepwise algorithm for extracting highly and mutually expressed gene expression profiles for their further grouping into clusters. In a review [6], the authors conducted a comparative analysis of current software to process the GED for purpose of extracting the most informative genes. The analysis of the authors' research allows concluding on the feasibility of using the R software for GEP processing in order to form clusters of highly and mutually expressed genes because this software contains all necessary modules and functions to process gene expression data according to the solved task.

The review [7] presents the research results focused on the study of various hybrid techniques to extract the clusters of mutually correlated GEP to solve the problem of creation of the system of cancer disease diagnostic. In the reviewed works, various combinations of filtering, clustering and classification techniques using various types of statistical criteria and gene expression profiles proximity metrics were applied. The examined objects classification accuracy was applied as the principal quality metric to assess the appropriate hybrid model effectiveness. The following filtration techniques and methods to estimate the gene expression profiles proximity were analyzed in this review: mutual information maximization method [8], 2  Pearson's test [9], correlation-based feature selection technique [10], Laplacian and Fisher score [11], information gain method [12], Fisher criterion [13], independent component analysis [14], maximum relevance minimum redundancy [15], probabilistic random function [16], random forest ranking [17], Fisher-Markov selector [18], symmetrical uncertainty [19] and logarithmic transformation [20] method. However, we would like to note that in the analyzed research high classification accuracy in most cases is achieved when using a low number of the extracted GEP. Moreover, the parameters of the respective technique used in the appropriate hybrid models are set upped empirically when the simulation process is performed. Undoubtedly, this fact is one of the main disadvantages of the analyzed models.

The works [21,22] presents the partial decision of this task. A stepwise procedure of GEP extraction on the basis of the joint application of Shannon entropy, statistical criteria, clustering technique based on the SOTA clustering algorithm and random forest binary classifier was developed in these papers. The suitable algorithm parameters considering the classification accuracy were set a priory according to the OCIT principles. However, only correlation proximity metric was used within the framework of the authors' research. Thus, the presented hereinbefore brief review allows concluding that an effective model of GEP extraction based on joint application of various proximity metrics, clustering and classification techniques is absent now. This problem can be solved on the basis of joint application of various techniques used successfully in current data science directions of scientific research nowadays [23][24][25][26].

In this work, we consider the GEP hybrid proximity metric calculated as a combination of modified mutual information maximization method and Pearson's 2  test. The modified mutual information maximization method, in this instance, takes into account various methods of Shannon entropy evaluation.

The objective of the research is the development and evaluation of a hybrid model of GEP extraction on the basis of joint application of hybrid proximity metric, OPTICS clustering algorithm implemented using principles of OCIT and random forest binary classifier.

Materials and methods

In the general instance, the clustering internal quality criterion should consider both the gene expression profiles allocation inside clusters and clusters' medians allocation relative to each other. Thus, this criterion should be complex and contains two components. If we assume that K is the number of clusters, then the formula for assessing the first component of this criterion can be calculated in the following way:

     K k N i k i k k C e d N K QCW 1 1 ) , (1 1 (1)

where:

k N and k C are the number of GEP in k-th cluster and the median of k-th cluster respectively; ) , (

k i C e d

is the distance between i-th profiles and median of this cluster calculated using complex proximity metric which contained both the modified mutual information maximization method (considered various methods of Shannon entropy calculation) and Pearson's 2  test the effectiveness of which is proved in [27].

The second component of the internal criterion can be assessed as the average distance between the allocated clusters' medians:

        1 1 1 ) , ( ) 1 ( 2 K i K i j j i C C d K K QCB (2)

In [21], the authors performed modelling to assess the performance of different types of internal criteria, containing (1) and ( 2) as the components. As a result, a hybrid internal criterion formed as a ratio of Calinski-Harabasz criterion and WB index has been proposed:

2 2 int ) ( ) 1 ( QCB K N QCW K K QC    (3)

where N is the number of objects that should be grouped. This criterion was used as the internal one during the modelling procedure performing. Assessment of the efficiency of both the GEP hybrid proximity metric and quality criteria when the profiles grouping into clusters was performed based on the application of density clustering algorithm Optics [28], which is a logical development of DBSCAN density algorithm and allows us to form a multicluster structure based on the application of respective proximity metric. The feasibility of using the OPTICS clustering algorithm is determined by the fact that its application allows us not only to form a multicluster structure containing clusters of close gene expression profiles by density in their allocation in feature space but also to allocate profiles identified as noise because of density of their allocation relative to other GEP is much lower compared to the density of the main groups of GEP distribution.

We would like to note that the criterion calculated by formulas (1) -( 3) does not always allow us to objectively form an adequate clustering due to the reproducibility error, which is inherent to most prevailing clustering algorithms. In other words, satisfactory results of data grouping gotten using one dataset are not always repeated when applying another similar dataset. In [29], the authors proposed the idea of reducing the reproducibility error by using "fresh data" (not used when creating the model) during the process of verifying the obtained model of object distribution into clusters and making the final decision regarding the cluster structure formation by joint using the internal, external and balance criteria, which considered possible discrepancies between internal and external criteria. This idea was further developed in [30,31] where the objective clustering inductive technology was described and implemented. The authors proposed an external quality criterion assessed in the form of normalized distinction of the internal criteria assessed on two equivalent subsets (contained the same number of pairwise similar objects) at the appropriate hierarchical level of cluster structure formation:

int 2 int 1 int 2 int 1 QC QC QC QC QC ext   (4)

The main idea was as follows. The minimal reproducibility error matches the maximum degree of the similarity of objects allocation in clusters obtained on two equivalent subsets. Since the internal criteria consider the nature of both the patterns distribution in clusters and the clusters' medians allocation relative to each other, objective clustering (minimum value of reproducibility error) in this case corresponds to the minimal difference between the corresponding values of the internal criteria. The normalizing correction in formula (4) transforms the range of the external criteria values variation from 0 (zero reproducibility error) to 1 (maximum error). The balance criterion was calculated using the Harrington desirability function according to the algorithm described in detail in [30,31].

The random forest classifier was used to implement this step. This choice is determined by the previous authors' research, presented in [21], where various types of binary classifiers were studied to classify the samples of patients examined on lung cancer. These samples contained gene expression data as attributes too. The effectiveness of the respective model was assessed using the examined samples classification accuracy.

Figure 2 shows a block chart of the stepwise procedure performed within the framework of the modelling procedure executing. The practical implementation of this algorithm assumes the following stages:

Stage I. Formation of GEP data and functions to calculate respective criteria.

1.1. Forming a array of GED, the components of which represent the assessed patterns and genes whose expression determines the relative amount of a given type of gene for the examined patterns respectively.

1.2. Formation of the function to estimate the proximity metrics between GEP on the basis of the joint application of the modified mutual information maximization proximity metric and Pearson's 2  test [28].

1.3. Formation of the functions to calculate the internal, external and hybrid balance quality criteria.

1.4. Formation of the function to calculate the examined samples classification accuracy. 1.5. Formation of two equivalent subsets of GEP by the iterative distribution of the two nearest GEP according to a hybrid proximity metric into two equivalent subsets.

Stage II. Setup of density-based OPTICS clustering algorithm.

2.1. Setup of range for changing the minimum number of points within the ε-neighborhood: MinPtsmin, MinPtsmax.

2.2. Creating a reachability chart. Setup of both the range and step of variation of the εneighborhood values: Epsmin, Epsmax, dEps.

2.3. Calculation of distances between all pairs of gene expression profiles in equal-power subsets and formation of matrixes of distances between the corresponding profiles. The obtained distance matrixes will be used as input data when the clustering procedure is implemented by applying the OPTICS algorithm.

If max

MinPts k 

, go to step 3.2 of this procedure. Otherwise, the creation of charts of the clustering and classification quality criteria depending on the Eps value for each of the MinPts values.

Stage IV. An analysis of the obtained results.

1. An analysis of the obtained charts. Forming conclusions regarding the effectiveness of hybrid metrics of GEP proximity in the process of forming subsets of informative genes for their further use when the creation of disease diagnosing systems or/and GRN reconstruction.

Experiment, results and discussion

The practical implementation of the proposed algorithm was carried out using the GSE19188 gene expressions dataset of patients studied for the early stage of lung cancer [32]. The data were obtained using a DNA microchips experiment and contained 156 microchips, 65 of them contained GED of healthy patients and 91 ones included the GED of patients with lung cancer tumor (mild form). 400 the most informative GEP in terms of classification accuracy (approximately 93%) [20,21] were used during the simulation procedure implementation. The MinPts value was changed within the limits of 3 to 5. This interval was established empirically. The results of the modelling showed that a larger quantity of points within the Eps neighborhood degrades the simulation results both in terms of the number of clusters in the equal over subsets and in terms of gene expression profiles clustering quality criteria and the samples classification accuracy. The Eps values were varied from the minimum, which was calculated as the minimum distance between gene expression profiles in equal-power subsets to a 1.5 minimum distance. This range was also set empirically. When the Eps values was larger, the GEP were allocated into 2 clusters, and the clustering results were repeated. The resulting range of the Eps values variation was divided into 20 equal sections. The width of the section was equal to the step of the Eps value changing. According to the hereinbefore presented algorithm, the clustering and classification quality criteria were calculated only for cases where the number of clusters allocated on equal-power subsets was equal. This condition minimizes the reproducibility error. Tables 1 and 2 and Figures 3 and 4 present the modelling results.

Table 1

The result of the division of GEP into clusters when MinPts = 3 EPS,*10 The analysis of the obtained results allows concluding on the feasibility of using the proposed GEP proximity metric for the selection of mutually correlated profiles in the case of using a multicluster structure which is formed by applying the OPTICS clustering algorithm. The proposed method crate the condition to assess the algorithm suitable parameters in terms of the optimal nature of the GEP grouping into clusters on the one hand, and the minimum value of the reproducibility error on the other hand. As can be seen from Tables 1 and 2, when the MinPts parameter value is 3, there are seven clusters' structures. The first clustering contains six clusters, the four clustering contain five clusters, and the last two clustering contain three clusters. In the cases when MinPts values are 4 or 5, the first clustering contained four clusters, in other cases, three clusters were obtained in each clustering. It should be noted that the initial data contained approximately 400 gene expression profiles that were carefully selected by stepwise application of the SOTA clustering algorithm [20,21]. The accuracy of the samples classifying when the full set of gene expression profiles was used as attributes was approximately 93%.

Analysis of the results shows also that in all cases, some of the gene expression profiles are identified as noise. These genes are not contained in any cluster. The presence of "noise" genes can be explained by the fact that the density of these GEP in terms of the used proximity metric is less than the conditional boundary value assessed by the OPTICS clustering algorithm. Analysis of the charts presented in Figure 3 has also shown that the internal and external criteria do not optimal to assess the OPTICS algorithm suitable parameters because the minimum values of these metrics do not matched to the maximum values of the object classification accuracy in the corresponding clusters. The maximum value of the hybrid balance criterion, which contains as components both the internal and external criteria is achieved in the case in a three-cluster structure with the parameters of the OPTICS algorithm: MinPts = 3, Eps = 0.00091155 or Eps = 0.00092700 (the same results are achieved in these instances). The results of the classification of objects contained in the corresponding clusters and presented in Figure 4, confirm the hereinbefore conclusions. As it can be seen from the charts, with these parameters of the algorithm, the classification accuracy is maximal for the first two clusters, while the second cluster contains the largest number of genes, i.e. it is the main in terms of the number of gene expression profiles. The third cluster contains only six genes. The classification results in the fourth, fifth and sixth clusters are not adequate because they are the same in all cases and slightly worse than the classification results in the first three clusters. It should be noted that the maximum values of the hybrid balance criterion that determines the quality of gene expression profiles clustering correspond to the maximum values of the samples classification accuracy that contain as the attributes the extracted gene expression profiles. This fact indicates the high efficiency of the proposed hybrid proximity metric and technique to asses the quality of GEP clustering.

Conclusions

A hybrid model of GEP clusters formation in order to extract the groups of mutual similar GEP in terms of applied proximity metrics based on the application of OPTICS clustering algorithm implemented on the basis of OCIT principles has been described in this paper. The hybrid proximity metric to access the distance between GEP has been applied during the simulation. This metric has been calculated on the basis of the joint applying the modified mutual information maximization metric (considered various methods of Shannon entropy evaluation) and Pearson's 2  test. The effectiveness of this hybrid proximity metric has been proved in [27]. The structural block chart of the stepwise algorithm for set the OPTICS algorithm suitable parameters in terms of a hybrid balance clustering quality criterion, which contains as components the internal and external clustering quality criteria has been presented. The high efficiency of the proposed model has been confirmed by the convergence of quality criteria for clustering gene expression profiles and the classification of objects that contain these GEP as attributes.

An analysis of the simulation results has indicated that the internal and external clustering quality criteria do not allow determining the OPTICS algorithm optimal parameters. The minimal values of these criteria do not matched to the maximum values of the object classification accuracy in the corresponding clusters. The maximal value of the hybrid balance criterion, which is formed considering both the internal and external criteria has been achieved for a three-cluster structure with parameters of the OPTICS algorithm: MinPts = 3, Eps = 0.00091155 or Eps = 0.00092700 (the same results are achieved in these instances).

The analysis of the results of objects classification has confirmed the high effectiveness of the proposed technique since the classification accuracy is maximal for the first two clusters, while the second cluster contains the largest number of genes, i.e. it is the main in terms of the number of gene expression profiles. The third cluster contains only six genes. The fourth, fifth and sixth clusters contained the same number of gene expression profiles. Additionally, classification accuracy in these cases is slightly worse than the classification results in the first three clusters. It should be noted that the maximum values of the hybrid balance criterion that determines the quality of GEP clustering matched to the maximum values of the samples classification accuracy that contain as the attributes the extracted gene expression profiles. This fact indicates the high efficiency of the proposed hybrid proximity metric and model to assess the quality of GEP clustering. However, we would like to note that the proposed proximity metric is appropriate for high dimensional gene expression profiles. In the case of the other data use, it is necessary to investigate other more suitable for this type of data metrics. This is the limitation of the proposed model.

The further perspectives of the authors' research are an application of the proposed hybrid proximity metric within the framework of gene expression profiles hybrid clustering and classification techniques implemented based on other clustering and classification algorithms.

Figure 2 :2Figure 2: Structural block-chart of the algorithm for forming a multicluster structure based on the OPTICS algorithm implemented using the principles of OCIT Stage III. Stepwise clustering of GEP within the specified ranges of the algorithm appropriate parameters variation.

3. 1 .1MinPts value initialization: k = MinPtsmin. 3.2. Eps value initialization: e = Epsmin. 3.3. Clustering of gene expression profiles contained in equivalent subsets, forming the partitions with the number of clusters K1 and K2. 3.4. If K1 = K2 > 2, calculation of internal and external quality criteria by formulas (1) -(4). Otherwise, increase the value of Eps parameter (e = e + de) and go to step 3.3 of this procedure. 3.5. Classification of objects that contain gene expression profiles in each of the allocated clusters. Calculation of the classification quality criterion (Accuracy). 3.6. If max Eps e  , go to step 3.3 of this procedure. Otherwise, calculate the hybrid balance criterion and increase the MinPts value by one: k = k + 1.

Figure 3 :3Figure 3: The simulation results regarding the criterial analysis of cluster structure using OPTICS algorithm implemented on the basis of OCIT: distribution of the internal criteria assessed on the first (a) and second (b) equivalent subsets of GEP; external (c) and hybrid balance criterion (d) when the Eps and MinPts values are varied from minimum to maximum values

Figure 4 :4Figure 4: The results of the simulation regarding assessing the objects classification accuracy whose attributes are the gene expression profiles allocated to clusters using the OPTICS algorithm: a) the first cluster; b) the second cluster; c) the third cluster

Table 22The result of the division of GEP into clusters when MinPts = 4 and 5-3Clusters1234560.6643524311614670.69525243226146-0.71070243236146-0.72615243296146-0.74160243326146-0.91155243596---0.92700243596---EPS,*10 -3MinPts = 4EPS,*10 -3MinPts = 5ClustersClusters123412340.6489024158115110.6489023155115110.695252432214-0.664352430814-0.710702432314-0.679802431414-0.726152432614-0.695252432114-0.741602433114-0.710702432214------0.726152432614------0.741602433114-

limma powers diff. express. analysis for RNA-sequencing and microarray studies MERitchie BPhipson DWu 10.1093/nar/gkv007 Nucl. Acids Res 43 7 2015 art a lang. for data analysis and graphics RIhaka RGentleman 10.2307/1390807 J. of Comp. and Graph. Statistics 5 3 1996 analysis of microarray GEP of lung cancer SBabichev AKornelyuk 10.7124/bc.00090F Biopolymers and Cell 32 1 2016 Techniques of DNA microarray data pre-processing based on the complex use of Bioconductor tools and Shannon entropy SBabichev BDurnyak CEUR Workshop Proceedings 2019 2353 Exploratory analysis of neuroblast. data genes expr. based on Bioconductor package tools SBabichev BDurnyak VSenkivskyy CEUR Workshop Proceedings 2019 2488 A Review of Feature Extraction Soft. for Microarray Gene Expr CSTan WSTing MSMohamad 10.1155/2014/213656 Data. BioMed Res. Int 2014 213656 2014 A survey on hybrid feature selection meth. in microarray gene express. data for cancer classific NAlmugren HAlshamlan 10.1109/ACCESS.2019.2922987 IEEE Access 7 8736725 2019 A hybrid feature selection algorithm for GED classification HLu JChen KYan 10.1016/j.neucom.2016.07.080 Neurocomp 256 2017 A novel hybrid feature selection method for microarray data analysis CPLee YLeu 10.1016/j.asoc.2009.11.010 Appl. Soft Comput 11 1 2011 A hybrid feature selection method for DNA microarray data LYChuang CHYang KCWu CHYang 10.1016/j.compbiomed.2011.02.004 Comput. Biol. Med 41 4 2011 A combined Fisher and Laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities MAValizade Hasanloei RSheikhpour 10.1007/s10822-017-0094-6 J. Comput Aided. Mol. Des 32 2 2018 Induction of decision trees JRQuinlan 10.1007/BF00116251 Mach Learn 1 1986 A Fisher's Criterion-Based Linear Discriminant Analysis for Predicting the Critical Values of Coal and Gas Outbursts Using the Initial Gas Flow in a Borehole LXiaowei JChenglin 10.1155/2017/7189803 Mathematical Problems in Engineering 2017 7189803 2017 Independent component analysis: recent advances AHyvärinen 10.1098/rsta.2011.0534 Phil. Trans. R. Soc 371 20110534 2013 mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray GEP HAlshamlan GBadr 10.1155/2015/604910 Biomed. Res. Intern 2015 604910 2018 A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy PMoradi MGholampour 10.1016/j.asoc.2016.01.044 Applied Soft Computing 43 2016 Gene selection and classification approach for microarray data based on random forest ranking and BBHA EPashaei MOzen NAydin 10.1109/BHI.2016.7455896 Proc. IEEE-EMBS Int. Conf. Biomed. Health Inform. (BHI) IEEE-EMBS Int. Conf. Biomed. Health Inform. (BHI) 2016 Multiobjective binary biogeography based optimization for feature selection using gene expression data XLi MYin 10.1109/TNB.2013.2294716 IEEE Trans. Nanobiosci 12 4 2013 Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm SSShreem SAbdullah MZ ANazri 10.1080/00207721.2014.924600 Int. J. Syst. Sci 47 6 2016 GOA-based DBN: Grasshopper optimization algorithm-based deep belief neural networks for cancer classification PTumuluru BRavi Int. J. Appl. Eng. Res 12 24 2017 Technique of Gene Expression Profiles Extraction Based on the Complex Use of Clustering and Classification Methods SBabichev JŠkvor 10.3390/diagnostics10080584 Diagnostics 10 8 2020 Information Technology of Gene Expression Profiles Processing for Purpose of Gene Regulatory Networks Reconstruction SBabichev VLytvynenko 10.1109/DSMP.2018.8478452 Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing

DSMP

2018. 2018 8478452 An approach towards missing data management using improved grnn-sgtm ensemble method IIzonin RTkachenko VVerhun 10.1016/j.jestch.2020.10.005 International Journal Engineering Science and Technology 2020 in press Identifying textual content based on thematic analysis of similar texts in big data VLytvyn TSalo VVysotska 10.1109/STC-CSIT.2019.8929808 IEEE 2019 14th International Scientificc and Technical Conference on Computer Sciences and Information Technologies, CSIT 2019 -Proceedings 2019 2 The architecture of distant competencies analyzing system for it recruitment ARzheuskyi OKutyuk VVysotska 10.1109/STC-CSIT.2019.8929762 IEEE 2019 14th Int. Sc. and Techn.l Conf. on Comp. Sc. and Inf. Techn 2019 3 An approach towards increasing prediction accuracy for the recovery of missing iot data based on the grnn-sgtm ensemble RTkachenko IIzonin NKryvinska 10.3390/s20092625 Sensors 20 9 2020 art Comparison Analysis of Gene Expression Profiles Proximity Metrics SBabichev LYasinska-Damri ILiakh BDurnyak 10.3390/sym13101812 Symmetry 13 10 2021 OPTICS: Ordering Points to Identify the Clustering Structure MAnkerst MMBreunig HPKriegel JSander 10.1145/304181.304187 SIGMOD Record (ACM Special Interest Group on Management of Data) 28 2 1999 Inductive Learning Algorithms for Complex Systems Modeling HRMadala AGIvakhnenko 1994 CRC Press 365 Application of Optics Density-Based Clustering Algorithm Using Inductive Methods of Complex System Analysis SBabichev BDurnyak 10.1109/STC-CSIT.2019.8929869 International Scientific and Technical Conference on Computer Sciences and Information Technologies 2019 1 Estimation of the inductive model of objective clustering stability based on k-means algorithm for different level of data noise SBabichev VLytvynenko MATaif 10.15588/1607-3274-2016-4-7 Radio Electronics, Comp. Science, Control 4 2016 Gene expression-based classification of non-small cell lung carcinomas and survival prediction JHou JAerts BHamer 10.1371/journal.pone.0010312 PLoS ONE 5 e10312 2010