Comparison Analysis of the Pearson’s Phi-Square Test and Correlation Metric Effectiveness to Form the Subset of Differently Expressed and Mutually Correlated Genes Lyudmyla Yasinska-Damria, Sergii Babichevb, and Igor Liakhc a Ukrainian Academy of Printing, Pid Goloskom street, 19, Lviv, 79000, Ukraine b Kherson State University, University street, 27, Kherson, 73000, Ukraine c Uzhhorod National University, University street, 14, Uzhhorod, 88000, Ukraine Abstract The development of patients' health monitoring systems based on gene expression data is a very important direction of current bioinformatics. In this instance, the allocation of both differently expressed and mutually correlated gene expression profiles (GEP) which allow monitoring in real-time the patients' health with high accuracy is a very important step of this problem solution. There are various types of similarity metrics to identify the level of GEP proximity. In this research, we compare the Pearson chi-square test and correlation metric to evaluate the gene expression profiles proximity. The evaluation of appropriate metric effectiveness has been executed by applying the object's classification quality criteria such as accuracy, f-score and Matthews correlation coefficient (MCC). The simulation results have shown that the metric based on Pearson’s phi-square coefficient is significantly effective in comparison with the correlation metric to allocate the mutually similar gene expression profiles and, this metric can be used when the differently expressed and mutually correlated GEP will be extracted using various clustering algorithms. Keywords 1 Gene expression profiles, correlation metric, Pearson’s chi-square test, gene expression profiles classification, classification quality criteria 1. Introduction and literature review The extraction of a subset of differently expressed and mutually correlated gene expression profiles (GEP) to further create a decision support system regarding the various diseases diagnosis or the gene regulatory network (GRN) reconstruction involves assessing both the informativity and proximity of gene expression profiles by using both single methods or ensemble of appropriate methods to measure the degree of GEP proximity. Currently, the clustering and biclustering techniques are applied widely to solve this problem. The implementation of these methods allow identifying the differently expressed and mutually correlated GEP, however, their application is led to a high rate of subjectivity due to the imperfection of used quality measures. In addition, useful information may be lost due to removing the informative gene expression profiles that contain significant information about the condition of the investigated object. The application of hybrid models based on joint use of both machine learning and data mining techniques for creating models based on an ensemble of various methods in order to analyze and follow the formation of GEP subsets considering the type of the disease can be reasonable in this instance. At a recent time, plenty of scientific papers have been devoted to the decision of the problems of measuring the degree of GEP informativeness in order to form the subsets of both differently IntelITSIS’2022: 3rd International Workshop on Intelligent Information Technologies and Systems of Information Security, March 23–25, 2022, Khmelnytskyi, Ukraine EMAIL: Lm.yasinska@gmail.com (L. Yasinska-Damri); sbabichev@ksu.ks.ua (S. Babichev); ihor.lyah@uzhnu.edu.ua (I. Liakh) ORCID: 0000-0002-8629-8658 (L. Yasinska-Damri); 0000-0001-6797-1467 (S. Babichev); 0000-0001-5417-9403 (I. Liakh) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) expressed and mutually proximate GEP in terms of the investigated objects recognizing accuracy. Thus, in [1] the authors considered the questions aimed at the detection of gene expression profiles of miRNA molecules. They have allocated from each library approximately 2,744,989 rows from 9,888,123 ones during the experiment carried out. As a result, 2,565 siRNAs molecules were discovered. The questions aimed at the comparative analysis of various types of classifiers application effectiveness to identify the differently expressed GEP using errors of both the first and second kind were considered in [2-4]. The principal shortcoming of the proposed method is that we a prior cannot have full information about the classes to which genes treat. For this reason, these techniques have a high rate of subjectivity. A comparative analysis of different hybrid models aimed at the extraction of subsets of GEP to decide the problem of differently expressed and mutually similar gene expression profiles allocation to create the cancer disease classifier based on gene expression data was carried out in [5-7]. The authors considered various steps of gene expression data pre-processing: from filtering with the following statistical analysis of the experimental data in order to solve the feature selection task to the evaluation of various types of hybrid models based on the joint use of clustering algorithms and classifiers. To evaluate the effectiveness of the respective approach, the authors have used various criteria based on an estimation of the objects classification results. Additionally, the authors have considered various combinations of current machine learning and data mining techniques. In this review, the following filtration and feature selection methods have been considered: maximization of mutual information based on Shannon entropy criterion; chi-squared test; technique based on correlation analysis; Fisher and Laplacian measure; random forest ranking technique; method based on a probabilistic random function; logarithmic transformation; method based on maximum relevance minimum redundancy; information gain technique. The authors in the appropriate hybrid models have used various combinations of gene grouping methods and the investigated objects classification techniques. The hereinbefore presented brief review allows inferring that the problem of objective extraction of differently expressed and mutually similar genes in terms of high-resolution ability when the disease diagnostic has not an unambiguous solution nowadays. In many instances, the acceptable classification accuracy was reached when a small quantity of gene expression profiles was applied. To reconstruct a qualitative gene regulatory network (in order to understand the particularities of genes interconnection) it is necessary to use a larger quantity of differentially expressed and mutually correlated gene expression profiles. Papers [8,9] present the partial solution of this problem. Proposed by the authors hybrid model assumes joint application of Shannon entropy, various types of statistical quality criteria, the SOTA clustering algorithm where the correlation distance was applied as the proximity metric, and various types of binary classifiers. The authors have proposed the step-by-step procedure of the GEP division with the evaluation of each step carried out effectivity by applying both the clustering and classification quality criteria. The fuzzy inference technique was applied to do the final decision about the differently expressed and mutually similar GEP selection. To the authors' minds, the application of an ensemble of the quality criteria contributes to higher objectivity when subsets of mutual similar and differently expressed genes extraction. However, the proposed technique has some shortcomings. At first, it focused on the dataset containing only two classes of the investigated objects. The authors have not considered the multi-class datasets. Thus, it will be better to extend the types of classifiers using datasets that contain a higher number of classes. The second shortcoming regards a limited number of datasets used during the simulation process performing. Thus, it is necessary to validate this model using various types of other gene expression profiles datasets [10,11]. A presented brief review of the current research in this subject area indicates that the problem of the extraction of mutually correlated and differently expressed GEP considering the type of disease is actual and, at the present time, this problem has no unambiguous decision. Its effective solution can be obtained using current techniques of computer science (data mining and machine learning) which are applied successfully in different fields of both applied and scientific research nowadays [12-15]. The choice of the gene expression profiles proximity metric that allows objectively grouping the mutually correlated and differently expressed GEP is the principal stages of this problem solve. The current research aims to the comparative analysis of various proximity metrics such as Pearson's chi- square test and correlation-based metric to assess the GEP proximity. The goal of the paper is the comparative analysis of correlation-based metric and Pearson’s chi- square test to assess the GEP proximity using various types of the classification quality criteria as the main criteria to assess the respective metric efficiency. 2. Problem statement Let, the experimental gene expression data be represented as follows: G = {esp }, s = 1, n; p = 1, m , (1) where: n is the number of genes that determine the state of the investigated samples; m is the number of the samples. The main measure for the formation of subsets of differently expressed and mutual similar GEP in this instance be a target function: F (ei , e j ) = min f (ei , e j ) , (2) where: ei and ej are the gene expression profiles i and j respectively; f () is the proximity function used to assess the proximity level of i-th and j-th GEP. In our research, we investigate as the GEP proximity function Pearson’s chi-square (  2 ) coefficient and correlation metric. The results of the objects classification were used to assess the appropriate similarity function effectiveness. 3. Materials and methods 3.1. Metrics and criteria to evaluate the GEP proximity Pearson's statistical chi-square (  2 ) measure tests the hypothesis that the values of GEP are distributed according to the same law [16]. Let the k-th gene expression profile be presented as a numeric vector of expression values: ek = (ekp ), p = 1, m , where m means the number of the study samples or conditions of the experiment carried out to form the gene expression data. If the range of the appropriate gene expression values [ekmin , ekmax ] divides into d non-intersection intervals r r [eks , ekp ], r = 1, d , then, the number of expression values allocated within the respective interval r can m be determined in the following way: mr =  [eks r r  ekj  ekp ]. j =1 In the case of the Pearson's chi-square test traditional application when using categorized data, initially it is necessary to assess the number of the investigated vector values belonging to the appropriate interval. At the second stage, it is necessary to assess the expected amount of samples in the respective range taking into account the probability of the appropriate sample allocation in the corresponding range: m'r = pr  m . Chi-square coefficient in this instance can be evaluated in the following way: k ( m − m' r ) 2 2 =  r (3) r =1 m' r The hypothesis that the studied data values are distributed accordingly to a certain distribution is accepted or rejected is done on the basis of comparative analysis of the criterion (3) with boundary value taking into consideration both the amount of the freedom grades and the likelihood of the result receiving. The null hypothesis is rejected (the data distribution does not correspond to the appropriate distribution) if the chi-square value is greater than the boundary value. Otherwise, this hypothesis is accepted. When we process the gene expression profiles, the value of the expressions in profiles is proportional to the quantity of appropriate specific gene. If we compare two GEP ei and ej, we assume, that the expression values in the first and second profiles are expected and evaluated respectively. Then, equation (3) for the chi-square criterion calculation takes the form: m (e − e ) 2  =2 is js (4) eis s =1 Higher gene expression profiles proximity level suits to a smaller value of the criterion (4). This fact is a basis to form the subsets of mutual similar GEP based on Pearson's chi-squared test. The second measure which is used in our research assumes the calculation of pairwise correlation distance between appropriate GEP in order to assess the degree of their consistency. As was noticed hereinbefore, the main goal of GEP pre-processing (features selection) is the allocation of differently expressed and mutually correlated GEP, which can allow us to identify the investigated samples contained allocated GEP as the attributes with the highest accuracy. We assume that the allocated gene expression profiles which correspond to hereinbefore listed requests should have a high level of mutual correlation and we can use the Pearson's correlation coefficient (since the gene expression profiles values are the numeric ones) to form the subset of differently expressed and mutually correlated gene expression profiles: m  (eis − ei )(e js − e j ) d cor (ei , e j ) = 1 − s =1 , (5) m m  (eis − ei )2  (e js − e j )2 s =1 s =1 where ei and e j are average value of the ei and ej gene expression profiles respectively. Similar to the previous case (the use of Pearson's chi-squared test), the minimal value of the criterion (5) suits to higher proximity level of the investigated GEP. The evaluation of the used distance functions (Pearson's chi-square coefficient and correlation distance) effectiveness was done based on the analysis of the investigated samples classification by calculation of respective quality criteria based on errors of both the first and the second kinds. Within the framework of the simulation process implementation, we used the Random Forest (RF) binary classifier [17,18] to assess the appropriate metric effectiveness. An effectiveness of this classifier implementation to identify the gene expression data was proven in [8]. As experimental data, we used the samples of patients examined on lung cancer. In accordance with the data description, the investigated samples can be divided into two groups: health patients and patients with lung cancer tumors. The quality of data classification was evaluated using criteria that contain as components the first and the second types errors. Table 1 presents the confusion matrix used to calculate the classification quality criteria. Table 1 Confusion matrix The real state of the patient Result of the object classification according to the diagnosis results Patients with tumor Healthy patients Patients with tumor True positive values (TPV) False negative values (FNV) Healthy patients False positive values (FPV) True negative values (TNV) To assess the efficiency of the hereinbefore listed metrics we applied the traditional classification quality criteria such as: • Accuracy (ACC), defined as the total probability of correct results prediction by the classifier use: TPV + TNV ACС = (6) TPV + FPV + TNV + FNV • F-score (FS), is a measure of the accuracy of a current model operation and it can be used to assess the binary classifier effectiveness that classifies the samples into negative and positive ones. F-score combines the Precision (PRC) and Recall (RCL) in the following way: 2  PRС  RCL FS = , (7) PRC + RCL where: TPV TPV PRC = RСС = TPV + FPV ; TPV + FNV • Matthews correlation coefficient (MCC) is a measure to assess the binary classifier effectiveness [19]: (TPV  TNV)-(FPV  FNV) MCC = (8) (TPV + FPV)  (TPV + FNV)  (TNV + FNV)  (TNV + FNV) Higher values of measures (6) - (8) correspond to the higher efficiency of the data classification procedure implementation. 3.2. The stepwise procedure of the simulation process implementation The algorithm with a stepwise procedure implemented within the framework of the simulation process to assess the efficiency of the used measures is shown in Figure 1. Its practical implementation assumes the following: Stage I. Forming the gene expression data as a matrix and vector of the method to evaluate the distance function between the gene expression profiles. 1.1. An analysis of the gene expression data and the forming these data as a matrix with rows and columns which are represented the investigated samples and genes that characterize the corresponding samples. 1.2. Formation of a vector of distance functions calculation methods for further estimation of mutual proximity of gene expression profiles. Stage II. Formation of a triangular distance matrix contained distance values between the gene expression profiles by the application of appropriate distance function. 2.1. Selection of the first method from the formed vector of distance function calculation methods (n = 1). 2.2. Calculation of the distance function value for all pairs of GEP that make up the matrix of gene expression data. Formation of a triangular distances matrix between the GEP. 2.3. Selection of kmax number of the mutually similar and differently expressed GEP considering the current distance function. Stage III. Classification of the investigated objects and form the vector of the classification quality criteria. 3.1. Selection of two the nearest gene expression profiles according to the current distance function (k = 2). 3.2. Initialization of the classification stage of objects containing as attributes of the selected genes expression values (p = 1). To increase the classification objectivity, this procedure was repeated 10 times (pmax = 10) with the redistribution of objects in the training and test data subsets. 3.3. Implementation of the data classification procedure at the appropriate stage p. 3.4. Calculation of the quality criteria by the formulas (6) – (8). 3.5. Increasing the number of nearby GEP per unit and going to step 3.2 of this procedure. If the number of expression profiles of genes used as attributes during the object classification procedure implementation reaches the maximum value, then the formation of a matrix of the classification quality criteria. Stage IV. The received results analysis. 4.1. Creation of the diagrams of the used classification quality criteria for various increasing quantities of the mutually correlated and differentially expressed GEP using appropriate distance metric. 4.2. The simulation results analysis. Doing the conclusion about the efficiency of the tested proximity metrics. Figure 1: Structure block chart of algorithm to evaluate the effectiveness of the Pearson’s chi-square test and correlation proximity metric 4. Experiment, results and discussion The experimental basis for the hereinbefore presented algorithm implementation was the dataset GSE19188 [20]. This dataset contains the experimental results of testing the various patients on lung cancer (156 patients). As a result, the expression values of 54675 genes were assessed. The simulation process was carried out using the R programming language. 156 samples were studied for the experimental data formation. Taking into consideration the data annotation, the tested samples were shared into two groups: healthy patients (65 samples) and patients with tumor of cancer (91 patients). As we have noted early, each of the samples contained in total 54675 genes, half of which was non- expressed for all samples (genes expression was zero). These genes were removed from the dataset at the first step. We have used the results of the research presented in [4], where the authors have applied a hierarchical clustering procedure with joint use of SOTA clustering algorithm with correlation proximity metric and binary classifiers to fixation an optimal hierarchical level of the gene expression data partition in terms of the samples classification accuracy. The authors allocated 401 differently expressed and mutually correlated gene expression data. The use of these data as the attributes has allowed the authors to get approximately 94% classification accuracy. This genes were used as the experimental dataset during the simulation process implementation. Figure 2 shows the charts of both the Pearson's chi-square coefficient and correlation distance values distribution using both a box-and-whiskers diagram and a kernel-density plot. An analysis of the obtained charts allows us to conclude that in both cases the data have outliers that correspond to significantly higher values of the distance between the respective GEP. These profiles should not be used at a further stage of data classification. Figure 2: Charts of distribution of the Pearson phi-square coefficient (a,b) and correlation distance (c,d) values Figures 3-5 show the results of the examined samples classification. The nearest and differently expressed gene expression profiles in terms of the applied distance metrics were used as the attributes of the examined samples. The number of GEP was increasing from 2 to 100 during the simulation procedure implementation. The results are presented as the charts of classification quality criteria calculated by the formulas (6) – (8) versus the number of GEP. The received charts analysis allows concluding the use of the chi-square test is more reasonable in comparison with the correlation measure in terms of various criteria that were used when the simulation procedure was executing. As it can be seen in the charts, when we applied the correlation measure to form the subset of the nearest and differently expressed GEP, the samples classification results are significantly worse in comparison with the results got with the use of Pearson's chi-square coefficient as the distance function. The obtained results create the conditions for increasing the objectivity of the most informative GEP extraction due to the careful selection of the distance functions which can be used as the component in complex distance metric calculated based on the use of an ensemble of the most effective distance functions. Figure 3: Charts of classification accuracy values when the increasing number of the nearest and differently expressed GEP Figure 4: Charts of F-score measure values when the increasing number of the nearest and differently expressed GEP 5. Conclusions In this research, we have carried out the comparative analysis of two distance functions: Pearson chi-square coefficient and correlation distance to assess the GEP proximity. The results of the investigated objects classification have been used to evaluation of the appropriate distance function effectiveness. The classification accuracy, F-score and Matthews correlation coefficient have been used as the classification quality criteria within the framework of our research. The dataset GSE19188 gene expression profiles of patients studied for early-stage lung cancer has been used as the experimental data. Taking into consideration the data annotation, the tested samples were shared into two groups: healthy patients (65 samples) and patients with tumor of cancer (91 patients). We have applied 401 differently expressed and mutually correlated gene expression data as the experimental dataset during the simulation process implementation. Figure 5: Charts of Matthews correlation coefficient values when the increasing number of the nearest and differently expressed GEP The stepwise procedure of increasing the nearest gene expression profiles from 2 to 100 with the implementation of data classification and calculation of the classification quality criteria has been implemented during the simulation process. The charts of classification quality criteria versus the number of gene expression profiles for each of the used distance functions have been obtained as the simulation results. An analysis of the obtained charts has allowed us to conclude about the lower efficiency of the correlation distance metric in comparison with the Pearson's phi-square coefficient both in absolute value and sensitivity. When using correlation distance metric for the subset of the nearest gene expression profiles formation, the results of the objects' classification that make up a subset of the testing data are significantly worse than the results obtained with the use of Pearson's phi-square coefficient as the distance function. The obtained results also create the conditions for increasing the objectivity of the most informative gene expression profiles extraction due to the careful selection of the distance functions which can be used as the component in complex distance metric calculated based on the use of an ensemble of the most effective distance functions. This is a further perspectives of the authors’ research. 6. References [1] L. Wang, F. Song, H. Yin, et al. Comparative microRNAs expression profiles analysis during embryonic development of common carp, Cyprinus carpio. Comparative Biochemistry and Physiology - Part D: Genomics and Proteomics, 37 100754 (2021). doi: 10.1016/j.cbd.2020.100754 [2] M. A. Marchetti, D.G. Coit, S.W. Dusza, et al. Performance of Gene Expression Profile Tests for Prognosis in Patients with Localized Cutaneous Melanoma: A Systematic Review and Meta- Analysis. JAMA Dermatology, 156(9) 2020 953-962. doi: 10.1001/jamadermatol.2020.1731 [3] K.C. Howlader, M.S. Satu, M.A. Awal, et al. Machine learning models for classification and identification of significant attributes to detect type 2 diabetes. Health Information Science and Systems, 10 (1) 2 (2022). doi: 10.1007/s13755-021-00168-2 [4] L. Zhou, Y. Zhu, T. Zong, Y. Xiang. A feature selection-based method for DDoS attack flow classification. Future Generation Computer Systems, 132 (2022) 67-79. doi: 10.1016/j.future.2022.02.006 [5] N. Almugren, H. Alshamlan. A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE Access, 7 8736725 (2019) 78533-78548. doi: 10.1109/ACCESS.2019.2922987 [6] S. Park, G. Yi. Development of Gene Expression-Based Random Forest Model for Predicting Neoadjuvant Chemotherapy Response in Triple-Negative Breast Cancer. Cancers, 14 (4) 881 (2022). doi: 10.3390/cancers14040881 [7] S.M. Snow, K.A. Matkowskyj, M. Maresh, et al. Validation of genetic classifiers derived from mouse and human tumors to identify molecular subtypes of colorectal cancer. Human Pathology, 119 (2022) 1-14. doi: 10.1016/j.humpath.2021.10.002 [8] S. Babichev, J. Škvor. Technique of Gene Expression Profiles Extraction Based on the Complex Use of Clustering and Classification Methods. Diagnostics, 10 (8) 584 (2020). doi: 10.3390/diagnostics10080584 [9] S. Babichev, J. Krejci, J. Bicanek, V. Lytvynenko, V. Gene expression sequences clustering based on the internal and external clustering quality criteria, 2017, Proceedings of the 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT2017, 1 (2017) 91-94. doi: 10.1109/STC-CSIT.2017.8098744 [10] L.-H. Lee, C.-H. Chen, W.-C. Chang, et al. Evaluating the performance of machine learning models for automatic diagnosis of patients with schizophrenia based on a single site dataset of 440 participants. European Psychiatry, 65 (1) e1 (2022). doi: 10.1192/j.eurpsy.2021.2248 [11] K.-N. Heo, J.-Y. Lee, Y.-M. Ah. Development and validation of a risk-score model for opioid overdose using a national claims database. Scientific Reports, 12 (1) 4974 (2022). doi: 10.1038/s41598-022-09095-y [12] P. Vitynskyi, R. Tkachenko, I. Izonin, H. Kutucu. Hybridization of the SGTM Neural-Like Structure Through Inputs Polynomial Extension, 2018, Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing, DSMP, 8478456 (2018) 386- 391. doi: 10.1109/DSMP.2018.8478456 [13] M. Haghighat, L. Browning, K. Sirinukunwattana, et al. Automated quality assessment of large digitised histology cohorts by artificial intelligence. Scientific Reports, 12(1) 5002 (2022). doi: 10.1038/s41598-022-08351-5 [14] M.R. Sabour, M. Besharati, G.A. Dezvareh, M. Hajbabaie, M. Akbari. Application of artificial neural network with the back-propagation algorithm for estimating the amount of polycyclic aromatic hydrocarbons in Tehran Oil Refinery, Iran. Environmental Nanotechnology, Monitoring and Management, 18 100677 (2022). doi: 10.1016/j.enmm.2022.100677 [15] N. Shakhovska, V. Vysotska, L. Chyrun. Intelligent systems design of distance learning realization for modern youth promotion and involvement in independent scientific researches, 2017, Advances in Intelligent Systems and Computing, 512 (2017) 175-198. doi: 10.1007/978-3- 319-45991-2_12 [16] B. Liu, W. Gou, H. Feng. Pathological investigations and correlation research of microfibrillar- associated protein 4 and tropoelastin in oral submucous fibrosis. BMC Oral Health, 21 (1) 588 (2021). doi: 10.1186/s12903-021-01962-w [17] L. Breiman. Random forests. Machine Learning, 45 (2001) 5-32. [18] S. van Gaal, A. Alimohammadi, A., A.Y.X. Yu, et al. Accurate classification of carotid endarterectomy indication using physician claims and hospital discharge data. BMC Health Services Research, 22 (1) 379 (2022). doi: 10.1186/s12913-022-07614-1 [19] G. Canbek, T. Taskaya Temizel, S. Sagiroglu. BenchMetrics: a systematic benchmarking method for binary classification performance metrics. Neural Computing and Applications, 33 (21) (2021) 14623-14650. doi: 10.1007/s00521-021-06103-6. [20] J. Hou, J. Aerts, B. den Hamer, et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS ONE, 5 e10312 (2010). doi: 10.1371/journal.pone.0010312