Simplifying p-value calculation for the unbiased microRNA enrichment analysis, using ML-techniques Konstantinos Zagganas Maria Lioli University of the Peloponnese & “Athena” RC National Kapodistrian University of Athens kzagganas@uop.gr mlioli@di.uoa.gr Thanasis Vergoulis Theodore Dalamagas “Athena” RC “Athena” RC vergoulis@athenarc.gr dalamag@athenarc.gr ABSTRACT Indeed, it was shown in 2015 [1] that the data related to miRNA The investigation of the role of small bio-molecules (called mi- targets contain a bias in how genes are targeted, and this means croRNAs) in biological functions is a very popular topic in bioin- that the hypergeometric test leads to biased or invalid results. formatics, since microRNAs have been shown to present novel However, the authors of [1] proposed an unbiased, empirical therapeutic methods for diseases like cancer or Hepatitis C. In miRNA functional enrichment approach, which uses a permuta- order to predict the involvement of microRNAs in biological func- tion (randomization) test to produce empirical p-values. tions many statistical approaches have been used that involve Generally, such approaches measure the strength of associa- p-value calculations, with the most popular one being Fisher’s tion based on the real data, without making assumptions about exact test. However, it has been shown that data distribution does the structure of the data like theoretical statistical tests. In order not match with any of the theoretical distributions used by the to test whether a group of miRNAs is associated to a specific aforementioned approaches. Thus, an empirical randomization biological process, the method calculates a statistical measure approach is preferred. Nevertheless, such analyses are computa- relevant to the miRNA group and compares it to the statistical tionally intensive. In this paper, we present a novel approach for measure calculated for a large number of randomly assembled microRNA enrichment analysis using Machine Learning tech- miRNA groups. Thus, if the number of random groups presenting niques, in order to predict p-values instead of calculating them a better behaviour (in terms of the statistical measure) is small, using randomization experiments. This simplifies the work for then the significance of the association between the query and bioinformatics data analysts, helping them to efficiently perform the biological function under examination is large. multiple enrichment analysis tasks. However, this approach is computationally intensive, because it involves a very large number of operations between sets (union and intersection) leading to large execution times (e.g., in the 1 INTRODUCTION order of hours, on 8 cores of a Intel Xeon CPU). For this reason, Transcriptomics, a popular sub-field of bioinformatics, deals with attempts have been made to alleviate this issue, either by making RNA biomolecules that are produced by DNA transcription and the set operations faster [5] or by using indexing techniques either contain instructions on how to produce proteins or per- to reduce the number of redundant operations performed [7]. form other functions vital to life. One such class of RNAs are These attempts led to a reduction in execution times by an order microRNAs (or miRNAs) which are short biomolecules that “tar- of magnitude, but still a significant amount of time is required get” specific genes and block their expression. This means that to complete such a task (e.g., several minutes are required on a they affect biological functions by preventing protein production. single Xeon CPU core). Given that researchers usually need to Dysregulation of certain miRNAs has been linked to diseases run multiple analyses, the total computational cost can still be like cancer, neuronal diseases, inflammatory diseases, etc [2]. very high. Consequently, predicting how groups of miRNAs affect certain In this paper, we present a novel approach for miRNA en- biological functions is of great importance to researchers. It has richment analysis using machine learning techniques, to predict led to the development of prediction workflows that combine p-values using features of miRNA groups, relevant to the problem, data about miRNA gene-targets, as well as associations between instead of calculating them using randomization experiments. genes and biological functions, and in order to measure the as- This simplifies the work for bioinformatics data analysts, help- sociation between a group of miRNAs and a specific biological ing them to efficiently perform multiple enrichment analysis function using statistical tests. tasks. Our contributions are: (a) framing the problem, (b) data set The most popular statistical test used in the relevant litera- creation and feature engineering, (c) determining a shortlist of ture is Fisher’s exact test [3]. The test uses the hypergeometric promising machine learning models, using cross-validation, and distribution to produce p-values that indicate the strength of (d) fine-tuning to determine the best models for our case. Our the association between the group of miRNAs and a biological approach shows that the best model demonstrates an 𝑅 2 score function.The hypergeometric distribution assumes that every above 90%, and 𝑀𝐴𝐸 = 0.048. gene has an equal probability of being targeted by a miRNA. However, complex biological mechanisms and effects (modeled by prediction algorithms) make such an assumption unrealistic. 2 BACKGROUND © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- 2.1 Permutation test ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 Given a miRNA group as a query, the biological function permu- International (CC BY 4.0) tation test introduced in [1] consists of the following steps: (1) Calculate a statistical measure 𝑆 (e.g., left-sided overlap about 3 minutes (on a single core of the Xeon processor). The - see below), that captures a type of ‘relevance’ of the downside to the latter approach, is that p-values are produced biological function with the query, according to the genes only for the functions expected to be significant and not all bio- that are related to both of them. logical functions in the dataset. Consequently, we were motivated (2) Create a large number (e.g., 1 million) of randomly assem- to use ML to train a model that will predict p-values very quickly bled miRNA groups, with each containing the same num- and for every biological function in the data set. ber of miRNAs, and calculate 𝑆 for each of these groups, as well. 3 MACHINE LEARNING FOR P-VALUE (3) Measure the proportion of randomly assembled groups PREDICTION that present more favourable values for 𝑆 than the query. 3.1 Features More formally, each miRNA is represented as the set of genes In order to train a machine learning model to predict p-values targeted by it. Consequently, a group of miRNAs is represented as (p_value) as accurately as possible, we selected features from a set, containing the union of all genes targeted by each miRNA our dataset based on their biological meaning and their relevance in the group. Moreover, a biological function is also represented to the analysis. The features are summarized below: as the set of genes participating in that function. miRNA group size (mirna_group_size): The number of miR- Based on the previous, we can now describe one popular per- NAs in a miRNA group. mutation test: Let 𝑀 be the set of genes containing the union Biological function ID (biological_process): a unique string of targets from all miRNAs in a query group, and also let 𝐵 the that identifies each biological function. This string consists of 2 set of genes participating in a biological function. The statistical letters (same for all biological functions) and a numerical part. measure used to compare the query to the randomly assembled We turned this ID into a numerical value, by stripping the letters miRNA groups is the left-sided overlap and it is defined as follows: from the string. 𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∩ 𝐵) Number of common genes (number_of_common_genes): The 𝑙𝑒 𝑓 𝑡-𝑠𝑖𝑑𝑒𝑑 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵) = number of genes targeted by a miRNA group that also belong to 𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀) the biological function. Essentially, the left-sided overlap is defined as the proportion Left-sided overlap (left_sided_overlap): the left-sided over- of targeted genes that also participate in the biological function. lap as defined in Section 2.1. Then, we create 1 million random miRNA groups 𝑀 𝑗 with the Right-sided overlap (right_sided_overlap): Given 𝑀 and 𝐵 same size as the query and calculate the left-sided overlap for from Section 2.1, the right-sided overlap is defined as: each of them. The empirical p-value is defined as (where overlap is the left-sided overlap): 𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∩ 𝐵) 𝑟𝑖𝑔ℎ𝑡-𝑠𝑖𝑑𝑒𝑑 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵) = 𝑠𝑖𝑧𝑒𝑜 𝑓 (𝐵) 𝑠𝑖𝑧𝑒𝑜 𝑓 ({𝑀 𝑗 : 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀 𝑗 , 𝐵) ≥ 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵)}) Two-sided overlap (overlap): Given 𝑀 and 𝐵 from Section 2.1, 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = the two-sided overlap (or Jaccard coefficient) is defined as: 𝑛 which is the proportion of randomly assembled groups presenting 𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∩ 𝐵) 𝑡𝑤𝑜-𝑠𝑖𝑑𝑒𝑑 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵) = a larger left-sided overlap than the query. 𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∪ 𝐵) Common genes as a percentage of the universe of genes 2.2 Performance issues (common_genes_proportion_to_total): The number of com- The above analysis relies on a very large number of union and mon genes between M and G as the proportion of the total num- intersection set operations. Given that this analysis is performed ber of genes in the universe of genes. for more than one biological functions, and that more than 20K Common gene list (common_genes): The list of common genes, biological functions exist, a few million union and about 20 billion sorted by alphabetical order. We used label encoding to turn the intersection operations are performed in the span of the analysis. values into categorical values. The software implemented by the authors in [1] is written in Chromosomes of common genes (common_chr): The list of Python, uses hash join set operations and a typical analysis on chromosomes on which the common genes are located, sorted a single CPU core of an Intel Xeon CPU requires many hours. by alphabetical order. We used label encoding to turn the values However, based on the fact that this analysis is very repetitive, into categorical values. even a small increase in operation speed is going to lead to a large Number of chromosomes where common genes are located total speedup. With this motivation, in [5] we re-implemented (number_of_common_chr): The number of chromosomes on which the algorithm in C++. We also improved the analysis performance the common genes are located. by exploiting bit vectors, as well as a hybrid version of hash join Left chromosome overlap (chr_left_sided_overlap) : The between sets of items and bit vectors. This made the analysis one number of chromosomes on which the common genes are located order of magnitude faster requiring about 40 minutes to complete divided by the number of chromosomes of the genes targeted by on a single core of the same Xeon computer. the miRNA group. Then in [7], we introduced novel indexing techniques, that Right chromosome overlap (chr_right_sided_overlap): allowed us to remove a large number of redundant operations The number of chromosomes on which the common genes are performed by our previous version and thus managed to reduce located, divided by the number of chromosomes of the genes the time to approximately half of what was required previously belonging in the biological function. on a single core. Furthermore, we used a technique that allowed Two sided chromosome overlap (chr_overlap): The number us to run the analysis on only a subset of the biological functions of chromosomes on which the common genes are located, di- (which are predicted to be statistically significant) and managed vided by the number of chromosomes for the union of the genes to further reduce the time required for p-value calculation to contained in 𝑀 and 𝐵. Using these features, we produced a dataset for groups of miR- 4.2 Preliminary results NAs of size 10, 25, 50, 100 (containing 100 groups of each size) In this section we are going to perform a preliminary evaluation for all biological functions as described in the Gene Ontology [6]. of the algorithms we outlined in Section 3.2. In order to evaluate Finally, to handle categorical features (miRNA group size, bio- and compare the algorithms, we are going to use the following logical function, common gene list, chromosomes of common statistical measures [4]: genes) we used label encoding. (1) Mean Absolute Error (𝑀𝐴𝐸): the 𝑀𝐴𝐸 is the mean abso- 3.2 ML Algorithms lute difference between observed and predicted values. Essentially, it is used to compare predictions to actual In order to find the best method to use for our case, we are going observed values. to use the following algorithms: (2) Coefficient of determination (𝑅 2 ): the 𝑅 2 score is a sta- • Linear/Ridge/Lasso Regressors: traditional regression tistical measure of how close are the predictions to the methods, fitting a linear equation that uses the least squares real data points. It is essentially a goodness-of-fit measure- method (with several variants of regularization). ment. • Decision Tree Regressor: uses decision trees to make predictions. Model selection and implementation was performed with the • Random Forest Regressor: estimates target value by method of k-Fold cross-validation. We first split the dataset into combining average estimation values of several individual the training and testing dataset. The training dataset is split into prediction models based on classifying decision trees for k folds (groups). Then k-1 of those groups are used to train the a number of subsets of the data set. model and the group left is used for validation. The score of the • Adaboost Regressor: combines multiple weak decision validation is recorded. Then, the process is repeated, but this trees into one. time another group is used for validation and the rest for re- • Gradient Boost (XGBoost, LightGBM, CatBoost): in training, until all folds have been exhausted. The average of all XGBoost, the estimation of the target is done by com- validation scores is the final validation score. Finally, the test score bining estimates of many individual prediction models is retrieved by testing the model against the testing data set. Final based on decision trees. LightGBM is similar to XGBoost, validation scores for each algorithm are shown in Table 1 and the but prediction is much faster than XGBoost. Regarding testing scores in Table 2. Model parameters were selected via Grid CatBoost, the key difference is that it builds decision sym- Search for the Linear Regressors and the rest via Random Search, metric trees. Both LightGBM and CatBoost can inherently due to the large number of parameters and memory constraints. handle categorical features. It should be noted that the algorithms based on linear regres- • MLPerceptron Regressor: typical Neural Net configura- sion models do not perform as well as the other algorithms. Our tion. approach shows that the best model is LightGBM, demonstrating a 𝑀𝐴𝐸 = 0.048. 4 EVALUATION In this section, we present preliminary results on a dataset that 5 CONCLUSION & FUTURE WORK has been created by only using 1000 random miRNA groups to In this paper, we presented an approach for miRNA enrichment calculate p-values. This was done due to time constraints and to analysis using machine learning techniques, in order to predict reach some preliminary conclusions about the dataset, in order p-values instead of calculating them using randomization experi- to compare the algorithms presented in Section 3.2. Given the ments. The goal is to simplify the work for bioinformatics data definition of an empirical p-value in Section 2.1, we expect that analysts, facilitating multiple enrichment analysis. Preliminary the p-values in the dataset will present a small accuracy, since the results showed that the best model demonstrates an 𝑅 2 score number of random miRNA groups is three orders of magnitude above 90%, and 𝑀𝐴𝐸 = 0.048. As next steps, we plan to expand smaller than the required number. This is expected to create our dataset to include p-values estimated using 1 million random larger errors than the ones that the permutation test produces. groups in order increase their accuracy as well as the accuracy However, these preliminary results are a first indicator on which of the prediction. algorithms present the worst performance in terms of accuracy and thus, which algorithms will be featured in the final work. ACKNOWLEDGMENTS We acknowledge support of this work by the project “Moving 4.1 Linear correlation from Big Data Management to Data Science" (MIS 5002437/3) In Figure 1 we can see a heat map of the linear correlation between which is implemented under the Action“Reinforcement of the Re- the features. Each cell represents the correlation measured by search and Innovation Infrastructure”, funded by the Operational Pearson correlation coefficient 𝑟 between features in row x and Programme “Competitiveness, Entrepreneurship and Innovation” column y. The values of 𝑟 range between -1 and 1. The larger (NSRF 2014-2020) and co-financed by Greece and the European the deviation from 0 the stronger the positive (for values closer Union (European Regional Development Fund). to 1) or negative (for values closer to -1) correlation is. It is interesting to note here that the p-value, which is the feature we want to predict, does not seem to have a strong linear correlation REFERENCES [1] Thomas Bleazard, Janine A Lamb, and Sam Griffiths-Jones. 2015. Bias in mi- (either positive or negative) with any of the features individually. croRNA functional enrichment analysis. Bioinformatics 31, 10 (01 2015). Nevertheless, this does not mean that the p-value is not related [2] Shailendra Dwivedi, Purvi Purohit, and Praveen Sharma. 2019. MicroRNAs with the features in a non-linear way. More specifically, it is and Diseases: Promising Biomarkers for Diagnosis and Therapeutics. Indian Journal of Clinical Biochemistry 34, 3 (01 Jul 2019), 243–245. expected to be related with the left-sided overlap (see p-value [3] R. A. Fisher. 1992. Statistical Methods for Research Workers. Springer New York, definition in Section 2.1). New York, NY, 66–70. Figure 1: Linear correlation between the selected features 𝑀𝐴𝐸 𝑅2 Model parameters Linear Regressor 0.242 0.246 copy_x = True, fit_intercept = False, normalize = True Ridge Regressor 0.242 0.2462 alpha=1, copy_x = True, fit_intercept = False, normalize = True, solver = auto, tol = 0.001 Lasso Regressor 0.4364 4 × 𝑒 −6 default Decision Tree Regressor 0.1027 N/A min_samples_leaf = 2, min_samples_split = 5,max_depth = None,max_features = None Random Forest Regressor 0.0992 N/A n_estimators = 50, n_jobs = -1, min_samples_leaf = 2, min_samples_split = 2, verbose = 10 Adaboost Regressor 0.13 N/A n_estimators = 20, loss = square, base_estimator = RandomForestRegressor, (n_jobs =-1, ver- bose=2, max_depth=25, n_estimators=4) LightGBM Regressor 0.038 N/A objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs = -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True XGBoost Regressor 0.0745 N/A verbosity = 2, max_depth = 20, n_estimators=200 CatBoost Regressor 0.13 N/A n_estimators=1000, learning_rate=0.5, bootstrap_type=Bernoulli, ML Perceptron Regressor 0.1817 N/A hidden_layer_sizes = (24,24,24), activation = "relu", max_iter = 40, learning_rate=’adaptive’ Table 1: Algorithm evaluation: validation set 𝑀𝐴𝐸 𝑅2 Model parameters Random Forest Regressor 0.1443 N/A n_estimators = 50, n_jobs = -1, min_samples_leaf = 2, min_samples_split = 2, verbose = 10 LightGBM Regressor 0.0477 N/A objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs = -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True XGBoost Regressor 0.136 N/A objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs = -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True Table 2: Algorithm evaluation: test set [4] A. Géron. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and Management (SSDBM 2020). Association for Computing Machinery (ACM). TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Incorporated. [5] K.Zagganas, T. Vergoulis, M. D. Paraskevopoulou, I. S. Vlachos, S. Skiadopoulos, and T. Dalamagas. 2017. BUFET: boosting the unbiased miRNA functional enrichment analysis using bitsets. BMC Bioinformatics (2017). [6] The Gene Ontology Consortium. 2018. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research 47, D1 (11 2018), D330–D338. [7] K. Zagganas, T. Vergoulis, S. Skiadopoulos, and T. Dalamagas. 2020. Efficient Calculation of Empirical P-Values for Association Testing of Binary Classifi- cations. In 32nd International Conference on Scientific and Statistical Database