Simplifying p-value calculation for the unbiased microRNA
            enrichment analysis, using ML-techniques
                        Konstantinos Zagganas                                                                  Maria Lioli
           University of the Peloponnese & “Athena” RC                                      National Kapodistrian University of Athens
                         kzagganas@uop.gr                                                               mlioli@di.uoa.gr

                            Thanasis Vergoulis                                                         Theodore Dalamagas
                                “Athena” RC                                                                  “Athena” RC
                            vergoulis@athenarc.gr                                                        dalamag@athenarc.gr

ABSTRACT                                                                                   Indeed, it was shown in 2015 [1] that the data related to miRNA
The investigation of the role of small bio-molecules (called mi-                       targets contain a bias in how genes are targeted, and this means
croRNAs) in biological functions is a very popular topic in bioin-                     that the hypergeometric test leads to biased or invalid results.
formatics, since microRNAs have been shown to present novel                            However, the authors of [1] proposed an unbiased, empirical
therapeutic methods for diseases like cancer or Hepatitis C. In                        miRNA functional enrichment approach, which uses a permuta-
order to predict the involvement of microRNAs in biological func-                      tion (randomization) test to produce empirical p-values.
tions many statistical approaches have been used that involve                              Generally, such approaches measure the strength of associa-
p-value calculations, with the most popular one being Fisher’s                         tion based on the real data, without making assumptions about
exact test. However, it has been shown that data distribution does                     the structure of the data like theoretical statistical tests. In order
not match with any of the theoretical distributions used by the                        to test whether a group of miRNAs is associated to a specific
aforementioned approaches. Thus, an empirical randomization                            biological process, the method calculates a statistical measure
approach is preferred. Nevertheless, such analyses are computa-                        relevant to the miRNA group and compares it to the statistical
tionally intensive. In this paper, we present a novel approach for                     measure calculated for a large number of randomly assembled
microRNA enrichment analysis using Machine Learning tech-                              miRNA groups. Thus, if the number of random groups presenting
niques, in order to predict p-values instead of calculating them                       a better behaviour (in terms of the statistical measure) is small,
using randomization experiments. This simplifies the work for                          then the significance of the association between the query and
bioinformatics data analysts, helping them to efficiently perform                      the biological function under examination is large.
multiple enrichment analysis tasks.                                                        However, this approach is computationally intensive, because
                                                                                       it involves a very large number of operations between sets (union
                                                                                       and intersection) leading to large execution times (e.g., in the
1    INTRODUCTION                                                                      order of hours, on 8 cores of a Intel Xeon CPU). For this reason,
Transcriptomics, a popular sub-field of bioinformatics, deals with                     attempts have been made to alleviate this issue, either by making
RNA biomolecules that are produced by DNA transcription and                            the set operations faster [5] or by using indexing techniques
either contain instructions on how to produce proteins or per-                         to reduce the number of redundant operations performed [7].
form other functions vital to life. One such class of RNAs are                         These attempts led to a reduction in execution times by an order
microRNAs (or miRNAs) which are short biomolecules that “tar-                          of magnitude, but still a significant amount of time is required
get” specific genes and block their expression. This means that                        to complete such a task (e.g., several minutes are required on a
they affect biological functions by preventing protein production.                     single Xeon CPU core). Given that researchers usually need to
Dysregulation of certain miRNAs has been linked to diseases                            run multiple analyses, the total computational cost can still be
like cancer, neuronal diseases, inflammatory diseases, etc [2].                        very high.
Consequently, predicting how groups of miRNAs affect certain                               In this paper, we present a novel approach for miRNA en-
biological functions is of great importance to researchers. It has                     richment analysis using machine learning techniques, to predict
led to the development of prediction workflows that combine                            p-values using features of miRNA groups, relevant to the problem,
data about miRNA gene-targets, as well as associations between                         instead of calculating them using randomization experiments.
genes and biological functions, and in order to measure the as-                        This simplifies the work for bioinformatics data analysts, help-
sociation between a group of miRNAs and a specific biological                          ing them to efficiently perform multiple enrichment analysis
function using statistical tests.                                                      tasks. Our contributions are: (a) framing the problem, (b) data set
   The most popular statistical test used in the relevant litera-                      creation and feature engineering, (c) determining a shortlist of
ture is Fisher’s exact test [3]. The test uses the hypergeometric                      promising machine learning models, using cross-validation, and
distribution to produce p-values that indicate the strength of                         (d) fine-tuning to determine the best models for our case. Our
the association between the group of miRNAs and a biological                           approach shows that the best model demonstrates an 𝑅 2 score
function.The hypergeometric distribution assumes that every                            above 90%, and 𝑀𝐴𝐸 = 0.048.
gene has an equal probability of being targeted by a miRNA.
However, complex biological mechanisms and effects (modeled
by prediction algorithms) make such an assumption unrealistic.                         2 BACKGROUND
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-   2.1 Permutation test
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0           Given a miRNA group as a query, the biological function permu-
International (CC BY 4.0)                                                              tation test introduced in [1] consists of the following steps:
   (1) Calculate a statistical measure 𝑆 (e.g., left-sided overlap     about 3 minutes (on a single core of the Xeon processor). The
       - see below), that captures a type of ‘relevance’ of the        downside to the latter approach, is that p-values are produced
       biological function with the query, according to the genes      only for the functions expected to be significant and not all bio-
       that are related to both of them.                               logical functions in the dataset. Consequently, we were motivated
   (2) Create a large number (e.g., 1 million) of randomly assem-      to use ML to train a model that will predict p-values very quickly
       bled miRNA groups, with each containing the same num-           and for every biological function in the data set.
       ber of miRNAs, and calculate 𝑆 for each of these groups,
       as well.                                                        3  MACHINE LEARNING FOR P-VALUE
   (3) Measure the proportion of randomly assembled groups                PREDICTION
       that present more favourable values for 𝑆 than the query.
                                                                       3.1 Features
More formally, each miRNA is represented as the set of genes
                                                                       In order to train a machine learning model to predict p-values
targeted by it. Consequently, a group of miRNAs is represented as
                                                                       (p_value) as accurately as possible, we selected features from
a set, containing the union of all genes targeted by each miRNA
                                                                       our dataset based on their biological meaning and their relevance
in the group. Moreover, a biological function is also represented
                                                                       to the analysis. The features are summarized below:
as the set of genes participating in that function.
                                                                       miRNA group size (mirna_group_size): The number of miR-
   Based on the previous, we can now describe one popular per-
                                                                       NAs in a miRNA group.
mutation test: Let 𝑀 be the set of genes containing the union
                                                                       Biological function ID (biological_process): a unique string
of targets from all miRNAs in a query group, and also let 𝐵 the
                                                                       that identifies each biological function. This string consists of 2
set of genes participating in a biological function. The statistical
                                                                       letters (same for all biological functions) and a numerical part.
measure used to compare the query to the randomly assembled
                                                                       We turned this ID into a numerical value, by stripping the letters
miRNA groups is the left-sided overlap and it is defined as follows:
                                                                       from the string.
                                           𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∩ 𝐵)             Number of common genes (number_of_common_genes): The
          𝑙𝑒 𝑓 𝑡-𝑠𝑖𝑑𝑒𝑑 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵) =                                number of genes targeted by a miRNA group that also belong to
                                             𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀)
                                                                       the biological function.
    Essentially, the left-sided overlap is defined as the proportion
                                                                       Left-sided overlap (left_sided_overlap): the left-sided over-
of targeted genes that also participate in the biological function.
                                                                       lap as defined in Section 2.1.
Then, we create 1 million random miRNA groups 𝑀 𝑗 with the
                                                                       Right-sided overlap (right_sided_overlap): Given 𝑀 and 𝐵
same size as the query and calculate the left-sided overlap for
                                                                       from Section 2.1, the right-sided overlap is defined as:
each of them. The empirical p-value is defined as (where overlap
is the left-sided overlap):                                                                                     𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∩ 𝐵)
                                                                                 𝑟𝑖𝑔ℎ𝑡-𝑠𝑖𝑑𝑒𝑑 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵) =
                                                                                                                   𝑠𝑖𝑧𝑒𝑜 𝑓 (𝐵)
                𝑠𝑖𝑧𝑒𝑜 𝑓 ({𝑀 𝑗 : 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀 𝑗 , 𝐵) ≥ 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵)})   Two-sided overlap (overlap): Given 𝑀 and 𝐵 from Section 2.1,
 𝑝 − 𝑣𝑎𝑙𝑢𝑒 =                                                           the two-sided overlap (or Jaccard coefficient) is defined as:
                                         𝑛
which is the proportion of randomly assembled groups presenting                                                 𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∩ 𝐵)
                                                                                 𝑡𝑤𝑜-𝑠𝑖𝑑𝑒𝑑 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 (𝑀, 𝐵) =
a larger left-sided overlap than the query.                                                                     𝑠𝑖𝑧𝑒𝑜 𝑓 (𝑀 ∪ 𝐵)
                                                                       Common genes as a percentage of the universe of genes
2.2    Performance issues                                              (common_genes_proportion_to_total): The number of com-
The above analysis relies on a very large number of union and          mon genes between M and G as the proportion of the total num-
intersection set operations. Given that this analysis is performed     ber of genes in the universe of genes.
for more than one biological functions, and that more than 20K         Common gene list (common_genes): The list of common genes,
biological functions exist, a few million union and about 20 billion   sorted by alphabetical order. We used label encoding to turn the
intersection operations are performed in the span of the analysis.     values into categorical values.
   The software implemented by the authors in [1] is written in        Chromosomes of common genes (common_chr): The list of
Python, uses hash join set operations and a typical analysis on        chromosomes on which the common genes are located, sorted
a single CPU core of an Intel Xeon CPU requires many hours.            by alphabetical order. We used label encoding to turn the values
However, based on the fact that this analysis is very repetitive,      into categorical values.
even a small increase in operation speed is going to lead to a large   Number of chromosomes where common genes are located
total speedup. With this motivation, in [5] we re-implemented          (number_of_common_chr): The number of chromosomes on which
the algorithm in C++. We also improved the analysis performance        the common genes are located.
by exploiting bit vectors, as well as a hybrid version of hash join    Left chromosome overlap (chr_left_sided_overlap) : The
between sets of items and bit vectors. This made the analysis one      number of chromosomes on which the common genes are located
order of magnitude faster requiring about 40 minutes to complete       divided by the number of chromosomes of the genes targeted by
on a single core of the same Xeon computer.                            the miRNA group.
   Then in [7], we introduced novel indexing techniques, that          Right chromosome overlap (chr_right_sided_overlap):
allowed us to remove a large number of redundant operations            The number of chromosomes on which the common genes are
performed by our previous version and thus managed to reduce           located, divided by the number of chromosomes of the genes
the time to approximately half of what was required previously         belonging in the biological function.
on a single core. Furthermore, we used a technique that allowed        Two sided chromosome overlap (chr_overlap): The number
us to run the analysis on only a subset of the biological functions    of chromosomes on which the common genes are located, di-
(which are predicted to be statistically significant) and managed      vided by the number of chromosomes for the union of the genes
to further reduce the time required for p-value calculation to         contained in 𝑀 and 𝐵.
   Using these features, we produced a dataset for groups of miR-        4.2     Preliminary results
NAs of size 10, 25, 50, 100 (containing 100 groups of each size)         In this section we are going to perform a preliminary evaluation
for all biological functions as described in the Gene Ontology [6].      of the algorithms we outlined in Section 3.2. In order to evaluate
Finally, to handle categorical features (miRNA group size, bio-          and compare the algorithms, we are going to use the following
logical function, common gene list, chromosomes of common                statistical measures [4]:
genes) we used label encoding.
                                                                             (1) Mean Absolute Error (𝑀𝐴𝐸): the 𝑀𝐴𝐸 is the mean abso-
3.2     ML Algorithms                                                            lute difference between observed and predicted values.
                                                                                 Essentially, it is used to compare predictions to actual
In order to find the best method to use for our case, we are going
                                                                                 observed values.
to use the following algorithms:
                                                                             (2) Coefficient of determination (𝑅 2 ): the 𝑅 2 score is a sta-
      • Linear/Ridge/Lasso Regressors: traditional regression                    tistical measure of how close are the predictions to the
        methods, fitting a linear equation that uses the least squares           real data points. It is essentially a goodness-of-fit measure-
        method (with several variants of regularization).                        ment.
      • Decision Tree Regressor: uses decision trees to make
        predictions.                                                         Model selection and implementation was performed with the
      • Random Forest Regressor: estimates target value by               method of k-Fold cross-validation. We first split the dataset into
        combining average estimation values of several individual        the training and testing dataset. The training dataset is split into
        prediction models based on classifying decision trees for        k folds (groups). Then k-1 of those groups are used to train the
        a number of subsets of the data set.                             model and the group left is used for validation. The score of the
      • Adaboost Regressor: combines multiple weak decision              validation is recorded. Then, the process is repeated, but this
        trees into one.                                                  time another group is used for validation and the rest for re-
      • Gradient Boost (XGBoost, LightGBM, CatBoost): in                 training, until all folds have been exhausted. The average of all
        XGBoost, the estimation of the target is done by com-            validation scores is the final validation score. Finally, the test score
        bining estimates of many individual prediction models            is retrieved by testing the model against the testing data set. Final
        based on decision trees. LightGBM is similar to XGBoost,         validation scores for each algorithm are shown in Table 1 and the
        but prediction is much faster than XGBoost. Regarding            testing scores in Table 2. Model parameters were selected via Grid
        CatBoost, the key difference is that it builds decision sym-     Search for the Linear Regressors and the rest via Random Search,
        metric trees. Both LightGBM and CatBoost can inherently          due to the large number of parameters and memory constraints.
        handle categorical features.                                         It should be noted that the algorithms based on linear regres-
      • MLPerceptron Regressor: typical Neural Net configura-            sion models do not perform as well as the other algorithms. Our
        tion.                                                            approach shows that the best model is LightGBM, demonstrating
                                                                         a 𝑀𝐴𝐸 = 0.048.
4     EVALUATION
In this section, we present preliminary results on a dataset that        5     CONCLUSION & FUTURE WORK
has been created by only using 1000 random miRNA groups to               In this paper, we presented an approach for miRNA enrichment
calculate p-values. This was done due to time constraints and to         analysis using machine learning techniques, in order to predict
reach some preliminary conclusions about the dataset, in order           p-values instead of calculating them using randomization experi-
to compare the algorithms presented in Section 3.2. Given the            ments. The goal is to simplify the work for bioinformatics data
definition of an empirical p-value in Section 2.1, we expect that        analysts, facilitating multiple enrichment analysis. Preliminary
the p-values in the dataset will present a small accuracy, since the     results showed that the best model demonstrates an 𝑅 2 score
number of random miRNA groups is three orders of magnitude               above 90%, and 𝑀𝐴𝐸 = 0.048. As next steps, we plan to expand
smaller than the required number. This is expected to create             our dataset to include p-values estimated using 1 million random
larger errors than the ones that the permutation test produces.          groups in order increase their accuracy as well as the accuracy
However, these preliminary results are a first indicator on which        of the prediction.
algorithms present the worst performance in terms of accuracy
and thus, which algorithms will be featured in the final work.           ACKNOWLEDGMENTS
                                                                         We acknowledge support of this work by the project “Moving
4.1     Linear correlation
                                                                         from Big Data Management to Data Science" (MIS 5002437/3)
In Figure 1 we can see a heat map of the linear correlation between      which is implemented under the Action“Reinforcement of the Re-
the features. Each cell represents the correlation measured by           search and Innovation Infrastructure”, funded by the Operational
Pearson correlation coefficient 𝑟 between features in row x and          Programme “Competitiveness, Entrepreneurship and Innovation”
column y. The values of 𝑟 range between -1 and 1. The larger             (NSRF 2014-2020) and co-financed by Greece and the European
the deviation from 0 the stronger the positive (for values closer        Union (European Regional Development Fund).
to 1) or negative (for values closer to -1) correlation is. It is
interesting to note here that the p-value, which is the feature we
want to predict, does not seem to have a strong linear correlation
                                                                         REFERENCES
                                                                         [1] Thomas Bleazard, Janine A Lamb, and Sam Griffiths-Jones. 2015. Bias in mi-
(either positive or negative) with any of the features individually.         croRNA functional enrichment analysis. Bioinformatics 31, 10 (01 2015).
Nevertheless, this does not mean that the p-value is not related         [2] Shailendra Dwivedi, Purvi Purohit, and Praveen Sharma. 2019. MicroRNAs
with the features in a non-linear way. More specifically, it is              and Diseases: Promising Biomarkers for Diagnosis and Therapeutics. Indian
                                                                             Journal of Clinical Biochemistry 34, 3 (01 Jul 2019), 243–245.
expected to be related with the left-sided overlap (see p-value          [3] R. A. Fisher. 1992. Statistical Methods for Research Workers. Springer New York,
definition in Section 2.1).                                                  New York, NY, 66–70.
                                               Figure 1: Linear correlation between the selected features

                                                               𝑀𝐴𝐸           𝑅2            Model parameters
       Linear Regressor                                         0.242       0.246          copy_x = True, fit_intercept = False, normalize = True
       Ridge Regressor                                          0.242      0.2462          alpha=1, copy_x = True, fit_intercept = False, normalize = True, solver = auto, tol = 0.001
       Lasso Regressor                                         0.4364      4 × 𝑒 −6        default
       Decision Tree Regressor                                 0.1027        N/A           min_samples_leaf = 2, min_samples_split = 5,max_depth = None,max_features = None
       Random Forest Regressor                                 0.0992        N/A           n_estimators = 50, n_jobs = -1, min_samples_leaf = 2, min_samples_split = 2, verbose = 10
       Adaboost Regressor                                       0.13         N/A           n_estimators = 20, loss = square, base_estimator = RandomForestRegressor, (n_jobs =-1, ver-
                                                                                           bose=2, max_depth=25, n_estimators=4)
       LightGBM Regressor                                       0.038        N/A           objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs =
                                                                                           -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True
       XGBoost Regressor                                    0.0745    N/A    verbosity = 2, max_depth = 20, n_estimators=200
       CatBoost Regressor                                    0.13     N/A    n_estimators=1000, learning_rate=0.5, bootstrap_type=Bernoulli,
       ML Perceptron Regressor                              0.1817    N/A    hidden_layer_sizes = (24,24,24), activation = "relu", max_iter = 40, learning_rate=’adaptive’
                                                        Table 1: Algorithm evaluation: validation set


                                                                 𝑀𝐴𝐸         𝑅2          Model parameters
         Random Forest Regressor                                 0.1443      N/A         n_estimators = 50, n_jobs = -1, min_samples_leaf = 2, min_samples_split = 2, verbose = 10
         LightGBM Regressor                                      0.0477      N/A         objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs =
                                                                                         -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True
         XGBoost Regressor                                        0.136      N/A         objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs =
                                                                                         -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True
                                                             Table 2: Algorithm evaluation: test set


[4] A. Géron. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and                           Management (SSDBM 2020). Association for Computing Machinery (ACM).
    TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly
    Media, Incorporated.
[5] K.Zagganas, T. Vergoulis, M. D. Paraskevopoulou, I. S. Vlachos, S. Skiadopoulos,
    and T. Dalamagas. 2017. BUFET: boosting the unbiased miRNA functional
    enrichment analysis using bitsets. BMC Bioinformatics (2017).
[6] The Gene Ontology Consortium. 2018. The Gene Ontology Resource: 20 years
    and still GOing strong. Nucleic Acids Research 47, D1 (11 2018), D330–D338.
[7] K. Zagganas, T. Vergoulis, S. Skiadopoulos, and T. Dalamagas. 2020. Efficient
    Calculation of Empirical P-Values for Association Testing of Binary Classifi-
    cations. In 32nd International Conference on Scientific and Statistical Database