<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Simplifying p-value calculation for the unbiased microRNA enrichment analysis, using ML-techniques</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Konstantinos Zagganas University of the Peloponnese &amp; “Athena” RC</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maria Lioli National Kapodistrian University of Athens</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Theodore Dalamagas “Athena” RC</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The investigation of the role of small bio-molecules (called microRNAs) in biological functions is a very popular topic in bioinformatics, since microRNAs have been shown to present novel therapeutic methods for diseases like cancer or Hepatitis C. In order to predict the involvement of microRNAs in biological functions many statistical approaches have been used that involve p-value calculations, with the most popular one being Fisher's exact test. However, it has been shown that data distribution does not match with any of the theoretical distributions used by the aforementioned approaches. Thus, an empirical randomization approach is preferred. Nevertheless, such analyses are computationally intensive. In this paper, we present a novel approach for microRNA enrichment analysis using Machine Learning techniques, in order to predict p-values instead of calculating them using randomization experiments. This simplifies the work for bioinformatics data analysts, helping them to eficiently perform multiple enrichment analysis tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Transcriptomics, a popular sub-field of bioinformatics, deals with
RNA biomolecules that are produced by DNA transcription and
either contain instructions on how to produce proteins or
perform other functions vital to life. One such class of RNAs are
microRNAs (or miRNAs) which are short biomolecules that
“target” specific genes and block their expression. This means that
they afect biological functions by preventing protein production.
Dysregulation of certain miRNAs has been linked to diseases
like cancer, neuronal diseases, inflammatory diseases, etc [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Consequently, predicting how groups of miRNAs afect certain
biological functions is of great importance to researchers. It has
led to the development of prediction workflows that combine
data about miRNA gene-targets, as well as associations between
genes and biological functions, and in order to measure the
association between a group of miRNAs and a specific biological
function using statistical tests.
      </p>
      <p>
        The most popular statistical test used in the relevant
literature is Fisher’s exact test [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The test uses the hypergeometric
distribution to produce p-values that indicate the strength of
the association between the group of miRNAs and a biological
function.The hypergeometric distribution assumes that every
gene has an equal probability of being targeted by a miRNA.
However, complex biological mechanisms and efects (modeled
by prediction algorithms) make such an assumption unrealistic.
      </p>
      <p>
        Indeed, it was shown in 2015 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that the data related to miRNA
targets contain a bias in how genes are targeted, and this means
that the hypergeometric test leads to biased or invalid results.
However, the authors of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed an unbiased, empirical
miRNA functional enrichment approach, which uses a
permutation (randomization) test to produce empirical p-values.
      </p>
      <p>Generally, such approaches measure the strength of
association based on the real data, without making assumptions about
the structure of the data like theoretical statistical tests. In order
to test whether a group of miRNAs is associated to a specific
biological process, the method calculates a statistical measure
relevant to the miRNA group and compares it to the statistical
measure calculated for a large number of randomly assembled
miRNA groups. Thus, if the number of random groups presenting
a better behaviour (in terms of the statistical measure) is small,
then the significance of the association between the query and
the biological function under examination is large.</p>
      <p>
        However, this approach is computationally intensive, because
it involves a very large number of operations between sets (union
and intersection) leading to large execution times (e.g., in the
order of hours, on 8 cores of a Intel Xeon CPU). For this reason,
attempts have been made to alleviate this issue, either by making
the set operations faster [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ] or by using indexing techniques
to reduce the number of redundant operations performed [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ].
These attempts led to a reduction in execution times by an order
of magnitude, but still a significant amount of time is required
to complete such a task (e.g., several minutes are required on a
single Xeon CPU core). Given that researchers usually need to
run multiple analyses, the total computational cost can still be
very high.
      </p>
      <p>In this paper, we present a novel approach for miRNA
enrichment analysis using machine learning techniques, to predict
p-values using features of miRNA groups, relevant to the problem,
instead of calculating them using randomization experiments.
This simplifies the work for bioinformatics data analysts,
helping them to eficiently perform multiple enrichment analysis
tasks. Our contributions are: (a) framing the problem, (b) data set
creation and feature engineering, (c) determining a shortlist of
promising machine learning models, using cross-validation, and
(d) fine-tuning to determine the best models for our case. Our
approach shows that the best model demonstrates an 2 score
above 90%, and  = 0.048.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
    </sec>
    <sec id="sec-3">
      <title>Permutation test</title>
      <p>
        Given a miRNA group as a query, the biological function
permutation test introduced in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] consists of the following steps:
(1) Calculate a statistical measure  (e.g., left-sided overlap
- see below), that captures a type of ‘relevance’ of the
biological function with the query, according to the genes
that are related to both of them.
(2) Create a large number (e.g., 1 million) of randomly
assembled miRNA groups, with each containing the same
number of miRNAs, and calculate  for each of these groups,
as well.
(3) Measure the proportion of randomly assembled groups
that present more favourable values for  than the query.
More formally, each miRNA is represented as the set of genes
targeted by it. Consequently, a group of miRNAs is represented as
a set, containing the union of all genes targeted by each miRNA
in the group. Moreover, a biological function is also represented
as the set of genes participating in that function.
      </p>
      <p>Based on the previous, we can now describe one popular
permutation test: Let  be the set of genes containing the union
of targets from all miRNAs in a query group, and also let  the
set of genes participating in a biological function. The statistical
measure used to compare the query to the randomly assembled
miRNA groups is the left-sided overlap and it is defined as follows:
  ( ∩ )
   -  (, ) =
  ()</p>
      <p>Essentially, the left-sided overlap is defined as the proportion
of targeted genes that also participate in the biological function.
Then, we create 1 million random miRNA groups  with the
same size as the query and calculate the left-sided overlap for
each of them. The empirical p-value is defined as (where overlap
is the left-sided overlap):</p>
      <p>({ :  ( , ) ≥  (, )})
 −  =

which is the proportion of randomly assembled groups presenting
a larger left-sided overlap than the query.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Performance issues</title>
      <p>The above analysis relies on a very large number of union and
intersection set operations. Given that this analysis is performed
for more than one biological functions, and that more than 20K
biological functions exist, a few million union and about 20 billion
intersection operations are performed in the span of the analysis.</p>
      <p>
        The software implemented by the authors in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is written in
Python, uses hash join set operations and a typical analysis on
a single CPU core of an Intel Xeon CPU requires many hours.
However, based on the fact that this analysis is very repetitive,
even a small increase in operation speed is going to lead to a large
total speedup. With this motivation, in [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ] we re-implemented
the algorithm in C++. We also improved the analysis performance
by exploiting bit vectors, as well as a hybrid version of hash join
between sets of items and bit vectors. This made the analysis one
order of magnitude faster requiring about 40 minutes to complete
on a single core of the same Xeon computer.
      </p>
      <p>
        Then in [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ], we introduced novel indexing techniques, that
allowed us to remove a large number of redundant operations
performed by our previous version and thus managed to reduce
the time to approximately half of what was required previously
on a single core. Furthermore, we used a technique that allowed
us to run the analysis on only a subset of the biological functions
(which are predicted to be statistically significant) and managed
to further reduce the time required for p-value calculation to
about 3 minutes (on a single core of the Xeon processor). The
downside to the latter approach, is that p-values are produced
only for the functions expected to be significant and not all
biological functions in the dataset. Consequently, we were motivated
to use ML to train a model that will predict p-values very quickly
and for every biological function in the data set.
3
3.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>MACHINE LEARNING FOR P-VALUE</title>
    </sec>
    <sec id="sec-6">
      <title>PREDICTION</title>
    </sec>
    <sec id="sec-7">
      <title>Features</title>
      <p>In order to train a machine learning model to predict p-values
(p_value) as accurately as possible, we selected features from
our dataset based on their biological meaning and their relevance
to the analysis. The features are summarized below:
miRNA group size (mirna_group_size): The number of
miRNAs in a miRNA group.</p>
      <sec id="sec-7-1">
        <title>Biological function ID (biological_process): a unique string</title>
        <p>that identifies each biological function. This string consists of 2
letters (same for all biological functions) and a numerical part.
We turned this ID into a numerical value, by stripping the letters
from the string.</p>
        <p>Number of common genes (number_of_common_genes): The
number of genes targeted by a miRNA group that also belong to
the biological function.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Left-sided overlap (left_sided_overlap): the left-sided over</title>
        <p>lap as defined in Section 2.1.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Right-sided overlap (right_sided_overlap): Given  and</title>
        <p>from Section 2.1, the right-sided overlap is defined as:
  ( ∩ )
ℎ -  (, ) =
  ()
Two-sided overlap (overlap): Given  and  from Section 2.1,
the two-sided overlap (or Jaccard coeficient) is defined as:
  ( ∩ )
 -  (, ) =
  ( ∪ )</p>
      </sec>
      <sec id="sec-7-4">
        <title>Common genes as a percentage of the universe of genes</title>
        <p>(common_genes_proportion_to_total): The number of
common genes between M and G as the proportion of the total
number of genes in the universe of genes.</p>
        <p>Common gene list (common_genes): The list of common genes,
sorted by alphabetical order. We used label encoding to turn the
values into categorical values.</p>
      </sec>
      <sec id="sec-7-5">
        <title>Chromosomes of common genes (common_chr): The list of</title>
        <p>chromosomes on which the common genes are located, sorted
by alphabetical order. We used label encoding to turn the values
into categorical values.</p>
      </sec>
      <sec id="sec-7-6">
        <title>Number of chromosomes where common genes are located</title>
        <p>(number_of_common_chr): The number of chromosomes on which
the common genes are located.</p>
      </sec>
      <sec id="sec-7-7">
        <title>Left chromosome overlap (chr_left_sided_overlap) : The</title>
        <p>number of chromosomes on which the common genes are located
divided by the number of chromosomes of the genes targeted by
the miRNA group.</p>
      </sec>
      <sec id="sec-7-8">
        <title>Right chromosome overlap (chr_right_sided_overlap):</title>
        <p>The number of chromosomes on which the common genes are
located, divided by the number of chromosomes of the genes
belonging in the biological function.</p>
      </sec>
      <sec id="sec-7-9">
        <title>Two sided chromosome overlap (chr_overlap): The number</title>
        <p>of chromosomes on which the common genes are located,
divided by the number of chromosomes for the union of the genes
contained in  and .</p>
        <p>
          Using these features, we produced a dataset for groups of
miRNAs of size 10, 25, 50, 100 (containing 100 groups of each size)
for all biological functions as described in the Gene Ontology [
          <xref ref-type="bibr" rid="ref7">6</xref>
          ].
Finally, to handle categorical features (miRNA group size,
biological function, common gene list, chromosomes of common
genes) we used label encoding.
3.2
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>ML Algorithms</title>
      <p>In order to find the best method to use for our case, we are going
to use the following algorithms:
• Linear/Ridge/Lasso Regressors: traditional regression
methods, fitting a linear equation that uses the least squares
method (with several variants of regularization).
• Decision Tree Regressor: uses decision trees to make
predictions.
• Random Forest Regressor: estimates target value by
combining average estimation values of several individual
prediction models based on classifying decision trees for
a number of subsets of the data set.
• Adaboost Regressor: combines multiple weak decision
trees into one.
• Gradient Boost (XGBoost, LightGBM, CatBoost): in
XGBoost, the estimation of the target is done by
combining estimates of many individual prediction models
based on decision trees. LightGBM is similar to XGBoost,
but prediction is much faster than XGBoost. Regarding
CatBoost, the key diference is that it builds decision
symmetric trees. Both LightGBM and CatBoost can inherently
handle categorical features.
• MLPerceptron Regressor: typical Neural Net
configuration.
4</p>
    </sec>
    <sec id="sec-9">
      <title>EVALUATION</title>
      <p>In this section, we present preliminary results on a dataset that
has been created by only using 1000 random miRNA groups to
calculate p-values. This was done due to time constraints and to
reach some preliminary conclusions about the dataset, in order
to compare the algorithms presented in Section 3.2. Given the
definition of an empirical p-value in Section 2.1, we expect that
the p-values in the dataset will present a small accuracy, since the
number of random miRNA groups is three orders of magnitude
smaller than the required number. This is expected to create
larger errors than the ones that the permutation test produces.
However, these preliminary results are a first indicator on which
algorithms present the worst performance in terms of accuracy
and thus, which algorithms will be featured in the final work.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Linear correlation</title>
      <p>In Figure 1 we can see a heat map of the linear correlation between
the features. Each cell represents the correlation measured by
Pearson correlation coeficient  between features in row x and
column y. The values of  range between -1 and 1. The larger
the deviation from 0 the stronger the positive (for values closer
to 1) or negative (for values closer to -1) correlation is. It is
interesting to note here that the p-value, which is the feature we
want to predict, does not seem to have a strong linear correlation
(either positive or negative) with any of the features individually.
Nevertheless, this does not mean that the p-value is not related
with the features in a non-linear way. More specifically, it is
expected to be related with the left-sided overlap (see p-value
definition in Section 2.1).
4.2</p>
    </sec>
    <sec id="sec-11">
      <title>Preliminary results</title>
      <p>
        In this section we are going to perform a preliminary evaluation
of the algorithms we outlined in Section 3.2. In order to evaluate
and compare the algorithms, we are going to use the following
statistical measures [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ]:
(1) Mean Absolute Error (): the  is the mean
absolute diference between observed and predicted values.
Essentially, it is used to compare predictions to actual
observed values.
(2) Coeficient of determination ( 2): the 2 score is a
statistical measure of how close are the predictions to the
real data points. It is essentially a goodness-of-fit
measurement.
      </p>
      <p>Model selection and implementation was performed with the
method of k-Fold cross-validation. We first split the dataset into
the training and testing dataset. The training dataset is split into
k folds (groups). Then k-1 of those groups are used to train the
model and the group left is used for validation. The score of the
validation is recorded. Then, the process is repeated, but this
time another group is used for validation and the rest for
retraining, until all folds have been exhausted. The average of all
validation scores is the final validation score. Finally, the test score
is retrieved by testing the model against the testing data set. Final
validation scores for each algorithm are shown in Table 1 and the
testing scores in Table 2. Model parameters were selected via Grid
Search for the Linear Regressors and the rest via Random Search,
due to the large number of parameters and memory constraints.</p>
      <p>It should be noted that the algorithms based on linear
regression models do not perform as well as the other algorithms. Our
approach shows that the best model is LightGBM, demonstrating
a  = 0.048.
5</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION &amp; FUTURE WORK</title>
      <p>In this paper, we presented an approach for miRNA enrichment
analysis using machine learning techniques, in order to predict
p-values instead of calculating them using randomization
experiments. The goal is to simplify the work for bioinformatics data
analysts, facilitating multiple enrichment analysis. Preliminary
results showed that the best model demonstrates an 2 score
above 90%, and  = 0.048. As next steps, we plan to expand
our dataset to include p-values estimated using 1 million random
groups in order increase their accuracy as well as the accuracy
of the prediction.</p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>We acknowledge support of this work by the project “Moving
from Big Data Management to Data Science" (MIS 5002437/3)
which is implemented under the Action“Reinforcement of the
Research and Innovation Infrastructure”, funded by the Operational
Programme “Competitiveness, Entrepreneurship and Innovation”
(NSRF 2014-2020) and co-financed by Greece and the European
Union (European Regional Development Fund).
2
2
N/A
N/A</p>
      <p>N/A
Management (SSDBM 2020). Association for Computing Machinery (ACM).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Bleazard</surname>
          </string-name>
          ,
          <article-title>Janine A Lamb,</article-title>
          and
          <string-name>
            <surname>Sam</surname>
          </string-name>
          Grifiths-Jones.
          <year>2015</year>
          .
          <article-title>Bias in microRNA functional enrichment analysis</article-title>
          .
          <source>Bioinformatics</source>
          <volume>31</volume>
          ,
          <issue>10</issue>
          (01
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Shailendra</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          , Purvi Purohit, and
          <string-name>
            <given-names>Praveen</given-names>
            <surname>Sharma</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>MicroRNAs and Diseases: Promising Biomarkers for Diagnosis and Therapeutics</article-title>
          .
          <source>Indian Journal of Clinical Biochemistry</source>
          <volume>34</volume>
          ,
          <issue>3</issue>
          (
          <issue>01</issue>
          <year>Jul 2019</year>
          ),
          <fpage>243</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Fisher</surname>
          </string-name>
          .
          <year>1992</year>
          . Statistical Methods for Research Workers. Springer New York, New York, NY,
          <fpage>66</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>copy_x = True, fit_intercept = False, normalize = True alpha=1, copy_x = True, fit_intercept = False, normalize = True, solver = auto, tol = 0.001 default min_samples_leaf = 2, min_samples_split = 5,max_depth = None,max_features = None n_estimators = 50, n_jobs = -1, min_samples_leaf = 2, min_samples_split = 2, verbose = 10 n_estimators = 20, loss = square, base_estimator = RandomForestRegressor, (n_jobs =-1, verbose=2, max_depth=25, n_estimators=4) objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs = -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True verbosity = 2, max_depth = 20, n_estimators=200 n_estimators=1000, learning</article-title>
          _
          <source>rate=0</source>
          .5, bootstrap_type=Bernoulli, hidden_layer_sizes = (
          <volume>24</volume>
          ,
          <fpage>24</fpage>
          ,24),
          <article-title>activation = "relu", max_iter = 40, learning_rate='adaptive' n_estimators = 50, n_jobs = -1, min_samples_leaf = 2, min_samples_split = 2, verbose = 10 objective = poisson, boosting_type = gbdt, learning_rate = 0.5, n_estimators = 900, n_jobs = -1, num_leaves = 40, max_depth=20, reg_alpha = 1, reg_lambda = 1, force_col_wise = True objective = poisson, boosting_type = gbdt, learning_rate = 0.5</article-title>
          , n_estimators = 900, n_jobs =
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Géron</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Incorporated.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zagganas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vergoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Paraskevopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Dalamagas</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>BUFET: boosting the unbiased miRNA functional enrichment analysis using bitsets</article-title>
          .
          <source>BMC Bioinformatics</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>The</given-names>
            <surname>Gene Ontology Consortium</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The Gene Ontology Resource: 20 years and still GOing strong</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>47</volume>
          ,
          <issue>D1</issue>
          (11
          <year>2018</year>
          ),
          <fpage>D330</fpage>
          -
          <lpage>D338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zagganas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vergoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Dalamagas</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Eficient Calculation of Empirical P-Values for Association Testing of Binary Classifications</article-title>
          .
          <source>In 32nd International Conference on Scientific and Statistical Database</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>