<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LightGWAS: A Novel Machine Learning Procedure for Genome-Wide Association Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bruno Am</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>rozio[</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Longo[</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>s Rizzo[</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, Technological University Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper proposes a novel machine learning procedure for genome-wide association study (GWAS), named LightGWAS. It is based on the LightGBM framework, in addition to being a single, resilient, autonomous and scalable solution to address common limitations of GWAS implementations found in the literature. These include reliance on massive manual quality control steps and speci c GWAS methods for each type of dataset morphology and size. Through this research, LightGWAS has been contrasted against PLINK2, one of the current state-ofthe-art for GWAS implementations based on general linear model with support to rth regularisation. The mean di erences measured upon standard classi cation metrics, extracted via quantitative empirical tests through k-fold cross-validation technique, indicated that LightGWAS outperforms PLINK2 for balanced, imbalanced, and high-imbalanced genomic datasets. Paired di erence tests denoted statistical signi cance in the results extracted from the experiments with imbalanced datasets. This article contributes to the body of knowledge by presenting a potentially more e cient GWAS procedure based on nonparametric approaches. LightGWAS ensures adaptability with higher precision in the discovery of causal single-nucleotide polymorphisms, thanks to the leafwise tree growth algorithm o ered by the state-of-the-art for gradient boosting decision trees. Control for false-positives and statistical power are automatically addressed by the model's training process, which signi cative reduces human dependency during the study design.</p>
      </abstract>
      <kwd-group>
        <kwd>LightGWAS</kwd>
        <kwd>LightGBM</kwd>
        <kwd>genome-wide association study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>There are many methods to discovery causal-SNPs, including genome-wide
association study (GWAS). GWAS implementations calculate the association
between each SNP and the underlying phenotype throughout a statistical model.
Therefore, GWAS is roughly analogue to, or a type of feature selection: each SNP
is a feature (independent variable), and the phenotype is the class (target, or
dependent variable). The features identi ed as better predictors of the class are
the potential causal-SNPs.</p>
      <p>
        Statistical regression models portray the state-of-the-art for GWAS.
Despite their e ciency, some eminent problems have become inevitable over the
past years. These include reduction of costs for DNA sequencing [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], which in
turn allowed an exponential growth of data; expansion of SNPs datasets that
have contributed to overwhelming sparsity, with millions of SNPs, and few
patients [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]; high-disperse (or high-dimensional) datasets, compromising the
approaches available for GWAS as they are derived from linear (parametric)
models [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another point of concern emerges with imbalanced ratios of rare cases
and several controls. Such a scenario tends to in ate false-positives when data
is exploited by regression over qualitative features. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Nowadays, these
obstacles are addressed via several manual quality control steps to increase
statistical power and avoid type 1 errors [
        <xref ref-type="bibr" rid="ref10 ref19 ref21 ref22 ref24">10, 19, 21, 22, 24</xref>
        ]. However, as much as data
grows, so does the dependency on manual intervention. Hence, it opens margins
for human mistakes and compromises the scalability of the study. To address the
aforementioned gaps, this paper proposes a novel procedure for GWAS assembled
over decision trees (DT) enhanced by gradient boosting machine (GBM), whose
implementation comes from the LightGBM framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It ensures
adaptability to the most diverse genomic data structures by controlling bias and variance
over the training process. Consequently, it improves precision, independently of
human intervention. Such a procedure has been named LightGWAS. Therefore,
this work attempts to answer the following research question:
{ Can LightGWAS be an alternative method to the state-of-the-art for
genomewide association studies based upon general linear models, by increasing
statistical power on causal-SNP detection, and reducing the number of manual
quality control steps?
      </p>
      <p>The research goals of this paper are: (a) to evaluate whether LightGWAS
is a suitable GWAS method for qualitative phenotypes, according to a set of
common metrics for classi cation problems; and (b) to assess if LightGWAS
outperforms the available state-of-the-art for GWAS in terms of statistical power
and precision. Finally, the remainder of this paper is organised as follows: Section
2 reviews researches on the state-of-the-art for genome-wide association studys
along with an overview of the LightGBM framework. Section 3 introduces the
design and a set of hypotheses for answering the research question. Section 4,
in turn, presents the results with a discussion. Lastly, section 5 concludes the
study, highlighting its contributions and possible future work.</p>
    </sec>
    <sec id="sec-2">
      <title>Literature review and related work</title>
      <p>
        A Genome-wide association study (GWAS) is a discovery-driven research
technique to catalogue single-nucleotide polymorphisms (SNPs) across populations
and to identify genetic markers associated with traits [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. Since the
completion of the human genome sequence in 2003, about 3; 700 GWASs contributed
to discovering thousands of genetic risk causal-SNPs and their biological
functions [
        <xref ref-type="bibr" rid="ref15 ref17">17, 15</xref>
        ]. The state-of-the-art for GWAS methods are based on three
exclusivelly statistical association models: general linear model (GLM), linear mixed
model (LMM), and scalable and accurate implementation of generalized mixed
model (SAIGE) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Their applicability depends on the phenotype type, sample
size, and cohort distribution across the manipulated genomic dataset.
According to [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the following criteria should be considered to select the appropriate
model: (a) GLM implementations for quantitative traits, up to ve thousand
samples. If qualitative phenotype, the logistic regression implementation should
include rth regularisation to minimize the tting errors caused by the
categorical class whenever its frequency is lower than 400 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; (b) LMM
implementation for datasets bigger than ve thousand samples and quantitative traits type.
Whether qualitative phenotype, the dataset should be in a normal distribution;
or (c) SAIGE [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] for high-imbalanced case-control ratio of qualitative traits.
Independently of the chosen method, principal component analysis should also
be employed. It helps to lter SNPs that might be caused by the structure of
the population (generating confounding due to ancestry), rather than the
investigated phenotype [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Usually, the rst ten eigenvalues are arbitrarily considered
as covariants for an association model [
        <xref ref-type="bibr" rid="ref19 ref2 ref27">19, 2</xref>
        ]. The GWAS outcome is a list of
potential causal-SNPs.
      </p>
      <p>
        This paper proposes a method for GWAS based on LightGBM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]: a
gradient boosted decision trees (GBDT) framework built upon histogram algorithms.
LightGBM grows the trees leaf-wise and uses gradient-based one-side sampling
(GOSS) to downsampling data, and exclusive feature bundling (EFB) to
reduce feature dimension. In order to address GBDT problems related to
highcomputational complexity due to abundance of data, GOSS retains the large
gradients samples, randomly selects small gradients, and assign constant weights
to them. The algorithm concentrates on undertrained samples without altering
the distribution of raw data. EFB, in turn, is a feature extraction technique,
based on the graph coloring problem, which also contributes to reducing the
histogram building complexity. It deals with the sparsity of the data by grouping
many independent variables to the dense features, avoiding unnecessary
computation with pieces that do not account for the outcome variable. LightGBM,
within the proposal GWAS solution, discoveries causal-SNPs by calculating the
model's feature importance. Each SNP is, in fact, an independent variable of the
model. Hence, the list of features that better explains the dependent variable
(phenotype) contains the saught SNPs. Considering the LightGBM framework
design and the strong evidence of its inference to address problems involving
high-sparse data over big datasets [
        <xref ref-type="bibr" rid="ref16 ref23 ref25">16, 25, 23</xref>
        ], this article works upon the idea
that such a framework is also a potential core engine for GWAS.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiment design and methodology</title>
      <p>
        LightGWAS is designed to be a GWAS procedure based on a machine
learning nonparametric method. The solution is composed of a GBDT algorithm
implemented by the LightGBM framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It is tted with the SNPs as
independent variables and the phenotype as the class. Thus, the causal-SNPs
are retrieved by calculating the models' feature importance. In turn, to answer
the research question of this paper, an experiment involving three datasets, two
feature selector models and a predicting model is conducted. Fig. 1 depicts the
experiment design in four steps, followed by the evaluation strategy applied.
      </p>
      <p>
        The datasets (Fig. 1A) contain the same number of SNPs each, but varying
the phenotype balance ratio on cases:controls of 1:1, 1:10, and 1:100. The rst
two models (Fig. 1B) are GWAS procedures. One of the GWAS methods is the
novelty behind this paper, the LightGWAS. The other one is PLINK2 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], one
of the state-of-the-art implementations for GWAS in contexts where GLM is
required. Therefore, six causal-SNPs result sets are generated from them. The
third model (Fig. 1C) referred to as common classi er from now on, is a
logistic regression. It employes k-fold cross-validation model selector technique,
with k been set arbitrarily to 50. A value higher than 30 was necessary to
perform statistically signi cant comparisons across the resulting sets. The common
classi er is tted once with causal-SNPs discovered by LightGWAS as
independent variables and another time with causal-SNPs retrieved with PLINK2. The
dependent variable, in both circumstances, is the underlying phenotype of the
datasets. Therefore, its output is a paired set of classi cation metrics from each
of the GWAS methods. Lastly (Fig. 1D), the group of metrics extracted with
the cross-validation are evaluated in terms of statistical signi cance for possible
di erences among them. The evaluated alternative hypotheses are:
{ H1: LightGWAS outperforms GLM based on logistic regression with rth
regularisation for GWAS, across genomic datasets of balanced qualitative
phenotypes (case : control = 1 : 1), in terms of accuracy, precision, F1
score, and ROC/AUC.
{ H2: LightGWAS outperforms GLM based on logistic regression with rth
regularisation for GWAS, across genomic datasets of imbalanced qualitative
phenotypes (case : control = 1 : 10), in terms of precision, F1 score, and
ROC/AUC.
{ H3: LightGWAS outperforms GLM based on logistic regression with rth
regularisation for GWAS, across genomic datasets of high-imbalanced
qualitative phenotypes (case : control = 1 : 100), in terms of precision, F1 score,
and ROC/AUC.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Datasets</title>
        <p>
          A GWAS relies on two di erent data groups: the genomic data that contains the
DNA variances, and the traits to be associated with the SNPs between the cases
and controls cohorts. Usually, the traits to be investigated are human
phenotypes, such as diseases status, that can be retrieved from the patients electronic
health records (EHR) [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. In this article, selected datasets are fully synthetic in
either genomic and phenotype data. Simulations have been introduced to
distinguish accurately the causal-SNPs expected to be exposed by each of the evaluated
GWAS models, which is paramount to compare them correctly. Dataset
simulation for GWAS methods validation is a prevalent practice and can be observed
in many types of researches, such as [
          <xref ref-type="bibr" rid="ref14 ref26 ref5 ref7 ref8">5, 7, 8, 14, 26</xref>
          ]. Accordingly, six datasets
have been created, combined into three data groups of class (phenotype status)
distribution: balanced, imbalanced, and high-imbalanced data. They have been
named as ds1 1, ds1 10 and ds1 100, respectivelly. The number of samples (
ctitious patients) in each of the datasets respected the following pattern: ds1 1 =
case:control=1:1=2500:2500, N=5000 ; ds1 10 = case:control=1:10=400:4000,
N=4400 ; ds1 100 = case:control=1:100=50:5000, N=5050. The datasets were
produced using the PLINK SNP simulation tool1. Each sample had a phenotype
status class (case or control ) and 10100 numeric features (each feature is a SNP).
Further details about the variables of interest along with the parameters set to
simulate the datasets can be found in Appendix A (table 2, page 12).
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Procedure</title>
        <p>
          The complete procedure to accomplish the objectives, and test the alternative
hypotheses includes seven steps:
1. Simulation of datasets, as outlined above, in section 3.1.
2. LightGWAS implementation. It is composed of a GBDT implementation
called LightGBM. The hyperparameters are tunned through 200 iterations
of randomised 5-folds cross-validation search. Table 3 in the Appendix B
(page 12) contains the cross-validated optimal hyperparameters selected for
each dataset group.
1 http://zzz.bwh.harvard.edu/plink/simulate.shtml
3. Discover the causal-SNPs across the early mentioned datasets by employing
LightGWAS and PLINK2. Therefore, two sets of causal-SNPs per GWAS
method is generated. PLINK's outcome is a set of SNPs accompanied by
their p-value. The causal-SNPs ltering is reached by assuming a cut-o ( )
for such a p-value. For the datasets ds1 1 and ds1 10, the cut-o p j =
5 10 8 is assumed, as per genome-wide association study convension [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In
turn, for the dataset ds1 100, the cut-o is p j = 5 10 4 because no
SNP was selected with the rst one. This decision has been grounded on [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
In contrast, LightGWAS selects each SNP with the gain or split score of the
decision trees. Therefore, the list of features importance from the LightGBM
framework is the set of causal-SNPs retrieved with LightGWAS.
4. GWAS model's evaluation. In order to compare how e ective LightGWAS
is in comparison to PLINK, the common classi er is employed. It is a
logistic regression executed through 50-folds cross-validation for model selection,
which is tted upon two conditions: one with the features as the
causalSNPs collected via LightGWAS, and another with causal-SNPs selected via
PLINK. The class (or target) for both scenarios, is the phenotype variable.
Therefore, the common classi er output is a separated dataset with 50
result samples per GWAS model. The following metrics have been evaluated:
weighted average of the precision and recall (F1), recall, average precision
score (APS), receiver operating characteristic (ROC)/area under the curve
(AUC), accuracy, and precision.
5. The con dence interval (CI) of the metric's result sets are calculated through
5000 bootstraps in a cut-o of = 0:05. The subsamples (resampling with
replacement) is sized at 50% (N 0:5). Therefore, there is 95% of a
likelihood that the reported lower limit (LL) and upper limit (UL) represent the
con dence intervals of the true metrics' performances.
6. Paired di erence tests are employed to measure how signi cant is the
observed di erences in each metric pair. Dependent (paired) sample Student's
t-test is applied to the metric pairs that held a normal distribution, and
Wilcoxon signed-rank test otherwise. Tests to assess whether a metric
(variable of the results dataset) is in a Gaussian distribution are conducted with
D'Agostino's K2 Normality Test. Whenever a sample does not reach the
levels of a normal distribution, power transformation through Box-cox is rstly
attempted before assuming nonparametric approaches.
7. The e ect of the observed mean di erences are calculated through Cohen's d
test when the parametric test has been used, and Wilcoxon r score otherwise.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and evaluation</title>
      <p>The consolidated results can be observed below in table 1, followed by the
statistical report. The CI ranges along with the standard deviation (SD) of each
metric have been logged to the Appendix C (table 4, page 12).
Below follows a statistical report, separated by dataset group, extracted from
the interpretation of the consolidate result sets disclosed in table 1.</p>
      <p>Dataset ds1 1: LightGWAS slightly outperformed PLINK on metrics F1,
recall, and ROC/AUC, while PLINK outperformed LightGWAS on APS, and
precision. Both models reached out the same mean value for accuracy so that
zero mean absolute di erence (MD). The t-tests indicated no statistical signi
cance on = 0:05 for any of the measured metrics. The standardized di erence
between the means resulted in a small e ect for all of the metrics (d &lt; 0:5). In
terms of causal-SNP selection, LightGWAS selected 86 SNPs, while PLINK
selected 90. PLINK managed to pick all SNPs selected by LightGWAS, plus other
four causal-SNPs.</p>
      <p>Dataset ds1 10: LightGWAS slightly outperformed PLINK for every
measured metrics. The t-tests indicated statistical signi cance on = 0:05 for both
F1 and accuracy with small e ect (d &lt; 0:5). The Wilcoxon test indicated
statistical signi cance on = 0:01 and large e ect (r 0:8) for APS, ROC/AUC and
precision. No statistical signi cance on = 0:05 has been observed for recall,
although the observed di erence had large e ect (r 0:8). In terms of causal-SNP
selection, LightGWAS selected 80 SNPs, while PLINK selected 76. LightGWAS
managed to pick all SNPs selected by PLINK, plus other four causal-SNPs.</p>
      <p>Dataset ds1 100: LightGWAS slightly outperformed PLINK for every
measured metrics. The Wilcoxon test indicated no statistical signi cance on = 0:05
for any of them. However, a medium e ect (r 0:5 ^ r &lt; 0:8) has been observed
for recall, and a large e ect (r 0:8) for all the other metrics. In terms of
causal-SNP selection, LightGWAS selected 28 SNPs, while PLINK selected 19.
LightGWAS managed to pick 14 SNPs missed by PLINK, and PLINK, in turn,
managed to select 5 SNPs missed by LightGWAS.
4.2</p>
      <sec id="sec-4-1">
        <title>Discussion</title>
        <p>The models implemented through LightGWAS performed as good as PLINK
for GWAS over the balanced dataset. The paired di erence tests disclosed that
none of the measured di erences is statistically signi cant on cut-o = 0:05.
Also, the observed e ects through Cohen's d presented a small standardised
e ect between all the means of the paired metrics. Consequently, the alternative
hypothesis H1 had to be rejected as LightGWAS did not outperform (neither
underperformed) statistically signi cant for such a dataset.</p>
        <p>The experiments involving an imbalanced dataset brought evidence that
supports accepting the alternative hypothesis H2. LightGWAS has outperformed
PLINK for such a scenario. Although recall did not reach statistical signi cance
on = 0:05 (therefore as good as PLINK), all the other metrics had relevant
results on = 0:01 (F1 and accuracy on = 0:05). Furthermore, the metrics
measured through nonparametric tests (recall, APS, ROC/AUC and precision)
resulted in a large e ect (r 0:8).</p>
        <p>The alternative hypothesis H3 was rejected. Although LightGWAS
outperformed PLINK with medium e ect for recall (r 0:5 ^ r &lt; 0:8) and a large
e ect for the other metrics (r 0:8) when instantiated with a high-imbalanced
dataset, none of the results reached statistical signi cance on = 0:05.</p>
        <p>
          Considering exclusively the k-fold cross-validation model selection results
(observed di erences in the means), models implemented via the proposed
LightGWAS procedure outperformed those implemented with PLINK in the three
evaluated scenarios. However, if taking into consideration the statistical analysis
of the metrics pairs di erences, this result is held with statistical signi cance
only in the experiments involving the imbalanced dataset. Nonetheless,
according to [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], it is important to note that statistical signi cance should not be the
exclusive approach to reject how relevant a model is. The scienti c perspective
(or signi cance) of the underlying problem should also be taken into
consideration. Genome-wide association study plays an essential rule on identifying causal
anomalies across DNA, and any improvement over a method, being it
statistically signi cant or not, should be accounted for. Hence, although some of the
measured metrics did not reach statistical signi cance (leading to the rejection of
the alternative hypotheses H1 and H3), they prove to be scienti cally
meaningful through their e ect di erences and the number of discovered causal-SNPs. As
a result, the research question (page 1) can be answered positively. The evidence
collected from the tested hypotheses supports the theory that LightGWAS is a
potential genome-wide association study method.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        This paper has proposed a novel genome-wide association study (GWAS)
procedure, named LightGWAS. It is a nonparametric machine learning (ML) method
based on the LightGBM framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. LightGWAS has been idealised as a
potential single, resilient, autonomous and scalable solution to address some
of the found limitations of the available state-of-the-art implementations for
GWAS. A literature review identi ed that the current GWAS implementations
rely on cumbersome manual quality control steps to address statistical problems,
such as controlling for false-positive in ation and power reduction. These
challenges increase as the data grows or becomes imbalanced. It also showed they
demand a particular GWAS method for each type of genomic data structure,
which increases human dependency. In this research, the e ectiveness of the
models implemented via the proposed LightGWAS procedure was assessed upon
GWAS scenarios where the investigated phenotype is qualitative and datasets are
about to ve thousand samples of balanced (case : control = 1 : 1), imbalanced
(case : control = 1 : 10), and high-imbalanced (case : control = 1 : 100)
genomic data. Next, LightGWAS models were contrasted with those implemented
via the state-of-the-art for GWAS (PLINK2 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). This assessment was performed
through an empirical comparative experiment. A model selection based on
50fold cross-validation signed out LightGWAS as the best choice in terms of mean
di erences. The results from empirical statistical tests denoted that the di
erences are statistically signi cant for imbalanced datasets contexts.
      </p>
      <p>The main contribution of LightGWAS for genome-wide association study is
the fact it is based on a nonparametric machine learning approach against the
state-of-the-art that strongly relies on parametric statistical models. Therefore,
LightGWAS allows scalability and adaptability to the most diverse genomic data
morphology, which, in turn, reduces human dependency. It scales thanks to the
LightGBM framework, which is the state-of-the-art for gradient boosted decision
trees, capable of handling large and high-sparse datasets. LightGBM was created
to address classi cation or regression problems. Still, in the LightGWAS
procedure, it is used as a phenotype causal single-nucleotide polymorphism (SNP)
discover by calculating the feature importance of a tted model. Hence, this
research shows originality by taking a speci c technique and adapting it to a new
domain of application. For all these reasons, LightGWAS is a new contribution
from data science towards the evolvement of molecular biology science.</p>
      <p>For future work, it is recommended to compare LightGWAS with the GWAS
procedures based on linear mixed model, and scalable and accurate
implementation of generalized mixed model. Thus, the e ectiveness of LightGWAS can
also be assessed against scenarios that go beyond the ones addressable through
general linear models. It would also bene t whether using quantitative
phenotypes to make sure LightGWAS attends to linear association models. Lastly, it
is recommended the development of a mechanism to identify causal-SNPs from
decision trees gain or split scores, as no p-values exist in such a context. It is
crucial to develop a system analogue to the cut-o s employed by the current
state-of-the-art regression models to lter causal-SNPs (p for each SNP).</p>
    </sec>
    <sec id="sec-6">
      <title>Appendices</title>
      <sec id="sec-6-1">
        <title>Appendix A: Datasets' phenotype ratios and variables of interest</title>
      </sec>
      <sec id="sec-6-2">
        <title>Appendix B: LightGBM hyperparameter values</title>
        <p>...
...
Alphanumeric</p>
      </sec>
      <sec id="sec-6-3">
        <title>Appendix C: Con dence interval ranges and standard deviations</title>
        <p>f1
0 recall
10 APS
1 ROC/AUC
s
d accuracy
precision
PLINK</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bush</surname>
            ,
            <given-names>W.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Chapter 11: Genome-wide association studies</article-title>
          .
          <source>PLoS Computational Biology</source>
          <volume>8</volume>
          (
          <issue>12</issue>
          ),
          <source>e1002822 (Dec</source>
          <year>2012</year>
          ). https://doi.org/10.1371/journal.pcbi.
          <volume>1002822</volume>
          , https://doi.org/10.1371/ journal.pcbi.1002822
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishwaran</surname>
          </string-name>
          , H.:
          <article-title>Random forests for genomic data analysis</article-title>
          .
          <source>Genomics</source>
          <volume>99</volume>
          (
          <issue>6</issue>
          ),
          <volume>323</volume>
          {329 (Jun
          <year>2012</year>
          ). https://doi.org/10.1016/j.ygeno.
          <year>2012</year>
          .
          <volume>04</volume>
          .003, https://doi. org/10.1016/j.ygeno.
          <year>2012</year>
          .
          <volume>04</volume>
          .003
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fadista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>The (in)famous GWAS p-value threshold revisited and updated for low-frequency variants</article-title>
          .
          <source>European Journal of Human Genetics</source>
          <volume>24</volume>
          (
          <issue>8</issue>
          ),
          <volume>1202</volume>
          {1205 (Jan
          <year>2016</year>
          ). https://doi.org/10.1038/ejhg.
          <year>2015</year>
          .
          <volume>269</volume>
          , https://doi.org/ 10.1038/ejhg.
          <year>2015</year>
          .269
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Farrell</surname>
          </string-name>
          , R.E.:
          <article-title>Functional genomics and transcript pro ling</article-title>
          .
          <source>In: RNA Methodologies</source>
          , pp.
          <volume>685</volume>
          {
          <fpage>695</fpage>
          .
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          (
          <year>2017</year>
          ). https://doi.org/10.1016/b978-0
          <source>-12-804678- 4</source>
          .
          <fpage>00024</fpage>
          -
          <lpage>5</lpage>
          , https://doi.org/10.1016/b978-0
          <source>-12-804678-4</source>
          .
          <fpage>00024</fpage>
          -
          <lpage>5</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Golan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosset</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.Y.</given-names>
          </string-name>
          :
          <article-title>Mixed models for case-control genomewide association studies: Major challenges and partial solutions</article-title>
          .
          <source>In: Handbook of Statistical Methods for Case-Control Studies</source>
          , pp.
          <volume>495</volume>
          {
          <fpage>514</fpage>
          . Chapman and Hall/CRC (Jun
          <year>2018</year>
          ). https://doi.org/10.1201/
          <fpage>9781315154084</fpage>
          -27, https://doi. org/10.1201/
          <fpage>9781315154084</fpage>
          -
          <lpage>27</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Greenland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Statistical tests, p values, con dence intervals, and power: a guide to misinterpretations</article-title>
          .
          <source>European Journal of Epidemiology</source>
          <volume>31</volume>
          (
          <issue>4</issue>
          ),
          <volume>337</volume>
          { 350 (Apr
          <year>2016</year>
          ). https://doi.org/10.1007/s10654-016-0149-3, https://doi.org/ 10.1007/s10654-016-0149-3
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loh</surname>
            ,
            <given-names>P.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bharadwaj</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pons</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guinan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lakhani</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kilty</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jelinsky</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Stepwise distributed open innovation contests for software development: Acceleration of genome-wide association analysis</article-title>
          .
          <source>GigaScience</source>
          <volume>6</volume>
          (
          <issue>5</issue>
          ) (
          <year>Feb 2017</year>
          ). https://doi.org/10.1093/gigascience/gix009, https://doi.org/ 10.1093/gigascience/gix009
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>A resource-e cient tool for mixed model association analysis of large-scale data</article-title>
          .
          <source>Nature Genetics</source>
          <volume>51</volume>
          (
          <issue>12</issue>
          ),
          <volume>1749</volume>
          {1755 (Nov
          <year>2019</year>
          ). https://doi.org/10.1038/s41588-019-0530-8, https://doi.org/10.1038/ s41588-019-0530-8
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ke</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Lightgbm: A highly e cient gradient boosting decision tree</article-title>
          . In: Guyon,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.V.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Vishwanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Garnett</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          , pp.
          <volume>3146</volume>
          {
          <fpage>3154</fpage>
          . Curran Associates, Inc. (
          <year>2017</year>
          ), http://papers.nips.cc/paper/ 6907-lightgbm
          <article-title>-a-highly-efficient-gradient-boosting-decision-tree.pdf</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Control of population strati cation by correlation-selected principal components</article-title>
          .
          <source>Biometrics</source>
          <volume>67</volume>
          (
          <issue>3</issue>
          ),
          <volume>967</volume>
          {974 (Dec
          <year>2010</year>
          ). https://doi.org/10.1111/j.1541-
          <fpage>0420</fpage>
          .
          <year>2010</year>
          .
          <volume>01520</volume>
          .x, https://doi.org/10.1111/j. 1541-
          <fpage>0420</fpage>
          .
          <year>2010</year>
          .
          <volume>01520</volume>
          .x
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Feature selection</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>50</volume>
          (
          <issue>6</issue>
          ),
          <volume>1</volume>
          {
          <fpage>45</fpage>
          (Jan
          <year>2018</year>
          ). https://doi.org/10.1145/3136625, https://doi.org/10.1145/3136625
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Loh</surname>
            ,
            <given-names>P.R.</given-names>
          </string-name>
          , et al.:
          <article-title>Mixed-model association for biobank-scale datasets</article-title>
          .
          <source>Nature Genetics</source>
          <volume>50</volume>
          (
          <issue>7</issue>
          ),
          <volume>906</volume>
          {908 (Jun
          <year>2018</year>
          ). https://doi.org/10.1038/s41588-018-0144-6, https://doi.org/10.1038/s41588-018-0144-6
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lubke</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Gradient boosting as a SNP lter: an evaluation using simulated and hair morphology data</article-title>
          .
          <source>Journal of Data Mining in Genomics &amp; Proteomics</source>
          <volume>04</volume>
          (
          <issue>04</issue>
          ) (
          <year>2013</year>
          ). https://doi.org/10.4172/
          <fpage>2153</fpage>
          -
          <lpage>0602</lpage>
          .1000143, https://doi.org/10. 4172/
          <fpage>2153</fpage>
          -
          <lpage>0602</lpage>
          .
          <fpage>1000143</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants</article-title>
          .
          <source>Genetic Epidemiology</source>
          <volume>37</volume>
          (
          <issue>6</issue>
          ),
          <volume>539</volume>
          { 550 (Jun
          <year>2013</year>
          ). https://doi.org/10.1002/gepi.21742, https://doi.org/10.1002/ gepi.21742
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A scientometric review of genome-wide association studies</article-title>
          .
          <source>Communications Biology</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ), 9 (Jan
          <year>2019</year>
          ). https://doi.org/10.1038/s42003-018- 0261-x, https://doi.org/10.1038/s42003-018-0261-x
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A deep auto-encoder based LightGBM approach for network intrusion detection system</article-title>
          .
          <source>In: Proceedings of the International Conference on Advances in Computer Technology, Information Science and Communications</source>
          . pp.
          <volume>142</volume>
          {
          <fpage>147</fpage>
          . SCITEPRESS - Science and Technology
          <string-name>
            <surname>Publications</surname>
          </string-name>
          (
          <year>2019</year>
          ). https://doi.org/10.5220/0008098401420147, https://doi.org/10.5220/ 0008098401420147
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pearson</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          :
          <article-title>How to interpret a genome-wide association study</article-title>
          .
          <source>JAMA</source>
          <volume>299</volume>
          (
          <issue>11</issue>
          ),
          <volume>1335</volume>
          (Mar
          <year>2008</year>
          ). https://doi.org/10.1001/jama.299.11.1335, https://doi.org/ 10.1001/jama.299.11.1335
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Perez-Enciso</surname>
          </string-name>
          ,
          <article-title>Zingaretti: A guide for using deep learning for complex trait genomic prediction</article-title>
          .
          <source>Genes</source>
          <volume>10</volume>
          (
          <issue>7</issue>
          ),
          <volume>553</volume>
          (Jul
          <year>2019</year>
          ). https://doi.org/10.3390/genes10070553, https://doi.org/10.3390/genes10070553
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Price</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          , et al.:
          <article-title>Principal components analysis corrects for strati cation in genome-wide association studies</article-title>
          .
          <source>Nature Genetics</source>
          <volume>38</volume>
          (
          <issue>8</issue>
          ),
          <volume>904</volume>
          {909 (Jul
          <year>2006</year>
          ). https://doi.org/10.1038/ng1847, https://doi.org/10.1038/ng1847
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Purcell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>PLINK: A tool set for whole-genome association and populationbased linkage analyses</article-title>
          .
          <source>The American Journal of Human Genetics</source>
          <volume>81</volume>
          (
          <issue>3</issue>
          ),
          <volume>559</volume>
          {575 (Sep
          <year>2007</year>
          ). https://doi.org/10.1086/519795, https://doi.org/10.1086/519795
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al.:
          <article-title>A guide to genome-wide association analysis and postanalytic interrogation</article-title>
          .
          <source>Statistics in Medicine</source>
          <volume>34</volume>
          (
          <issue>28</issue>
          ),
          <volume>3769</volume>
          {3792 (Sep
          <year>2015</year>
          ). https://doi.org/10.1002/sim.6605, https://doi.org/10.1002/sim.6605
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Genome-wide association studies and the genetic dissection of complex traits</article-title>
          .
          <source>American Journal of Hematology</source>
          <volume>84</volume>
          (
          <issue>8</issue>
          ),
          <volume>504</volume>
          {515 (Aug
          <year>2009</year>
          ). https://doi.org/10.1002/ajh.21440, https://doi.org/10.1002/ajh.21440
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , et al.:
          <article-title>Prediction of double-high biochemical indicators based on LightGBM and XGBoost</article-title>
          .
          <source>In: Proceedings of the 2019 International Conference on Arti cial Intelligence and Computer</source>
          Science - AICS
          <year>2019</year>
          . p.
          <volume>189</volume>
          {
          <fpage>193</fpage>
          . ACM Press (
          <year>2019</year>
          ). https://doi.org/10.1145/3349341.3349400, https://doi.org/ 10.1145/3349341.3349400
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Spencer</surname>
            ,
            <given-names>C.C.A.</given-names>
          </string-name>
          , et al.:
          <article-title>Designing genome-wide association studies: Sample size, power, imputation, and the choice of genotyping chip</article-title>
          .
          <source>PLoS Genetics</source>
          <volume>5</volume>
          (
          <issue>5</issue>
          ),
          <volume>1</volume>
          {13 (May
          <year>2009</year>
          ). https://doi.org/10.1371/journal.pgen.
          <volume>1000477</volume>
          , https://doi.org/10. 1371/journal.pgen.1000477
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Power system transient stability assessment based on bayesian optimized LightGBM</article-title>
          .
          <source>In: 2019 IEEE 3rd Conference on Energy Internet and Energy System Integration (EI2)</source>
          . pp.
          <volume>263</volume>
          {
          <fpage>268</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (Nov
          <year>2019</year>
          ). https://doi.org/10.1109/ei247390.
          <year>2019</year>
          .
          <volume>9062027</volume>
          , https://doi.org/ 10.1109/ei247390.
          <year>2019</year>
          .9062027
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          , W.,
          <article-title>other: E ciently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies</article-title>
          .
          <source>Nature Genetics</source>
          <volume>50</volume>
          (
          <issue>9</issue>
          ),
          <volume>1335</volume>
          { 1341 (Aug
          <year>2018</year>
          ). https://doi.org/10.1038/s41588-018-0184-y, https://doi.org/ 10.1038/s41588-018-0184-y
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>2 2 ds1 1 ds1 10 ds1 100</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>colsample bytree 0.47328041 0.47328041 0.866621446 learning rate 0.03 0.03 0.01 max depth 1 1 6 min child samples 147 147 454 min child weight 1.0 1.0 1.0 min split gain 0 0 0 n estimators 2000 2000 2000 num leaves 35 35 41 reg alpha 0.1 0.1 5 reg lambda 0.1 0.1 50 subsample 0.995930118 0.995930118 0.820421212 subsample for bin 200000 200000 200000</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>