-

LightGWAS: A Novel Machine Learning Procedure for Genome-Wide Association Study

Bruno Am

rozio[

Longo[

s Rizzo[

0 0 School of Computer Science, Technological University Dublin , Ireland

This paper proposes a novel machine learning procedure for genome-wide association study (GWAS), named LightGWAS. It is based on the LightGBM framework, in addition to being a single, resilient, autonomous and scalable solution to address common limitations of GWAS implementations found in the literature. These include reliance on massive manual quality control steps and speci c GWAS methods for each type of dataset morphology and size. Through this research, LightGWAS has been contrasted against PLINK2, one of the current state-ofthe-art for GWAS implementations based on general linear model with support to rth regularisation. The mean di erences measured upon standard classi cation metrics, extracted via quantitative empirical tests through k-fold cross-validation technique, indicated that LightGWAS outperforms PLINK2 for balanced, imbalanced, and high-imbalanced genomic datasets. Paired di erence tests denoted statistical signi cance in the results extracted from the experiments with imbalanced datasets. This article contributes to the body of knowledge by presenting a potentially more e cient GWAS procedure based on nonparametric approaches. LightGWAS ensures adaptability with higher precision in the discovery of causal single-nucleotide polymorphisms, thanks to the leafwise tree growth algorithm o ered by the state-of-the-art for gradient boosting decision trees. Control for false-positives and statistical power are automatically addressed by the model's training process, which signi cative reduces human dependency during the study design.

LightGWAS LightGBM genome-wide association study

There are many methods to discovery causal-SNPs, including genome-wide association study (GWAS). GWAS implementations calculate the association between each SNP and the underlying phenotype throughout a statistical model. Therefore, GWAS is roughly analogue to, or a type of feature selection: each SNP is a feature (independent variable), and the phenotype is the class (target, or dependent variable). The features identi ed as better predictors of the class are the potential causal-SNPs.

Statistical regression models portray the state-of-the-art for GWAS. Despite their e ciency, some eminent problems have become inevitable over the past years. These include reduction of costs for DNA sequencing [ 18 ], which in turn allowed an exponential growth of data; expansion of SNPs datasets that have contributed to overwhelming sparsity, with millions of SNPs, and few patients [ 13 ]; high-disperse (or high-dimensional) datasets, compromising the approaches available for GWAS as they are derived from linear (parametric) models [ 11 ]. Another point of concern emerges with imbalanced ratios of rare cases and several controls. Such a scenario tends to in ate false-positives when data is exploited by regression over qualitative features. [ 26 ]. Nowadays, these obstacles are addressed via several manual quality control steps to increase statistical power and avoid type 1 errors [ 10, 19, 21, 22, 24 ]. However, as much as data grows, so does the dependency on manual intervention. Hence, it opens margins for human mistakes and compromises the scalability of the study. To address the aforementioned gaps, this paper proposes a novel procedure for GWAS assembled over decision trees (DT) enhanced by gradient boosting machine (GBM), whose implementation comes from the LightGBM framework [ 9 ]. It ensures adaptability to the most diverse genomic data structures by controlling bias and variance over the training process. Consequently, it improves precision, independently of human intervention. Such a procedure has been named LightGWAS. Therefore, this work attempts to answer the following research question: { Can LightGWAS be an alternative method to the state-of-the-art for genomewide association studies based upon general linear models, by increasing statistical power on causal-SNP detection, and reducing the number of manual quality control steps?

The research goals of this paper are: (a) to evaluate whether LightGWAS is a suitable GWAS method for qualitative phenotypes, according to a set of common metrics for classi cation problems; and (b) to assess if LightGWAS outperforms the available state-of-the-art for GWAS in terms of statistical power and precision. Finally, the remainder of this paper is organised as follows: Section 2 reviews researches on the state-of-the-art for genome-wide association studys along with an overview of the LightGBM framework. Section 3 introduces the design and a set of hypotheses for answering the research question. Section 4, in turn, presents the results with a discussion. Lastly, section 5 concludes the study, highlighting its contributions and possible future work.

Literature review and related work

A Genome-wide association study (GWAS) is a discovery-driven research technique to catalogue single-nucleotide polymorphisms (SNPs) across populations and to identify genetic markers associated with traits [ 1, 4 ]. Since the completion of the human genome sequence in 2003, about 3; 700 GWASs contributed to discovering thousands of genetic risk causal-SNPs and their biological functions [ 17, 15 ]. The state-of-the-art for GWAS methods are based on three exclusivelly statistical association models: general linear model (GLM), linear mixed model (LMM), and scalable and accurate implementation of generalized mixed model (SAIGE) [ 12 ]. Their applicability depends on the phenotype type, sample size, and cohort distribution across the manipulated genomic dataset. According to [ 12 ], the following criteria should be considered to select the appropriate model: (a) GLM implementations for quantitative traits, up to ve thousand samples. If qualitative phenotype, the logistic regression implementation should include rth regularisation to minimize the tting errors caused by the categorical class whenever its frequency is lower than 400 [ 14 ]; (b) LMM implementation for datasets bigger than ve thousand samples and quantitative traits type. Whether qualitative phenotype, the dataset should be in a normal distribution; or (c) SAIGE [ 26 ] for high-imbalanced case-control ratio of qualitative traits. Independently of the chosen method, principal component analysis should also be employed. It helps to lter SNPs that might be caused by the structure of the population (generating confounding due to ancestry), rather than the investigated phenotype [ 19 ]. Usually, the rst ten eigenvalues are arbitrarily considered as covariants for an association model [ 19, 2 ]. The GWAS outcome is a list of potential causal-SNPs.

This paper proposes a method for GWAS based on LightGBM [ 9 ]: a gradient boosted decision trees (GBDT) framework built upon histogram algorithms. LightGBM grows the trees leaf-wise and uses gradient-based one-side sampling (GOSS) to downsampling data, and exclusive feature bundling (EFB) to reduce feature dimension. In order to address GBDT problems related to highcomputational complexity due to abundance of data, GOSS retains the large gradients samples, randomly selects small gradients, and assign constant weights to them. The algorithm concentrates on undertrained samples without altering the distribution of raw data. EFB, in turn, is a feature extraction technique, based on the graph coloring problem, which also contributes to reducing the histogram building complexity. It deals with the sparsity of the data by grouping many independent variables to the dense features, avoiding unnecessary computation with pieces that do not account for the outcome variable. LightGBM, within the proposal GWAS solution, discoveries causal-SNPs by calculating the model's feature importance. Each SNP is, in fact, an independent variable of the model. Hence, the list of features that better explains the dependent variable (phenotype) contains the saught SNPs. Considering the LightGBM framework design and the strong evidence of its inference to address problems involving high-sparse data over big datasets [ 16, 25, 23 ], this article works upon the idea that such a framework is also a potential core engine for GWAS.

Experiment design and methodology

LightGWAS is designed to be a GWAS procedure based on a machine learning nonparametric method. The solution is composed of a GBDT algorithm implemented by the LightGBM framework [ 9 ]. It is tted with the SNPs as independent variables and the phenotype as the class. Thus, the causal-SNPs are retrieved by calculating the models' feature importance. In turn, to answer the research question of this paper, an experiment involving three datasets, two feature selector models and a predicting model is conducted. Fig. 1 depicts the experiment design in four steps, followed by the evaluation strategy applied.

The datasets (Fig. 1A) contain the same number of SNPs each, but varying the phenotype balance ratio on cases:controls of 1:1, 1:10, and 1:100. The rst two models (Fig. 1B) are GWAS procedures. One of the GWAS methods is the novelty behind this paper, the LightGWAS. The other one is PLINK2 [ 7 ], one of the state-of-the-art implementations for GWAS in contexts where GLM is required. Therefore, six causal-SNPs result sets are generated from them. The third model (Fig. 1C) referred to as common classi er from now on, is a logistic regression. It employes k-fold cross-validation model selector technique, with k been set arbitrarily to 50. A value higher than 30 was necessary to perform statistically signi cant comparisons across the resulting sets. The common classi er is tted once with causal-SNPs discovered by LightGWAS as independent variables and another time with causal-SNPs retrieved with PLINK2. The dependent variable, in both circumstances, is the underlying phenotype of the datasets. Therefore, its output is a paired set of classi cation metrics from each of the GWAS methods. Lastly (Fig. 1D), the group of metrics extracted with the cross-validation are evaluated in terms of statistical signi cance for possible di erences among them. The evaluated alternative hypotheses are: { H1: LightGWAS outperforms GLM based on logistic regression with rth regularisation for GWAS, across genomic datasets of balanced qualitative phenotypes (case : control = 1 : 1), in terms of accuracy, precision, F1 score, and ROC/AUC. { H2: LightGWAS outperforms GLM based on logistic regression with rth regularisation for GWAS, across genomic datasets of imbalanced qualitative phenotypes (case : control = 1 : 10), in terms of precision, F1 score, and ROC/AUC. { H3: LightGWAS outperforms GLM based on logistic regression with rth regularisation for GWAS, across genomic datasets of high-imbalanced qualitative phenotypes (case : control = 1 : 100), in terms of precision, F1 score, and ROC/AUC. 3.1

Datasets

A GWAS relies on two di erent data groups: the genomic data that contains the DNA variances, and the traits to be associated with the SNPs between the cases and controls cohorts. Usually, the traits to be investigated are human phenotypes, such as diseases status, that can be retrieved from the patients electronic health records (EHR) [ 26 ]. In this article, selected datasets are fully synthetic in either genomic and phenotype data. Simulations have been introduced to distinguish accurately the causal-SNPs expected to be exposed by each of the evaluated GWAS models, which is paramount to compare them correctly. Dataset simulation for GWAS methods validation is a prevalent practice and can be observed in many types of researches, such as [ 5, 7, 8, 14, 26 ]. Accordingly, six datasets have been created, combined into three data groups of class (phenotype status) distribution: balanced, imbalanced, and high-imbalanced data. They have been named as ds1 1, ds1 10 and ds1 100, respectivelly. The number of samples ( ctitious patients) in each of the datasets respected the following pattern: ds1 1 = case:control=1:1=2500:2500, N=5000 ; ds1 10 = case:control=1:10=400:4000, N=4400 ; ds1 100 = case:control=1:100=50:5000, N=5050. The datasets were produced using the PLINK SNP simulation tool1. Each sample had a phenotype status class (case or control ) and 10100 numeric features (each feature is a SNP). Further details about the variables of interest along with the parameters set to simulate the datasets can be found in Appendix A (table 2, page 12). 3.2

Procedure

The complete procedure to accomplish the objectives, and test the alternative hypotheses includes seven steps: 1. Simulation of datasets, as outlined above, in section 3.1. 2. LightGWAS implementation. It is composed of a GBDT implementation called LightGBM. The hyperparameters are tunned through 200 iterations of randomised 5-folds cross-validation search. Table 3 in the Appendix B (page 12) contains the cross-validated optimal hyperparameters selected for each dataset group. 1 http://zzz.bwh.harvard.edu/plink/simulate.shtml 3. Discover the causal-SNPs across the early mentioned datasets by employing LightGWAS and PLINK2. Therefore, two sets of causal-SNPs per GWAS method is generated. PLINK's outcome is a set of SNPs accompanied by their p-value. The causal-SNPs ltering is reached by assuming a cut-o ( ) for such a p-value. For the datasets ds1 1 and ds1 10, the cut-o p j = 5 10 8 is assumed, as per genome-wide association study convension [ 3 ]. In turn, for the dataset ds1 100, the cut-o is p j = 5 10 4 because no SNP was selected with the rst one. This decision has been grounded on [ 14 ]. In contrast, LightGWAS selects each SNP with the gain or split score of the decision trees. Therefore, the list of features importance from the LightGBM framework is the set of causal-SNPs retrieved with LightGWAS. 4. GWAS model's evaluation. In order to compare how e ective LightGWAS is in comparison to PLINK, the common classi er is employed. It is a logistic regression executed through 50-folds cross-validation for model selection, which is tted upon two conditions: one with the features as the causalSNPs collected via LightGWAS, and another with causal-SNPs selected via PLINK. The class (or target) for both scenarios, is the phenotype variable. Therefore, the common classi er output is a separated dataset with 50 result samples per GWAS model. The following metrics have been evaluated: weighted average of the precision and recall (F1), recall, average precision score (APS), receiver operating characteristic (ROC)/area under the curve (AUC), accuracy, and precision. 5. The con dence interval (CI) of the metric's result sets are calculated through 5000 bootstraps in a cut-o of = 0:05. The subsamples (resampling with replacement) is sized at 50% (N 0:5). Therefore, there is 95% of a likelihood that the reported lower limit (LL) and upper limit (UL) represent the con dence intervals of the true metrics' performances. 6. Paired di erence tests are employed to measure how signi cant is the observed di erences in each metric pair. Dependent (paired) sample Student's t-test is applied to the metric pairs that held a normal distribution, and Wilcoxon signed-rank test otherwise. Tests to assess whether a metric (variable of the results dataset) is in a Gaussian distribution are conducted with D'Agostino's K2 Normality Test. Whenever a sample does not reach the levels of a normal distribution, power transformation through Box-cox is rstly attempted before assuming nonparametric approaches. 7. The e ect of the observed mean di erences are calculated through Cohen's d test when the parametric test has been used, and Wilcoxon r score otherwise. 4

Results and evaluation

The consolidated results can be observed below in table 1, followed by the statistical report. The CI ranges along with the standard deviation (SD) of each metric have been logged to the Appendix C (table 4, page 12). Below follows a statistical report, separated by dataset group, extracted from the interpretation of the consolidate result sets disclosed in table 1.

Dataset ds1 1: LightGWAS slightly outperformed PLINK on metrics F1, recall, and ROC/AUC, while PLINK outperformed LightGWAS on APS, and precision. Both models reached out the same mean value for accuracy so that zero mean absolute di erence (MD). The t-tests indicated no statistical signi cance on = 0:05 for any of the measured metrics. The standardized di erence between the means resulted in a small e ect for all of the metrics (d < 0:5). In terms of causal-SNP selection, LightGWAS selected 86 SNPs, while PLINK selected 90. PLINK managed to pick all SNPs selected by LightGWAS, plus other four causal-SNPs.

Dataset ds1 10: LightGWAS slightly outperformed PLINK for every measured metrics. The t-tests indicated statistical signi cance on = 0:05 for both F1 and accuracy with small e ect (d < 0:5). The Wilcoxon test indicated statistical signi cance on = 0:01 and large e ect (r 0:8) for APS, ROC/AUC and precision. No statistical signi cance on = 0:05 has been observed for recall, although the observed di erence had large e ect (r 0:8). In terms of causal-SNP selection, LightGWAS selected 80 SNPs, while PLINK selected 76. LightGWAS managed to pick all SNPs selected by PLINK, plus other four causal-SNPs.

Dataset ds1 100: LightGWAS slightly outperformed PLINK for every measured metrics. The Wilcoxon test indicated no statistical signi cance on = 0:05 for any of them. However, a medium e ect (r 0:5 ^ r < 0:8) has been observed for recall, and a large e ect (r 0:8) for all the other metrics. In terms of causal-SNP selection, LightGWAS selected 28 SNPs, while PLINK selected 19. LightGWAS managed to pick 14 SNPs missed by PLINK, and PLINK, in turn, managed to select 5 SNPs missed by LightGWAS. 4.2

Discussion

The models implemented through LightGWAS performed as good as PLINK for GWAS over the balanced dataset. The paired di erence tests disclosed that none of the measured di erences is statistically signi cant on cut-o = 0:05. Also, the observed e ects through Cohen's d presented a small standardised e ect between all the means of the paired metrics. Consequently, the alternative hypothesis H1 had to be rejected as LightGWAS did not outperform (neither underperformed) statistically signi cant for such a dataset.

The experiments involving an imbalanced dataset brought evidence that supports accepting the alternative hypothesis H2. LightGWAS has outperformed PLINK for such a scenario. Although recall did not reach statistical signi cance on = 0:05 (therefore as good as PLINK), all the other metrics had relevant results on = 0:01 (F1 and accuracy on = 0:05). Furthermore, the metrics measured through nonparametric tests (recall, APS, ROC/AUC and precision) resulted in a large e ect (r 0:8).

The alternative hypothesis H3 was rejected. Although LightGWAS outperformed PLINK with medium e ect for recall (r 0:5 ^ r < 0:8) and a large e ect for the other metrics (r 0:8) when instantiated with a high-imbalanced dataset, none of the results reached statistical signi cance on = 0:05.

Considering exclusively the k-fold cross-validation model selection results (observed di erences in the means), models implemented via the proposed LightGWAS procedure outperformed those implemented with PLINK in the three evaluated scenarios. However, if taking into consideration the statistical analysis of the metrics pairs di erences, this result is held with statistical signi cance only in the experiments involving the imbalanced dataset. Nonetheless, according to [ 6 ], it is important to note that statistical signi cance should not be the exclusive approach to reject how relevant a model is. The scienti c perspective (or signi cance) of the underlying problem should also be taken into consideration. Genome-wide association study plays an essential rule on identifying causal anomalies across DNA, and any improvement over a method, being it statistically signi cant or not, should be accounted for. Hence, although some of the measured metrics did not reach statistical signi cance (leading to the rejection of the alternative hypotheses H1 and H3), they prove to be scienti cally meaningful through their e ect di erences and the number of discovered causal-SNPs. As a result, the research question (page 1) can be answered positively. The evidence collected from the tested hypotheses supports the theory that LightGWAS is a potential genome-wide association study method.

Conclusion

This paper has proposed a novel genome-wide association study (GWAS) procedure, named LightGWAS. It is a nonparametric machine learning (ML) method based on the LightGBM framework [ 9 ]. LightGWAS has been idealised as a potential single, resilient, autonomous and scalable solution to address some of the found limitations of the available state-of-the-art implementations for GWAS. A literature review identi ed that the current GWAS implementations rely on cumbersome manual quality control steps to address statistical problems, such as controlling for false-positive in ation and power reduction. These challenges increase as the data grows or becomes imbalanced. It also showed they demand a particular GWAS method for each type of genomic data structure, which increases human dependency. In this research, the e ectiveness of the models implemented via the proposed LightGWAS procedure was assessed upon GWAS scenarios where the investigated phenotype is qualitative and datasets are about to ve thousand samples of balanced (case : control = 1 : 1), imbalanced (case : control = 1 : 10), and high-imbalanced (case : control = 1 : 100) genomic data. Next, LightGWAS models were contrasted with those implemented via the state-of-the-art for GWAS (PLINK2 [ 7 ]). This assessment was performed through an empirical comparative experiment. A model selection based on 50fold cross-validation signed out LightGWAS as the best choice in terms of mean di erences. The results from empirical statistical tests denoted that the di erences are statistically signi cant for imbalanced datasets contexts.

The main contribution of LightGWAS for genome-wide association study is the fact it is based on a nonparametric machine learning approach against the state-of-the-art that strongly relies on parametric statistical models. Therefore, LightGWAS allows scalability and adaptability to the most diverse genomic data morphology, which, in turn, reduces human dependency. It scales thanks to the LightGBM framework, which is the state-of-the-art for gradient boosted decision trees, capable of handling large and high-sparse datasets. LightGBM was created to address classi cation or regression problems. Still, in the LightGWAS procedure, it is used as a phenotype causal single-nucleotide polymorphism (SNP) discover by calculating the feature importance of a tted model. Hence, this research shows originality by taking a speci c technique and adapting it to a new domain of application. For all these reasons, LightGWAS is a new contribution from data science towards the evolvement of molecular biology science.

For future work, it is recommended to compare LightGWAS with the GWAS procedures based on linear mixed model, and scalable and accurate implementation of generalized mixed model. Thus, the e ectiveness of LightGWAS can also be assessed against scenarios that go beyond the ones addressable through general linear models. It would also bene t whether using quantitative phenotypes to make sure LightGWAS attends to linear association models. Lastly, it is recommended the development of a mechanism to identify causal-SNPs from decision trees gain or split scores, as no p-values exist in such a context. It is crucial to develop a system analogue to the cut-o s employed by the current state-of-the-art regression models to lter causal-SNPs (p for each SNP).

Appendices Appendix A: Datasets' phenotype ratios and variables of interest Appendix B: LightGBM hyperparameter values

... ... Alphanumeric

Appendix C: Con dence interval ranges and standard deviations

f1 0 recall 10 APS 1 ROC/AUC s d accuracy precision PLINK

1. Bush , W.S. , Moore , J.H. : Chapter 11: Genome-wide association studies . PLoS Computational Biology 8 ( 12 ), e1002822 (Dec 2012 ). https://doi.org/10.1371/journal.pcbi. 1002822 , https://doi.org/10.1371/ journal.pcbi.1002822

2. Chen , X. , Ishwaran , H.: Random forests for genomic data analysis . Genomics 99 ( 6 ), 323 {329 (Jun 2012 ). https://doi.org/10.1016/j.ygeno. 2012 . 04 .003, https://doi. org/10.1016/j.ygeno. 2012 . 04 .003

3. Fadista , J. , et al.: The (in)famous GWAS p-value threshold revisited and updated for low-frequency variants . European Journal of Human Genetics 24 ( 8 ), 1202 {1205 (Jan 2016 ). https://doi.org/10.1038/ejhg. 2015 . 269 , https://doi.org/ 10.1038/ejhg. 2015 .269

4. Farrell , R.E.: Functional genomics and transcript pro ling . In: RNA Methodologies , pp. 685 { 695 . Elsevier ( 2017 ). https://doi.org/10.1016/b978-0 -12-804678- 4 . 00024 - 5 , https://doi.org/10.1016/b978-0 -12-804678-4 . 00024 - 5

5. Golan , D. , Rosset , S. , Lin , D.Y. : Mixed models for case-control genomewide association studies: Major challenges and partial solutions . In: Handbook of Statistical Methods for Case-Control Studies , pp. 495 { 514 . Chapman and Hall/CRC (Jun 2018 ). https://doi.org/10.1201/ 9781315154084 -27, https://doi. org/10.1201/ 9781315154084 - 27

6. Greenland , S. , et al.: Statistical tests, p values, con dence intervals, and power: a guide to misinterpretations . European Journal of Epidemiology 31 ( 4 ), 337 { 350 (Apr 2016 ). https://doi.org/10.1007/s10654-016-0149-3, https://doi.org/ 10.1007/s10654-016-0149-3

7. Hill , A. , Loh , P.R. , Bharadwaj , R.B. , Pons , P. , Shang , J. , Guinan , E. , Lakhani , K. , Kilty , I. , Jelinsky , S.A. : Stepwise distributed open innovation contests for software development: Acceleration of genome-wide association analysis . GigaScience 6 ( 5 ) ( Feb 2017 ). https://doi.org/10.1093/gigascience/gix009, https://doi.org/ 10.1093/gigascience/gix009

8. Jiang , L. , et al.: A resource-e cient tool for mixed model association analysis of large-scale data . Nature Genetics 51 ( 12 ), 1749 {1755 (Nov 2019 ). https://doi.org/10.1038/s41588-019-0530-8, https://doi.org/10.1038/ s41588-019-0530-8

9. Ke , G. , et al.: Lightgbm: A highly e cient gradient boosting decision tree . In: Guyon, I. , Luxburg , U.V. , Bengio , S. , Wallach , H. , Fergus , R. , Vishwanathan , S. , Garnett , R . (eds.) Advances in Neural Information Processing Systems 30 , pp. 3146 { 3154 . Curran Associates, Inc. ( 2017 ), http://papers.nips.cc/paper/ 6907-lightgbm -a-highly-efficient-gradient-boosting-decision-tree.pdf

10. Lee , S. , Wright , F.A. , Zou , F. : Control of population strati cation by correlation-selected principal components . Biometrics 67 ( 3 ), 967 {974 (Dec 2010 ). https://doi.org/10.1111/j.1541- 0420 . 2010 . 01520 .x, https://doi.org/10.1111/j. 1541- 0420 . 2010 . 01520 .x

11. Li , J. , et al.: Feature selection . ACM Computing Surveys 50 ( 6 ), 1 { 45 (Jan 2018 ). https://doi.org/10.1145/3136625, https://doi.org/10.1145/3136625

12. Loh , P.R. , et al.: Mixed-model association for biobank-scale datasets . Nature Genetics 50 ( 7 ), 906 {908 (Jun 2018 ). https://doi.org/10.1038/s41588-018-0144-6, https://doi.org/10.1038/s41588-018-0144-6

13. Lubke , G. , et al.: Gradient boosting as a SNP lter: an evaluation using simulated and hair morphology data . Journal of Data Mining in Genomics & Proteomics 04 ( 04 ) ( 2013 ). https://doi.org/10.4172/ 2153 - 0602 .1000143, https://doi.org/10. 4172/ 2153 - 0602 . 1000143

14. Ma , C. , et al.: Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants . Genetic Epidemiology 37 ( 6 ), 539 { 550 (Jun 2013 ). https://doi.org/10.1002/gepi.21742, https://doi.org/10.1002/ gepi.21742

15. Mills , M.C. , Rahal , C. : A scientometric review of genome-wide association studies . Communications Biology 2 ( 1 ), 9 (Jan 2019 ). https://doi.org/10.1038/s42003-018- 0261-x, https://doi.org/10.1038/s42003-018-0261-x

16. Mo , K. , Li , J.: A deep auto-encoder based LightGBM approach for network intrusion detection system . In: Proceedings of the International Conference on Advances in Computer Technology, Information Science and Communications . pp. 142 { 147 . SCITEPRESS - Science and Technology Publications ( 2019 ). https://doi.org/10.5220/0008098401420147, https://doi.org/10.5220/ 0008098401420147

17. Pearson , T.A. : How to interpret a genome-wide association study . JAMA 299 ( 11 ), 1335 (Mar 2008 ). https://doi.org/10.1001/jama.299.11.1335, https://doi.org/ 10.1001/jama.299.11.1335

18. Perez-Enciso , Zingaretti: A guide for using deep learning for complex trait genomic prediction . Genes 10 ( 7 ), 553 (Jul 2019 ). https://doi.org/10.3390/genes10070553, https://doi.org/10.3390/genes10070553

19. Price , A.L. , et al.: Principal components analysis corrects for strati cation in genome-wide association studies . Nature Genetics 38 ( 8 ), 904 {909 (Jul 2006 ). https://doi.org/10.1038/ng1847, https://doi.org/10.1038/ng1847

20. Purcell , S. , et al.: PLINK: A tool set for whole-genome association and populationbased linkage analyses . The American Journal of Human Genetics 81 ( 3 ), 559 {575 (Sep 2007 ). https://doi.org/10.1086/519795, https://doi.org/10.1086/519795

21. Reed , E. , et al.: A guide to genome-wide association analysis and postanalytic interrogation . Statistics in Medicine 34 ( 28 ), 3769 {3792 (Sep 2015 ). https://doi.org/10.1002/sim.6605, https://doi.org/10.1002/sim.6605

22. Sebastiani , P. , et al.: Genome-wide association studies and the genetic dissection of complex traits . American Journal of Hematology 84 ( 8 ), 504 {515 (Aug 2009 ). https://doi.org/10.1002/ajh.21440, https://doi.org/10.1002/ajh.21440

23. Song , Y. , et al.: Prediction of double-high biochemical indicators based on LightGBM and XGBoost . In: Proceedings of the 2019 International Conference on Arti cial Intelligence and Computer Science - AICS 2019 . p. 189 { 193 . ACM Press ( 2019 ). https://doi.org/10.1145/3349341.3349400, https://doi.org/ 10.1145/3349341.3349400

24. Spencer , C.C.A. , et al.: Designing genome-wide association studies: Sample size, power, imputation, and the choice of genotyping chip . PLoS Genetics 5 ( 5 ), 1 {13 (May 2009 ). https://doi.org/10.1371/journal.pgen. 1000477 , https://doi.org/10. 1371/journal.pgen.1000477

25. Wang , R. , et al.: Power system transient stability assessment based on bayesian optimized LightGBM . In: 2019 IEEE 3rd Conference on Energy Internet and Energy System Integration (EI2) . pp. 263 { 268 . IEEE (Nov 2019 ). https://doi.org/10.1109/ei247390. 2019 . 9062027 , https://doi.org/ 10.1109/ei247390. 2019 .9062027

26. Zhou , W., other: E ciently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies . Nature Genetics 50 ( 9 ), 1335 { 1341 (Aug 2018 ). https://doi.org/10.1038/s41588-018-0184-y, https://doi.org/ 10.1038/s41588-018-0184-y

2 2 ds1 1 ds1 10 ds1 100

colsample bytree 0.47328041 0.47328041 0.866621446 learning rate 0.03 0.03 0.01 max depth 1 1 6 min child samples 147 147 454 min child weight 1.0 1.0 1.0 min split gain 0 0 0 n estimators 2000 2000 2000 num leaves 35 35 41 reg alpha 0.1 0.1 5 reg lambda 0.1 0.1 50 subsample 0.995930118 0.995930118 0.820421212 subsample for bin 200000 200000 200000