LightGWAS: A Novel Machine Learning Procedure for Genome-Wide Association Study Bruno Ambrozio[0000−0002−6180−6986] , Luca Longo[0000−0002−2718−5426] , and Lucas Rizzo[0000−0001−9805−5306] School of Computer Science, Technological University Dublin, Ireland d16128063@mytudublin.ie,{luca.longo,lucas.rizzo}@tudublin.ie Abstract. This paper proposes a novel machine learning procedure for genome-wide association study (GWAS), named LightGWAS. It is based on the LightGBM framework, in addition to being a single, resilient, au- tonomous and scalable solution to address common limitations of GWAS implementations found in the literature. These include reliance on mas- sive manual quality control steps and specific GWAS methods for each type of dataset morphology and size. Through this research, LightG- WAS has been contrasted against PLINK2, one of the current state-of- the-art for GWAS implementations based on general linear model with support to firth regularisation. The mean differences measured upon standard classification metrics, extracted via quantitative empirical tests through k-fold cross-validation technique, indicated that LightGWAS outperforms PLINK2 for balanced, imbalanced, and high-imbalanced ge- nomic datasets. Paired difference tests denoted statistical significance in the results extracted from the experiments with imbalanced datasets. This article contributes to the body of knowledge by presenting a po- tentially more efficient GWAS procedure based on nonparametric ap- proaches. LightGWAS ensures adaptability with higher precision in the discovery of causal single-nucleotide polymorphisms, thanks to the leaf- wise tree growth algorithm offered by the state-of-the-art for gradient boosting decision trees. Control for false-positives and statistical power are automatically addressed by the model’s training process, which sig- nificative reduces human dependency during the study design. Keywords: LightGWAS, LightGBM, genome-wide association study. 1 Introduction The most common type of genetic variant among humans’ DNA is the single- nucleotide polymorphism (SNP) [22]. SNPs are responsible for phenotypes: ob- servable characteristics or traits in a cohort [7]. Phenotypes can be modelled quantitatively, such as people’s height, weight, body mass index, or blood pres- sure. Alternatively, they can be qualitative such as eye colour, curly hair, or a disease status like affected or not by Type-2 diabetes. Whenever a SNP is responsible for a phenotype, it is denominated as a causal-SNP. Therefore, iden- tifying causal-SNPs is an effective way to understand, prevent, or treat complex illnesses. Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 B. Ambrozio et al. There are many methods to discovery causal-SNPs, including genome-wide association study (GWAS). GWAS implementations calculate the association between each SNP and the underlying phenotype throughout a statistical model. Therefore, GWAS is roughly analogue to, or a type of feature selection: each SNP is a feature (independent variable), and the phenotype is the class (target, or dependent variable). The features identified as better predictors of the class are the potential causal-SNPs. Statistical regression models portray the state-of-the-art for GWAS. De- spite their efficiency, some eminent problems have become inevitable over the past years. These include reduction of costs for DNA sequencing [18], which in turn allowed an exponential growth of data; expansion of SNPs datasets that have contributed to overwhelming sparsity, with millions of SNPs, and few pa- tients [13]; high-disperse (or high-dimensional) datasets, compromising the ap- proaches available for GWAS as they are derived from linear (parametric) mod- els [11]. Another point of concern emerges with imbalanced ratios of rare cases and several controls. Such a scenario tends to inflate false-positives when data is exploited by regression over qualitative features. [26]. Nowadays, these obsta- cles are addressed via several manual quality control steps to increase statisti- cal power and avoid type 1 errors [10, 19, 21, 22, 24]. However, as much as data grows, so does the dependency on manual intervention. Hence, it opens margins for human mistakes and compromises the scalability of the study. To address the aforementioned gaps, this paper proposes a novel procedure for GWAS assembled over decision trees (DT) enhanced by gradient boosting machine (GBM), whose implementation comes from the LightGBM framework [9]. It ensures adaptabil- ity to the most diverse genomic data structures by controlling bias and variance over the training process. Consequently, it improves precision, independently of human intervention. Such a procedure has been named LightGWAS. Therefore, this work attempts to answer the following research question: – Can LightGWAS be an alternative method to the state-of-the-art for genome- wide association studies based upon general linear models, by increasing sta- tistical power on causal-SNP detection, and reducing the number of manual quality control steps? The research goals of this paper are: (a) to evaluate whether LightGWAS is a suitable GWAS method for qualitative phenotypes, according to a set of common metrics for classification problems; and (b) to assess if LightGWAS outperforms the available state-of-the-art for GWAS in terms of statistical power and precision. Finally, the remainder of this paper is organised as follows: Section 2 reviews researches on the state-of-the-art for genome-wide association studys along with an overview of the LightGBM framework. Section 3 introduces the design and a set of hypotheses for answering the research question. Section 4, in turn, presents the results with a discussion. Lastly, section 5 concludes the study, highlighting its contributions and possible future work. LightGWAS: A Novel ML Procedure for GWAS 3 2 Literature review and related work A Genome-wide association study (GWAS) is a discovery-driven research tech- nique to catalogue single-nucleotide polymorphisms (SNPs) across populations and to identify genetic markers associated with traits [1, 4]. Since the comple- tion of the human genome sequence in 2003, about 3, 700 GWASs contributed to discovering thousands of genetic risk causal-SNPs and their biological func- tions [17, 15]. The state-of-the-art for GWAS methods are based on three exclu- sivelly statistical association models: general linear model (GLM), linear mixed model (LMM), and scalable and accurate implementation of generalized mixed model (SAIGE) [12]. Their applicability depends on the phenotype type, sample size, and cohort distribution across the manipulated genomic dataset. Accord- ing to [12], the following criteria should be considered to select the appropriate model: (a) GLM implementations for quantitative traits, up to five thousand samples. If qualitative phenotype, the logistic regression implementation should include firth regularisation to minimize the fitting errors caused by the categor- ical class whenever its frequency is lower than 400 [14]; (b) LMM implementa- tion for datasets bigger than five thousand samples and quantitative traits type. Whether qualitative phenotype, the dataset should be in a normal distribution; or (c) SAIGE [26] for high-imbalanced case-control ratio of qualitative traits. Independently of the chosen method, principal component analysis should also be employed. It helps to filter SNPs that might be caused by the structure of the population (generating confounding due to ancestry), rather than the investi- gated phenotype [19]. Usually, the first ten eigenvalues are arbitrarily considered as covariants for an association model [19, 2]. The GWAS outcome is a list of potential causal-SNPs. This paper proposes a method for GWAS based on LightGBM [9]: a gradi- ent boosted decision trees (GBDT) framework built upon histogram algorithms. LightGBM grows the trees leaf-wise and uses gradient-based one-side sampling (GOSS) to downsampling data, and exclusive feature bundling (EFB) to re- duce feature dimension. In order to address GBDT problems related to high- computational complexity due to abundance of data, GOSS retains the large gradients samples, randomly selects small gradients, and assign constant weights to them. The algorithm concentrates on undertrained samples without altering the distribution of raw data. EFB, in turn, is a feature extraction technique, based on the graph coloring problem, which also contributes to reducing the his- togram building complexity. It deals with the sparsity of the data by grouping many independent variables to the dense features, avoiding unnecessary com- putation with pieces that do not account for the outcome variable. LightGBM, within the proposal GWAS solution, discoveries causal-SNPs by calculating the model’s feature importance. Each SNP is, in fact, an independent variable of the model. Hence, the list of features that better explains the dependent variable (phenotype) contains the saught SNPs. Considering the LightGBM framework design and the strong evidence of its inference to address problems involving high-sparse data over big datasets [16, 25, 23], this article works upon the idea that such a framework is also a potential core engine for GWAS. 4 B. Ambrozio et al. 3 Experiment design and methodology LightGWAS is designed to be a GWAS procedure based on a machine learn- ing nonparametric method. The solution is composed of a GBDT algorithm implemented by the LightGBM framework [9]. It is fitted with the SNPs as independent variables and the phenotype as the class. Thus, the causal-SNPs are retrieved by calculating the models’ feature importance. In turn, to answer the research question of this paper, an experiment involving three datasets, two feature selector models and a predicting model is conducted. Fig. 1 depicts the experiment design in four steps, followed by the evaluation strategy applied. Fig. 1: Diagrammatic visualisation of experiment design, components and eval- uation. The datasets (Fig. 1A) contain the same number of SNPs each, but varying the phenotype balance ratio on cases:controls of 1:1, 1:10, and 1:100. The first two models (Fig. 1B) are GWAS procedures. One of the GWAS methods is the novelty behind this paper, the LightGWAS. The other one is PLINK2 [7], one of the state-of-the-art implementations for GWAS in contexts where GLM is required. Therefore, six causal-SNPs result sets are generated from them. The third model (Fig. 1C) referred to as common classifier from now on, is a lo- gistic regression. It employes k-fold cross-validation model selector technique, with k been set arbitrarily to 50. A value higher than 30 was necessary to per- form statistically significant comparisons across the resulting sets. The common classifier is fitted once with causal-SNPs discovered by LightGWAS as indepen- dent variables and another time with causal-SNPs retrieved with PLINK2. The dependent variable, in both circumstances, is the underlying phenotype of the datasets. Therefore, its output is a paired set of classification metrics from each of the GWAS methods. Lastly (Fig. 1D), the group of metrics extracted with the cross-validation are evaluated in terms of statistical significance for possible differences among them. The evaluated alternative hypotheses are: – H 1 : LightGWAS outperforms GLM based on logistic regression with firth regularisation for GWAS, across genomic datasets of balanced qualitative LightGWAS: A Novel ML Procedure for GWAS 5 phenotypes (case : control = 1 : 1), in terms of accuracy, precision, F1 score, and ROC/AUC. – H 2 : LightGWAS outperforms GLM based on logistic regression with firth regularisation for GWAS, across genomic datasets of imbalanced qualitative phenotypes (case : control = 1 : 10), in terms of precision, F1 score, and ROC/AUC. – H 3 : LightGWAS outperforms GLM based on logistic regression with firth regularisation for GWAS, across genomic datasets of high-imbalanced quali- tative phenotypes (case : control = 1 : 100), in terms of precision, F1 score, and ROC/AUC. 3.1 Datasets A GWAS relies on two different data groups: the genomic data that contains the DNA variances, and the traits to be associated with the SNPs between the cases and controls cohorts. Usually, the traits to be investigated are human pheno- types, such as diseases status, that can be retrieved from the patients electronic health records (EHR) [26]. In this article, selected datasets are fully synthetic in either genomic and phenotype data. Simulations have been introduced to distin- guish accurately the causal-SNPs expected to be exposed by each of the evaluated GWAS models, which is paramount to compare them correctly. Dataset simula- tion for GWAS methods validation is a prevalent practice and can be observed in many types of researches, such as [5, 7, 8, 14, 26]. Accordingly, six datasets have been created, combined into three data groups of class (phenotype status) distribution: balanced, imbalanced, and high-imbalanced data. They have been named as ds1 1, ds1 10 and ds1 100, respectivelly. The number of samples (fic- titious patients) in each of the datasets respected the following pattern: ds1 1 = case:control=1:1=2500:2500, N=5000 ; ds1 10 = case:control=1:10=400:4000, N=4400 ; ds1 100 = case:control=1:100=50:5000, N=5050. The datasets were produced using the PLINK SNP simulation tool1 . Each sample had a phenotype status class (case or control ) and 10100 numeric features (each feature is a SNP). Further details about the variables of interest along with the parameters set to simulate the datasets can be found in Appendix A (table 2, page 12). 3.2 Procedure The complete procedure to accomplish the objectives, and test the alternative hypotheses includes seven steps: 1. Simulation of datasets, as outlined above, in section 3.1. 2. LightGWAS implementation. It is composed of a GBDT implementation called LightGBM. The hyperparameters are tunned through 200 iterations of randomised 5-folds cross-validation search. Table 3 in the Appendix B (page 12) contains the cross-validated optimal hyperparameters selected for each dataset group. 1 http://zzz.bwh.harvard.edu/plink/simulate.shtml 6 B. Ambrozio et al. 3. Discover the causal-SNPs across the early mentioned datasets by employing LightGWAS and PLINK2. Therefore, two sets of causal-SNPs per GWAS method is generated. PLINK’s outcome is a set of SNPs accompanied by their p-value. The causal-SNPs filtering is reached by assuming a cut-off (α) for such a p-value. For the datasets ds1 1 and ds1 10, the cut-off p ≤ α|α = 5 × 10−8 is assumed, as per genome-wide association study convension [3]. In turn, for the dataset ds1 100, the cut-off is p ≤ α|α = 5 × 10−4 because no SNP was selected with the first one. This decision has been grounded on [14]. In contrast, LightGWAS selects each SNP with the gain or split score of the decision trees. Therefore, the list of features importance from the LightGBM framework is the set of causal-SNPs retrieved with LightGWAS. 4. GWAS model’s evaluation. In order to compare how effective LightGWAS is in comparison to PLINK, the common classifier is employed. It is a logis- tic regression executed through 50-folds cross-validation for model selection, which is fitted upon two conditions: one with the features as the causal- SNPs collected via LightGWAS, and another with causal-SNPs selected via PLINK. The class (or target) for both scenarios, is the phenotype variable. Therefore, the common classifier output is a separated dataset with 50 re- sult samples per GWAS model. The following metrics have been evaluated: weighted average of the precision and recall (F1), recall, average precision score (APS), receiver operating characteristic (ROC)/area under the curve (AUC), accuracy, and precision. 5. The confidence interval (CI) of the metric’s result sets are calculated through 5000 bootstraps in a cut-off of α = 0.05. The subsamples (resampling with replacement) is sized at 50% (N × 0.5). Therefore, there is 95% of a likeli- hood that the reported lower limit (LL) and upper limit (UL) represent the confidence intervals of the true metrics’ performances. 6. Paired difference tests are employed to measure how significant is the ob- served differences in each metric pair. Dependent (paired) sample Student’s t-test is applied to the metric pairs that held a normal distribution, and Wilcoxon signed-rank test otherwise. Tests to assess whether a metric (vari- able of the results dataset) is in a Gaussian distribution are conducted with D’Agostino’s K 2 Normality Test. Whenever a sample does not reach the lev- els of a normal distribution, power transformation through Box-cox is firstly attempted before assuming nonparametric approaches. 7. The effect of the observed mean differences are calculated through Cohen’s d test when the parametric test has been used, and Wilcoxon r score otherwise. 4 Results and evaluation The consolidated results can be observed below in table 1, followed by the sta- tistical report. The CI ranges along with the standard deviation (SD) of each metric have been logged to the Appendix C (table 4, page 12). LightGWAS: A Novel ML Procedure for GWAS 7 Table 1: Results of statistical tests. (§ ) Metric’s “p-value” and “stat” calculated from the Box-Cox power transform result. (*) Metric statistically significant on α = 0.05. (**) Metric statistically significant on α = 0.01. (MD) mean absolute difference. Best values in bold. LightGWAS PLINK MD Stat p-value Effect (Mean) (Mean) −2 f1 0.967436 0.967 416 0.000 020 2.879 656 × 10 0.977 144 0.001 191 recall 0.966800 0.966 400 0.000 400 3.747 014 × 10−1 0.709 499 0.019 789 ds1 1 APS§ 0.995 725 0.995748 0.000 022 0.559 679 0.578 248 0.006 192 ROC/AUC§ 0.995664 0.995 648 0.000 016 0.744 993 0.459 835 0.004 531 accuracy 0.967 400 0.967 400 0 3.172 727 × 10−15 1 0 precision 0.968 505 0.968896 0.000 390 −4.497 929 × 10−1 0.654 843 0.015 893 f1* 0.993251 0.991 394 0.001 857 2.364 684 0.022051 0.292 229 recall 0.993750 0.993 000 0.000 750 113.5 0.662 096 16.051324 ds1 10 APS** 0.999830 0.999 671 0.000 159 54.0 0.002024 7.636 753 ROC/AUC** 0.998281 0.996 719 0.001 562 48.5 0.006190 6.858 936 accuracy* 0.987727 0.984 318 0.003 409 2.393 172 0.020579 0.294 840 precision** 0.992842 0.989 887 0.002 955 37.5 0.006574 5.303 301 f1 0.997205 0.996 713 0.000 492 183.0 0.430 596 25.880108 recall 0.998600 0.999 400 0.000 800 5.0 0.234 194 0.707107 ds1 100 APS 0.999857 0.999 823 0.000 034 163.5 0.095 638 23.122392 ROC/AUC 0.987000 0.982 600 0.004 400 166.5 0.107 381 23.546656 accuracy 0.994455 0.993 465 0.000 990 180.0 0.387 660 25.455844 precision 0.995830 0.994 053 0.001 776 176.0 0.342 925 24.890159 4.1 Statistical report Below follows a statistical report, separated by dataset group, extracted from the interpretation of the consolidate result sets disclosed in table 1. Dataset ds1 1: LightGWAS slightly outperformed PLINK on metrics F1, recall, and ROC/AUC, while PLINK outperformed LightGWAS on APS, and precision. Both models reached out the same mean value for accuracy so that zero mean absolute difference (MD). The t-tests indicated no statistical signifi- cance on α = 0.05 for any of the measured metrics. The standardized difference between the means resulted in a small effect for all of the metrics (d < 0.5). In terms of causal-SNP selection, LightGWAS selected 86 SNPs, while PLINK se- lected 90. PLINK managed to pick all SNPs selected by LightGWAS, plus other four causal-SNPs. Dataset ds1 10: LightGWAS slightly outperformed PLINK for every mea- sured metrics. The t-tests indicated statistical significance on α = 0.05 for both F1 and accuracy with small effect (d < 0.5). The Wilcoxon test indicated statis- tical significance on α = 0.01 and large effect (r ≥ 0.8) for APS, ROC/AUC and precision. No statistical significance on α = 0.05 has been observed for recall, al- though the observed difference had large effect (r ≥ 0.8). In terms of causal-SNP selection, LightGWAS selected 80 SNPs, while PLINK selected 76. LightGWAS managed to pick all SNPs selected by PLINK, plus other four causal-SNPs. 8 B. Ambrozio et al. Dataset ds1 100: LightGWAS slightly outperformed PLINK for every mea- sured metrics. The Wilcoxon test indicated no statistical significance on α = 0.05 for any of them. However, a medium effect (r ≥ 0.5 ∧ r < 0.8) has been observed for recall, and a large effect (r ≥ 0.8) for all the other metrics. In terms of causal-SNP selection, LightGWAS selected 28 SNPs, while PLINK selected 19. LightGWAS managed to pick 14 SNPs missed by PLINK, and PLINK, in turn, managed to select 5 SNPs missed by LightGWAS. 4.2 Discussion The models implemented through LightGWAS performed as good as PLINK for GWAS over the balanced dataset. The paired difference tests disclosed that none of the measured differences is statistically significant on cut-off α = 0.05. Also, the observed effects through Cohen’s d presented a small standardised effect between all the means of the paired metrics. Consequently, the alternative hypothesis H 1 had to be rejected as LightGWAS did not outperform (neither underperformed) statistically significant for such a dataset. The experiments involving an imbalanced dataset brought evidence that sup- ports accepting the alternative hypothesis H 2 . LightGWAS has outperformed PLINK for such a scenario. Although recall did not reach statistical significance on α = 0.05 (therefore as good as PLINK), all the other metrics had relevant results on α = 0.01 (F1 and accuracy on α = 0.05). Furthermore, the metrics measured through nonparametric tests (recall, APS, ROC/AUC and precision) resulted in a large effect (r ≥ 0.8). The alternative hypothesis H 3 was rejected. Although LightGWAS outper- formed PLINK with medium effect for recall (r ≥ 0.5 ∧ r < 0.8) and a large effect for the other metrics (r ≥ 0.8) when instantiated with a high-imbalanced dataset, none of the results reached statistical significance on α = 0.05. Considering exclusively the k-fold cross-validation model selection results (observed differences in the means), models implemented via the proposed Light- GWAS procedure outperformed those implemented with PLINK in the three evaluated scenarios. However, if taking into consideration the statistical analysis of the metrics pairs differences, this result is held with statistical significance only in the experiments involving the imbalanced dataset. Nonetheless, accord- ing to [6], it is important to note that statistical significance should not be the exclusive approach to reject how relevant a model is. The scientific perspective (or significance) of the underlying problem should also be taken into considera- tion. Genome-wide association study plays an essential rule on identifying causal anomalies across DNA, and any improvement over a method, being it statisti- cally significant or not, should be accounted for. Hence, although some of the measured metrics did not reach statistical significance (leading to the rejection of the alternative hypotheses H 1 and H 3 ), they prove to be scientifically meaning- ful through their effect differences and the number of discovered causal-SNPs. As a result, the research question (page 1) can be answered positively. The evidence collected from the tested hypotheses supports the theory that LightGWAS is a potential genome-wide association study method. LightGWAS: A Novel ML Procedure for GWAS 9 5 Conclusion This paper has proposed a novel genome-wide association study (GWAS) proce- dure, named LightGWAS. It is a nonparametric machine learning (ML) method based on the LightGBM framework [9]. LightGWAS has been idealised as a potential single, resilient, autonomous and scalable solution to address some of the found limitations of the available state-of-the-art implementations for GWAS. A literature review identified that the current GWAS implementations rely on cumbersome manual quality control steps to address statistical problems, such as controlling for false-positive inflation and power reduction. These chal- lenges increase as the data grows or becomes imbalanced. It also showed they demand a particular GWAS method for each type of genomic data structure, which increases human dependency. In this research, the effectiveness of the models implemented via the proposed LightGWAS procedure was assessed upon GWAS scenarios where the investigated phenotype is qualitative and datasets are about to five thousand samples of balanced (case : control = 1 : 1), imbalanced (case : control = 1 : 10), and high-imbalanced (case : control = 1 : 100) ge- nomic data. Next, LightGWAS models were contrasted with those implemented via the state-of-the-art for GWAS (PLINK2 [7]). This assessment was performed through an empirical comparative experiment. A model selection based on 50- fold cross-validation signed out LightGWAS as the best choice in terms of mean differences. The results from empirical statistical tests denoted that the differ- ences are statistically significant for imbalanced datasets contexts. The main contribution of LightGWAS for genome-wide association study is the fact it is based on a nonparametric machine learning approach against the state-of-the-art that strongly relies on parametric statistical models. Therefore, LightGWAS allows scalability and adaptability to the most diverse genomic data morphology, which, in turn, reduces human dependency. It scales thanks to the LightGBM framework, which is the state-of-the-art for gradient boosted decision trees, capable of handling large and high-sparse datasets. LightGBM was created to address classification or regression problems. Still, in the LightGWAS proce- dure, it is used as a phenotype causal single-nucleotide polymorphism (SNP) discover by calculating the feature importance of a fitted model. Hence, this re- search shows originality by taking a specific technique and adapting it to a new domain of application. For all these reasons, LightGWAS is a new contribution from data science towards the evolvement of molecular biology science. For future work, it is recommended to compare LightGWAS with the GWAS procedures based on linear mixed model, and scalable and accurate implemen- tation of generalized mixed model. Thus, the effectiveness of LightGWAS can also be assessed against scenarios that go beyond the ones addressable through general linear models. It would also benefit whether using quantitative pheno- types to make sure LightGWAS attends to linear association models. Lastly, it is recommended the development of a mechanism to identify causal-SNPs from decision trees gain or split scores, as no p-values exist in such a context. It is crucial to develop a system analogue to the cut-offs employed by the current state-of-the-art regression models to filter causal-SNPs (p ≤ α for each SNP). 10 B. Ambrozio et al. References 1. Bush, W.S., Moore, J.H.: Chapter 11: Genome-wide association studies. PLoS Computational Biology 8(12), e1002822 (Dec 2012). https://doi.org/10.1371/journal.pcbi.1002822, https://doi.org/10.1371/ journal.pcbi.1002822 2. Chen, X., Ishwaran, H.: Random forests for genomic data analysis. Genomics 99(6), 323–329 (Jun 2012). https://doi.org/10.1016/j.ygeno.2012.04.003, https://doi. org/10.1016/j.ygeno.2012.04.003 3. Fadista, J., et al.: The (in)famous GWAS p-value threshold revisited and up- dated for low-frequency variants. European Journal of Human Genetics 24(8), 1202–1205 (Jan 2016). https://doi.org/10.1038/ejhg.2015.269, https://doi.org/ 10.1038/ejhg.2015.269 4. Farrell, R.E.: Functional genomics and transcript profiling. In: RNA Method- ologies, pp. 685–695. Elsevier (2017). https://doi.org/10.1016/b978-0-12-804678- 4.00024-5, https://doi.org/10.1016/b978-0-12-804678-4.00024-5 5. Golan, D., Rosset, S., Lin, D.Y.: Mixed models for case-control genome- wide association studies: Major challenges and partial solutions. In: Handbook of Statistical Methods for Case-Control Studies, pp. 495–514. Chapman and Hall/CRC (Jun 2018). https://doi.org/10.1201/9781315154084-27, https://doi. org/10.1201/9781315154084-27 6. Greenland, S., et al.: Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31(4), 337– 350 (Apr 2016). https://doi.org/10.1007/s10654-016-0149-3, https://doi.org/ 10.1007/s10654-016-0149-3 7. Hill, A., Loh, P.R., Bharadwaj, R.B., Pons, P., Shang, J., Guinan, E., Lakhani, K., Kilty, I., Jelinsky, S.A.: Stepwise distributed open innovation contests for soft- ware development: Acceleration of genome-wide association analysis. GigaScience 6(5) (Feb 2017). https://doi.org/10.1093/gigascience/gix009, https://doi.org/ 10.1093/gigascience/gix009 8. Jiang, L., et al.: A resource-efficient tool for mixed model associa- tion analysis of large-scale data. Nature Genetics 51(12), 1749–1755 (Nov 2019). https://doi.org/10.1038/s41588-019-0530-8, https://doi.org/10.1038/ s41588-019-0530-8 9. Ke, G., et al.: Lightgbm: A highly efficient gradient boosting decision tree. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 3146–3154. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/ 6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf 10. Lee, S., Wright, F.A., Zou, F.: Control of population stratification by correlation-selected principal components. Biometrics 67(3), 967–974 (Dec 2010). https://doi.org/10.1111/j.1541-0420.2010.01520.x, https://doi.org/10.1111/j. 1541-0420.2010.01520.x 11. Li, J., et al.: Feature selection. ACM Computing Surveys 50(6), 1–45 (Jan 2018). https://doi.org/10.1145/3136625, https://doi.org/10.1145/3136625 12. Loh, P.R., et al.: Mixed-model association for biobank-scale datasets. Nature Genetics 50(7), 906–908 (Jun 2018). https://doi.org/10.1038/s41588-018-0144-6, https://doi.org/10.1038/s41588-018-0144-6 13. Lubke, G., et al.: Gradient boosting as a SNP filter: an evaluation using simulated and hair morphology data. Journal of Data Mining in Genomics & Proteomics LightGWAS: A Novel ML Procedure for GWAS 11 04(04) (2013). https://doi.org/10.4172/2153-0602.1000143, https://doi.org/10. 4172/2153-0602.1000143 14. Ma, C., et al.: Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genetic Epidemiology 37(6), 539– 550 (Jun 2013). https://doi.org/10.1002/gepi.21742, https://doi.org/10.1002/ gepi.21742 15. Mills, M.C., Rahal, C.: A scientometric review of genome-wide association studies. Communications Biology 2(1), 9 (Jan 2019). https://doi.org/10.1038/s42003-018- 0261-x, https://doi.org/10.1038/s42003-018-0261-x 16. Mo, K., Li, J.: A deep auto-encoder based LightGBM approach for net- work intrusion detection system. In: Proceedings of the International Confer- ence on Advances in Computer Technology, Information Science and Commu- nications. pp. 142–147. SCITEPRESS - Science and Technology Publications (2019). https://doi.org/10.5220/0008098401420147, https://doi.org/10.5220/ 0008098401420147 17. Pearson, T.A.: How to interpret a genome-wide association study. JAMA 299(11), 1335 (Mar 2008). https://doi.org/10.1001/jama.299.11.1335, https://doi.org/ 10.1001/jama.299.11.1335 18. Pérez-Enciso, Zingaretti: A guide for using deep learning for complex trait genomic prediction. Genes 10(7), 553 (Jul 2019). https://doi.org/10.3390/genes10070553, https://doi.org/10.3390/genes10070553 19. Price, A.L., et al.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38(8), 904–909 (Jul 2006). https://doi.org/10.1038/ng1847, https://doi.org/10.1038/ng1847 20. Purcell, S., et al.: PLINK: A tool set for whole-genome association and population- based linkage analyses. The American Journal of Human Genetics 81(3), 559–575 (Sep 2007). https://doi.org/10.1086/519795, https://doi.org/10.1086/519795 21. Reed, E., et al.: A guide to genome-wide association analysis and post- analytic interrogation. Statistics in Medicine 34(28), 3769–3792 (Sep 2015). https://doi.org/10.1002/sim.6605, https://doi.org/10.1002/sim.6605 22. Sebastiani, P., et al.: Genome-wide association studies and the genetic dissection of complex traits. American Journal of Hematology 84(8), 504–515 (Aug 2009). https://doi.org/10.1002/ajh.21440, https://doi.org/10.1002/ajh.21440 23. Song, Y., et al.: Prediction of double-high biochemical indicators based on LightGBM and XGBoost. In: Proceedings of the 2019 International Confer- ence on Artificial Intelligence and Computer Science - AICS 2019. p. 189–193. ACM Press (2019). https://doi.org/10.1145/3349341.3349400, https://doi.org/ 10.1145/3349341.3349400 24. Spencer, C.C.A., et al.: Designing genome-wide association studies: Sample size, power, imputation, and the choice of genotyping chip. PLoS Genetics 5(5), 1–13 (May 2009). https://doi.org/10.1371/journal.pgen.1000477, https://doi.org/10. 1371/journal.pgen.1000477 25. Wang, R., et al.: Power system transient stability assessment based on bayesian optimized LightGBM. In: 2019 IEEE 3rd Conference on En- ergy Internet and Energy System Integration (EI2). pp. 263–268. IEEE (Nov 2019). https://doi.org/10.1109/ei247390.2019.9062027, https://doi.org/ 10.1109/ei247390.2019.9062027 26. Zhou, W., other: Efficiently controlling for case-control imbalance and sample re- latedness in large-scale genetic association studies. Nature Genetics 50(9), 1335– 1341 (Aug 2018). https://doi.org/10.1038/s41588-018-0184-y, https://doi.org/ 10.1038/s41588-018-0184-y 12 B. Ambrozio et al. Appendices Appendix A: Datasets’ phenotype ratios and variables of interest Table 2: Phenotype ratios for genetic datasets build-up (top), and variables of interest extracted from the executed simulations (bottom). Values have been based on the PLINK SNP simulation tool documentation [20]. Due to space limitations, the concepts of minor allele frequency (MAF), heterozygotes, and homozygotes are not expanded. However, they can be consulted at [1, 15, 20]. SNP Lower allele Upper allele Odds ratio for Odds ratio for no. SNPs Prefix frequency frequency range heterozygotes homozyygotes 10000 n 0.00 1.00 1.00 1.00 100 d 0.00 1.00 2.00 4.00 Variable Type Range Sample Individual ID Nominal Alphanumeric per13 Phenotype Numeric 1=control, 2=case 2 ... n 1351 T(/A) Numeric [0, 1 or 2] 2 d 13 G(/T) ... Appendix B: LightGBM hyperparameter values Table 3: LightGBM parameters selected via 200 iterations of randomised 5-folds cross-validation. ds1 1 ds1 10 ds1 100 colsample bytree 0.47328041 0.47328041 0.866621446 learning rate 0.03 0.03 0.01 max depth 1 1 6 min child samples 147 147 454 min child weight 1.0 1.0 1.0 min split gain 0 0 0 n estimators 2000 2000 2000 num leaves 35 35 41 reg alpha 0.1 0.1 5 reg lambda 0.1 0.1 50 subsample 0.995930118 0.995930118 0.820421212 subsample for bin 200000 200000 200000 Appendix C: Confidence interval ranges and standard deviations Table 4: Bootstrap 95% confidence interval (CI) metric ranges and standard deviations (SDs). LightGWAS PLINK SD LL UL SD LL UL f1 0.017 298 0.961 616 0.981 966 0.016 862 0.961 767 0.983 936 recall 0.020 045 0.952 000 0.984 000 0.020 380 0.952 000 0.984 000 ds1 1 APS 0.003 669 0.994 011 0.998 711 0.003 506 0.994 256 0.998 870 ROC/AUC 0.003 572 0.994 760 0.998 672 0.003 490 0.993 080 0.998 848 accuracy 0.017 474 0.962 000 0.982 000 0.017 001 0.962 000 0.984 000 precision 0.024 702 0.963 563 0.987 904 0.024 434 0.963 710 0.991 701 f1 0.005 909 0.987 562 0.996 255 0.006 772 0.985 000 0.993 789 recall 0.009 193 0.990 000 1.000 000 0.009 161 0.985 000 0.997 500 ds1 10 APS 0.000 272 0.999 462 0.999 925 0.000 486 0.999 183 0.999 863 ROC/AUC 0.002 738 0.994 750 0.999 250 0.004 748 0.991 938 0.998 625 accuracy 0.010 729 0.977 273 0.993 182 0.012 340 0.972 727 0.988 636 precision 0.008 637 0.980 344 0.995 000 0.010 093 0.980 247 0.992 537 f1 0.003 806 0.994 024 0.997 509 0.002 956 0.994 000 0.997 009 recall 0.004 522 0.996 000 1.000 000 0.002 399 0.994 000 1.000 000 ds1 100 APS 0.000 565 0.999 624 0.999 984 0.000 304 0.999 330 0.999 851 ROC/AUC 0.048 498 0.964 000 0.998 400 0.029 264 0.937 600 0.985 200 accuracy 0.007 527 0.988 119 0.994 759 0.005 869 0.988 119 0.994 059 precision 0.004 951 0.990 079 0.996 008 0.004 905 0.990 079 0.995 036