Cloud-based Approach on Genetic Data Imputation Parameters’ Optimization Pavlo Horun1, ∗,† and Chirstine Strauss2,† 1 Lviv Polytechnic National University, Kniazia Romana St. 5, 79000 Lviv, Ukraine 2 University of Vienna, Oskar Morgenstern Platz 1, 1090 Vienna, Austria Abstract The imputation process for genetic data is cost and time-intensive, primarily due to the high complexity of the methods involved, and the substantial volume of data processed. A thorough performance evaluation of the imputation algorithms such as Beagle, AlphaPlantImpute, LinkImputeR, MACH and others shows that while some algorithms are highly accurate, they are often computationally expensive. Being widely used, they have multiple input parameters which impact the quality and accuracy of the imputation. Traditional machine learning techniques for parameter optimization like grid search and randomized search become inefficient in high-dimensional parameter spaces, leading to prohibitive computational costs, especially in large-scale applications. Our study proposes the cloud-based approach for input parameters optimization by using Bayesian optimization with consecutive Domain Reduction Transformer (DRT). Described algorithm and developed library allow users to find the optimal input parameters for the data imputation in a more flexible way. Keywords Bayesian optimization, parameters optimization, data imputation, Beagle, cloud technologies, distributed calculations, bioinformatics 1 1. Introduction The modern sequencing technologies made the analysis of genetic variations to be significantly advanced. However, the vast amounts of data generated during these processes often contain missing values, particularly in the field of plant genetics. To address these gaps, a variety of machine learning and statistical tools have been employed to impute missing genetic data (GD) efficiently [1-5]. These methods vary in their performance, and each comes with its strengths and limitations. The imputation process for genetic data is cost and time-intensive, primarily due to the high complexity of the methods involved, and the substantial volume of data processed. Imputation tools such as Beagle [6-7], HBImpute [3], Impute, MACH [5], AlphaPlantImpute [8-9], MissForest, and LinkImputeR [10] offer flexibility in their input parameters. Being powerful and widely used, they have multiple input parameters, and each one impacts the quality and accuracy of the imputation [9, 11]. Furthermore, these parameters often have a wide range of acceptable values, and the relationships between these parameters and the resulting accuracy can be highly nonlinear and difficult to model. Therefore, performing a comprehensive search across all possible parameter combinations to determine the optimal settings would be computationally prohibitive, requiring almost infinite amount of time and resources. Traditional machine learning techniques for parameter optimization, such as grid search or even randomized search, quickly become inefficient due to the high dimensionality of the parameter space and the large number of evaluations needed. Such approaches often scale poorly ∗ Corresponding author. † These authors contributed equally. pavlo.p.horun@lpnu.ua (P. Horun); christine.strauss@univie.ac.at (C. Strauss) 0009-0008-4296-5560 (P. Horun); 0000-0003-0276-3610 (C. Strauss) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings with the number of parameters and their wide ranges, leading to significantly increased computational costs, which is unfeasible in large-scale applications. In the article [12] authors addressed the challenges of GD imputation in bioinformatics, specifically focusing on the limitations of existing software tools such as Beagle, AlphaPlantImpute and others. The core issues are: 1. The complexity and unclear nature of input parameters, which often make optimization time-consuming. There is some lack of proper investigation of time and computational resource requirements in existing literature. 2. High time and resource costs for GD imputation. Traditional tools for GD imputation have high computational costs, especially when using large datasets. 3. The accuracy of missing GD imputation is crucial for subsequent bioinformatics analysis. However, optimizing accuracy often results in increased computation time, which the article attempts to balance in their “Algorithm A – Time Optimization”. This algorithm focuses on reducing the imputation time using cloud-based distributed computing, which showed significant speed improvements ranging from 47% to 87% based on the size of the dataset. The use of parallelization in the cloud environment is well-detailed, showing how splitting the data by chromosomes and distributing processing across multiple virtual nodes leads to significant time savings. Based on described above, the next goal is to optimize input parameters in a more efficient way. One of the biggest issues here is time required for each single imputation run. In such case Bayesian optimization offers a more efficient alternative way by incorporating prior knowledge and sequential learning to iteratively select promising parameter combinations. Bayesian optimization builds a probabilistic model of the objective function, typically a Gaussian process, which helps in approximating the relationship between the input parameters and the output (e.g., imputation accuracy). By balancing exploration (testing unknown regions of the parameter space) and exploitation (focusing on regions known to yield good results), Bayesian optimization can converge to an optimal or suboptimal solution with significantly fewer evaluations compared to exhaustive or random search methods. Using Bayesian optimization in this context allows for a more targeted exploration of the parameter space, thereby reducing the overall computational cost and time required for parameter tuning. It focuses on the most promising regions of the parameter space, which leads to a faster convergence toward effective parameter settings. This is particularly advantageous given the constraints on computational resources and the need for timely imputation results in large-scale genetic studies. 2. Input parameters optimization In [12] authors suggested the approach to reduce imputation time in distributed cloud environment. To continue this research we developed the library that uses Bayesian optimization (BO) [13] with consecutive Domain Reduction Transformer (DRT) [14] to automate input parameters optimization using benefits of distributed environment. In addition, it is possible to restore parameters optimization process from the last successful or even failed step. It can lead to near-optimal settings for tools like Beagle without the prohibitive costs associated with exhaustive parameter searches. Consequently, researchers can achieve higher accuracy in their imputations while maintaining a reasonable resource usage that facilitates more efficient genetic analysis. The main flow consists of the following steps:  take original file;  prepare set of files by modifying original one (split it by chromosomes, randomly remove some data for different missing ratio (MR) etc.);  upload the artifacts on AWS S3 storage (storage type and cloud provider could be different);  in a distributed environment run Bayesian Optimization using Beagle with consecutive DRT;  iteratively collect and process the logs (optimization, imputation metrics, beagle logs);  analyze obtained results. This high-level algorithm is depicted on the Figure 1. Figure 1. High-level parameters optimization flow 2.1. File preparation phase During the file preparation phase it’s important to split original file by chromosomes (based on the performance comparison conducted in [12], it decreases imputation time significantly) and collect set of files with the different MR (like 0.1%, 0.2%, 0.5%, 1%, 3%, 5%, 10% etc.). At the end, it will be useful for more comprehensive analysis and comparison to get more valuable insights. So, the key steps on this phase are:  Splitting original file by chromosomes (this will decrease further imputation time).  For each MR randomly remove the data several times (this allows us to reduce bias and have statistically independent results that could be averaged with a higher level of confidence). Doing this 3 times for each case should be enough for further averaging.  Upload prepared artifacts on S3. Described algorithm represented on the Figure 2. Figure 2. Input files preparation algorithm 2.2. Parameters’ optimization phase After uploading prepared artifacts on storage, user needs to configure dedicated cloud environment (you may refer to the subsection 2.2 from [12]). During this phase each file is downloaded on each virtual host and run parallel Bayesian optimization task using DRT with multiple Beagle runs. When done – upload results with the corresponding logs on storage. Described algorithm is demonstrated on the Figure 3. Figure 3. Parameters’ optimization algorithm The developed library performs such actions by itself without user’s interruption. One of the key advantages of it is that user can continue parameters optimization even after successful or failed steps. This plays crucial role since accuracy achieved after X epochs could be still insufficient. 2.3. Beagle parameters selection Being widely used, Beagle tool accepts numerous input parameters and each one may significantly influence the quality and accuracy of the imputation [9, 11]. The latest version of Beagle 5.4 (release 06Aug24.a91) accepts:  Data parameters – most of them are required and contain file paths (nothing to optimize here).  Phasing parameters – used to control phasing accuracy (could be specified by the user).  Imputation parameters. It worth to focus on the limited subset and specify rest of them based on each specific case.  General parameters. The most impactful are “ne” and “window” (see description below). Reducing the window parameter will reduce the amount of memory required for the analysis. Most of the mentioned parameters were investigated in [9] and then compared Beagle and AlphaPlantImpute2. Unfortunately, there is a lack of description for exact optimal parameters investigation. Nevertheless, authors admitted that some parameters (like “ne” and “iterations”) may cause too long time for imputation and based on their results such approach is not very suitable for the large-scale datasets. Potentially, using developed library it is possible to perform more versatile parameters optimization to find other optimal (or sub-optimal) points that provides similar accuracy within reasonable amount of time. The set of Beagle parameters that worth to optimize is listed in the Table 1. Rest of the parameters should be specified depending on the current use case. Table 1 Beagle input parameters for optimization Parameter Description Type, default burnin The maximum number of burnin iterations used to > 0, default = 3 estimate an initial haplotype frequency model for inferring genotype phase iterations The number of iterations used to estimate genotype > 0, default = 12 phase phase-states The number of model states used to estimate genotype > 0, default = 280 phase imp-states The number of model states used to impute > 0, default = 1600 ungenotyped markers imp-segment The minimum cM length of haplotype segments that > 0, default = 6.0 will be incorporated in the Hidden Markov Model state space for a target haplotype imp-step The length in cM of the step in centiMorgans used for > 0, default = 0.1 detecting short IBS segments imp-nsteps The number of consecutive steps (see imp-step > 0, default = 7 argument) that will be considered when detecting long IBS segments cluster The maximum cM distance between individual markers >= 0, default = 0.005 that are combined into an aggregate marker when imputing ungenotyped markers ne The effective population size (for unphased input > 0, default = 100000 genotypes) window The cM length of each sliding window > 0, default = 40.0 overlap The cM length of overlap between adjacent sliding > 0, default = 2.0 windows Based on official Beagle 5.4 documentation, shorter “window” requires less memory, while “iterations” controls the trade-off between compute time and phasing accuracy. 3. Discussion As mentioned before, one more useful (even crucial) feature introduced here is an ability to continue optimization from the last run. In case of long running imputation jobs this allows users to decide if existing imputation accuracy is sufficient or they want to continue parameters optimization. The core library that performs Bayesian optimization from [13] allows to use existing logs for further optimization. So, in the last case such algorithm will avoid running imputation with already checked parameters values that saves time tremendously. In some cases, it could be very beneficial to continue parameters optimization with the modified general parameters. Additional issue is the cost saving. To address it, so-called spot EC2 instances could be used as suggested in [12] (this saves up to 90% of costs compared to so-called "on-demand instances"). For example, users may stick with r5a.2xlarge (4 vCore and 32 Gb RAM) EC2 type. It worth to mention that the target EC2 type has no impact on the overall process – its main benefit is to increase optimization (and imputation) time performance. And having regular backups (dumps of the logs) allows users to continue optimization from the latest observed state. Above feature to continue optimization is extremely valuable since spot instances could be terminated at any time and whole progress will be lost. Continuous failures caused by sporadic spot instances termination could be managed by the appropriate cloud orchestration. It was briefly described in [12] to address network instability and synchronization issues. More advanced orchestration layer could enhance the robustness and fault tolerance of whole optimization flow. From one side, it increases costs due to systems’ complexity and more resources to be used. But from the other side, including orchestration, such as Kubernetes, could improve reliability and further optimized resource allocation and synchronization. One more interesting topic is a broader species scope. Extensive testing across other recent imputation models and different crop species could provide more insights into the general applicability and scalability of the proposed methods. This is a part of the future research. 4. Conclusions and future work The optimization of genetic data imputation using distributed cloud technologies represents a significant step forward in the analysis of large-scale biological datasets. The proper selection of optimal input parameters is a key to improve imputation accuracy. Described method and library allow to automate their tuning using cloud-based environment and Bayesian optimization with consecutive Domain Reduction Transformer. This allows users to find the optimal input parameters in a more flexible way for different datasets. More generic scenario consists of conducting preliminary experiments for predefined input files using proposed approach and developed library to find set of optimal imputation parameters (like “profiles”) for Beagle. This may give some insights on how accuracy and time depend on input file characteristics (for instance, species type, variants and lines count, missing genotypes ratio, heterozygosity frequency (HF), minor allele frequency (MAF) etc.). Such “profiles” could be used then on a regular basis for similar input files that saves time and resources to achieve optimal (or at least satisfied for the user) imputation accuracy. As a next step it is reasonable to focus efforts addressing above question, refining suggested algorithms and exploring additional methods for adaptive parameter tuning using distributed environments. Declaration on Generative AI During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content. References [1] N. Bhandari, R. Walambe, K. Kotecha, and S. P. Khare, “A comprehensive survey on computational learning methods for analysis of gene expression data,” Front. Mol. Biosci., vol. 9, p. 907150, Nov. 2022, doi: 10.3389/fmolb.2022.907150. [2] T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A survey on missing data in machine learning,” J Big Data, vol. 8, no. 1, p. 140, Oct. 2021, doi: 10.1186/s40537-021-00516-9. [3] T. Pook, A. Nemri, E. G. Gonzalez Segovia, D. Valle Torres, H. Simianer, and C.-C. Schoen, “Increasing calling accuracy, coverage, and read-depth in sequence data by the use of haplotype blocks,” PLoS Genet, vol. 17, no. 12, p. e1009944, Dec. 2021, doi: 10.1371/journal.pgen.1009944. [4] T. Pook et al., “Improving Imputation Quality in BEAGLE for Crop and Livestock Data,” G3 Genes|Genomes|Genetics, vol. 10, no. 1, pp. 177–188, Jan. 2020, doi: 10.1534/g3.119.400798. [5] H. Alipour, G. Bai, G. Zhang, M. R. Bihamta, V. Mohammadi, and S. A. Peyghambari, “Imputation accuracy of wheat genotyping-by-sequencing (GBS) data using barley and wheat genome references,” PLoS ONE, vol. 14, no. 1, p. e0208614, Jan. 2019, doi: 10.1371/journal.pone.0208614. [6] B. L. Browning and S. R. Browning, “Genotype Imputation with Millions of Reference Samples,” The American Journal of Human Genetics, vol. 98, no. 1, pp. 116–126, Jan. 2016, doi: 10.1016/j.ajhg.2015.11.020. [7] B. L. Browning, Y. Zhou, and S. R. Browning, “A One-Penny Imputed Genome from Next- Generation Reference Panels,” The American Journal of Human Genetics, vol. 103, no. 3, pp. 338–348, Sep. 2018, doi: 10.1016/j.ajhg.2018.07.015. [8] L. Chen et al., “Genotype imputation for soybean nested association mapping population to improve precision of QTL detection,” Theor Appl Genet, vol. 135, no. 5, pp. 1797–1810, May 2022, doi: 10.1007/s00122-022-04070-7. [9] T. Niehoff, T. Pook, M. Gholami, and T. Beissinger, “Imputation of low‐density marker chip data in plant breeding: Evaluation of methods based on sugar beet,” The Plant Genome, vol. 15, no. 4, Dec. 2022, doi: 10.1002/tpg2.20257. [10] N. Munyengwa et al., “Optimizing imputation of marker data from genotyping-by-sequencing (GBS) for genomic selection in non-model species: Rubber tree (Hevea brasiliensis) as a case study,” Genomics, vol. 113, no. 2, pp. 655–668, Mar. 2021, doi: 10.1016/j.ygeno.2021.01.012. [11] A. Palanivinayagam and R. Damaševičius, “Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods,” Information, vol. 14, no. 2, p. 92, Feb. 2023, doi: 10.3390/info14020092. [12] L. Mochurad and P. Horun, “Improvement Technologies for Data Imputation in Bioinformatics,” Technologies, vol. 11, no. 6, p. 154, Nov. 2023, doi: 10.3390/technologies11060154. [13] Fernando Nogueira, Bayesian Optimization: Open source constrained global optimization tool for Python. (2014). Python. Accessed: Jan. 10, 2024. [Online]. Available: https://github.com/bayesian-optimization/BayesianOptimization [14] N. Stander and K. J. Craig, “On the robustness of a simple domain reduction scheme for simulation‐based optimization,” Engineering Computations, vol. 19, no. 4, pp. 431–450, Jun. 2002, doi: 10.1108/02644400210430190.