Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures Somayeh Mohammadi1 , Latif PourKarimi2 , Manuel Zschäbitz1 , Tristan Aretz1 , Ninon De Mecquenem3 , Ulf Leser3 and Knut Reinert1 1 Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany 2 Department of Mathematics and Computer Science, Razi University, Kermanshah, Iran 3 Department of Mathematics and Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany Abstract Data analysis workflows are popular for sequencing activities in large-scale and complex scientific processes. Scheduling approaches attempt to find an appropriate assignment of workflow tasks to the computing nodes for minimizing the makespan in heterogeneous cluster infrastructures. A common feature of these approaches is that they already know the structure of the workflow. However, for many workflows, a high degree of parallelization can be achieved by splitting the large input data of a single task into chunks and processing them independently. We call this problem task granularity, which involves finding an assignment of tasks to computing nodes and simultaneously optimizing the structure of a bag of tasks. Accordingly, this paper addresses the problem of task granularity for metagenomic workflows. To this end, we first formulated the problem as a mathematical model. We then solved the proposed model using the genetic algorithm. To overcome the challenge of not knowing the number of tasks, we adjusted the number of tasks as a factor of the number of computing nodes. The procedure of increasing the number of tasks is performed interactively and evolutionarily. Experimental results showed that a desirable makespan value can be achieved after a few steps of the increase. Keywords Data Analysis Workflow, Mathematical Programming, Makespan Minimization, Run Time Prediction, Genetic Algorithm 1. Introduction should be the appropriate size of each chunk of input data, and how should each task be assigned to a computing node Scientists in many domains, such as bioinformatics, remote so that the makespan is minimized. We call this problem task sensing, and physics, use Data Analysis Workflows (DAWs) granularity. In heterogeneous environments, this challenge to sequence activities involved in large-scale and complex is aggravated because the computing power of the existing scientific processes [1]; [2]. These DAWs are typically rep- computing nodes is different, so choosing the input size of resented as a Directed Acyclic Graph (DAG), which consists each task to be executed on each of these computing nodes of a set of tasks and some directed edges between the tasks; is a very effective means in optimizing the makespan. Since edges show data dependencies between tasks and the priority each task of the DAW is equivalent to a job for the cluster order of task execution. when a workflow is submitted for execution, the terms task Scientists often use heterogeneous cluster infrastructures and job are used interchangeably in this study. to run their DAWs because of privacy and financial concerns. In this paper, we propose a novel approach to task granu- Heterogeneous clusters provide high-performance comput- larity for metagenomic DAWs in cluster infrastructures with ing environments that enable efficient data analysis and the makespan minimization. We first formulate the problem as execution of large-scale DAWs in a reasonable amount of a mathematical model. We solve then the proposed model time [3]. DAWs are often executed on large amounts of data, using the Genetic Algorithm method. Since the calculation resulting in long runtimes that can exceed days or weeks of makespan requires a proper estimation of tasks runtime, [4]; [5]; [6]. Thus, in such environments, the key objective we apply three different methods for this estimation and also is to schedule DAW tasks across computing resources in compare their accuracy. such a way that the total execution time, also known as the The paper is organized as follows: Section 2 presents makespan, is minimized. a review of the related work, and Section 3 illustrates the It is well known that a high degree of parallelization can problem statement. The proposed mathematical model is be achieved in many DAWs by splitting the input data of indi- introduced in Section 4. Section 5 discusses solving the pro- vidual tasks into chunks and processing them independently posed model using genetic algorithm. Job runtime estimation [7]. For example, in metagenomic DAWs, the size of a ref- is presented in Section 6, and experimental results are pre- erence genome as frequent input data can vary from several sented in Section 7. Finally, Section 8 provides concluding KB to hundreds of GB, and the reference genome typically remarks and plans for further studies. contains thousands of genome files. The reference genome can be divided into different bins of genome files and they are processed by several independent tasks in parallel. The 2. Related works main challenge here is how to partition the input data; what In this section, we first discuss data access patterns used in scientific workflows. We then cover the scheduling of Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 scientific workflows on heterogeneous clusters. Finally, we Joint Conference (March 25-28, 2024), Paestum, Italy. focus on methods for predicting the runtime of tasks, as these $ somayeh.mohammadi@fu-berlin.de (S. Mohammadi); l.pourkarimi@razi.ac.ir (L. PourKarimi); manuez42@zedat.fu-berlin.de estimates are often used as input for scheduling approaches. (M. Zschäbitz); aret01@zedat.fu-berlin.de (T. Aretz); mecquenn@informatik.hu-berlin.de (N. D. Mecquenem); leser@informatik.hu-berlin.de (U. Leser); Reinert@fu-berlin.de (K. Reinert) © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2.1. Data access patterns used in scientific know the structure of the DAG and find an optimal or sub- workflows optimal assignment of tasks to the computing nodes. By addressing the scattering problem of a bag, our pro- The data access patterns of workflow applications have been posed approach not only provides a suitable structure for the addressed by several studies [8]; [9]; [10]; [11]. Accordingly, bag and consequently for the DAG, but also finds the best the most commonly used patterns in scientific workflows are scheduling of tasks to computing nodes with the objective of as follows (Fig. 1): minimizing the makespan. • Pipeline:This is the most basic and familiar pattern. A set of computational tasks is chained in a sequence 2.3. Task runtime prediction such that the output of a parent task is the input of Most of state-of-the-art techniques for workflow scheduling its child task in the chain. Because of the line de- rely on accurate predictions of tasks runtime [15]. Therefore, pendencies in a pipeline pattern, the execution of a the problem of predicting the runtime of scientific workflow task cannot begin until the execution of its parent has tasks based on historical data has been studied extensively. completed and it has received the data generated by Recent research has used machine learning techniques to its parent. address this issue [16]; [17]; [18]; [19]. They build their • Scatter: An input data is divided into several chunks. initial models on historical data before the actual workflow These chunks are distributed into multiple tasks (a execution. [20]; [18] use neural network methods, which bag of tasks), which can be executed simultaneously are known to require large training data sets to perform well, because there are no dependencies between them. while [3] employs a Bayesian linear regression model, which • Gather: Multiple chunks of data are produced by can work with few training points and provides uncertainty multiple tasks. All of them are used as input data by estimates for its predictions. Most existing approaches use a subsequent task. The later task may need to receive the size of the file on the hard drive as input to their prediction all the chunks and integrate them to start execution. models. In this research, the objective is to minimize the makespan. To compute the makespan value, an acceptable estimate of Data Task the runtime of jobs is required. Assuming that we have some historical data of execution traces of jobs on computing nodes, we use three different methods to predict the runtime of jobs (See Section 6). 3. Problem statement This study addresses the problem of job/task granularity of Pipeline Scatter Gather scientific workflows in heterogeneous cluster environments with the aim of minimizing makespan. The case study is a metagenomic workflow where a reference genome con- taining a set of genome files is the main input data of the Figure 1: Common data access patters in scientific workflows. workflow. Building FM index (Full-text index in Minute space) over a reference genome (reference genome index- Obviously, the scatter pattern can be so effective in reduc- ing) is a common and time-consuming task in metagenomic ing makespan in a distributed environment such as a cluster DAWs [21] and such a task is often used by the bioinfor- and cloud because it provides parallel execution of tasks on matics workflow community, so it is a good case study for computing nodes. The implementation of scatter is an NP- optimizing job/task granularity. hard problem (See Section 3.1), so the user is not able to do In general, the system model includes the following steps: it manually to achieve an acceptable makespan. In this study, the first step is to collect a historical dataset of job execution we propose an approach to address this problem. traces on different computing nodes of the cluster. If this dataset is not already available, it can be collected by data 2.2. Scheduling of scientific Workflows on sampling [3]. In the second step, a proper estimation of the heterogeneous clusters job runtime and its memory consumption is performed using a prediction method. Then, by solving the proposed mathe- Generally, workflow scheduling on heterogeneous infrastruc- matical model, the optimal size of each chunk of input data tures can be done in two ways, statically or dynamically [12]; for each job and also the assignment of jobs to computing [2]. Static scheduling assigns tasks to compute resources in nodes is obtained. Finally, the optimal job granularity and advance, assuming that accurate information about workflow assignment is used to execute the DAW in the cluster. and infrastructure resources is available. Dynamic schedul- ing doesn’t require such assumptions. On the one hand, many heuristic and meta-heuristic approaches have been provided 3.1. A motivating example for this problem. HEFT [13] is considered to be the most Suppose there is a reference genome that contains five famous of these. On the other hand, mathematical optimiza- genome files of the following sizes: g0 = 10, g1 = 15, tion approaches such as MILP [14] have proposed optimal g2 = 20, g3 = 25, g4 = 35. Moreover, assume that the cluster solutions to this problem and have also analyzed the problem infrastructure has three computing nodes A, B and C. The more extensively. However, the presented state-of-the-art runtime of a job with an input size of S on the computing scheduling approaches have in common that they already nodes is calculated by the following functions: • fA (S) = ln S2 + 1 of the problem constraints and the objective function of the • fB (S) = ln S2 + 2S proposed model are presented in detail. • fC (S) = ln S2 − 4S − 10 4.1. Required memory for running jobs g1 , g4 g2 g0 g3 This constraint states that the memory limitation of kth node must be met. This constraint must be done for all jobs and computing nodes. Job1 Job2 Job3 Job4 n memk ≥ fmem ( ∑ Si · yi j · x jk ) Node-A Node-B Node-C i=1 (1) Execution time on Node-A= 7.80 ∀ j ∈ {1, 2, ..., J}, ∀k ∈ {1, 2, ..., v} Execution time on Node-B= 6.08 Execution time on Node-C= 10.15 Makespan=10.15 For each J j and CNk , ∑ni=1 Si · yi j · x jk denotes the input size of J j on CNk and fmem estimates the memory required for J j . (a) A possible assignment. 4.2. Assigning jobs to cluster nodes g0 , g2 g1 g3 , g4 Constraints (2) and (3) imply that non-empty jobs must be assigned to exactly one node. Constrain (4) enforces that empty jobs cannot be assigned to any node. Job1 Job2 Job3 Node-A Node-B Node-C 1 n v ∑ yi j ≤ ∑ x jk ∀ j ∈ {1, 2, ..., J} (2) Execution time on Node-A= 6.80 n i=1 k=1 Execution time on Node-B= 5.42 Execution time on Node-C= 8.11 v Makespan=8.11 ∑ x jk ≤ 1 ∀ j ∈ {1, 2, ..., J} (3) k=1 (b) A better assignment. v n ∑ x jk ≤ ∑ yi j ∀ j ∈ {1, 2, ..., J} (4) Figure 2: Two possible job granularities and assignments with k=1 i=1 different makespan values. 4.3. Objective function For constructing the objective function the following points In Fig. 2, two different possible job building and assign- should be highlighted: ments of the genome files is depicted. However, there are • The objective function deals with the runtime of some 35 = 243 different states as a solution among which the solu- bags of jobs on some computing nodes. tion with the minimum makespan is the best. In a real example, Archaea1 has 488 genome files. As- • If tk (Eq. (5))denotes for the above mentioned time suming that the number of cluster computing nodes is 10. for CNk then the total runtime (makespan) equals to Any approach based on complete enumeration and trial and max(tk ) 1 ≤ k ≤ v . The objective function aims error for assigning these genomes to computing nodes re- to minimize this time (Eq. (6)). quires comparing a maximum number of 10488 to different assignments. This approach is obviously impractical. It is • When more than one task is assigned to a node, the noticeable that Archaea is a very small reference genome node wastes a certain amount of time between exe- among the available reference genomes. Therefore, applying cuting two jobs. ∑Jj=1 (x jk − 1) denotes that time. the above mentioned approach for solving the related assign- • For each CNk , ∑Jj=1 fk (∑ni=1 yi j · si · x jk ) calculates ment problem is not efficient or even applicable. The best the summation of the runtime of jobs assigned to approach to deal with this problem is to create a mathemati- CNk . ∑ni=1 yi j · si · x jk is the input size of J j , in which, cal model for the problem and then apply available efficient fk () denotes an implicit function of that. algorithms for solving mathematical optimization models. Runtime of jobs on each CNk is calculated by Eq.5. J J n 4. The proposed mathematical model tk = ( ∑ x jk − 1) · stk + ( ∑ fk ( ∑ yi j · si · x jk )) (5) j=1 j=1 i=1 In a mathematical model, the objective function, decision variables and problem constraints are expressed in mathe- Therefore, the objective function of the model is as fol- matical expressions. These models provide a deep insight lows: into the structure of the problem [12]; [22]. They are there- min(max(tk )) 1 ≤ k ≤ v (6) fore suitable not only for solving the problem using classical This can be expressed as follows: methods, but also for solving the problem using heuristic or meta-heuristic methods. The input data for the mathematical model are described min α in Table 1. Also, this table explains the decision variables ex- subject to: isting in the proposed model. In this section, the formulation α ≥ tk ∀k ∈ {1, 2, ..., v} 1 https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/ α≥0 Table 1 Models parameters/variables and their description. Notation of parameters Description CN = {node1, node2, ..., nodev } Cluster node set v The number of nodes of the cluster CNk kth node of the cluster memk Accessible memory size of CNk for running tasks stk Switch time between two jobs for CNk Gen = {g1, g2, ..., gn } The set of genomes of the reference genome n The number of genomes of the reference genome gi ith genome of the reference genome Si Size of gi Job = {J1, J2, ..., JJ } The set of jobs Jj jth job J The number of jobs Notation of decision variables Description x jk 1 iff J j is assigned to CNk , otherwise 0 yi j 1 iff gi is binned to J j , otherwise 0 5. Genetic optimization for solving the to J4 . Moreover, J1 and J4 are scheduled to CN 3 and CN 2 , respectively. proposed model Suppose there is a reference genome of a certain size with Genome file number 1 2 3 n genome files g1 , g2 , . . . , gn . We want to group the genomes into a number of jobs and then assign the jobs to computing Job number 1 4 1 ... nodes CN1 , . . . ,CNv of a cluster infrastructure. It can be seen that the proposed model is a non-linear bi- Node number 3 2 3 ... nary mathematical model. Due to the special structure of the constraints and the objective function of this model, lineariz- ing it leads to a binary linear model with a significantly large Figure 3: A chromosome encoding. number of constraints. Solving this model is very time con- suming in terms of computation (it may even be impossible). On the other hand, in contrast to classic approaches, using genetic algorithms is a very powerful approach for treating 5.2. GA operators discret models even if the model is nonlinear [23]. Based on The crossover operator can help to inherit some chromosome this fact, using genetic approach can be an efficient approach fragments of excellent individuals to subsequent generations. for solving the proposed model without any linearization. In In this study, the single-point crossover technique [23] is the following, we explain how to implement the presented adapted during the performing of the crossover operator to model using a Genetic Algorithm (GA). produce new individuals. These new individuals are then as- sessed for their potential to contribute to the next generation 5.1. Chromosome structure of the population. An example is shown in Fig. 4. After crossover, the first new individual is not feasible to add to the GAs mimic optimization during optimization by modelling next generation because J5 has been assigned to two different genetic recombination and a fitness function. Hence, when computing nodes. using GA to solve a particular problem, the first concern to The mutation operator is a technique that replaces some be addressed revolves around the determination of a suitable gene values with others to increase population diversity. We chromosome coding. Each chromosome represents the dif- use swap mutation [25] to explore new regions of the solution ferent parameters that characterize a solution to the problem. space, where two positions on a chromosome are randomly [24]. In the solution, we consider a population of individuals, selected and swapped. After mutation, the potential of new each individual being a potential solution to the problem de- individuals to contribute to the next generation of the popula- scribed by the individual chromosome. The initial population tion is evaluated. is generated at random. In this study, job granularity involves determining the as- signment of genome files to jobs and the assignment of jobs 6. JOB RUNTIME ESTIMATION to cluster nodes. Thus, a solution (chromosome) is a two- dimensional array in which the indices indicate the genome In the GA algorithm, each individual should be assigned a file number. The elements of the first row contain the job value of the fitness score, which is shown in the objective numbers and the elements of the second row contain the function defined in Eq.5. Suppose the runtime of job j on computing node numbers. The representation of a chromo- node CNk is equal to t j . The challenge here is that there is some in the GA implementation is shown in Fig. 3. As Fig. no explicit function for calculating t j ; in other words, how 3 shows g1 and g3 are assigned to J1 while g2 is assigned workflow to perform the experiments. The specifications of the input data are described in Table 2. We also looked into different cluster sizes of 4, 8, 16, and 24 computing nodes in the experiments. As previously stated, our approach consider the job gran- ularity and scheduling problems simultaneously for a bag of task in a workflow while related works have addressed only the scheduling problem; Indeed, they have assumed that there are a number of jobs, each with a certain size, and they schedule these jobs to the cluster nodes so that the total runtime is minimized. Figure 4: An example of crossover. Most existing approaches use the file size on disk as the input for their predictions or approximate models. In [30] is a detailed discussion of why uncompressed input data size long does it take to execute a job j of the size S j on a com- for compressed files should be used. Accordingly, we use the puting node CNk ? Therefore, we use the following different uncompressed file size to predict the runtime and the used methods to predict the runtime of jobs on computing nodes, memory of jobs. assuming that we have historical data from task execution traces: Table 2 Reference genomes specifications. • Linear Regression: Linear regression is a popular and simple machine learning algorithm that mod- Reference Genome Data Size # Genome files els the relationship between dependent and indepen- Archaea 1.4 GiB 488 dent variables by analyzing and learning from current Bacteria_1 4 30 GiB 7167 training results using a linear expression of indepen- Bacteria_2 5 88 GiB 22185 dent variables [26]. • Polynomial approximation: According to mathe- matical theorems, any continuous function can be 7.1. Accuracy comparison of job runtime optimally approximated by a suitable polynomial estimation methods [27]. Based on this theory, a polynomial of appropri- ate degree can be used to approximate the unknown We compare the accuracy of the methods used to estimate function using the given historical data. In practice, job runtime (See Section 6). The linear regression was imple- lower degree polynomials are considered to avoid the mented using the sklearn.linear_model6 library. The polyno- socialisation of higher degree polynomials. Here we mial and log-pol have been implemented using polyfit() from only consider degrees one, two and three. the Numpy7 library and Curve_fit() from the Scipy.optimize8 • Logarithmic of a polynomial approximation (log- library, respectively. For comparison we use the Mean Ab- pol): As shown in [28], the curves of runtime ac- solute Percentage Error (MAPE). This metric is calculated cording to data input size are logarithmic like and on using the Eq. 7 where Ai is the actual value, Fi is the pre- the other hand, as mentioned above, the polynomials dicted value and n is the number of fitted points. are a suitable method for estimation, so we use a 1 n Ai − Fi combination of logarithmic and polynomial as a new MAPE = ∑ | Ai | (7) approximation method. n i=1 These methods assume that there is a relationship between Figs. 5-7 show results for each input data separately with a task’s input size and its runtime, using the input size as different cluster sizes. As shown in these figures, the polyno- the independent variable. Thus, they can be used to predict mial method gives a more accurate estimate of the values of task runtime for any task input sizes. Similar to related work, job run times than log-pol. Log-pol also outperforms linear these methods use the size of the file on the hard disk as an regression. input to their prediction models. In a heterogeneous cluster, a same task may have different run times on different com- 7.2. Changing the number of jobs to improve puting nodes. Therefore, the methods create their prediction makespan models for each computing node. One of the main challenges in this problem is that the number of jobs is not known in advance. The most obvious idea is 7. Experimental results to consider this number as equal to the number of genome files, i.e. n. This seems logical at first sight, since the possi- We developed a read-mapping workflow 2 for metagenomic bility to consider empty jobs allows to potentially find any data in the popular workflow engine Snakemake [29]. The possible clustering for genome files in the form of jobs. How- workflow was run on Allegro 3 , a cluster infrastructure. We ever, from a practical point of view, this is inappropriate and created a historical dataset from traces of the workflow exe- impossible in most cases, because due to the large number cution on the cluster as a historical execution trace. of genome files, the number of decision variables and con- We selected and used three reference genomes with dif- straints increases significantly, making it impossible to solve ferent sizes (small, medium and large) as input data of the 6 https://scikit-learn.org/stable/modules/linear_model.html 2 7 https://github.com/CRC-FONDA/A2-job-granularity/tree/main/MG- https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html 8 HIBF https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize 3 https://www.mi.fu-berlin.de/w/Cluster/WebHome .curve_fit.html #Nodes = 4 #Nodes = 8 #Nodes = 16 1.6 1.2 1.2 1.4 1.0 1.0 1.2 0.8 1.0 MAPE 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Prediction method Prediction method Prediction method Figure 5: Accuracy of different prediction methods for Archaea. #Nodes = 4 #Nodes = 8 #Nodes = 16 1.1 1.1 1.0 1.0 1.0 0.9 0.9 0.9 0.8 0.8 0.8 MAPE 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Prediction method Prediction method Prediction method Figure 6: Accuracy of different prediction methods for Bacteria30G. the problem in a reasonable time, and even if the problem value in each step is better than or equal to the previous is solved, the propagation of computational errors leads to step (i.e., it evolves). On the other hand, given the stopping incorrect and unreasonable results. condition of the procedure, there is no need to solve problems Considering that jobs are supposed to be assigned to nodes, with large number of jobs. another idea is to consider the number of jobs as a factor of From the experimental results presented in Tables 3 to 5, the number of nodes, i.e. v; more precisely: the following points can be highlighted: J = kv f or some k ∈ N (8) • As the number of jobs increases, the makespan and the number of unused nodes decrease. The coefficient k in the Eq. 8 changes interactively and • As the number of nodes increases, the makespan evolutionarily from one to higher; that is, after solving the decreases. problem for k=1 and calculating the makespan, we consider • The procedure of increasing the number of jobs may the resulting solution as an initial solution (an individual of be stopped for two reasons: Firstly, increasing the the initial population) for k=2. This procedure continues and number of jobs does not improve the makespan (sig- evolves until the distance between two consecutive makespan nificantly). Secondly, increasing the number of jobs values is negligible or insignificant from the decision maker’s may be stopped by decision maker (especially if no point of view. On the one hand, this process is compatible significant improvement in makespan is expected). with the evolutionary nature of the GA used to solve the • As can be seen in the fourth row of Table 3, the above sequential problems, because in each step, with the makespan does not improve as the number of jobs solution of the previous step as the initial solution, the value increases from v to 2v. Obviously, a higher number of the fitness function in the current step starts to improve of jobs does not improve the makespan (such cases from the value of the previous step. Therefore, the makespan are marked in bold in the tables). Thus, we have #Nodes = 4 #Nodes = 8 #Nodes = 16 0.75 0.75 0.70 0.70 0.7 0.65 0.65 0.6 0.60 0.60 MAPE 0.55 0.55 0.5 0.50 0.50 0.45 0.4 0.45 0.40 0.40 0.3 0.35 Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Prediction method Prediction method Prediction method Figure 7: Accuracy of different prediction methods for Bacteria88G. Table 3 Obtained results for Archaea. J=v J = 2v J = 3v J = 4v #Nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes 4 539 2 404 0 404 0 - - 8 339 1 334 0 312 0 312 0 16 285 5 252 0 251 0 251 0 24 206 5 206 3 - - - - Table 4 Obtained results for Bacteria-30GiB. J=v J = 2v J = 3v J = 4v #Nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes 4 5804 1 5124 0 5077 0 5077 0 8 3050 1 2957 0 2952* 0 - - 16 1937 3 1937 3 - - - - 24 2016 7 1859 1 1859 0 - - not calculated them. Moreover, in the last row of References Table 5, increasing the number of jobs from 3v to 4v only leads to 1% decrease in the makespan, which is S. Mohammadi, L. PourKarimi, H. Pedram, Integer linear not significant (such cases are marked in the tables programming-based multi-objective scheduling for sci- in bold and with an asterisk). So we stopped the entific workflows in multi-cloud environments, The procedure. It should be noted that the DM may stop Journal of Supercomputing 75 (2019) 6683–6709. the procedure when the number of jobs is 3v due to S. Abdi, L. PourKarimi, M. Ahmadi, F. Zargari, Cost min- the insignificant improvement in makespan and the imization for deadline-constrained bag-of-tasks appli- abandonment of an unused node. cations in federated hybrid clouds, Future Generation Computer Systems 71 (2017) 113–128. 8. Conclusion and Future works J. Bader, F. Lehmann, L. Thamsen, U. Leser, O. Kao, Lotaru : Locally predicting workflow task runtimes for re- In this paper, an approach to the task/job granularity prob- source management on heterogeneous infrastructures, lem for metagenomic DAWs in cluster infrastructures with Future Generation Computer Systems 150 (2024) 171– makespan minimization was proposed. The problem was first 185. URL: https://doi.org/10.1016/j.future.2023.08.022. formulated as a mathematical model and then the proposed doi:10.1016/j.future.2023.08.022. model was solved using the GA method. One of the main R. da Silva, A.-C. Orgerie, H. Casanova, R. Tanaka, E. Deel- challenges in this problem is that the number of jobs is not man, F. Suter, Accurately simulating energy consump- known in advance. We overcame this challenge by adjusting tion of I/O-intensive scientific workflows, in: Computa- the number of jobs as a factor of the number of computing tional Science–ICCS 2019: 19th International Confer- nodes. For each increase in the number of jobs, the makespan ence, Faro, Portugal, June 12–14, 2019, Proceedings, is calculated. This procedure continues and evolves until the Part I 19, Springer, 2019, pp. 138–152. distance between two successive makespan values is negligi- ble or insignificant from the decision maker’s point of view. R. F. da Silva, H. Casanova, A.-C. Orgerie, R. Tanaka, Experimental results showed that a desirable makespan value E. Deelman, F. Suter, Characterizing, modeling, and can be obtained after a few steps of increasing the number accurately simulating power and energy consumption of jobs. Furthermore, the calculation of makespan requires of i/o-intensive scientific workflows, Journal of compu- a proper estimation of the task runtime, so we applied three tational science 44 (2020) 101157. different methods for this estimation. Experimental results showed that the polynomial approximation outperforms. P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta, In the future, we aim to generalize our proposed model N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan, so that it can be applied to other scientific domains. Since D. Okaya, SCEC CyberShake workflows—automating the proposed approach does not schedule the workflow, but probabilistic seismic hazard analysis calculations, in: optimizes a single step of the workflow, we intend to integrate Workflows for e-Science, Springer, 2007, pp. 143–163. it into a scheduling approach in the future work. A. M. Kintsakis, F. E. Psomopoulos, P. A. Mitkas, Data- aware optimization of bioinformatics workflows in hy- Acknowledgments brid clouds, Journal of Big Data 3 (2016) 1–26. Funded by the Deutsche Forschungsgemeinschaft (DFG, Ger- S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H. man Research Foundation) as FONDA (Project 414984028, Su, K. Vahi, Characterization of scientific workflows, SFB 1404). in: Workflows in Support of Large-Scale Science, 2008. WORKS 2008. Third Workshop on, IEEE, 2008, pp. 1–10. L. B. Costa, H. Yang, E. Vairavanathan, A. Barros, K. Ma- heshwari, G. Fedak, D. Katz, M. Wilde, M. Ripeanu, Table 5 Obtained results for Bacteria-88GiB. J=v J = 2v J = 3v J = 4v #Nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes 4 18965 0 18965 0 - - 8 14673 1 11432 0 11119 0 10341* 0 16 7821 4 7385 1 6127 0 6127 0 24 6078 6 5487 2 4911 1 4849* 0 S. Al-Kiswany, The case for workflow-aware storage: S. M. Sadjadi, S. Shimizu, J. Figueroa, R. Rangaswami, An opportunity study, Journal of Grid Computing 13 J. Delgado, H. Duran, X. J. Collazo-Mojica, A mod- (2015) 95–113. eling approach for estimating execution time of long- running scientific applications, in: 2008 IEEE interna- E. Vairavanathan, S. Al-Kiswany, L. B. Costa, Z. Zhang, tional symposium on parallel and distributed processing, D. S. Katz, M. Wilde, M. Ripeanu, A workflow-aware IEEE, 2008, pp. 1–8. storage system: An opportunity study, in: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud M. H. Hilman, M. A. Rodriguez, R. Buyya, Task runtime and Grid Computing (ccgrid 2012), IEEE, 2012, pp. prediction in scientific workflows using an online incre- 326–334. mental learning approach, in: 2018 IEEE/ACM 11th International Conference on Utility and Cloud Comput- J. M. Wozniak, M. Wilde, Case studies in storage access by ing (UCC), IEEE, 2018, pp. 93–102. loosely coupled petascale applications, in: Proceedings of the 4th Annual Workshop on Petascale Data Storage, K. Kianfar, C. Pockrandt, B. Torkamandi, H. Luo, K. Rein- 2009, pp. 16–20. ert, Optimum Search Schemes for approximate string matching using bidirectional FM-index, arXiv preprint S. Mohammadi, H. Pedram, L. PourKarimi, Integer linear arXiv:1711.02035 (2017). programming-based cost optimization for scheduling scientific workflows in multi-cloud environments, Jour- S. Abdi, M. Ashjaei, S. Mubeen, Cognitive and Time Pre- nal of Supercomputing 74 (2018) 4717–4745. doi:10. dictable Task Scheduling in Edge-cloud Federation, in: 1007/s11227-018-2465-8. 2022 IEEE 27th International Conference on Emerging Technologies and Factory Automation (ETFA), IEEE, H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-effective 2022, pp. 1–4. and low-complexity task scheduling for heterogeneous computing, IEEE transactions on parallel and dis- S. Katoch, S. S. Chauhan, V. Kumar, A review on genetic tributed systems 13 (2002) 260–274. algorithm: past, present, and future, Multimedia tools and applications 80 (2021) 8091–8126. S. Mohammadi, L. PourKarimi, F. Droop, N. De Mec- quenem, U. Leser, K. Reinert, A mathematical program- G. Tremblay, R. Sabourin, P. Maupin, Optimizing nearest ming approach for resource allocation of data analysis neighbour in random subspaces using a multi-objective workflows on heterogeneous clusters, The Journal of genetic algorithm, in: Proceedings of the 17th Interna- Supercomputing (2023) 1–30. tional Conference on Pattern Recognition, 2004. ICPR 2004., volume 1, IEEE, 2004, pp. 208–211. T. Dziok, K. Figiela, M. Malawski, Adaptive multi-level workflow scheduling with uncertain task estimates, in: I. De Falco, A. Della Cioppa, E. Tarantino, Mutation-based Parallel Processing and Applied Mathematics: 11th genetic algorithm: performance evaluation, Applied International Conference, PPAM 2015, Krakow, Poland, Soft Computing 1 (2002) 285–299. September 6-9, 2015. Revised Selected Papers, Part II, Springer, 2016, pp. 90–100. T. M. H. Hope, Linear regression, in: Machine Learning, Elsevier, 2020, pp. 67–81. R. F. Da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez, D. Thain, B. Tovar, M. Livny, Toward fine-grained W. Rudin, Principles of Real Analysis. Mathematics Series, online task characteristics estimation in scientific work- 1976. flows, in: Proceedings of the 8th workshop on work- C. Valdes, V. Stebliankin, G. Narasimhan, Large scale micro- flows in support of large-scale science, 2013, pp. 58–67. biome profiling in the cloud, Bioinformatics 35 (2019) R. F. Da Silva, G. Juve, M. Rynge, E. Deelman, M. Livny, i13—-i22. Online task resource consumption prediction for scien- J. Köster, S. Rahmann, Snakemake—a scalable bioinformat- tific workflows, Parallel Processing Letters 25 (2015) ics workflow engine, Bioinformatics 28 (2012) 2520– 1541003. 2522. F. Nadeem, D. Alghazzawi, A. Mashat, K. Fakeeh, A. Al- J. Bader, F. Lehmann, L. Thamsen, J. Will, U. Leser, O. Kao, malaise, H. Hagras, Modeling and predicting execution Lotaru: Locally Estimating Runtimes of Scientific time of scientific workflows in the grid using radial Workflow Tasks in Heterogeneous Clusters, arXiv basis function neural network, Cluster Computing 20 preprint arXiv:2205.11181 (2022). (2017) 2805–2819.