=Paper=
{{Paper
|id=Vol-3651/DARLI-AP_paper15
|storemode=property
|title=Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures
|pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-15.pdf
|volume=Vol-3651
|authors=Somayeh Mohammadi,Latif Pourkarimi,Manuel Zschäbitz,Tristan Aretz,Ninon De Mecquenem,Ulf Leser,Knut Reinert
|dblpUrl=https://dblp.org/rec/conf/edbt/MohammadiPZAMLR24
}}
==Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures==
Optimizing Job/Task Granularity for Metagenomic Workflows in
Heterogeneous Cluster Infrastructures
Somayeh Mohammadi1 , Latif PourKarimi2 , Manuel Zschäbitz1 , Tristan Aretz1 , Ninon De Mecquenem3 ,
Ulf Leser3 and Knut Reinert1
1
Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany
2
Department of Mathematics and Computer Science, Razi University, Kermanshah, Iran
3
Department of Mathematics and Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany
Abstract
Data analysis workflows are popular for sequencing activities in large-scale and complex scientific processes. Scheduling approaches
attempt to find an appropriate assignment of workflow tasks to the computing nodes for minimizing the makespan in heterogeneous
cluster infrastructures. A common feature of these approaches is that they already know the structure of the workflow. However, for
many workflows, a high degree of parallelization can be achieved by splitting the large input data of a single task into chunks and
processing them independently. We call this problem task granularity, which involves finding an assignment of tasks to computing nodes
and simultaneously optimizing the structure of a bag of tasks. Accordingly, this paper addresses the problem of task granularity for
metagenomic workflows. To this end, we first formulated the problem as a mathematical model. We then solved the proposed model using
the genetic algorithm. To overcome the challenge of not knowing the number of tasks, we adjusted the number of tasks as a factor of the
number of computing nodes. The procedure of increasing the number of tasks is performed interactively and evolutionarily. Experimental
results showed that a desirable makespan value can be achieved after a few steps of the increase.
Keywords
Data Analysis Workflow, Mathematical Programming, Makespan Minimization, Run Time Prediction, Genetic Algorithm
1. Introduction should be the appropriate size of each chunk of input data,
and how should each task be assigned to a computing node
Scientists in many domains, such as bioinformatics, remote so that the makespan is minimized. We call this problem task
sensing, and physics, use Data Analysis Workflows (DAWs) granularity. In heterogeneous environments, this challenge
to sequence activities involved in large-scale and complex is aggravated because the computing power of the existing
scientific processes [1]; [2]. These DAWs are typically rep- computing nodes is different, so choosing the input size of
resented as a Directed Acyclic Graph (DAG), which consists each task to be executed on each of these computing nodes
of a set of tasks and some directed edges between the tasks; is a very effective means in optimizing the makespan. Since
edges show data dependencies between tasks and the priority each task of the DAW is equivalent to a job for the cluster
order of task execution. when a workflow is submitted for execution, the terms task
Scientists often use heterogeneous cluster infrastructures and job are used interchangeably in this study.
to run their DAWs because of privacy and financial concerns. In this paper, we propose a novel approach to task granu-
Heterogeneous clusters provide high-performance comput- larity for metagenomic DAWs in cluster infrastructures with
ing environments that enable efficient data analysis and the makespan minimization. We first formulate the problem as
execution of large-scale DAWs in a reasonable amount of a mathematical model. We solve then the proposed model
time [3]. DAWs are often executed on large amounts of data, using the Genetic Algorithm method. Since the calculation
resulting in long runtimes that can exceed days or weeks of makespan requires a proper estimation of tasks runtime,
[4]; [5]; [6]. Thus, in such environments, the key objective we apply three different methods for this estimation and also
is to schedule DAW tasks across computing resources in compare their accuracy.
such a way that the total execution time, also known as the The paper is organized as follows: Section 2 presents
makespan, is minimized. a review of the related work, and Section 3 illustrates the
It is well known that a high degree of parallelization can problem statement. The proposed mathematical model is
be achieved in many DAWs by splitting the input data of indi- introduced in Section 4. Section 5 discusses solving the pro-
vidual tasks into chunks and processing them independently posed model using genetic algorithm. Job runtime estimation
[7]. For example, in metagenomic DAWs, the size of a ref- is presented in Section 6, and experimental results are pre-
erence genome as frequent input data can vary from several sented in Section 7. Finally, Section 8 provides concluding
KB to hundreds of GB, and the reference genome typically remarks and plans for further studies.
contains thousands of genome files. The reference genome
can be divided into different bins of genome files and they
are processed by several independent tasks in parallel. The 2. Related works
main challenge here is how to partition the input data; what
In this section, we first discuss data access patterns used
in scientific workflows. We then cover the scheduling of
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 scientific workflows on heterogeneous clusters. Finally, we
Joint Conference (March 25-28, 2024), Paestum, Italy. focus on methods for predicting the runtime of tasks, as these
$ somayeh.mohammadi@fu-berlin.de (S. Mohammadi);
l.pourkarimi@razi.ac.ir (L. PourKarimi); manuez42@zedat.fu-berlin.de
estimates are often used as input for scheduling approaches.
(M. Zschäbitz); aret01@zedat.fu-berlin.de (T. Aretz);
mecquenn@informatik.hu-berlin.de (N. D. Mecquenem);
leser@informatik.hu-berlin.de (U. Leser); Reinert@fu-berlin.de
(K. Reinert)
© 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2.1. Data access patterns used in scientific know the structure of the DAG and find an optimal or sub-
workflows optimal assignment of tasks to the computing nodes.
By addressing the scattering problem of a bag, our pro-
The data access patterns of workflow applications have been posed approach not only provides a suitable structure for the
addressed by several studies [8]; [9]; [10]; [11]. Accordingly, bag and consequently for the DAG, but also finds the best
the most commonly used patterns in scientific workflows are scheduling of tasks to computing nodes with the objective of
as follows (Fig. 1): minimizing the makespan.
• Pipeline:This is the most basic and familiar pattern.
A set of computational tasks is chained in a sequence 2.3. Task runtime prediction
such that the output of a parent task is the input of Most of state-of-the-art techniques for workflow scheduling
its child task in the chain. Because of the line de- rely on accurate predictions of tasks runtime [15]. Therefore,
pendencies in a pipeline pattern, the execution of a the problem of predicting the runtime of scientific workflow
task cannot begin until the execution of its parent has tasks based on historical data has been studied extensively.
completed and it has received the data generated by Recent research has used machine learning techniques to
its parent. address this issue [16]; [17]; [18]; [19]. They build their
• Scatter: An input data is divided into several chunks. initial models on historical data before the actual workflow
These chunks are distributed into multiple tasks (a execution. [20]; [18] use neural network methods, which
bag of tasks), which can be executed simultaneously are known to require large training data sets to perform well,
because there are no dependencies between them. while [3] employs a Bayesian linear regression model, which
• Gather: Multiple chunks of data are produced by can work with few training points and provides uncertainty
multiple tasks. All of them are used as input data by estimates for its predictions. Most existing approaches use
a subsequent task. The later task may need to receive the size of the file on the hard drive as input to their prediction
all the chunks and integrate them to start execution. models.
In this research, the objective is to minimize the makespan.
To compute the makespan value, an acceptable estimate of
Data Task
the runtime of jobs is required. Assuming that we have
some historical data of execution traces of jobs on computing
nodes, we use three different methods to predict the runtime
of jobs (See Section 6).
3. Problem statement
This study addresses the problem of job/task granularity of
Pipeline Scatter Gather scientific workflows in heterogeneous cluster environments
with the aim of minimizing makespan. The case study is
a metagenomic workflow where a reference genome con-
taining a set of genome files is the main input data of the
Figure 1: Common data access patters in scientific workflows.
workflow. Building FM index (Full-text index in Minute
space) over a reference genome (reference genome index-
Obviously, the scatter pattern can be so effective in reduc-
ing) is a common and time-consuming task in metagenomic
ing makespan in a distributed environment such as a cluster
DAWs [21] and such a task is often used by the bioinfor-
and cloud because it provides parallel execution of tasks on
matics workflow community, so it is a good case study for
computing nodes. The implementation of scatter is an NP-
optimizing job/task granularity.
hard problem (See Section 3.1), so the user is not able to do
In general, the system model includes the following steps:
it manually to achieve an acceptable makespan. In this study,
the first step is to collect a historical dataset of job execution
we propose an approach to address this problem.
traces on different computing nodes of the cluster. If this
dataset is not already available, it can be collected by data
2.2. Scheduling of scientific Workflows on sampling [3]. In the second step, a proper estimation of the
heterogeneous clusters job runtime and its memory consumption is performed using
a prediction method. Then, by solving the proposed mathe-
Generally, workflow scheduling on heterogeneous infrastruc- matical model, the optimal size of each chunk of input data
tures can be done in two ways, statically or dynamically [12]; for each job and also the assignment of jobs to computing
[2]. Static scheduling assigns tasks to compute resources in nodes is obtained. Finally, the optimal job granularity and
advance, assuming that accurate information about workflow assignment is used to execute the DAW in the cluster.
and infrastructure resources is available. Dynamic schedul-
ing doesn’t require such assumptions. On the one hand, many
heuristic and meta-heuristic approaches have been provided 3.1. A motivating example
for this problem. HEFT [13] is considered to be the most Suppose there is a reference genome that contains five
famous of these. On the other hand, mathematical optimiza- genome files of the following sizes: g0 = 10, g1 = 15,
tion approaches such as MILP [14] have proposed optimal g2 = 20, g3 = 25, g4 = 35. Moreover, assume that the cluster
solutions to this problem and have also analyzed the problem infrastructure has three computing nodes A, B and C. The
more extensively. However, the presented state-of-the-art runtime of a job with an input size of S on the computing
scheduling approaches have in common that they already nodes is calculated by the following functions:
• fA (S) = ln S2 + 1 of the problem constraints and the objective function of the
• fB (S) = ln S2 + 2S proposed model are presented in detail.
• fC (S) = ln S2 − 4S − 10
4.1. Required memory for running jobs
g1 , g4 g2 g0 g3 This constraint states that the memory limitation of kth node
must be met. This constraint must be done for all jobs and
computing nodes.
Job1 Job2 Job3 Job4
n
memk ≥ fmem ( ∑ Si · yi j · x jk )
Node-A Node-B Node-C i=1 (1)
Execution time on Node-A= 7.80 ∀ j ∈ {1, 2, ..., J}, ∀k ∈ {1, 2, ..., v}
Execution time on Node-B= 6.08
Execution time on Node-C= 10.15
Makespan=10.15
For each J j and CNk , ∑ni=1 Si · yi j · x jk denotes the input size
of J j on CNk and fmem estimates the memory required for J j .
(a) A possible assignment.
4.2. Assigning jobs to cluster nodes
g0 , g2 g1 g3 , g4
Constraints (2) and (3) imply that non-empty jobs must be
assigned to exactly one node. Constrain (4) enforces that
empty jobs cannot be assigned to any node.
Job1 Job2 Job3
Node-A Node-B Node-C 1 n v
∑ yi j ≤ ∑ x jk ∀ j ∈ {1, 2, ..., J} (2)
Execution time on Node-A= 6.80 n i=1 k=1
Execution time on Node-B= 5.42
Execution time on Node-C= 8.11 v
Makespan=8.11
∑ x jk ≤ 1 ∀ j ∈ {1, 2, ..., J} (3)
k=1
(b) A better assignment.
v n
∑ x jk ≤ ∑ yi j ∀ j ∈ {1, 2, ..., J} (4)
Figure 2: Two possible job granularities and assignments with k=1 i=1
different makespan values.
4.3. Objective function
For constructing the objective function the following points
In Fig. 2, two different possible job building and assign- should be highlighted:
ments of the genome files is depicted. However, there are
• The objective function deals with the runtime of some
35 = 243 different states as a solution among which the solu-
bags of jobs on some computing nodes.
tion with the minimum makespan is the best.
In a real example, Archaea1 has 488 genome files. As- • If tk (Eq. (5))denotes for the above mentioned time
suming that the number of cluster computing nodes is 10. for CNk then the total runtime (makespan) equals to
Any approach based on complete enumeration and trial and max(tk ) 1 ≤ k ≤ v . The objective function aims
error for assigning these genomes to computing nodes re- to minimize this time (Eq. (6)).
quires comparing a maximum number of 10488 to different
assignments. This approach is obviously impractical. It is • When more than one task is assigned to a node, the
noticeable that Archaea is a very small reference genome node wastes a certain amount of time between exe-
among the available reference genomes. Therefore, applying cuting two jobs. ∑Jj=1 (x jk − 1) denotes that time.
the above mentioned approach for solving the related assign-
• For each CNk , ∑Jj=1 fk (∑ni=1 yi j · si · x jk ) calculates
ment problem is not efficient or even applicable. The best
the summation of the runtime of jobs assigned to
approach to deal with this problem is to create a mathemati-
CNk . ∑ni=1 yi j · si · x jk is the input size of J j , in which,
cal model for the problem and then apply available efficient
fk () denotes an implicit function of that.
algorithms for solving mathematical optimization models.
Runtime of jobs on each CNk is calculated by Eq.5.
J J n
4. The proposed mathematical model tk = ( ∑ x jk − 1) · stk + ( ∑ fk ( ∑ yi j · si · x jk )) (5)
j=1 j=1 i=1
In a mathematical model, the objective function, decision
variables and problem constraints are expressed in mathe- Therefore, the objective function of the model is as fol-
matical expressions. These models provide a deep insight lows:
into the structure of the problem [12]; [22]. They are there- min(max(tk )) 1 ≤ k ≤ v (6)
fore suitable not only for solving the problem using classical This can be expressed as follows:
methods, but also for solving the problem using heuristic or
meta-heuristic methods.
The input data for the mathematical model are described min α
in Table 1. Also, this table explains the decision variables ex- subject to:
isting in the proposed model. In this section, the formulation α ≥ tk ∀k ∈ {1, 2, ..., v}
1
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/ α≥0
Table 1
Models parameters/variables and their description.
Notation of parameters Description
CN = {node1, node2, ..., nodev } Cluster node set
v The number of nodes of the cluster
CNk kth node of the cluster
memk Accessible memory size of CNk for running tasks
stk Switch time between two jobs for CNk
Gen = {g1, g2, ..., gn } The set of genomes of the reference genome
n The number of genomes of the reference genome
gi ith genome of the reference genome
Si Size of gi
Job = {J1, J2, ..., JJ } The set of jobs
Jj jth job
J The number of jobs
Notation of decision variables Description
x jk 1 iff J j is assigned to CNk , otherwise 0
yi j 1 iff gi is binned to J j , otherwise 0
5. Genetic optimization for solving the to J4 . Moreover, J1 and J4 are scheduled to CN 3 and CN 2 ,
respectively.
proposed model
Suppose there is a reference genome of a certain size with Genome file number 1 2 3 n
genome files g1 , g2 , . . . , gn . We want to group the genomes
into a number of jobs and then assign the jobs to computing Job number 1 4 1 ...
nodes CN1 , . . . ,CNv of a cluster infrastructure.
It can be seen that the proposed model is a non-linear bi- Node number 3 2 3 ...
nary mathematical model. Due to the special structure of the
constraints and the objective function of this model, lineariz-
ing it leads to a binary linear model with a significantly large
Figure 3: A chromosome encoding.
number of constraints. Solving this model is very time con-
suming in terms of computation (it may even be impossible).
On the other hand, in contrast to classic approaches, using
genetic algorithms is a very powerful approach for treating 5.2. GA operators
discret models even if the model is nonlinear [23]. Based on
The crossover operator can help to inherit some chromosome
this fact, using genetic approach can be an efficient approach
fragments of excellent individuals to subsequent generations.
for solving the proposed model without any linearization. In
In this study, the single-point crossover technique [23] is
the following, we explain how to implement the presented
adapted during the performing of the crossover operator to
model using a Genetic Algorithm (GA).
produce new individuals. These new individuals are then as-
sessed for their potential to contribute to the next generation
5.1. Chromosome structure of the population. An example is shown in Fig. 4. After
crossover, the first new individual is not feasible to add to the
GAs mimic optimization during optimization by modelling
next generation because J5 has been assigned to two different
genetic recombination and a fitness function. Hence, when
computing nodes.
using GA to solve a particular problem, the first concern to
The mutation operator is a technique that replaces some
be addressed revolves around the determination of a suitable
gene values with others to increase population diversity. We
chromosome coding. Each chromosome represents the dif-
use swap mutation [25] to explore new regions of the solution
ferent parameters that characterize a solution to the problem.
space, where two positions on a chromosome are randomly
[24]. In the solution, we consider a population of individuals,
selected and swapped. After mutation, the potential of new
each individual being a potential solution to the problem de-
individuals to contribute to the next generation of the popula-
scribed by the individual chromosome. The initial population
tion is evaluated.
is generated at random.
In this study, job granularity involves determining the as-
signment of genome files to jobs and the assignment of jobs 6. JOB RUNTIME ESTIMATION
to cluster nodes. Thus, a solution (chromosome) is a two-
dimensional array in which the indices indicate the genome In the GA algorithm, each individual should be assigned a
file number. The elements of the first row contain the job value of the fitness score, which is shown in the objective
numbers and the elements of the second row contain the function defined in Eq.5. Suppose the runtime of job j on
computing node numbers. The representation of a chromo- node CNk is equal to t j . The challenge here is that there is
some in the GA implementation is shown in Fig. 3. As Fig. no explicit function for calculating t j ; in other words, how
3 shows g1 and g3 are assigned to J1 while g2 is assigned
workflow to perform the experiments. The specifications of
the input data are described in Table 2. We also looked into
different cluster sizes of 4, 8, 16, and 24 computing nodes in
the experiments.
As previously stated, our approach consider the job gran-
ularity and scheduling problems simultaneously for a bag
of task in a workflow while related works have addressed
only the scheduling problem; Indeed, they have assumed
that there are a number of jobs, each with a certain size, and
they schedule these jobs to the cluster nodes so that the total
runtime is minimized.
Figure 4: An example of crossover. Most existing approaches use the file size on disk as the
input for their predictions or approximate models. In [30] is
a detailed discussion of why uncompressed input data size
long does it take to execute a job j of the size S j on a com- for compressed files should be used. Accordingly, we use the
puting node CNk ? Therefore, we use the following different uncompressed file size to predict the runtime and the used
methods to predict the runtime of jobs on computing nodes, memory of jobs.
assuming that we have historical data from task execution
traces: Table 2
Reference genomes specifications.
• Linear Regression: Linear regression is a popular
and simple machine learning algorithm that mod- Reference Genome Data Size # Genome files
els the relationship between dependent and indepen- Archaea 1.4 GiB 488
dent variables by analyzing and learning from current Bacteria_1 4 30 GiB 7167
training results using a linear expression of indepen- Bacteria_2 5 88 GiB 22185
dent variables [26].
• Polynomial approximation: According to mathe-
matical theorems, any continuous function can be 7.1. Accuracy comparison of job runtime
optimally approximated by a suitable polynomial
estimation methods
[27]. Based on this theory, a polynomial of appropri-
ate degree can be used to approximate the unknown We compare the accuracy of the methods used to estimate
function using the given historical data. In practice, job runtime (See Section 6). The linear regression was imple-
lower degree polynomials are considered to avoid the mented using the sklearn.linear_model6 library. The polyno-
socialisation of higher degree polynomials. Here we mial and log-pol have been implemented using polyfit() from
only consider degrees one, two and three. the Numpy7 library and Curve_fit() from the Scipy.optimize8
• Logarithmic of a polynomial approximation (log- library, respectively. For comparison we use the Mean Ab-
pol): As shown in [28], the curves of runtime ac- solute Percentage Error (MAPE). This metric is calculated
cording to data input size are logarithmic like and on using the Eq. 7 where Ai is the actual value, Fi is the pre-
the other hand, as mentioned above, the polynomials dicted value and n is the number of fitted points.
are a suitable method for estimation, so we use a
1 n Ai − Fi
combination of logarithmic and polynomial as a new MAPE = ∑ | Ai | (7)
approximation method. n i=1
These methods assume that there is a relationship between Figs. 5-7 show results for each input data separately with
a task’s input size and its runtime, using the input size as different cluster sizes. As shown in these figures, the polyno-
the independent variable. Thus, they can be used to predict mial method gives a more accurate estimate of the values of
task runtime for any task input sizes. Similar to related work, job run times than log-pol. Log-pol also outperforms linear
these methods use the size of the file on the hard disk as an regression.
input to their prediction models. In a heterogeneous cluster,
a same task may have different run times on different com- 7.2. Changing the number of jobs to improve
puting nodes. Therefore, the methods create their prediction makespan
models for each computing node.
One of the main challenges in this problem is that the number
of jobs is not known in advance. The most obvious idea is
7. Experimental results to consider this number as equal to the number of genome
files, i.e. n. This seems logical at first sight, since the possi-
We developed a read-mapping workflow 2 for metagenomic bility to consider empty jobs allows to potentially find any
data in the popular workflow engine Snakemake [29]. The possible clustering for genome files in the form of jobs. How-
workflow was run on Allegro 3 , a cluster infrastructure. We ever, from a practical point of view, this is inappropriate and
created a historical dataset from traces of the workflow exe- impossible in most cases, because due to the large number
cution on the cluster as a historical execution trace. of genome files, the number of decision variables and con-
We selected and used three reference genomes with dif- straints increases significantly, making it impossible to solve
ferent sizes (small, medium and large) as input data of the 6
https://scikit-learn.org/stable/modules/linear_model.html
2 7
https://github.com/CRC-FONDA/A2-job-granularity/tree/main/MG- https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
8
HIBF https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize
3
https://www.mi.fu-berlin.de/w/Cluster/WebHome .curve_fit.html
#Nodes = 4 #Nodes = 8 #Nodes = 16
1.6
1.2 1.2
1.4
1.0 1.0
1.2
0.8 1.0
MAPE
0.8
0.8
0.6
0.6 0.6
0.4
0.4 0.4
0.2 0.2
Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol
Prediction method Prediction method Prediction method
Figure 5: Accuracy of different prediction methods for Archaea.
#Nodes = 4 #Nodes = 8 #Nodes = 16
1.1 1.1
1.0
1.0 1.0
0.9 0.9 0.9
0.8 0.8 0.8
MAPE
0.7 0.7
0.7
0.6 0.6
0.6 0.5 0.5
0.5 0.4 0.4
Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol
Prediction method Prediction method Prediction method
Figure 6: Accuracy of different prediction methods for Bacteria30G.
the problem in a reasonable time, and even if the problem value in each step is better than or equal to the previous
is solved, the propagation of computational errors leads to step (i.e., it evolves). On the other hand, given the stopping
incorrect and unreasonable results. condition of the procedure, there is no need to solve problems
Considering that jobs are supposed to be assigned to nodes, with large number of jobs.
another idea is to consider the number of jobs as a factor of From the experimental results presented in Tables 3 to 5,
the number of nodes, i.e. v; more precisely: the following points can be highlighted:
J = kv f or some k ∈ N (8) • As the number of jobs increases, the makespan and
the number of unused nodes decrease.
The coefficient k in the Eq. 8 changes interactively and • As the number of nodes increases, the makespan
evolutionarily from one to higher; that is, after solving the decreases.
problem for k=1 and calculating the makespan, we consider • The procedure of increasing the number of jobs may
the resulting solution as an initial solution (an individual of be stopped for two reasons: Firstly, increasing the
the initial population) for k=2. This procedure continues and number of jobs does not improve the makespan (sig-
evolves until the distance between two consecutive makespan nificantly). Secondly, increasing the number of jobs
values is negligible or insignificant from the decision maker’s may be stopped by decision maker (especially if no
point of view. On the one hand, this process is compatible significant improvement in makespan is expected).
with the evolutionary nature of the GA used to solve the • As can be seen in the fourth row of Table 3, the
above sequential problems, because in each step, with the makespan does not improve as the number of jobs
solution of the previous step as the initial solution, the value increases from v to 2v. Obviously, a higher number
of the fitness function in the current step starts to improve of jobs does not improve the makespan (such cases
from the value of the previous step. Therefore, the makespan are marked in bold in the tables). Thus, we have
#Nodes = 4 #Nodes = 8 #Nodes = 16
0.75 0.75
0.70 0.70 0.7
0.65 0.65
0.6
0.60 0.60
MAPE
0.55 0.55 0.5
0.50 0.50
0.45 0.4
0.45
0.40
0.40 0.3
0.35
Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol Polynomial Linear Regression log-pol
Prediction method Prediction method Prediction method
Figure 7: Accuracy of different prediction methods for Bacteria88G.
Table 3
Obtained results for Archaea.
J=v J = 2v J = 3v J = 4v
#Nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes
4 539 2 404 0 404 0 - -
8 339 1 334 0 312 0 312 0
16 285 5 252 0 251 0 251 0
24 206 5 206 3 - - - -
Table 4
Obtained results for Bacteria-30GiB.
J=v J = 2v J = 3v J = 4v
#Nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes
4 5804 1 5124 0 5077 0 5077 0
8 3050 1 2957 0 2952* 0 - -
16 1937 3 1937 3 - - - -
24 2016 7 1859 1 1859 0 - -
not calculated them. Moreover, in the last row of References
Table 5, increasing the number of jobs from 3v to 4v
only leads to 1% decrease in the makespan, which is S. Mohammadi, L. PourKarimi, H. Pedram, Integer linear
not significant (such cases are marked in the tables programming-based multi-objective scheduling for sci-
in bold and with an asterisk). So we stopped the entific workflows in multi-cloud environments, The
procedure. It should be noted that the DM may stop Journal of Supercomputing 75 (2019) 6683–6709.
the procedure when the number of jobs is 3v due to
S. Abdi, L. PourKarimi, M. Ahmadi, F. Zargari, Cost min-
the insignificant improvement in makespan and the
imization for deadline-constrained bag-of-tasks appli-
abandonment of an unused node.
cations in federated hybrid clouds, Future Generation
Computer Systems 71 (2017) 113–128.
8. Conclusion and Future works J. Bader, F. Lehmann, L. Thamsen, U. Leser, O. Kao, Lotaru
: Locally predicting workflow task runtimes for re-
In this paper, an approach to the task/job granularity prob- source management on heterogeneous infrastructures,
lem for metagenomic DAWs in cluster infrastructures with Future Generation Computer Systems 150 (2024) 171–
makespan minimization was proposed. The problem was first 185. URL: https://doi.org/10.1016/j.future.2023.08.022.
formulated as a mathematical model and then the proposed doi:10.1016/j.future.2023.08.022.
model was solved using the GA method. One of the main
R. da Silva, A.-C. Orgerie, H. Casanova, R. Tanaka, E. Deel-
challenges in this problem is that the number of jobs is not
man, F. Suter, Accurately simulating energy consump-
known in advance. We overcame this challenge by adjusting
tion of I/O-intensive scientific workflows, in: Computa-
the number of jobs as a factor of the number of computing
tional Science–ICCS 2019: 19th International Confer-
nodes. For each increase in the number of jobs, the makespan
ence, Faro, Portugal, June 12–14, 2019, Proceedings,
is calculated. This procedure continues and evolves until the
Part I 19, Springer, 2019, pp. 138–152.
distance between two successive makespan values is negligi-
ble or insignificant from the decision maker’s point of view. R. F. da Silva, H. Casanova, A.-C. Orgerie, R. Tanaka,
Experimental results showed that a desirable makespan value E. Deelman, F. Suter, Characterizing, modeling, and
can be obtained after a few steps of increasing the number accurately simulating power and energy consumption
of jobs. Furthermore, the calculation of makespan requires of i/o-intensive scientific workflows, Journal of compu-
a proper estimation of the task runtime, so we applied three tational science 44 (2020) 101157.
different methods for this estimation. Experimental results
showed that the polynomial approximation outperforms. P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta,
In the future, we aim to generalize our proposed model N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan,
so that it can be applied to other scientific domains. Since D. Okaya, SCEC CyberShake workflows—automating
the proposed approach does not schedule the workflow, but probabilistic seismic hazard analysis calculations, in:
optimizes a single step of the workflow, we intend to integrate Workflows for e-Science, Springer, 2007, pp. 143–163.
it into a scheduling approach in the future work.
A. M. Kintsakis, F. E. Psomopoulos, P. A. Mitkas, Data-
aware optimization of bioinformatics workflows in hy-
Acknowledgments brid clouds, Journal of Big Data 3 (2016) 1–26.
Funded by the Deutsche Forschungsgemeinschaft (DFG, Ger- S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H.
man Research Foundation) as FONDA (Project 414984028, Su, K. Vahi, Characterization of scientific workflows,
SFB 1404). in: Workflows in Support of Large-Scale Science, 2008.
WORKS 2008. Third Workshop on, IEEE, 2008, pp.
1–10.
L. B. Costa, H. Yang, E. Vairavanathan, A. Barros, K. Ma-
heshwari, G. Fedak, D. Katz, M. Wilde, M. Ripeanu,
Table 5
Obtained results for Bacteria-88GiB.
J=v J = 2v J = 3v J = 4v
#Nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes Makespan #Unused nodes
4 18965 0 18965 0 - -
8 14673 1 11432 0 11119 0 10341* 0
16 7821 4 7385 1 6127 0 6127 0
24 6078 6 5487 2 4911 1 4849* 0
S. Al-Kiswany, The case for workflow-aware storage: S. M. Sadjadi, S. Shimizu, J. Figueroa, R. Rangaswami,
An opportunity study, Journal of Grid Computing 13 J. Delgado, H. Duran, X. J. Collazo-Mojica, A mod-
(2015) 95–113. eling approach for estimating execution time of long-
running scientific applications, in: 2008 IEEE interna-
E. Vairavanathan, S. Al-Kiswany, L. B. Costa, Z. Zhang, tional symposium on parallel and distributed processing,
D. S. Katz, M. Wilde, M. Ripeanu, A workflow-aware IEEE, 2008, pp. 1–8.
storage system: An opportunity study, in: 2012 12th
IEEE/ACM International Symposium on Cluster, Cloud M. H. Hilman, M. A. Rodriguez, R. Buyya, Task runtime
and Grid Computing (ccgrid 2012), IEEE, 2012, pp. prediction in scientific workflows using an online incre-
326–334. mental learning approach, in: 2018 IEEE/ACM 11th
International Conference on Utility and Cloud Comput-
J. M. Wozniak, M. Wilde, Case studies in storage access by ing (UCC), IEEE, 2018, pp. 93–102.
loosely coupled petascale applications, in: Proceedings
of the 4th Annual Workshop on Petascale Data Storage, K. Kianfar, C. Pockrandt, B. Torkamandi, H. Luo, K. Rein-
2009, pp. 16–20. ert, Optimum Search Schemes for approximate string
matching using bidirectional FM-index, arXiv preprint
S. Mohammadi, H. Pedram, L. PourKarimi, Integer linear arXiv:1711.02035 (2017).
programming-based cost optimization for scheduling
scientific workflows in multi-cloud environments, Jour- S. Abdi, M. Ashjaei, S. Mubeen, Cognitive and Time Pre-
nal of Supercomputing 74 (2018) 4717–4745. doi:10. dictable Task Scheduling in Edge-cloud Federation, in:
1007/s11227-018-2465-8. 2022 IEEE 27th International Conference on Emerging
Technologies and Factory Automation (ETFA), IEEE,
H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-effective 2022, pp. 1–4.
and low-complexity task scheduling for heterogeneous
computing, IEEE transactions on parallel and dis- S. Katoch, S. S. Chauhan, V. Kumar, A review on genetic
tributed systems 13 (2002) 260–274. algorithm: past, present, and future, Multimedia tools
and applications 80 (2021) 8091–8126.
S. Mohammadi, L. PourKarimi, F. Droop, N. De Mec-
quenem, U. Leser, K. Reinert, A mathematical program- G. Tremblay, R. Sabourin, P. Maupin, Optimizing nearest
ming approach for resource allocation of data analysis neighbour in random subspaces using a multi-objective
workflows on heterogeneous clusters, The Journal of genetic algorithm, in: Proceedings of the 17th Interna-
Supercomputing (2023) 1–30. tional Conference on Pattern Recognition, 2004. ICPR
2004., volume 1, IEEE, 2004, pp. 208–211.
T. Dziok, K. Figiela, M. Malawski, Adaptive multi-level
workflow scheduling with uncertain task estimates, in: I. De Falco, A. Della Cioppa, E. Tarantino, Mutation-based
Parallel Processing and Applied Mathematics: 11th genetic algorithm: performance evaluation, Applied
International Conference, PPAM 2015, Krakow, Poland, Soft Computing 1 (2002) 285–299.
September 6-9, 2015. Revised Selected Papers, Part II,
Springer, 2016, pp. 90–100. T. M. H. Hope, Linear regression, in: Machine Learning,
Elsevier, 2020, pp. 67–81.
R. F. Da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez,
D. Thain, B. Tovar, M. Livny, Toward fine-grained W. Rudin, Principles of Real Analysis. Mathematics Series,
online task characteristics estimation in scientific work- 1976.
flows, in: Proceedings of the 8th workshop on work- C. Valdes, V. Stebliankin, G. Narasimhan, Large scale micro-
flows in support of large-scale science, 2013, pp. 58–67. biome profiling in the cloud, Bioinformatics 35 (2019)
R. F. Da Silva, G. Juve, M. Rynge, E. Deelman, M. Livny, i13—-i22.
Online task resource consumption prediction for scien- J. Köster, S. Rahmann, Snakemake—a scalable bioinformat-
tific workflows, Parallel Processing Letters 25 (2015) ics workflow engine, Bioinformatics 28 (2012) 2520–
1541003. 2522.
F. Nadeem, D. Alghazzawi, A. Mashat, K. Fakeeh, A. Al- J. Bader, F. Lehmann, L. Thamsen, J. Will, U. Leser, O. Kao,
malaise, H. Hagras, Modeling and predicting execution Lotaru: Locally Estimating Runtimes of Scientific
time of scientific workflows in the grid using radial Workflow Tasks in Heterogeneous Clusters, arXiv
basis function neural network, Cluster Computing 20 preprint arXiv:2205.11181 (2022).
(2017) 2805–2819.