=Paper= {{Paper |id=Vol-3651/DARLI-AP_paper15 |storemode=property |title=Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures |pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-15.pdf |volume=Vol-3651 |authors=Somayeh Mohammadi,Latif Pourkarimi,Manuel Zschäbitz,Tristan Aretz,Ninon De Mecquenem,Ulf Leser,Knut Reinert |dblpUrl=https://dblp.org/rec/conf/edbt/MohammadiPZAMLR24 }} ==Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures== https://ceur-ws.org/Vol-3651/DARLI-AP-15.pdf
                         Optimizing Job/Task Granularity for Metagenomic Workflows in
                         Heterogeneous Cluster Infrastructures
                         Somayeh Mohammadi1 , Latif PourKarimi2 , Manuel Zschäbitz1 , Tristan Aretz1 , Ninon De Mecquenem3 ,
                         Ulf Leser3 and Knut Reinert1
                         1
                           Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany
                         2
                           Department of Mathematics and Computer Science, Razi University, Kermanshah, Iran
                         3
                           Department of Mathematics and Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany


                                           Abstract
                                           Data analysis workflows are popular for sequencing activities in large-scale and complex scientific processes. Scheduling approaches
                                           attempt to find an appropriate assignment of workflow tasks to the computing nodes for minimizing the makespan in heterogeneous
                                           cluster infrastructures. A common feature of these approaches is that they already know the structure of the workflow. However, for
                                           many workflows, a high degree of parallelization can be achieved by splitting the large input data of a single task into chunks and
                                           processing them independently. We call this problem task granularity, which involves finding an assignment of tasks to computing nodes
                                           and simultaneously optimizing the structure of a bag of tasks. Accordingly, this paper addresses the problem of task granularity for
                                           metagenomic workflows. To this end, we first formulated the problem as a mathematical model. We then solved the proposed model using
                                           the genetic algorithm. To overcome the challenge of not knowing the number of tasks, we adjusted the number of tasks as a factor of the
                                           number of computing nodes. The procedure of increasing the number of tasks is performed interactively and evolutionarily. Experimental
                                           results showed that a desirable makespan value can be achieved after a few steps of the increase.
                                           Keywords
                                           Data Analysis Workflow, Mathematical Programming, Makespan Minimization, Run Time Prediction, Genetic Algorithm



                         1. Introduction                                                                                                 should be the appropriate size of each chunk of input data,
                                                                                                                                         and how should each task be assigned to a computing node
                         Scientists in many domains, such as bioinformatics, remote                                                      so that the makespan is minimized. We call this problem task
                         sensing, and physics, use Data Analysis Workflows (DAWs)                                                        granularity. In heterogeneous environments, this challenge
                         to sequence activities involved in large-scale and complex                                                      is aggravated because the computing power of the existing
                         scientific processes [1]; [2]. These DAWs are typically rep-                                                    computing nodes is different, so choosing the input size of
                         resented as a Directed Acyclic Graph (DAG), which consists                                                      each task to be executed on each of these computing nodes
                         of a set of tasks and some directed edges between the tasks;                                                    is a very effective means in optimizing the makespan. Since
                         edges show data dependencies between tasks and the priority                                                     each task of the DAW is equivalent to a job for the cluster
                         order of task execution.                                                                                        when a workflow is submitted for execution, the terms task
                            Scientists often use heterogeneous cluster infrastructures                                                   and job are used interchangeably in this study.
                         to run their DAWs because of privacy and financial concerns.                                                       In this paper, we propose a novel approach to task granu-
                         Heterogeneous clusters provide high-performance comput-                                                         larity for metagenomic DAWs in cluster infrastructures with
                         ing environments that enable efficient data analysis and the                                                    makespan minimization. We first formulate the problem as
                         execution of large-scale DAWs in a reasonable amount of                                                         a mathematical model. We solve then the proposed model
                         time [3]. DAWs are often executed on large amounts of data,                                                     using the Genetic Algorithm method. Since the calculation
                         resulting in long runtimes that can exceed days or weeks                                                        of makespan requires a proper estimation of tasks runtime,
                         [4]; [5]; [6]. Thus, in such environments, the key objective                                                    we apply three different methods for this estimation and also
                         is to schedule DAW tasks across computing resources in                                                          compare their accuracy.
                         such a way that the total execution time, also known as the                                                        The paper is organized as follows: Section 2 presents
                         makespan, is minimized.                                                                                         a review of the related work, and Section 3 illustrates the
                            It is well known that a high degree of parallelization can                                                   problem statement. The proposed mathematical model is
                         be achieved in many DAWs by splitting the input data of indi-                                                   introduced in Section 4. Section 5 discusses solving the pro-
                         vidual tasks into chunks and processing them independently                                                      posed model using genetic algorithm. Job runtime estimation
                         [7]. For example, in metagenomic DAWs, the size of a ref-                                                       is presented in Section 6, and experimental results are pre-
                         erence genome as frequent input data can vary from several                                                      sented in Section 7. Finally, Section 8 provides concluding
                         KB to hundreds of GB, and the reference genome typically                                                        remarks and plans for further studies.
                         contains thousands of genome files. The reference genome
                         can be divided into different bins of genome files and they
                         are processed by several independent tasks in parallel. The                                                     2. Related works
                         main challenge here is how to partition the input data; what
                                                                                                                                         In this section, we first discuss data access patterns used
                                                                                                                                         in scientific workflows. We then cover the scheduling of
                          Published in the Proceedings of the Workshops of the EDBT/ICDT 2024                                            scientific workflows on heterogeneous clusters. Finally, we
                          Joint Conference (March 25-28, 2024), Paestum, Italy.                                                          focus on methods for predicting the runtime of tasks, as these
                          $ somayeh.mohammadi@fu-berlin.de (S. Mohammadi);
                          l.pourkarimi@razi.ac.ir (L. PourKarimi); manuez42@zedat.fu-berlin.de
                                                                                                                                         estimates are often used as input for scheduling approaches.
                          (M. Zschäbitz); aret01@zedat.fu-berlin.de (T. Aretz);
                          mecquenn@informatik.hu-berlin.de (N. D. Mecquenem);
                          leser@informatik.hu-berlin.de (U. Leser); Reinert@fu-berlin.de
                          (K. Reinert)
                                   © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License
                                   Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2.1. Data access patterns used in scientific                      know the structure of the DAG and find an optimal or sub-
     workflows                                                    optimal assignment of tasks to the computing nodes.
                                                                    By addressing the scattering problem of a bag, our pro-
The data access patterns of workflow applications have been       posed approach not only provides a suitable structure for the
addressed by several studies [8]; [9]; [10]; [11]. Accordingly,   bag and consequently for the DAG, but also finds the best
the most commonly used patterns in scientific workflows are       scheduling of tasks to computing nodes with the objective of
as follows (Fig. 1):                                              minimizing the makespan.
     • Pipeline:This is the most basic and familiar pattern.
       A set of computational tasks is chained in a sequence      2.3. Task runtime prediction
       such that the output of a parent task is the input of      Most of state-of-the-art techniques for workflow scheduling
       its child task in the chain. Because of the line de-       rely on accurate predictions of tasks runtime [15]. Therefore,
       pendencies in a pipeline pattern, the execution of a       the problem of predicting the runtime of scientific workflow
       task cannot begin until the execution of its parent has    tasks based on historical data has been studied extensively.
       completed and it has received the data generated by        Recent research has used machine learning techniques to
       its parent.                                                address this issue [16]; [17]; [18]; [19]. They build their
     • Scatter: An input data is divided into several chunks.     initial models on historical data before the actual workflow
       These chunks are distributed into multiple tasks (a        execution. [20]; [18] use neural network methods, which
       bag of tasks), which can be executed simultaneously        are known to require large training data sets to perform well,
       because there are no dependencies between them.            while [3] employs a Bayesian linear regression model, which
     • Gather: Multiple chunks of data are produced by            can work with few training points and provides uncertainty
       multiple tasks. All of them are used as input data by      estimates for its predictions. Most existing approaches use
       a subsequent task. The later task may need to receive      the size of the file on the hard drive as input to their prediction
       all the chunks and integrate them to start execution.      models.
                                                                     In this research, the objective is to minimize the makespan.
                                                                  To compute the makespan value, an acceptable estimate of
                       Data       Task
                                                                  the runtime of jobs is required. Assuming that we have
                                                                  some historical data of execution traces of jobs on computing
                                                                  nodes, we use three different methods to predict the runtime
                                                                  of jobs (See Section 6).


                                                                  3. Problem statement
                                                                  This study addresses the problem of job/task granularity of
  Pipeline           Scatter                   Gather             scientific workflows in heterogeneous cluster environments
                                                                  with the aim of minimizing makespan. The case study is
                                                                  a metagenomic workflow where a reference genome con-
                                                                  taining a set of genome files is the main input data of the
Figure 1: Common data access patters in scientific workflows.
                                                                  workflow. Building FM index (Full-text index in Minute
                                                                  space) over a reference genome (reference genome index-
   Obviously, the scatter pattern can be so effective in reduc-
                                                                  ing) is a common and time-consuming task in metagenomic
ing makespan in a distributed environment such as a cluster
                                                                  DAWs [21] and such a task is often used by the bioinfor-
and cloud because it provides parallel execution of tasks on
                                                                  matics workflow community, so it is a good case study for
computing nodes. The implementation of scatter is an NP-
                                                                  optimizing job/task granularity.
hard problem (See Section 3.1), so the user is not able to do
                                                                     In general, the system model includes the following steps:
it manually to achieve an acceptable makespan. In this study,
                                                                  the first step is to collect a historical dataset of job execution
we propose an approach to address this problem.
                                                                  traces on different computing nodes of the cluster. If this
                                                                  dataset is not already available, it can be collected by data
2.2. Scheduling of scientific Workflows on                        sampling [3]. In the second step, a proper estimation of the
     heterogeneous clusters                                       job runtime and its memory consumption is performed using
                                                                  a prediction method. Then, by solving the proposed mathe-
Generally, workflow scheduling on heterogeneous infrastruc-       matical model, the optimal size of each chunk of input data
tures can be done in two ways, statically or dynamically [12];    for each job and also the assignment of jobs to computing
[2]. Static scheduling assigns tasks to compute resources in      nodes is obtained. Finally, the optimal job granularity and
advance, assuming that accurate information about workflow        assignment is used to execute the DAW in the cluster.
and infrastructure resources is available. Dynamic schedul-
ing doesn’t require such assumptions. On the one hand, many
heuristic and meta-heuristic approaches have been provided        3.1. A motivating example
for this problem. HEFT [13] is considered to be the most          Suppose there is a reference genome that contains five
famous of these. On the other hand, mathematical optimiza-        genome files of the following sizes: g0 = 10, g1 = 15,
tion approaches such as MILP [14] have proposed optimal           g2 = 20, g3 = 25, g4 = 35. Moreover, assume that the cluster
solutions to this problem and have also analyzed the problem      infrastructure has three computing nodes A, B and C. The
more extensively. However, the presented state-of-the-art         runtime of a job with an input size of S on the computing
scheduling approaches have in common that they already            nodes is calculated by the following functions:
         • fA (S) = ln S2 + 1                                                         of the problem constraints and the objective function of the
         • fB (S) = ln S2 + 2S                                                        proposed model are presented in detail.
         • fC (S) = ln S2 − 4S − 10
                                                                                      4.1. Required memory for running jobs
                          g1 , g4                     g2         g0              g3   This constraint states that the memory limitation of kth node
                                                                                      must be met. This constraint must be done for all jobs and
                                                                                      computing nodes.
                          Job1                    Job2          Job3           Job4
                                                                                                                                 n
                                                                                                            memk ≥ fmem ( ∑ Si · yi j · x jk )
                         Node-A                 Node-B                Node-C                                                  i=1                         (1)
                    Execution time on Node-A= 7.80                                                    ∀ j ∈ {1, 2, ..., J}, ∀k ∈ {1, 2, ..., v}
                    Execution time on Node-B= 6.08
                    Execution time on Node-C= 10.15
                    Makespan=10.15
                                                                                      For each J j and CNk , ∑ni=1 Si · yi j · x jk denotes the input size
                                                                                      of J j on CNk and fmem estimates the memory required for J j .
                        (a) A possible assignment.

                                                                                      4.2. Assigning jobs to cluster nodes
                         g0 , g2                           g1          g3 , g4

                                                                                      Constraints (2) and (3) imply that non-empty jobs must be
                                                                                      assigned to exactly one node. Constrain (4) enforces that
                                                                                      empty jobs cannot be assigned to any node.
                          Job1                        Job2              Job3



                         Node-A                   Node-B               Node-C                    1 n           v
                                                                                                   ∑   yi j ≤ ∑ x jk             ∀ j ∈ {1, 2, ..., J}     (2)
                    Execution time on Node-A= 6.80                                               n i=1        k=1
                    Execution time on Node-B= 5.42
                    Execution time on Node-C= 8.11                                                v
                    Makespan=8.11
                                                                                                 ∑ x jk ≤ 1                      ∀ j ∈ {1, 2, ..., J}     (3)
                                                                                                 k=1
                          (b) A better assignment.
                                                                                                  v            n
                                                                                                 ∑ x jk ≤ ∑ yi j                 ∀ j ∈ {1, 2, ..., J}     (4)
Figure 2: Two possible job granularities and assignments with                                    k=1          i=1
different makespan values.
                                                                                      4.3. Objective function
                                                                                      For constructing the objective function the following points
   In Fig. 2, two different possible job building and assign-                         should be highlighted:
ments of the genome files is depicted. However, there are
                                                                                           • The objective function deals with the runtime of some
35 = 243 different states as a solution among which the solu-
                                                                                             bags of jobs on some computing nodes.
tion with the minimum makespan is the best.
   In a real example, Archaea1 has 488 genome files. As-                                   • If tk (Eq. (5))denotes for the above mentioned time
suming that the number of cluster computing nodes is 10.                                     for CNk then the total runtime (makespan) equals to
Any approach based on complete enumeration and trial and                                     max(tk ) 1 ≤ k ≤ v . The objective function aims
error for assigning these genomes to computing nodes re-                                     to minimize this time (Eq. (6)).
quires comparing a maximum number of 10488 to different
assignments. This approach is obviously impractical. It is                                 • When more than one task is assigned to a node, the
noticeable that Archaea is a very small reference genome                                     node wastes a certain amount of time between exe-
among the available reference genomes. Therefore, applying                                   cuting two jobs. ∑Jj=1 (x jk − 1) denotes that time.
the above mentioned approach for solving the related assign-
                                                                                           • For each CNk , ∑Jj=1 fk (∑ni=1 yi j · si · x jk ) calculates
ment problem is not efficient or even applicable. The best
                                                                                             the summation of the runtime of jobs assigned to
approach to deal with this problem is to create a mathemati-
                                                                                             CNk . ∑ni=1 yi j · si · x jk is the input size of J j , in which,
cal model for the problem and then apply available efficient
                                                                                              fk () denotes an implicit function of that.
algorithms for solving mathematical optimization models.
                                                                                        Runtime of jobs on each CNk is calculated by Eq.5.
                                                                                                       J                     J         n
4. The proposed mathematical model                                                           tk = ( ∑ x jk − 1) · stk + ( ∑ fk ( ∑ yi j · si · x jk ))    (5)
                                                                                                      j=1                   j=1       i=1
In a mathematical model, the objective function, decision
variables and problem constraints are expressed in mathe-                               Therefore, the objective function of the model is as fol-
matical expressions. These models provide a deep insight                              lows:
into the structure of the problem [12]; [22]. They are there-                                         min(max(tk )) 1 ≤ k ≤ v                 (6)
fore suitable not only for solving the problem using classical                          This can be expressed as follows:
methods, but also for solving the problem using heuristic or
meta-heuristic methods.
   The input data for the mathematical model are described                                         min α
in Table 1. Also, this table explains the decision variables ex-                                   subject to:
isting in the proposed model. In this section, the formulation                                     α ≥ tk      ∀k ∈ {1, 2, ..., v}
1
    https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/                                           α≥0
     Table 1
     Models parameters/variables and their description.

                  Notation of parameters                    Description
                 CN = {node1, node2, ..., nodev }          Cluster node set
                 v                                         The number of nodes of the cluster
                 CNk                                       kth node of the cluster
                 memk                                      Accessible memory size of CNk for running tasks
                 stk                                       Switch time between two jobs for CNk
                 Gen = {g1, g2, ..., gn }                  The set of genomes of the reference genome
                 n                                         The number of genomes of the reference genome
                 gi                                        ith genome of the reference genome
                 Si                                        Size of gi
                 Job = {J1, J2, ..., JJ }                  The set of jobs
                 Jj                                         jth job
                 J                                         The number of jobs
                  Notation of decision variables           Description
                  x jk                                     1 iff J j is assigned to CNk , otherwise 0
                  yi j                                     1 iff gi is binned to J j , otherwise 0


5. Genetic optimization for solving the                              to J4 . Moreover, J1 and J4 are scheduled to CN 3 and CN 2 ,
                                                                     respectively.
   proposed model
Suppose there is a reference genome of a certain size with              Genome file number   1     2     3                 n
genome files g1 , g2 , . . . , gn . We want to group the genomes
into a number of jobs and then assign the jobs to computing             Job number           1     4     1        ...
nodes CN1 , . . . ,CNv of a cluster infrastructure.
   It can be seen that the proposed model is a non-linear bi-           Node number          3     2     3        ...
nary mathematical model. Due to the special structure of the
constraints and the objective function of this model, lineariz-
ing it leads to a binary linear model with a significantly large
                                                                     Figure 3: A chromosome encoding.
number of constraints. Solving this model is very time con-
suming in terms of computation (it may even be impossible).
On the other hand, in contrast to classic approaches, using
genetic algorithms is a very powerful approach for treating          5.2. GA operators
discret models even if the model is nonlinear [23]. Based on
                                                                     The crossover operator can help to inherit some chromosome
this fact, using genetic approach can be an efficient approach
                                                                     fragments of excellent individuals to subsequent generations.
for solving the proposed model without any linearization. In
                                                                     In this study, the single-point crossover technique [23] is
the following, we explain how to implement the presented
                                                                     adapted during the performing of the crossover operator to
model using a Genetic Algorithm (GA).
                                                                     produce new individuals. These new individuals are then as-
                                                                     sessed for their potential to contribute to the next generation
5.1. Chromosome structure                                            of the population. An example is shown in Fig. 4. After
                                                                     crossover, the first new individual is not feasible to add to the
GAs mimic optimization during optimization by modelling
                                                                     next generation because J5 has been assigned to two different
genetic recombination and a fitness function. Hence, when
                                                                     computing nodes.
using GA to solve a particular problem, the first concern to
                                                                        The mutation operator is a technique that replaces some
be addressed revolves around the determination of a suitable
                                                                     gene values with others to increase population diversity. We
chromosome coding. Each chromosome represents the dif-
                                                                     use swap mutation [25] to explore new regions of the solution
ferent parameters that characterize a solution to the problem.
                                                                     space, where two positions on a chromosome are randomly
[24]. In the solution, we consider a population of individuals,
                                                                     selected and swapped. After mutation, the potential of new
each individual being a potential solution to the problem de-
                                                                     individuals to contribute to the next generation of the popula-
scribed by the individual chromosome. The initial population
                                                                     tion is evaluated.
is generated at random.
   In this study, job granularity involves determining the as-
signment of genome files to jobs and the assignment of jobs          6. JOB RUNTIME ESTIMATION
to cluster nodes. Thus, a solution (chromosome) is a two-
dimensional array in which the indices indicate the genome           In the GA algorithm, each individual should be assigned a
file number. The elements of the first row contain the job           value of the fitness score, which is shown in the objective
numbers and the elements of the second row contain the               function defined in Eq.5. Suppose the runtime of job j on
computing node numbers. The representation of a chromo-              node CNk is equal to t j . The challenge here is that there is
some in the GA implementation is shown in Fig. 3. As Fig.            no explicit function for calculating t j ; in other words, how
3 shows g1 and g3 are assigned to J1 while g2 is assigned
                                                                  workflow to perform the experiments. The specifications of
                                                                  the input data are described in Table 2. We also looked into
                                                                  different cluster sizes of 4, 8, 16, and 24 computing nodes in
                                                                  the experiments.
                                                                     As previously stated, our approach consider the job gran-
                                                                  ularity and scheduling problems simultaneously for a bag
                                                                  of task in a workflow while related works have addressed
                                                                  only the scheduling problem; Indeed, they have assumed
                                                                  that there are a number of jobs, each with a certain size, and
                                                                  they schedule these jobs to the cluster nodes so that the total
                                                                  runtime is minimized.
Figure 4: An example of crossover.                                   Most existing approaches use the file size on disk as the
                                                                  input for their predictions or approximate models. In [30] is
                                                                  a detailed discussion of why uncompressed input data size
long does it take to execute a job j of the size S j on a com-    for compressed files should be used. Accordingly, we use the
puting node CNk ? Therefore, we use the following different       uncompressed file size to predict the runtime and the used
methods to predict the runtime of jobs on computing nodes,        memory of jobs.
assuming that we have historical data from task execution
traces:                                                           Table 2
                                                                  Reference genomes specifications.
     • Linear Regression: Linear regression is a popular
       and simple machine learning algorithm that mod-                   Reference Genome         Data Size     # Genome files
       els the relationship between dependent and indepen-               Archaea                  1.4 GiB       488
       dent variables by analyzing and learning from current             Bacteria_1 4             30 GiB        7167
       training results using a linear expression of indepen-            Bacteria_2 5             88 GiB        22185
       dent variables [26].
     • Polynomial approximation: According to mathe-
       matical theorems, any continuous function can be           7.1. Accuracy comparison of job runtime
       optimally approximated by a suitable polynomial
                                                                       estimation methods
       [27]. Based on this theory, a polynomial of appropri-
       ate degree can be used to approximate the unknown          We compare the accuracy of the methods used to estimate
       function using the given historical data. In practice,     job runtime (See Section 6). The linear regression was imple-
       lower degree polynomials are considered to avoid the       mented using the sklearn.linear_model6 library. The polyno-
       socialisation of higher degree polynomials. Here we        mial and log-pol have been implemented using polyfit() from
       only consider degrees one, two and three.                  the Numpy7 library and Curve_fit() from the Scipy.optimize8
     • Logarithmic of a polynomial approximation (log-            library, respectively. For comparison we use the Mean Ab-
       pol): As shown in [28], the curves of runtime ac-          solute Percentage Error (MAPE). This metric is calculated
       cording to data input size are logarithmic like and on     using the Eq. 7 where Ai is the actual value, Fi is the pre-
       the other hand, as mentioned above, the polynomials        dicted value and n is the number of fitted points.
       are a suitable method for estimation, so we use a
                                                                                                   1 n Ai − Fi
       combination of logarithmic and polynomial as a new                               MAPE =       ∑ | Ai |                        (7)
       approximation method.                                                                       n i=1

   These methods assume that there is a relationship between         Figs. 5-7 show results for each input data separately with
a task’s input size and its runtime, using the input size as      different cluster sizes. As shown in these figures, the polyno-
the independent variable. Thus, they can be used to predict       mial method gives a more accurate estimate of the values of
task runtime for any task input sizes. Similar to related work,   job run times than log-pol. Log-pol also outperforms linear
these methods use the size of the file on the hard disk as an     regression.
input to their prediction models. In a heterogeneous cluster,
a same task may have different run times on different com-        7.2. Changing the number of jobs to improve
puting nodes. Therefore, the methods create their prediction           makespan
models for each computing node.
                                                                  One of the main challenges in this problem is that the number
                                                                  of jobs is not known in advance. The most obvious idea is
7. Experimental results                                           to consider this number as equal to the number of genome
                                                                  files, i.e. n. This seems logical at first sight, since the possi-
We developed a read-mapping workflow 2 for metagenomic            bility to consider empty jobs allows to potentially find any
data in the popular workflow engine Snakemake [29]. The           possible clustering for genome files in the form of jobs. How-
workflow was run on Allegro 3 , a cluster infrastructure. We      ever, from a practical point of view, this is inappropriate and
created a historical dataset from traces of the workflow exe-     impossible in most cases, because due to the large number
cution on the cluster as a historical execution trace.            of genome files, the number of decision variables and con-
   We selected and used three reference genomes with dif-         straints increases significantly, making it impossible to solve
ferent sizes (small, medium and large) as input data of the       6
                                                                    https://scikit-learn.org/stable/modules/linear_model.html
2                                                                 7
  https://github.com/CRC-FONDA/A2-job-granularity/tree/main/MG-     https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
                                                                  8
  HIBF                                                              https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize
3
  https://www.mi.fu-berlin.de/w/Cluster/WebHome                     .curve_fit.html
                                #Nodes = 4                                                 #Nodes = 8                                             #Nodes = 16
                                                                                                                           1.6
       1.2                                                     1.2
                                                                                                                           1.4
       1.0                                                     1.0
                                                                                                                           1.2
                                                               0.8                                                         1.0
MAPE




       0.8
                                                                                                                           0.8
                                                               0.6
       0.6                                                                                                                 0.6
                                                               0.4
       0.4                                                                                                                 0.4
                                                               0.2                                                         0.2
              Polynomial      Linear Regression     log-pol           Polynomial         Linear Regression       log-pol          Polynomial     Linear Regression   log-pol
                            Prediction method                                        Prediction method                                         Prediction method




                 Figure 5: Accuracy of different prediction methods for Archaea.

                                #Nodes = 4                                                 #Nodes = 8                                             #Nodes = 16
                                                               1.1                                                         1.1
       1.0
                                                               1.0                                                         1.0
       0.9                                                     0.9                                                         0.9

       0.8                                                     0.8                                                         0.8
MAPE




                                                               0.7                                                         0.7
       0.7
                                                               0.6                                                         0.6
       0.6                                                     0.5                                                         0.5
       0.5                                                     0.4                                                         0.4
              Polynomial      Linear Regression     log-pol           Polynomial         Linear Regression       log-pol          Polynomial     Linear Regression   log-pol
                            Prediction method                                        Prediction method                                         Prediction method




                 Figure 6: Accuracy of different prediction methods for Bacteria30G.



the problem in a reasonable time, and even if the problem                                           value in each step is better than or equal to the previous
is solved, the propagation of computational errors leads to                                         step (i.e., it evolves). On the other hand, given the stopping
incorrect and unreasonable results.                                                                 condition of the procedure, there is no need to solve problems
   Considering that jobs are supposed to be assigned to nodes,                                      with large number of jobs.
another idea is to consider the number of jobs as a factor of                                          From the experimental results presented in Tables 3 to 5,
the number of nodes, i.e. v; more precisely:                                                        the following points can be highlighted:

                           J = kv            f or some k ∈ N                       (8)                       • As the number of jobs increases, the makespan and
                                                                                                               the number of unused nodes decrease.
The coefficient k in the Eq. 8 changes interactively and                                                     • As the number of nodes increases, the makespan
evolutionarily from one to higher; that is, after solving the                                                  decreases.
problem for k=1 and calculating the makespan, we consider                                                    • The procedure of increasing the number of jobs may
the resulting solution as an initial solution (an individual of                                                be stopped for two reasons: Firstly, increasing the
the initial population) for k=2. This procedure continues and                                                  number of jobs does not improve the makespan (sig-
evolves until the distance between two consecutive makespan                                                    nificantly). Secondly, increasing the number of jobs
values is negligible or insignificant from the decision maker’s                                                may be stopped by decision maker (especially if no
point of view. On the one hand, this process is compatible                                                     significant improvement in makespan is expected).
with the evolutionary nature of the GA used to solve the                                                     • As can be seen in the fourth row of Table 3, the
above sequential problems, because in each step, with the                                                      makespan does not improve as the number of jobs
solution of the previous step as the initial solution, the value                                               increases from v to 2v. Obviously, a higher number
of the fitness function in the current step starts to improve                                                  of jobs does not improve the makespan (such cases
from the value of the previous step. Therefore, the makespan                                                   are marked in bold in the tables). Thus, we have

                                #Nodes = 4                                                 #Nodes = 8                                              #Nodes = 16
       0.75                                                    0.75
       0.70                                                    0.70                                                         0.7

       0.65                                                    0.65
                                                                                                                            0.6
       0.60                                                    0.60
MAPE




       0.55                                                    0.55                                                         0.5
       0.50                                                    0.50
                                                               0.45                                                         0.4
       0.45
                                                               0.40
       0.40                                                                                                                 0.3
                                                               0.35
              Polynomial      Linear Regression     log-pol           Polynomial         Linear Regression       log-pol          Polynomial     Linear Regression   log-pol
                            Prediction method                                        Prediction method                                         Prediction method




                 Figure 7: Accuracy of different prediction methods for Bacteria88G.
    Table 3
    Obtained results for Archaea.
                      J=v                         J = 2v                          J = 3v                        J = 4v
  #Nodes   Makespan    #Unused nodes       Makespan   #Unused nodes       Makespan   #Unused nodes      Makespan   #Unused nodes
    4        539             2               404            0               404            0               -             -
    8        339             1               334            0               312            0              312            0
    16       285             5               252            0               251            0              251            0
    24       206             5               206            3                -             -               -             -


    Table 4
    Obtained results for Bacteria-30GiB.
                      J=v                         J = 2v                          J = 3v                        J = 4v
  #Nodes   Makespan    #Unused nodes       Makespan   #Unused nodes       Makespan   #Unused nodes      Makespan   #Unused nodes
    4        5804            1               5124           0               5077           0              5077           0
    8        3050            1               2957           0              2952*           0                -            -
    16       1937            3               1937           3                 -            -                -            -
    24       2016            7               1859           1               1859           0                -            -



       not calculated them. Moreover, in the last row of              References
       Table 5, increasing the number of jobs from 3v to 4v
       only leads to 1% decrease in the makespan, which is            S. Mohammadi, L. PourKarimi, H. Pedram, Integer linear
       not significant (such cases are marked in the tables               programming-based multi-objective scheduling for sci-
       in bold and with an asterisk). So we stopped the                   entific workflows in multi-cloud environments, The
       procedure. It should be noted that the DM may stop                 Journal of Supercomputing 75 (2019) 6683–6709.
       the procedure when the number of jobs is 3v due to
                                                                      S. Abdi, L. PourKarimi, M. Ahmadi, F. Zargari, Cost min-
       the insignificant improvement in makespan and the
                                                                          imization for deadline-constrained bag-of-tasks appli-
       abandonment of an unused node.
                                                                          cations in federated hybrid clouds, Future Generation
                                                                          Computer Systems 71 (2017) 113–128.

8. Conclusion and Future works                                        J. Bader, F. Lehmann, L. Thamsen, U. Leser, O. Kao, Lotaru
                                                                           : Locally predicting workflow task runtimes for re-
In this paper, an approach to the task/job granularity prob-               source management on heterogeneous infrastructures,
lem for metagenomic DAWs in cluster infrastructures with                   Future Generation Computer Systems 150 (2024) 171–
makespan minimization was proposed. The problem was first                  185. URL: https://doi.org/10.1016/j.future.2023.08.022.
formulated as a mathematical model and then the proposed                   doi:10.1016/j.future.2023.08.022.
model was solved using the GA method. One of the main
                                                                      R. da Silva, A.-C. Orgerie, H. Casanova, R. Tanaka, E. Deel-
challenges in this problem is that the number of jobs is not
                                                                           man, F. Suter, Accurately simulating energy consump-
known in advance. We overcame this challenge by adjusting
                                                                           tion of I/O-intensive scientific workflows, in: Computa-
the number of jobs as a factor of the number of computing
                                                                           tional Science–ICCS 2019: 19th International Confer-
nodes. For each increase in the number of jobs, the makespan
                                                                           ence, Faro, Portugal, June 12–14, 2019, Proceedings,
is calculated. This procedure continues and evolves until the
                                                                           Part I 19, Springer, 2019, pp. 138–152.
distance between two successive makespan values is negligi-
ble or insignificant from the decision maker’s point of view.         R. F. da Silva, H. Casanova, A.-C. Orgerie, R. Tanaka,
Experimental results showed that a desirable makespan value                E. Deelman, F. Suter, Characterizing, modeling, and
can be obtained after a few steps of increasing the number                 accurately simulating power and energy consumption
of jobs. Furthermore, the calculation of makespan requires                 of i/o-intensive scientific workflows, Journal of compu-
a proper estimation of the task runtime, so we applied three               tational science 44 (2020) 101157.
different methods for this estimation. Experimental results
showed that the polynomial approximation outperforms.                 P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta,
    In the future, we aim to generalize our proposed model                N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan,
so that it can be applied to other scientific domains. Since              D. Okaya, SCEC CyberShake workflows—automating
the proposed approach does not schedule the workflow, but                 probabilistic seismic hazard analysis calculations, in:
optimizes a single step of the workflow, we intend to integrate           Workflows for e-Science, Springer, 2007, pp. 143–163.
it into a scheduling approach in the future work.
                                                                      A. M. Kintsakis, F. E. Psomopoulos, P. A. Mitkas, Data-
                                                                          aware optimization of bioinformatics workflows in hy-
Acknowledgments                                                           brid clouds, Journal of Big Data 3 (2016) 1–26.

Funded by the Deutsche Forschungsgemeinschaft (DFG, Ger-              S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H.
man Research Foundation) as FONDA (Project 414984028,                     Su, K. Vahi, Characterization of scientific workflows,
SFB 1404).                                                                in: Workflows in Support of Large-Scale Science, 2008.
                                                                          WORKS 2008. Third Workshop on, IEEE, 2008, pp.
                                                                          1–10.
                                                                      L. B. Costa, H. Yang, E. Vairavanathan, A. Barros, K. Ma-
                                                                           heshwari, G. Fedak, D. Katz, M. Wilde, M. Ripeanu,
    Table 5
    Obtained results for Bacteria-88GiB.
                      J=v                         J = 2v                          J = 3v                          J = 4v
  #Nodes   Makespan    #Unused nodes       Makespan   #Unused nodes       Makespan   #Unused nodes      Makespan     #Unused nodes
    4       18965            0              18965           0                 -            -
    8       14673            1              11432           0              11119           0             10341*            0
    16       7821            4               7385           1               6127           0              6127             0
    24       6078            6               5487           2               4911           1             4849*             0



     S. Al-Kiswany, The case for workflow-aware storage:              S. M. Sadjadi, S. Shimizu, J. Figueroa, R. Rangaswami,
     An opportunity study, Journal of Grid Computing 13                   J. Delgado, H. Duran, X. J. Collazo-Mojica, A mod-
     (2015) 95–113.                                                       eling approach for estimating execution time of long-
                                                                          running scientific applications, in: 2008 IEEE interna-
E. Vairavanathan, S. Al-Kiswany, L. B. Costa, Z. Zhang,                   tional symposium on parallel and distributed processing,
    D. S. Katz, M. Wilde, M. Ripeanu, A workflow-aware                    IEEE, 2008, pp. 1–8.
    storage system: An opportunity study, in: 2012 12th
    IEEE/ACM International Symposium on Cluster, Cloud                M. H. Hilman, M. A. Rodriguez, R. Buyya, Task runtime
    and Grid Computing (ccgrid 2012), IEEE, 2012, pp.                     prediction in scientific workflows using an online incre-
    326–334.                                                              mental learning approach, in: 2018 IEEE/ACM 11th
                                                                          International Conference on Utility and Cloud Comput-
J. M. Wozniak, M. Wilde, Case studies in storage access by                ing (UCC), IEEE, 2018, pp. 93–102.
     loosely coupled petascale applications, in: Proceedings
     of the 4th Annual Workshop on Petascale Data Storage,            K. Kianfar, C. Pockrandt, B. Torkamandi, H. Luo, K. Rein-
     2009, pp. 16–20.                                                     ert, Optimum Search Schemes for approximate string
                                                                          matching using bidirectional FM-index, arXiv preprint
S. Mohammadi, H. Pedram, L. PourKarimi, Integer linear                    arXiv:1711.02035 (2017).
    programming-based cost optimization for scheduling
    scientific workflows in multi-cloud environments, Jour-           S. Abdi, M. Ashjaei, S. Mubeen, Cognitive and Time Pre-
    nal of Supercomputing 74 (2018) 4717–4745. doi:10.                    dictable Task Scheduling in Edge-cloud Federation, in:
    1007/s11227-018-2465-8.                                               2022 IEEE 27th International Conference on Emerging
                                                                          Technologies and Factory Automation (ETFA), IEEE,
H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-effective                  2022, pp. 1–4.
    and low-complexity task scheduling for heterogeneous
    computing, IEEE transactions on parallel and dis-                 S. Katoch, S. S. Chauhan, V. Kumar, A review on genetic
    tributed systems 13 (2002) 260–274.                                   algorithm: past, present, and future, Multimedia tools
                                                                          and applications 80 (2021) 8091–8126.
S. Mohammadi, L. PourKarimi, F. Droop, N. De Mec-
    quenem, U. Leser, K. Reinert, A mathematical program-             G. Tremblay, R. Sabourin, P. Maupin, Optimizing nearest
    ming approach for resource allocation of data analysis                neighbour in random subspaces using a multi-objective
    workflows on heterogeneous clusters, The Journal of                   genetic algorithm, in: Proceedings of the 17th Interna-
    Supercomputing (2023) 1–30.                                           tional Conference on Pattern Recognition, 2004. ICPR
                                                                          2004., volume 1, IEEE, 2004, pp. 208–211.
T. Dziok, K. Figiela, M. Malawski, Adaptive multi-level
    workflow scheduling with uncertain task estimates, in:            I. De Falco, A. Della Cioppa, E. Tarantino, Mutation-based
    Parallel Processing and Applied Mathematics: 11th                      genetic algorithm: performance evaluation, Applied
    International Conference, PPAM 2015, Krakow, Poland,                   Soft Computing 1 (2002) 285–299.
    September 6-9, 2015. Revised Selected Papers, Part II,
    Springer, 2016, pp. 90–100.                                       T. M. H. Hope, Linear regression, in: Machine Learning,
                                                                          Elsevier, 2020, pp. 67–81.
R. F. Da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez,
     D. Thain, B. Tovar, M. Livny, Toward fine-grained                W. Rudin, Principles of Real Analysis. Mathematics Series,
     online task characteristics estimation in scientific work-           1976.
     flows, in: Proceedings of the 8th workshop on work-              C. Valdes, V. Stebliankin, G. Narasimhan, Large scale micro-
     flows in support of large-scale science, 2013, pp. 58–67.             biome profiling in the cloud, Bioinformatics 35 (2019)
R. F. Da Silva, G. Juve, M. Rynge, E. Deelman, M. Livny,                   i13—-i22.
     Online task resource consumption prediction for scien-           J. Köster, S. Rahmann, Snakemake—a scalable bioinformat-
     tific workflows, Parallel Processing Letters 25 (2015)                ics workflow engine, Bioinformatics 28 (2012) 2520–
     1541003.                                                              2522.
F. Nadeem, D. Alghazzawi, A. Mashat, K. Fakeeh, A. Al-                J. Bader, F. Lehmann, L. Thamsen, J. Will, U. Leser, O. Kao,
    malaise, H. Hagras, Modeling and predicting execution                  Lotaru: Locally Estimating Runtimes of Scientific
    time of scientific workflows in the grid using radial                  Workflow Tasks in Heterogeneous Clusters, arXiv
    basis function neural network, Cluster Computing 20                    preprint arXiv:2205.11181 (2022).
    (2017) 2805–2819.