<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Somayeh Mohammadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Latif PourKarimi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Zschäbitz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tristan Aretz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ninon De Mecquenem</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulf Leser</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knut Reinert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, Freie Universität</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics and Computer Science, Humboldt-Universität zu Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Mathematics and Computer Science, Razi University</institution>
          ,
          <addr-line>Kermanshah</addr-line>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data analysis workflows are popular for sequencing activities in large-scale and complex scientific processes. Scheduling approaches attempt to find an appropriate assignment of workflow tasks to the computing nodes for minimizing the makespan in heterogeneous cluster infrastructures. A common feature of these approaches is that they already know the structure of the workflow. However, for many workflows, a high degree of parallelization can be achieved by splitting the large input data of a single task into chunks and processing them independently. We call this problem task granularity, which involves finding an assignment of tasks to computing nodes and simultaneously optimizing the structure of a bag of tasks. Accordingly, this paper addresses the problem of task granularity for metagenomic workflows. To this end, we first formulated the problem as a mathematical model. We then solved the proposed model using the genetic algorithm. To overcome the challenge of not knowing the number of tasks, we adjusted the number of tasks as a factor of the number of computing nodes. The procedure of increasing the number of tasks is performed interactively and evolutionarily. Experimental results showed that a desirable makespan value can be achieved after a few steps of the increase.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Analysis Workflow</kwd>
        <kwd>Mathematical Programming</kwd>
        <kwd>Makespan Minimization</kwd>
        <kwd>Run Time Prediction</kwd>
        <kwd>Genetic Algorithm</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Scientists in many domains, such as bioinformatics, remote
sensing, and physics, use Data Analysis Workflows (DAWs)
to sequence activities involved in large-scale and complex
scientific processes [ 1]; [2]. These DAWs are typically
represented as a Directed Acyclic Graph (DAG), which consists
of a set of tasks and some directed edges between the tasks;
edges show data dependencies between tasks and the priority
order of task execution.</p>
      <p>Scientists often use heterogeneous cluster infrastructures
to run their DAWs because of privacy and financial concerns.
Heterogeneous clusters provide high-performance
computing environments that enable efcfiient data analysis and the
execution of large-scale DAWs in a reasonable amount of
time [3]. DAWs are often executed on large amounts of data,
resulting in long runtimes that can exceed days or weeks
[4]; [5]; [6]. Thus, in such environments, the key objective
is to schedule DAW tasks across computing resources in
such a way that the total execution time, also known as the
makespan, is minimized.</p>
      <p>It is well known that a high degree of parallelization can
be achieved in many DAWs by splitting the input data of
individual tasks into chunks and processing them independently
[7]. For example, in metagenomic DAWs, the size of a
reference genome as frequent input data can vary from several
KB to hundreds of GB, and the reference genome typically
contains thousands of genome files. The reference genome
can be divided into different bins of genome files and they
are processed by several independent tasks in parallel. The
main challenge here is how to partition the input data; what
should be the appropriate size of each chunk of input data,
and how should each task be assigned to a computing node
so that the makespan is minimized. We call this problem task
granularity. In heterogeneous environments, this challenge
is aggravated because the computing power of the existing
computing nodes is different, so choosing the input size of
each task to be executed on each of these computing nodes
is a very effective means in optimizing the makespan. Since
each task of the DAW is equivalent to a job for the cluster
when a workflow is submitted for execution, the terms task
and job are used interchangeably in this study.</p>
      <p>In this paper, we propose a novel approach to task
granularity for metagenomic DAWs in cluster infrastructures with
makespan minimization. We first formulate the problem as
a mathematical model. We solve then the proposed model
using the Genetic Algorithm method. Since the calculation
of makespan requires a proper estimation of tasks runtime,
we apply three different methods for this estimation and also
compare their accuracy.</p>
      <p>The paper is organized as follows: Section 2 presents
a review of the related work, and Section 3 illustrates the
problem statement. The proposed mathematical model is
introduced in Section 4. Section 5 discusses solving the
proposed model using genetic algorithm. Job runtime estimation
is presented in Section 6, and experimental results are
presented in Section 7. Finally, Section 8 provides concluding
remarks and plans for further studies.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>In this section, we first discuss data access patterns used
in scientific workflows. We then cover the scheduling of
scientific workflows on heterogeneous clusters. Finally, we
focus on methods for predicting the runtime of tasks, as these
estimates are often used as input for scheduling approaches.</p>
      <sec id="sec-2-1">
        <title>2.1. Data access patterns used in scientific workflows</title>
        <p>The data access patterns of workflow applications have been
addressed by several studies [8]; [9]; [10]; [11]. Accordingly,
the most commonly used patterns in scientific workflows are
as follows (Fig. 1):
• Pipeline:This is the most basic and familiar pattern.</p>
        <p>A set of computational tasks is chained in a sequence
such that the output of a parent task is the input of
its child task in the chain. Because of the line
dependencies in a pipeline pattern, the execution of a
task cannot begin until the execution of its parent has
completed and it has received the data generated by
its parent.
• Scatter: An input data is divided into several chunks.</p>
        <p>These chunks are distributed into multiple tasks (a
bag of tasks), which can be executed simultaneously
because there are no dependencies between them.
• Gather: Multiple chunks of data are produced by
multiple tasks. All of them are used as input data by
a subsequent task. The later task may need to receive
all the chunks and integrate them to start execution.</p>
        <p>Data</p>
        <p>Task
Pipeline</p>
        <p>Scatter</p>
        <p>Gather</p>
        <p>Obviously, the scatter pattern can be so effective in
reducing makespan in a distributed environment such as a cluster
and cloud because it provides parallel execution of tasks on
computing nodes. The implementation of scatter is an
NPhard problem (See Section 3.1), so the user is not able to do
it manually to achieve an acceptable makespan. In this study,
we propose an approach to address this problem.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Scheduling of scientific Workflows on heterogeneous clusters</title>
        <p>Generally, workflow scheduling on heterogeneous
infrastructures can be done in two ways, statically or dynamically [12];
[2]. Static scheduling assigns tasks to compute resources in
advance, assuming that accurate information about workflow
and infrastructure resources is available. Dynamic
scheduling doesn’t require such assumptions. On the one hand, many
heuristic and meta-heuristic approaches have been provided
for this problem. HEFT [13] is considered to be the most
famous of these. On the other hand, mathematical
optimization approaches such as MILP [14] have proposed optimal
solutions to this problem and have also analyzed the problem
more extensively. However, the presented state-of-the-art
scheduling approaches have in common that they already
know the structure of the DAG and find an optimal or
suboptimal assignment of tasks to the computing nodes.</p>
        <p>By addressing the scattering problem of a bag, our
proposed approach not only provides a suitable structure for the
bag and consequently for the DAG, but also finds the best
scheduling of tasks to computing nodes with the objective of
minimizing the makespan.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Task runtime prediction</title>
        <p>Most of state-of-the-art techniques for workflow scheduling
rely on accurate predictions of tasks runtime [15]. Therefore,
the problem of predicting the runtime of scientific workflow
tasks based on historical data has been studied extensively.
Recent research has used machine learning techniques to
address this issue [16]; [17]; [18]; [19]. They build their
initial models on historical data before the actual workflow
execution. [20]; [18] use neural network methods, which
are known to require large training data sets to perform well,
while [3] employs a Bayesian linear regression model, which
can work with few training points and provides uncertainty
estimates for its predictions. Most existing approaches use
the size of the file on the hard drive as input to their prediction
models.</p>
        <p>In this research, the objective is to minimize the makespan.
To compute the makespan value, an acceptable estimate of
the runtime of jobs is required. Assuming that we have
some historical data of execution traces of jobs on computing
nodes, we use three different methods to predict the runtime
of jobs (See Section 6).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem statement</title>
      <p>This study addresses the problem of job/task granularity of
scientific workflows in heterogeneous cluster environments
with the aim of minimizing makespan. The case study is
a metagenomic workflow where a reference genome
containing a set of genome files is the main input data of the
workflow. Building FM index (Full-text index in Minute
space) over a reference genome (reference genome
indexing) is a common and time-consuming task in metagenomic
DAWs [21] and such a task is often used by the
bioinformatics workflow community, so it is a good case study for
optimizing job/task granularity.</p>
      <p>In general, the system model includes the following steps:
the first step is to collect a historical dataset of job execution
traces on different computing nodes of the cluster. If this
dataset is not already available, it can be collected by data
sampling [3]. In the second step, a proper estimation of the
job runtime and its memory consumption is performed using
a prediction method. Then, by solving the proposed
mathematical model, the optimal size of each chunk of input data
for each job and also the assignment of jobs to computing
nodes is obtained. Finally, the optimal job granularity and
assignment is used to execute the DAW in the cluster.</p>
      <sec id="sec-3-1">
        <title>3.1. A motivating example</title>
        <p>Suppose there is a reference genome that contains vfie
genome files of the following sizes: g 0 = 10, g1 = 15,
g2 = 20, g3 = 25, g4 = 35. Moreover, assume that the cluster
infrastructure has three computing nodes A, B and C. The
runtime of a job with an input size of S on the computing
nodes is calculated by the following functions:
g2
Job2
Node-B
g1
Job2
Node-B
g0
Job3
g3</p>
        <p>Job4
Node-C</p>
        <p>In Fig. 2, two different possible job building and
assignments of the genome files is depicted. However, there are
35 = 243 different states as a solution among which the
solution with the minimum makespan is the best.</p>
        <p>In a real example, Archaea1 has 488 genome files.
Assuming that the number of cluster computing nodes is 10.
Any approach based on complete enumeration and trial and
error for assigning these genomes to computing nodes
requires comparing a maximum number of 10488 to different
assignments. This approach is obviously impractical. It is
noticeable that Archaea is a very small reference genome
among the available reference genomes. Therefore, applying
the above mentioned approach for solving the related
assignment problem is not efficient or even applicable. The best
approach to deal with this problem is to create a
mathematical model for the problem and then apply available efficient
algorithms for solving mathematical optimization models.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The proposed mathematical model</title>
      <p>In a mathematical model, the objective function, decision
variables and problem constraints are expressed in
mathematical expressions. These models provide a deep insight
into the structure of the problem [12]; [22]. They are
therefore suitable not only for solving the problem using classical
methods, but also for solving the problem using heuristic or
meta-heuristic methods.</p>
      <p>The input data for the mathematical model are described
in Table 1. Also, this table explains the decision variables
existing in the proposed model. In this section, the formulation
1https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/
• fA(S) = ln S2 + 1
• fB(S) = ln S2 + 2S
• fC(S) = ln S2 − 4S − 10</p>
      <sec id="sec-4-1">
        <title>4.1. Required memory for running jobs</title>
        <p>This constraint states that the memory limitation of kth node
must be met. This constraint must be done for all jobs and
computing nodes.</p>
        <p>n
memk ≥ fmem( ∑ Si · yi j · x jk)</p>
        <p>i=1
∀ j ∈ {1, 2, ..., J}, ∀k ∈ {1, 2, ..., v}
For each J j and CNk, ∑in=1 Si · yi j · x jk denotes the input size
of J j on CNk and fmem estimates the memory required for J j.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Assigning jobs to cluster nodes</title>
        <p>Constraints (2) and (3) imply that non-empty jobs must be
assigned to exactly one node. Constrain (4) enforces that
empty jobs cannot be assigned to any node.</p>
        <p>1 n</p>
        <p>∑ yi j ≤
n i=1
v
∑ x jk ≤ 1
k=1
v
∑ x jk ≤
k=1
v
∑ x jk
k=1
n
∑ yi j
i=1
∀ j ∈ {1, 2, ..., J}
∀ j ∈ {1, 2, ..., J}
∀ j ∈ {1, 2, ..., J}</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Objective function</title>
        <p>For constructing the objective function the following points
should be highlighted:
•
•
•</p>
        <p>The objective function deals with the runtime of some
bags of jobs on some computing nodes.
• If tk (Eq. (5))denotes for the above mentioned time
for CNk then the total runtime (makespan) equals to
max(tk) 1 ≤ k ≤ v . The objective function aims
to minimize this time (Eq. (6)).</p>
        <p>When more than one task is assigned to a node, the
node wastes a certain amount of time between
executing two jobs. ∑Jj=1(x jk − 1) denotes that time.
For each CNk, ∑Jj=1 fk(∑in=1 yi j · si · x jk) calculates
the summation of the runtime of jobs assigned to
n
CNk. ∑i=1 yi j · si · x jk is the input size of J j , in which,
fk() denotes an implicit function of that.</p>
        <p>Runtime of jobs on each CNk is calculated by Eq.5.</p>
        <p>J J n
tk = ( ∑ x jk − 1) · stk + ( ∑ fk( ∑ yi j · si · x jk))</p>
        <p>j=1 j=1 i=1</p>
        <p>Therefore, the objective function of the model is as
follows:</p>
        <p>min(max(tk)) 1 ≤ k ≤ v
This can be expressed as follows:
min α
subject to:
α ≥ tk
α ≥ 0
∀k ∈ {1, 2, ..., v}
(1)
(2)
(3)
(4)
(5)
(6)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Genetic optimization for solving the proposed model</title>
      <p>Suppose there is a reference genome of a certain size with
genome files g1, g2, . . . , gn. We want to group the genomes
into a number of jobs and then assign the jobs to computing
nodes CN1, . . . ,CNv of a cluster infrastructure.</p>
      <p>It can be seen that the proposed model is a non-linear
binary mathematical model. Due to the special structure of the
constraints and the objective function of this model,
linearizing it leads to a binary linear model with a significantly large
number of constraints. Solving this model is very time
consuming in terms of computation (it may even be impossible).
On the other hand, in contrast to classic approaches, using
genetic algorithms is a very powerful approach for treating
discret models even if the model is nonlinear [23]. Based on
this fact, using genetic approach can be an efficient approach
for solving the proposed model without any linearization. In
the following, we explain how to implement the presented
model using a Genetic Algorithm (GA).</p>
      <sec id="sec-5-1">
        <title>5.1. Chromosome structure</title>
        <p>GAs mimic optimization during optimization by modelling
genetic recombination and a fitness function. Hence, when
using GA to solve a particular problem, the first concern to
be addressed revolves around the determination of a suitable
chromosome coding. Each chromosome represents the
different parameters that characterize a solution to the problem.
[24]. In the solution, we consider a population of individuals,
each individual being a potential solution to the problem
described by the individual chromosome. The initial population
is generated at random.</p>
        <p>In this study, job granularity involves determining the
assignment of genome files to jobs and the assignment of jobs
to cluster nodes. Thus, a solution (chromosome) is a
twodimensional array in which the indices indicate the genome
ifle number. The elements of the first row contain the job
numbers and the elements of the second row contain the
computing node numbers. The representation of a
chromosome in the GA implementation is shown in Fig. 3. As Fig.
3 shows g1 and g3 are assigned to J1 while g2 is assigned
to J4. Moreover, J1 and J4 are scheduled to CN3 and CN2,
respectively.</p>
        <p>Genome file number
Job number
Node number
1
1
3
2
4
2
3
1
3
...
...</p>
        <p>n
The crossover operator can help to inherit some chromosome
fragments of excellent individuals to subsequent generations.
In this study, the single-point crossover technique [23] is
adapted during the performing of the crossover operator to
produce new individuals. These new individuals are then
assessed for their potential to contribute to the next generation
of the population. An example is shown in Fig. 4. After
crossover, the first new individual is not feasible to add to the
next generation because J5 has been assigned to two different
computing nodes.</p>
        <p>The mutation operator is a technique that replaces some
gene values with others to increase population diversity. We
use swap mutation [25] to explore new regions of the solution
space, where two positions on a chromosome are randomly
selected and swapped. After mutation, the potential of new
individuals to contribute to the next generation of the
population is evaluated.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. JOB RUNTIME ESTIMATION</title>
      <p>In the GA algorithm, each individual should be assigned a
value of the fitness score, which is shown in the objective
function defined in Eq.5. Suppose the runtime of job j on
node CNk is equal to t j. The challenge here is that there is
no explicit function for calculating t j; in other words, how
long does it take to execute a job j of the size S j on a
computing node CNk? Therefore, we use the following different
methods to predict the runtime of jobs on computing nodes,
assuming that we have historical data from task execution
traces:
• Linear Regression: Linear regression is a popular
and simple machine learning algorithm that
models the relationship between dependent and
independent variables by analyzing and learning from current
training results using a linear expression of
independent variables [26].
• Polynomial approximation: According to
mathematical theorems, any continuous function can be
optimally approximated by a suitable polynomial
[27]. Based on this theory, a polynomial of
appropriate degree can be used to approximate the unknown
function using the given historical data. In practice,
lower degree polynomials are considered to avoid the
socialisation of higher degree polynomials. Here we
only consider degrees one, two and three.
• Logarithmic of a polynomial approximation
(logpol): As shown in [28], the curves of runtime
according to data input size are logarithmic like and on
the other hand, as mentioned above, the polynomials
are a suitable method for estimation, so we use a
combination of logarithmic and polynomial as a new
approximation method.</p>
      <p>These methods assume that there is a relationship between
a task’s input size and its runtime, using the input size as
the independent variable. Thus, they can be used to predict
task runtime for any task input sizes. Similar to related work,
these methods use the size of the file on the hard disk as an
input to their prediction models. In a heterogeneous cluster,
a same task may have different run times on different
computing nodes. Therefore, the methods create their prediction
models for each computing node.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Experimental results</title>
      <p>We developed a read-mapping workflow 2 for metagenomic
data in the popular workflow engine Snakemake [ 29]. The
workflow was run on Allegro 3, a cluster infrastructure. We
created a historical dataset from traces of the workflow
execution on the cluster as a historical execution trace.</p>
      <p>We selected and used three reference genomes with
different sizes (small, medium and large) as input data of the
2https://github.com/CRC-FONDA/A2-job-granularity/tree/main/MGHIBF
3https://www.mi.fu-berlin.de/w/Cluster/WebHome
workflow to perform the experiments. The specifications of
the input data are described in Table 2. We also looked into
different cluster sizes of 4, 8, 16, and 24 computing nodes in
the experiments.</p>
      <p>As previously stated, our approach consider the job
granularity and scheduling problems simultaneously for a bag
of task in a workflow while related works have addressed
only the scheduling problem; Indeed, they have assumed
that there are a number of jobs, each with a certain size, and
they schedule these jobs to the cluster nodes so that the total
runtime is minimized.</p>
      <p>Most existing approaches use the file size on disk as the
input for their predictions or approximate models. In [30] is
a detailed discussion of why uncompressed input data size
for compressed files should be used. Accordingly, we use the
uncompressed file size to predict the runtime and the used
memory of jobs.
One of the main challenges in this problem is that the number
of jobs is not known in advance. The most obvious idea is
to consider this number as equal to the number of genome
ifles, i.e. n. This seems logical at first sight, since the
possibility to consider empty jobs allows to potentially find any
possible clustering for genome files in the form of jobs.
However, from a practical point of view, this is inappropriate and
impossible in most cases, because due to the large number
of genome files, the number of decision variables and
constraints increases significantly, making it impossible to solve
6https://scikit-learn.org/stable/modules/linear_model.html
7https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
8https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize
.curve_fit.html
Polynomial</p>
      <p>Linear Regression
Prediction method
log-pol</p>
      <p>Polynomial
log-pol</p>
      <p>Polynomial</p>
      <p>Linear Regression
Prediction method</p>
      <p>Linear Regression
Prediction method
log-pol
the problem in a reasonable time, and even if the problem
is solved, the propagation of computational errors leads to
incorrect and unreasonable results.</p>
      <p>Considering that jobs are supposed to be assigned to nodes,
another idea is to consider the number of jobs as a factor of
the number of nodes, i.e. v; more precisely:</p>
      <p>J = kv
f or some k ∈ N
(8)
The coefficient k in the Eq. 8 changes interactively and
evolutionarily from one to higher; that is, after solving the
problem for k=1 and calculating the makespan, we consider
the resulting solution as an initial solution (an individual of
the initial population) for k=2. This procedure continues and
evolves until the distance between two consecutive makespan
values is negligible or insignificant from the decision maker’s
point of view. On the one hand, this process is compatible
with the evolutionary nature of the GA used to solve the
above sequential problems, because in each step, with the
solution of the previous step as the initial solution, the value
of the fitness function in the current step starts to improve
from the value of the previous step. Therefore, the makespan
value in each step is better than or equal to the previous
step (i.e., it evolves). On the other hand, given the stopping
condition of the procedure, there is no need to solve problems
with large number of jobs.</p>
      <p>From the experimental results presented in Tables 3 to 5,
the following points can be highlighted:
• As the number of jobs increases, the makespan and
the number of unused nodes decrease.
• As the number of nodes increases, the makespan
decreases.
• The procedure of increasing the number of jobs may
be stopped for two reasons: Firstly, increasing the
number of jobs does not improve the makespan
(significantly). Secondly, increasing the number of jobs
may be stopped by decision maker (especially if no
significant improvement in makespan is expected).
• As can be seen in the fourth row of Table 3, the
makespan does not improve as the number of jobs
increases from v to 2v. Obviously, a higher number
of jobs does not improve the makespan (such cases
are marked in bold in the tables). Thus, we have
not calculated them. Moreover, in the last row of
Table 5, increasing the number of jobs from 3v to 4v
only leads to 1% decrease in the makespan, which is
not significant (such cases are marked in the tables
in bold and with an asterisk). So we stopped the
procedure. It should be noted that the DM may stop
the procedure when the number of jobs is 3v due to
the insignificant improvement in makespan and the
abandonment of an unused node.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Future works</title>
      <p>In this paper, an approach to the task/job granularity
problem for metagenomic DAWs in cluster infrastructures with
makespan minimization was proposed. The problem was first
formulated as a mathematical model and then the proposed
model was solved using the GA method. One of the main
challenges in this problem is that the number of jobs is not
known in advance. We overcame this challenge by adjusting
the number of jobs as a factor of the number of computing
nodes. For each increase in the number of jobs, the makespan
is calculated. This procedure continues and evolves until the
distance between two successive makespan values is
negligible or insignificant from the decision maker’s point of view.
Experimental results showed that a desirable makespan value
can be obtained after a few steps of increasing the number
of jobs. Furthermore, the calculation of makespan requires
a proper estimation of the task runtime, so we applied three
different methods for this estimation. Experimental results
showed that the polynomial approximation outperforms.</p>
      <p>In the future, we aim to generalize our proposed model
so that it can be applied to other scientific domains. Since
the proposed approach does not schedule the workflow, but
optimizes a single step of the workflow, we intend to integrate
it into a scheduling approach in the future work.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>Funded by the Deutsche Forschungsgemeinschaft (DFG,
German Research Foundation) as FONDA (Project 414984028,
SFB 1404).
#Nodes
4
8
16
24</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohammadi</surname>
          </string-name>
          , L. PourKarimi, H. Pedram,
          <article-title>Integer linear programming-based multi-objective scheduling for scientific workflows in multi-cloud environments</article-title>
          ,
          <source>The Journal of Supercomputing</source>
          <volume>75</volume>
          (
          <year>2019</year>
          )
          <fpage>6683</fpage>
          -
          <lpage>6709</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdi</surname>
          </string-name>
          , L. PourKarimi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zargari</surname>
          </string-name>
          ,
          <article-title>Cost minimization for deadline-constrained bag-of-tasks applications in federated hybrid clouds</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>71</volume>
          (
          <year>2017</year>
          )
          <fpage>113</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Bader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Thamsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Leser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kao</surname>
          </string-name>
          , Lotaru :
          <article-title>Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>150</volume>
          (
          <year>2024</year>
          )
          <fpage>171</fpage>
          -
          <lpage>185</lpage>
          . URL: https://doi.org/10.1016/j.future.
          <year>2023</year>
          .
          <volume>08</volume>
          .022. doi:
          <volume>10</volume>
          .1016/j.future.
          <year>2023</year>
          .
          <volume>08</volume>
          .022.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>R. da Silva</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Orgerie</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Casanova</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Deelman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Suter</surname>
          </string-name>
          ,
          <article-title>Accurately simulating energy consumption of I/O-intensive scientific workflows</article-title>
          ,
          <source>in: Computational Science-ICCS</source>
          <year>2019</year>
          : 19th International Conference, Faro, Portugal, June 12-14,
          <year>2019</year>
          , Proceedings,
          <source>Part I 19</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>R. F. da Silva</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Casanova</surname>
            ,
            <given-names>A.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Orgerie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Deelman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Suter</surname>
          </string-name>
          , Characterizing, modeling, and
          <article-title>accurately simulating power and energy consumption of i/o-intensive scientific workflows</article-title>
          ,
          <source>Journal of computational science 44</source>
          (
          <year>2020</year>
          )
          <fpage>101157</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Maechling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Deelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Graves</surname>
          </string-name>
          , G. Mehta,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mehringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kesselman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Callaghan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Okaya</surname>
          </string-name>
          ,
          <string-name>
            <surname>SCEC</surname>
          </string-name>
          <article-title>CyberShake workflows-automating probabilistic seismic hazard analysis calculations</article-title>
          , in: Workflows for e-Science, Springer,
          <year>2007</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>A. M. Kintsakis</surname>
            ,
            <given-names>F. E.</given-names>
          </string-name>
          <string-name>
            <surname>Psomopoulos</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Mitkas</surname>
          </string-name>
          ,
          <article-title>Dataaware optimization of bioinformatics workflows in hybrid clouds</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>3</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chervenak</surname>
          </string-name>
          , E. Deelman, G. Mehta,
          <string-name>
            <surname>M.-H. Su</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Vahi</surname>
          </string-name>
          ,
          <article-title>Characterization of scientific workoflws</article-title>
          ,
          <source>in: Workflows in Support of Large-Scale Science</source>
          ,
          <year>2008</year>
          .
          <article-title>WORKS 2008</article-title>
          . Third Workshop on, IEEE,
          <year>2008</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>L. B. Costa</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Vairavanathan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Barros</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Maheshwari</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Fedak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wilde</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ripeanu</surname>
          </string-name>
          , Table 5
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Obtained results for Bacteria-88GiB.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Al-Kiswany</surname>
          </string-name>
          ,
          <article-title>The case for workflow-aware storage: An opportunity study</article-title>
          ,
          <source>Journal of Grid Computing</source>
          <volume>13</volume>
          (
          <year>2015</year>
          )
          <fpage>95</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>E.</given-names>
            <surname>Vairavanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Al-Kiswany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wilde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ripeanu</surname>
          </string-name>
          ,
          <article-title>A workflow-aware storage system: An opportunity study</article-title>
          ,
          <source>in: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid</source>
          Computing (ccgrid
          <year>2012</year>
          ), IEEE,
          <year>2012</year>
          , pp.
          <fpage>326</fpage>
          -
          <lpage>334</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>J. M. Wozniak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wilde</surname>
          </string-name>
          ,
          <article-title>Case studies in storage access by loosely coupled petascale applications</article-title>
          ,
          <source>in: Proceedings of the 4th Annual Workshop on Petascale Data Storage</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohammadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pedram</surname>
          </string-name>
          , L. PourKarimi,
          <article-title>Integer linear programming-based cost optimization for scheduling scientific workflows in multi-cloud environments</article-title>
          ,
          <source>Journal of Supercomputing</source>
          <volume>74</volume>
          (
          <year>2018</year>
          )
          <fpage>4717</fpage>
          -
          <lpage>4745</lpage>
          . doi:
          <volume>10</volume>
          . 1007/s11227-018-2465-8.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Topcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hariri</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-Y. Wu</surname>
          </string-name>
          ,
          <article-title>Performance-effective and low-complexity task scheduling for heterogeneous computing</article-title>
          ,
          <source>IEEE transactions on parallel and distributed systems 13</source>
          (
          <year>2002</year>
          )
          <fpage>260</fpage>
          -
          <lpage>274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohammadi</surname>
          </string-name>
          , L. PourKarimi,
          <string-name>
            <given-names>F.</given-names>
            <surname>Droop</surname>
          </string-name>
          , N. De Mecquenem,
          <string-name>
            <given-names>U.</given-names>
            <surname>Leser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Reinert</surname>
          </string-name>
          ,
          <article-title>A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters</article-title>
          ,
          <source>The Journal of Supercomputing</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Dziok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Figiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Malawski</surname>
          </string-name>
          ,
          <article-title>Adaptive multi-level workflow scheduling with uncertain task estimates</article-title>
          ,
          <source>in: Parallel Processing and Applied Mathematics: 11th International Conference, PPAM</source>
          <year>2015</year>
          , Krakow, Poland, September 6-
          <issue>9</issue>
          ,
          <year>2015</year>
          . Revised Selected Papers,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Da Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Juve</surname>
          </string-name>
          , E. Deelman,
          <string-name>
            <given-names>T.</given-names>
            <surname>Glatard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Desprez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tovar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Livny</surname>
          </string-name>
          ,
          <article-title>Toward fine-grained online task characteristics estimation in scientific worklfows</article-title>
          ,
          <source>in: Proceedings of the 8th workshop on worklfows in support of large-scale science</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Da Silva</surname>
          </string-name>
          , G. Juve,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rynge</surname>
          </string-name>
          , E. Deelman,
          <string-name>
            <given-names>M.</given-names>
            <surname>Livny</surname>
          </string-name>
          ,
          <article-title>Online task resource consumption prediction for scientific workflows</article-title>
          ,
          <source>Parallel Processing Letters</source>
          <volume>25</volume>
          (
          <year>2015</year>
          )
          <fpage>1541003</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Nadeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alghazzawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mashat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fakeeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almalaise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hagras</surname>
          </string-name>
          ,
          <article-title>Modeling and predicting execution time of scientific workflows in the grid using radial basis function neural network</article-title>
          ,
          <source>Cluster Computing</source>
          <volume>20</volume>
          (
          <year>2017</year>
          )
          <fpage>2805</fpage>
          -
          <lpage>2819</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>S. M. Sadjadi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shimizu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Figueroa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rangaswami</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Delgado</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Duran</surname>
            ,
            <given-names>X. J.</given-names>
          </string-name>
          <string-name>
            <surname>Collazo-Mojica</surname>
          </string-name>
          ,
          <article-title>A modeling approach for estimating execution time of longrunning scientific applications, in: 2008 IEEE international symposium on parallel and distributed processing</article-title>
          , IEEE,
          <year>2008</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>M. H. Hilman</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Buyya</surname>
          </string-name>
          ,
          <article-title>Task runtime prediction in scientific workflows using an online incremental learning approach</article-title>
          ,
          <source>in: 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Kianfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pockrandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Torkamandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Reinert</surname>
          </string-name>
          ,
          <article-title>Optimum Search Schemes for approximate string matching using bidirectional FM-index</article-title>
          ,
          <source>arXiv preprint arXiv:1711</source>
          .
          <year>02035</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ashjaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mubeen</surname>
          </string-name>
          ,
          <article-title>Cognitive and Time Predictable Task Scheduling in Edge-cloud Federation</article-title>
          ,
          <source>in: 2022 IEEE 27th International Conference on Emerging Technologies and Factory Automation (ETFA)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Katoch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>A review on genetic algorithm: past, present, and future</article-title>
          ,
          <source>Multimedia tools and applications 80</source>
          (
          <year>2021</year>
          )
          <fpage>8091</fpage>
          -
          <lpage>8126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Tremblay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sabourin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maupin</surname>
          </string-name>
          ,
          <article-title>Optimizing nearest neighbour in random subspaces using a multi-objective genetic algorithm</article-title>
          ,
          <source>in: Proceedings of the 17th International Conference on Pattern Recognition</source>
          ,
          <year>2004</year>
          .
          <source>ICPR</source>
          <year>2004</year>
          ., volume
          <volume>1</volume>
          , IEEE,
          <year>2004</year>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>I. De Falco</surname>
            ,
            <given-names>A. Della</given-names>
          </string-name>
          <string-name>
            <surname>Cioppa</surname>
          </string-name>
          , E. Tarantino,
          <article-title>Mutation-based genetic algorithm: performance evaluation</article-title>
          ,
          <source>Applied Soft Computing</source>
          <volume>1</volume>
          (
          <year>2002</year>
          )
          <fpage>285</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>T. M. H. Hope</surname>
          </string-name>
          ,
          <article-title>Linear regression</article-title>
          ,
          <source>in: Machine Learning</source>
          , Elsevier,
          <year>2020</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>W.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <source>Principles of Real Analysis. Mathematics Series</source>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Valdes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stebliankin</surname>
          </string-name>
          , G. Narasimhan,
          <article-title>Large scale microbiome profiling in the cloud</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>35</volume>
          (
          <year>2019</year>
          )
          <fpage>i13</fpage>
          --
          <lpage>i22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Köster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rahmann</surname>
          </string-name>
          ,
          <article-title>Snakemake-a scalable bioinformatics workflow engine</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>28</volume>
          (
          <year>2012</year>
          )
          <fpage>2520</fpage>
          -
          <lpage>2522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Bader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Thamsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Will</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Leser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kao</surname>
          </string-name>
          ,
          <article-title>Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters</article-title>
          ,
          <source>arXiv preprint arXiv:2205.11181</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>