<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Somayeh</forename><surname>Mohammadi</surname></persName>
							<email>somayeh.mohammadi@fu-berlin.de</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="institution">Freie Universität</orgName>
								<address>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Latif</forename><surname>Pourkarimi</surname></persName>
							<email>l.pourkarimi@razi.ac.ir</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="institution">Razi University</orgName>
								<address>
									<settlement>Kermanshah</settlement>
									<country key="IR">Iran</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Manuel</forename><surname>Zschäbitz</surname></persName>
							<email>manuez42@zedat.fu-berlin.de</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="institution">Freie Universität</orgName>
								<address>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tristan</forename><surname>Aretz</surname></persName>
							<email>aret01@zedat.fu-berlin.de</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="institution">Freie Universität</orgName>
								<address>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ninon</forename><surname>De Mecquenem</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="institution">Humboldt-Universität zu Berlin</orgName>
								<address>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ulf</forename><surname>Leser</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="institution">Humboldt-Universität zu Berlin</orgName>
								<address>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Knut</forename><surname>Reinert</surname></persName>
							<email>reinert@fu-berlin.de</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="institution">Freie Universität</orgName>
								<address>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<address>
									<settlement>Paestum</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">56A5475066C9826575BAD036EC36FC38</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:18+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Data Analysis Workflow</term>
					<term>Mathematical Programming</term>
					<term>Makespan Minimization</term>
					<term>Run Time Prediction</term>
					<term>Genetic Algorithm</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Data analysis workflows are popular for sequencing activities in large-scale and complex scientific processes. Scheduling approaches attempt to find an appropriate assignment of workflow tasks to the computing nodes for minimizing the makespan in heterogeneous cluster infrastructures. A common feature of these approaches is that they already know the structure of the workflow. However, for many workflows, a high degree of parallelization can be achieved by splitting the large input data of a single task into chunks and processing them independently. We call this problem task granularity, which involves finding an assignment of tasks to computing nodes and simultaneously optimizing the structure of a bag of tasks. Accordingly, this paper addresses the problem of task granularity for metagenomic workflows. To this end, we first formulated the problem as a mathematical model. We then solved the proposed model using the genetic algorithm. To overcome the challenge of not knowing the number of tasks, we adjusted the number of tasks as a factor of the number of computing nodes. The procedure of increasing the number of tasks is performed interactively and evolutionarily. Experimental results showed that a desirable makespan value can be achieved after a few steps of the increase.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Scientists in many domains, such as bioinformatics, remote sensing, and physics, use Data Analysis Workflows (DAWs) to sequence activities involved in large-scale and complex scientific processes [1]; <ref type="bibr">[2]</ref>. These DAWs are typically represented as a Directed Acyclic Graph (DAG), which consists of a set of tasks and some directed edges between the tasks; edges show data dependencies between tasks and the priority order of task execution.</p><p>Scientists often use heterogeneous cluster infrastructures to run their DAWs because of privacy and financial concerns. Heterogeneous clusters provide high-performance computing environments that enable efficient data analysis and the execution of large-scale DAWs in a reasonable amount of time <ref type="bibr">[3]</ref>. DAWs are often executed on large amounts of data, resulting in long runtimes that can exceed days or weeks <ref type="bibr">[4]</ref>; <ref type="bibr">[5]</ref>; <ref type="bibr">[6]</ref>. Thus, in such environments, the key objective is to schedule DAW tasks across computing resources in such a way that the total execution time, also known as the makespan, is minimized.</p><p>It is well known that a high degree of parallelization can be achieved in many DAWs by splitting the input data of individual tasks into chunks and processing them independently <ref type="bibr">[7]</ref>. For example, in metagenomic DAWs, the size of a reference genome as frequent input data can vary from several KB to hundreds of GB, and the reference genome typically contains thousands of genome files. The reference genome can be divided into different bins of genome files and they are processed by several independent tasks in parallel. The main challenge here is how to partition the input data; what should be the appropriate size of each chunk of input data, and how should each task be assigned to a computing node so that the makespan is minimized. We call this problem task granularity. In heterogeneous environments, this challenge is aggravated because the computing power of the existing computing nodes is different, so choosing the input size of each task to be executed on each of these computing nodes is a very effective means in optimizing the makespan. Since each task of the DAW is equivalent to a job for the cluster when a workflow is submitted for execution, the terms task and job are used interchangeably in this study.</p><p>In this paper, we propose a novel approach to task granularity for metagenomic DAWs in cluster infrastructures with makespan minimization. We first formulate the problem as a mathematical model. We solve then the proposed model using the Genetic Algorithm method. Since the calculation of makespan requires a proper estimation of tasks runtime, we apply three different methods for this estimation and also compare their accuracy.</p><p>The paper is organized as follows: Section 2 presents a review of the related work, and Section 3 illustrates the problem statement. The proposed mathematical model is introduced in Section 4. Section 5 discusses solving the proposed model using genetic algorithm. Job runtime estimation is presented in Section 6, and experimental results are presented in Section 7. Finally, Section 8 provides concluding remarks and plans for further studies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>In this section, we first discuss data access patterns used in scientific workflows. We then cover the scheduling of scientific workflows on heterogeneous clusters. Finally, we focus on methods for predicting the runtime of tasks, as these estimates are often used as input for scheduling approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Data access patterns used in scientific workflows</head><p>The data access patterns of workflow applications have been addressed by several studies [8]; [9]; [10]; <ref type="bibr">[11]</ref>. Accordingly, the most commonly used patterns in scientific workflows are as follows (Fig. <ref type="figure" target="#fig_0">1</ref>):</p><p>• Pipeline:This is the most basic and familiar pattern.</p><p>A set of computational tasks is chained in a sequence such that the output of a parent task is the input of its child task in the chain. Because of the line dependencies in a pipeline pattern, the execution of a task cannot begin until the execution of its parent has completed and it has received the data generated by its parent. • Scatter: An input data is divided into several chunks.</p><p>These chunks are distributed into multiple tasks (a bag of tasks), which can be executed simultaneously because there are no dependencies between them. • Gather: Multiple chunks of data are produced by multiple tasks. All of them are used as input data by a subsequent task. The later task may need to receive all the chunks and integrate them to start execution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pipeline Scatter Gather</head><p>Data Task Obviously, the scatter pattern can be so effective in reducing makespan in a distributed environment such as a cluster and cloud because it provides parallel execution of tasks on computing nodes. The implementation of scatter is an NPhard problem (See Section 3.1), so the user is not able to do it manually to achieve an acceptable makespan. In this study, we propose an approach to address this problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Scheduling of scientific Workflows on heterogeneous clusters</head><p>Generally, workflow scheduling on heterogeneous infrastructures can be done in two ways, statically or dynamically <ref type="bibr">[12]</ref>; <ref type="bibr">[2]</ref>. Static scheduling assigns tasks to compute resources in advance, assuming that accurate information about workflow and infrastructure resources is available. Dynamic scheduling doesn't require such assumptions. On the one hand, many heuristic and meta-heuristic approaches have been provided for this problem. <ref type="bibr">HEFT [13]</ref> is considered to be the most famous of these. On the other hand, mathematical optimization approaches such as MILP <ref type="bibr">[14]</ref> have proposed optimal solutions to this problem and have also analyzed the problem more extensively. However, the presented state-of-the-art scheduling approaches have in common that they already know the structure of the DAG and find an optimal or suboptimal assignment of tasks to the computing nodes. By addressing the scattering problem of a bag, our proposed approach not only provides a suitable structure for the bag and consequently for the DAG, but also finds the best scheduling of tasks to computing nodes with the objective of minimizing the makespan.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Task runtime prediction</head><p>Most of state-of-the-art techniques for workflow scheduling rely on accurate predictions of tasks runtime <ref type="bibr">[15]</ref>. Therefore, the problem of predicting the runtime of scientific workflow tasks based on historical data has been studied extensively. In this research, the objective is to minimize the makespan. To compute the makespan value, an acceptable estimate of the runtime of jobs is required. Assuming that we have some historical data of execution traces of jobs on computing nodes, we use three different methods to predict the runtime of jobs (See Section 6).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Problem statement</head><p>This study addresses the problem of job/task granularity of scientific workflows in heterogeneous cluster environments with the aim of minimizing makespan. The case study is a metagenomic workflow where a reference genome containing a set of genome files is the main input data of the workflow. Building FM index (Full-text index in Minute space) over a reference genome (reference genome indexing) is a common and time-consuming task in metagenomic DAWs <ref type="bibr">[21]</ref> and such a task is often used by the bioinformatics workflow community, so it is a good case study for optimizing job/task granularity.</p><p>In general, the system model includes the following steps: the first step is to collect a historical dataset of job execution traces on different computing nodes of the cluster. If this dataset is not already available, it can be collected by data sampling <ref type="bibr">[3]</ref>. In the second step, a proper estimation of the job runtime and its memory consumption is performed using a prediction method. Then, by solving the proposed mathematical model, the optimal size of each chunk of input data for each job and also the assignment of jobs to computing nodes is obtained. Finally, the optimal job granularity and assignment is used to execute the DAW in the cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">A motivating example</head><p>Suppose there is a reference genome that contains five genome files of the following sizes: g 0 = 10, g 1 = 15, g 2 = 20, g 3 = 25, g 4 = 35. Moreover, assume that the cluster infrastructure has three computing nodes A, B and C. The runtime of a job with an input size of S on the computing nodes is calculated by the following functions:  In Fig. <ref type="figure" target="#fig_3">2</ref>, two different possible job building and assignments of the genome files is depicted. However, there are 3 5 = 243 different states as a solution among which the solution with the minimum makespan is the best.</p><formula xml:id="formula_0">• f A (S) = ln S 2 + 1 • f B (S) = ln S 2 + 2S • f C (S) = ln S 2 − 4S − 10</formula><p>In a real example, Archaea<ref type="foot" target="#foot_0">1</ref> has 488 genome files. Assuming that the number of cluster computing nodes is 10. Any approach based on complete enumeration and trial and error for assigning these genomes to computing nodes requires comparing a maximum number of 10 488 to different assignments. This approach is obviously impractical. It is noticeable that Archaea is a very small reference genome among the available reference genomes. Therefore, applying the above mentioned approach for solving the related assignment problem is not efficient or even applicable. The best approach to deal with this problem is to create a mathematical model for the problem and then apply available efficient algorithms for solving mathematical optimization models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">The proposed mathematical model</head><p>In a mathematical model, the objective function, decision variables and problem constraints are expressed in mathematical expressions. These models provide a deep insight into the structure of the problem [12]; <ref type="bibr">[22]</ref>. They are therefore suitable not only for solving the problem using classical methods, but also for solving the problem using heuristic or meta-heuristic methods.</p><p>The input data for the mathematical model are described in Table <ref type="table">1</ref>. Also, this table explains the decision variables existing in the proposed model. In this section, the formulation of the problem constraints and the objective function of the proposed model are presented in detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Required memory for running jobs</head><p>This constraint states that the memory limitation of kth node must be met. This constraint must be done for all jobs and computing nodes.</p><formula xml:id="formula_1">mem k ≥ f mem ( n ∑ i=1 S i • y i j • x jk ) ∀ j ∈ {1, 2, ..., J}, ∀k ∈ {1, 2, ..., v}<label>(1)</label></formula><p>For each J j and CN k , ∑ n i=1 S i • y i j • x jk denotes the input size of J j on CN k and f mem estimates the memory required for J j .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Assigning jobs to cluster nodes</head><p>Constraints (2) and (3) imply that non-empty jobs must be assigned to exactly one node. Constrain (4) enforces that empty jobs cannot be assigned to any node.</p><formula xml:id="formula_2">1 n n ∑ i=1 y i j ≤ v ∑ k=1 x jk ∀ j ∈ {1, 2, ..., J}<label>(2)</label></formula><formula xml:id="formula_3">v ∑ k=1 x jk ≤ 1 ∀ j ∈ {1, 2, ..., J}<label>(3)</label></formula><formula xml:id="formula_4">v ∑ k=1 x jk ≤ n ∑ i=1 y i j ∀ j ∈ {1, 2, ..., J}<label>(4)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Objective function</head><p>For constructing the objective function the following points should be highlighted:</p><p>• The objective function deals with the runtime of some bags of jobs on some computing nodes.</p><p>• If t k (Eq. ( <ref type="formula">5</ref>))denotes for the above mentioned time for CN k then the total runtime (makespan) equals to max(t k ) 1 ≤ k ≤ v . The objective function aims to minimize this time (Eq. ( <ref type="formula">6</ref>)).</p><p>• When more than one task is assigned to a node, the node wastes a certain amount of time between executing two jobs. ∑ J j=1 (x jk − 1) denotes that time.</p><p>• For each CN k , ∑ J j=1 f k (∑ n i=1 y i j • s i • x jk ) calculates the summation of the runtime of jobs assigned to CN k . ∑ n i=1 y i j • s i • x jk is the input size of J j , in which, f k () denotes an implicit function of that.</p><p>Runtime of jobs on each CN k is calculated by Eq.5.</p><formula xml:id="formula_5">t k = ( J ∑ j=1 x jk − 1) • st k + ( J ∑ j=1 f k ( n ∑ i=1 y i j • s i • x jk )) (5)</formula><p>Therefore, the objective function of the model is as follows: min(max</p><formula xml:id="formula_6">(t k )) 1 ≤ k ≤ v (6)</formula><p>This can be expressed as follows:</p><p>min α subject to:</p><formula xml:id="formula_7">α ≥ t k ∀k ∈ {1, 2, ..., v} α ≥ 0 Table 1</formula><p>Models parameters/variables and their description.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Notation of parameters Description</head><p>CN = {node1, node2, ..., node v } Cluster node set v</p><p>The number of nodes of the cluster CN k kth node of the cluster mem k</p><p>Accessible memory size of CN k for running tasks st k Switch time between two jobs for CN k Gen = {g1, g2, ..., g n }</p><p>The set of genomes of the reference genome n</p><p>The number of genomes of the reference genome g i i th genome of the reference genome S i Size of g i Job = {J1, J2, ..., J J }</p><p>The set of jobs J j j th job J</p><p>The number of jobs</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Notation of decision variables Description</head><p>x jk 1 iff J j is assigned to CN k , otherwise 0 y i j 1 iff g i is binned to J j , otherwise 0</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Genetic optimization for solving the proposed model</head><p>Suppose there is a reference genome of a certain size with genome files g 1 , g 2 , . . . , g n . We want to group the genomes into a number of jobs and then assign the jobs to computing nodes CN 1 , . . . ,CN v of a cluster infrastructure.</p><p>It can be seen that the proposed model is a non-linear binary mathematical model. Due to the special structure of the constraints and the objective function of this model, linearizing it leads to a binary linear model with a significantly large number of constraints. Solving this model is very time consuming in terms of computation (it may even be impossible). On the other hand, in contrast to classic approaches, using genetic algorithms is a very powerful approach for treating discret models even if the model is nonlinear <ref type="bibr">[23]</ref>. Based on this fact, using genetic approach can be an efficient approach for solving the proposed model without any linearization. In the following, we explain how to implement the presented model using a Genetic Algorithm (GA).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Chromosome structure</head><p>GAs mimic optimization during optimization by modelling genetic recombination and a fitness function. Hence, when using GA to solve a particular problem, the first concern to be addressed revolves around the determination of a suitable chromosome coding. Each chromosome represents the different parameters that characterize a solution to the problem.</p><p>[24]. In the solution, we consider a population of individuals, each individual being a potential solution to the problem described by the individual chromosome. The initial population is generated at random.</p><p>In this study, job granularity involves determining the assignment of genome files to jobs and the assignment of jobs to cluster nodes. Thus, a solution (chromosome) is a twodimensional array in which the indices indicate the genome file number. The elements of the first row contain the job numbers and the elements of the second row contain the computing node numbers. The representation of a chromosome in the GA implementation is shown in Fig. <ref type="figure" target="#fig_6">3</ref>. As Fig. <ref type="figure" target="#fig_6">3</ref> shows g 1 and g 3 are assigned to J 1 while g 2 is assigned to J 4 . Moreover, J 1 and J 4 are scheduled to CN 3 and CN 2 , respectively.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">GA operators</head><p>The crossover operator can help to inherit some chromosome fragments of excellent individuals to subsequent generations. In this study, the single-point crossover technique [23] is adapted during the performing of the crossover operator to produce new individuals. These new individuals are then assessed for their potential to contribute to the next generation of the population. An example is shown in Fig. <ref type="figure" target="#fig_7">4</ref>. After crossover, the first new individual is not feasible to add to the next generation because J 5 has been assigned to two different computing nodes. The mutation operator is a technique that replaces some gene values with others to increase population diversity. We use swap mutation <ref type="bibr">[25]</ref> to explore new regions of the solution space, where two positions on a chromosome are randomly selected and swapped. After mutation, the potential of new individuals to contribute to the next generation of the population is evaluated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">JOB RUNTIME ESTIMATION</head><p>In the GA algorithm, each individual should be assigned a value of the fitness score, which is shown in the objective function defined in Eq.5. Suppose the runtime of job j on node CN k is equal to t j . The challenge here is that there is no explicit function for calculating t j ; in other words, how long does it take to execute a job j of the size S j on a computing node CN k ? Therefore, we use the following different methods to predict the runtime of jobs on computing nodes, assuming that we have historical data from task execution traces:</p><p>• Linear Regression: Linear regression is a popular and simple machine learning algorithm that models the relationship between dependent and independent variables by analyzing and learning from current training results using a linear expression of independent variables <ref type="bibr">[26]</ref>. • Polynomial approximation: According to mathematical theorems, any continuous function can be optimally approximated by a suitable polynomial <ref type="bibr">[27]</ref>. Based on this theory, a polynomial of appropriate degree can be used to approximate the unknown function using the given historical data. In practice, lower degree polynomials are considered to avoid the socialisation of higher degree polynomials. Here we only consider degrees one, two and three. • Logarithmic of a polynomial approximation (logpol): As shown in <ref type="bibr">[28]</ref>, the curves of runtime according to data input size are logarithmic like and on the other hand, as mentioned above, the polynomials are a suitable method for estimation, so we use a combination of logarithmic and polynomial as a new approximation method.</p><p>These methods assume that there is a relationship between a task's input size and its runtime, using the input size as the independent variable. Thus, they can be used to predict task runtime for any task input sizes. Similar to related work, these methods use the size of the file on the hard disk as an input to their prediction models. In a heterogeneous cluster, a same task may have different run times on different computing nodes. Therefore, the methods create their prediction models for each computing node.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Experimental results</head><p>We developed a read-mapping workflow<ref type="foot" target="#foot_1">2</ref> for metagenomic data in the popular workflow engine Snakemake <ref type="bibr">[29]</ref>. The workflow was run on Allegro<ref type="foot" target="#foot_2">3</ref> , a cluster infrastructure. We created a historical dataset from traces of the workflow execution on the cluster as a historical execution trace.</p><p>We selected and used three reference genomes with different sizes (small, medium and large) as input data of the workflow to perform the experiments. The specifications of the input data are described in Table <ref type="table" target="#tab_0">2</ref>. We also looked into different cluster sizes of 4, 8, 16, and 24 computing nodes in the experiments.</p><p>As previously stated, our approach consider the job granularity and scheduling problems simultaneously for a bag of task in a workflow while related works have addressed only the scheduling problem; Indeed, they have assumed that there are a number of jobs, each with a certain size, and they schedule these jobs to the cluster nodes so that the total runtime is minimized.</p><p>Most existing approaches use the file size on disk as the input for their predictions or approximate models. In [30] is a detailed discussion of why uncompressed input data size for compressed files should be used. Accordingly, we use the uncompressed file size to predict the runtime and the used memory of jobs. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Accuracy comparison of job runtime estimation methods</head><p>We compare the accuracy of the methods used to estimate job runtime (See Section 6). The linear regression was implemented using the sklearn.linear_model<ref type="foot" target="#foot_3">6</ref> library. The polynomial and log-pol have been implemented using polyfit() from the Numpy<ref type="foot" target="#foot_4">7</ref> library and Curve_fit() from the Scipy.optimize<ref type="foot" target="#foot_5">8</ref> library, respectively. For comparison we use the Mean Absolute Percentage Error (MAPE). This metric is calculated using the Eq. 7 where A i is the actual value, F i is the predicted value and n is the number of fitted points.</p><formula xml:id="formula_8">MAPE = 1 n n ∑ i=1 | A i − F i A i |<label>(7)</label></formula><p>Figs. 5-7 show results for each input data separately with different cluster sizes. As shown in these figures, the polynomial method gives a more accurate estimate of the values of job run times than log-pol. Log-pol also outperforms linear regression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">Changing the number of jobs to improve makespan</head><p>One of the main challenges in this problem is that the number of jobs is not known in advance. The most obvious idea is to consider this number as equal to the number of genome files, i.e. n. This seems logical at first sight, since the possibility to consider empty jobs allows to potentially find any possible clustering for genome files in the form of jobs. However, from a practical point of view, this is inappropriate and impossible in most cases, because due to the large number of genome files, the number of decision variables and constraints increases significantly, making it impossible to solve</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Conclusion and Future works</head><p>In this paper, an approach to the task/job granularity problem for metagenomic DAWs in cluster infrastructures with makespan minimization was proposed. The problem was first formulated as a mathematical model and then the proposed model was solved using the GA method. One of the main challenges in this problem is that the number of jobs is not known in advance. We overcame this challenge by adjusting the number of jobs as a factor of the number of computing nodes. For each increase in the number of jobs, the makespan is calculated. This procedure continues and evolves until the distance between two successive makespan values is negligible or insignificant from the decision maker's point of view.</p><p>Experimental results showed that a desirable makespan value can be obtained after a few steps of increasing the number of jobs. Furthermore, the calculation of makespan requires a proper estimation of the task runtime, so we applied three different methods for this estimation. Experimental results showed that the polynomial approximation outperforms. the future, we aim to generalize our proposed model so that it can be applied to other scientific domains. Since the proposed approach does not schedule the workflow, but optimizes a single step of the workflow, we intend to integrate it into a scheduling approach in the future work.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Common data access patters in scientific workflows.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>Recent research has used machine learning techniques to address this issue [16]; [17]; [18]; [19]. They build their initial models on historical data before the actual workflow execution. [20]; [18] use neural network methods, which are known to require large training data sets to perform well, while [3] employs a Bayesian linear regression model, which can work with few training points and provides uncertainty estimates for its predictions. Most existing approaches use the size of the file on the hard drive as input to their prediction models.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>Execution time on Node-A= 7.80 Execution time on Node-B= 6.08 Execution time on Node-C= 10.Execution time on Node-A= 6.80 Execution time on Node-B= 5.42 Execution time on Node-C= 8.11 Makespan=8</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Two possible job granularities and assignments with different makespan values.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: A chromosome encoding.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: An example of crossover.</figDesc><graphic coords="5,79.63,65.61,198.43,113.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc>Reference genomes specifications.</figDesc><table><row><cell>Reference Genome</cell><cell>Data Size</cell><cell># Genome files</cell></row><row><cell>Archaea</cell><cell>1.4 GiB</cell><cell>488</cell></row><row><cell>Bacteria_1 4</cell><cell>30 GiB</cell><cell>7167</cell></row><row><cell>Bacteria_2 5</cell><cell>88 GiB</cell><cell>22185</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://github.com/CRC-FONDA/A2-job-granularity/tree/main/MG-HIBF</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">3 https://www.mi.fu-berlin.de/w/Cluster/WebHome</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">https://scikit-learn.org/stable/modules/linear_model.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize .curve_fit.html</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) as FONDA (Project 414984028, SFB 1404).</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Polynomial</head><p>the problem in a reasonable time, and even if the problem is solved, the propagation of computational errors leads to incorrect and unreasonable results.</p><p>Considering that jobs are supposed to be assigned to nodes, another idea is to consider the number of jobs as a factor of the number of nodes, i.e. v; more precisely:</p><p>The coefficient k in the Eq. 8 changes interactively and evolutionarily from one to higher; that is, after solving the problem for k=1 and calculating the makespan, we consider the resulting solution as an initial solution (an individual of the initial population) for k=2. This procedure continues and evolves until the distance between two consecutive makespan values is negligible or insignificant from the decision maker's point of view. On the one hand, this process is compatible with the evolutionary nature of the GA used to solve the above sequential problems, because in each step, with the solution of the previous step as the initial solution, the value of the fitness function in the current step starts to improve from the value of the previous step. Therefore, the makespan value in each step is better than or equal to the previous step (i.e., it evolves). On the other hand, given the stopping condition of the procedure, there is no need to solve problems with large number of jobs.</p><p>From the experimental results presented in Tables <ref type="table">3 to 5</ref>, the following points can be highlighted:</p><p>• As the number of jobs increases, the makespan and the number of unused nodes decrease. • As the number of nodes increases, the makespan decreases. • The procedure of increasing the number of jobs may be stopped for two reasons: Firstly, increasing the number of jobs does not improve the makespan (significantly). Secondly, increasing the number of jobs may be stopped by decision maker (especially if no significant improvement in makespan is expected). • As can be seen in the fourth row of Table <ref type="table">3</ref>, the makespan does not improve as the number of jobs increases from v to 2v. Obviously, a higher number of jobs does not improve the makespan (such cases are marked in bold in the tables). Thus, we have </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Obtained results for Archaea. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Obtained results for Bacteria-30GiB.  <ref type="table">5</ref>, increasing the number of jobs from 3v to 4v only leads to 1% decrease in the makespan, which is not significant (such cases are marked in the tables in bold and with an asterisk). So we stopped the procedure. It should be noted that the DM may stop the procedure when the number of jobs is 3v due to the insignificant improvement in makespan and the abandonment of an unused node. </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Integer linear programming-based multi-objective scheduling for scientific workflows in multi-cloud environments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mohammadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pourkarimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pedram</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Supercomputing</title>
		<imprint>
			<biblScope unit="volume">75</biblScope>
			<biblScope unit="page" from="6683" to="6709" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Cost minimization for deadline-constrained bag-of-tasks applications in federated hybrid clouds</title>
		<author>
			<persName><forename type="first">S</forename><surname>Abdi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pourkarimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ahmadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zargari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Future Generation Computer Systems</title>
		<imprint>
			<biblScope unit="volume">71</biblScope>
			<biblScope unit="page" from="113" to="128" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Lotaru : Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bader</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lehmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Thamsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Leser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kao</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.future.2023.08.022</idno>
		<ptr target="https://doi.org/10.1016/j.future.2023.08.022.doi:10.1016/j.future.2023.08.022" />
	</analytic>
	<monogr>
		<title level="j">Future Generation Computer Systems</title>
		<imprint>
			<biblScope unit="volume">150</biblScope>
			<biblScope unit="page" from="171" to="185" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Accurately simulating energy consumption of I/O-intensive scientific workflows</title>
		<author>
			<persName><forename type="first">R</forename><surname>Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-C</forename><surname>Orgerie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Casanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tanaka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Deelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Suter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computational Science-ICCS 2019: 19th International Conference</title>
				<meeting><address><addrLine>Faro, Portugal</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">June 12-14, 2019. 2019</date>
			<biblScope unit="page" from="138" to="152" />
		</imprint>
	</monogr>
	<note>Proceedings, Part I 19</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Characterizing, modeling, and accurately simulating power and energy consumption of i/o-intensive scientific workflows</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">F</forename><surname>Da Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Casanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-C</forename><surname>Orgerie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tanaka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Deelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Suter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of computational science</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="page">101157</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">SCEC CyberShake workflows-automating probabilistic seismic hazard analysis calculations</title>
		<author>
			<persName><forename type="first">P</forename><surname>Maechling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Deelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mehringer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kesselman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Callaghan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Okaya</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workflows for e-Science</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="143" to="163" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Dataaware optimization of bioinformatics workflows in hybrid clouds</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Kintsakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">E</forename><surname>Psomopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Mitkas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Big Data</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="1" to="26" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Characterization of scientific workflows</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bharathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chervenak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Deelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Vahi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workflows in Support of Large-Scale Science</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
	<note>Third Workshop on</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The case for workflow-aware storage: An opportunity study</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">B</forename><surname>Costa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Vairavanathan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Barros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Maheshwari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Fedak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Katz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wilde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ripeanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Al-Kiswany</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Grid Computing</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="95" to="113" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A workflow-aware storage system: An opportunity study</title>
		<author>
			<persName><forename type="first">E</forename><surname>Vairavanathan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Al-Kiswany</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">B</forename><surname>Costa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Katz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wilde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ripeanu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2012">2012. 2012</date>
			<biblScope unit="page" from="326" to="334" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Case studies in storage access by loosely coupled petascale applications</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Wozniak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wilde</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th Annual Workshop on Petascale Data Storage</title>
				<meeting>the 4th Annual Workshop on Petascale Data Storage</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="16" to="20" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Integer linear programming-based cost optimization for scheduling scientific workflows in multi-cloud environments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mohammadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pedram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pourkarimi</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11227-018-2465-8</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Supercomputing</title>
		<imprint>
			<biblScope unit="volume">74</biblScope>
			<biblScope unit="page" from="4717" to="4745" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Performance-effective and low-complexity task scheduling for heterogeneous computing</title>
		<author>
			<persName><forename type="first">H</forename><surname>Topcuoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hariri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-Y</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE transactions on parallel and distributed systems</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="260" to="274" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mohammadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pourkarimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Droop</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>De Mecquenem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Leser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Reinert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Supercomputing</title>
		<imprint>
			<biblScope unit="page" from="1" to="30" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Adaptive multi-level workflow scheduling with uncertain task estimates</title>
		<author>
			<persName><forename type="first">T</forename><surname>Dziok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Figiela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Malawski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Parallel Processing and Applied Mathematics: 11th International Conference, PPAM 2015</title>
				<meeting><address><addrLine>Krakow, Poland</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">September 6-9, 2015. 2016</date>
			<biblScope unit="page" from="90" to="100" />
		</imprint>
	</monogr>
	<note>Revised Selected Papers, Part II</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Toward fine-grained online task characteristics estimation in scientific workflows</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">F</forename><surname>Da Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Juve</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Deelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Glatard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Desprez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Thain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Tovar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Livny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 8th workshop on workflows in support of large-scale science</title>
				<meeting>the 8th workshop on workflows in support of large-scale science</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="58" to="67" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Online task resource consumption prediction for scientific workflows</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">F</forename><surname>Da Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Juve</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rynge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Deelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Livny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Parallel Processing Letters</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page">1541003</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Modeling and predicting execution time of scientific workflows in the grid using radial basis function neural network</title>
		<author>
			<persName><forename type="first">F</forename><surname>Nadeem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Alghazzawi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mashat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Fakeeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Almalaise</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hagras</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Cluster Computing</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="2805" to="2819" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A modeling approach for estimating execution time of longrunning scientific applications</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Sadjadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shimizu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Figueroa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rangaswami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Delgado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Duran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">J</forename><surname>Collazo-Mojica</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE international symposium on parallel and distributed processing</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Task runtime prediction in scientific workflows using an online incremental learning approach</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">H</forename><surname>Hilman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Buyya</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC), IEEE</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="93" to="102" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Kianfar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Pockrandt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Torkamandi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Reinert</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1711.02035</idno>
		<title level="m">Optimum Search Schemes for approximate string matching using bidirectional FM-index</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Cognitive and Time Predictable Task Scheduling in Edge-cloud Federation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Abdi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ashjaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mubeen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2022 IEEE 27th International Conference on Emerging Technologies and Factory Automation (ETFA), IEEE</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="1" to="4" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A review on genetic algorithm: past, present, and future</title>
		<author>
			<persName><forename type="first">S</forename><surname>Katoch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Chauhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Multimedia tools and applications</title>
		<imprint>
			<biblScope unit="volume">80</biblScope>
			<biblScope unit="page" from="8091" to="8126" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Optimizing nearest neighbour in random subspaces using a multi-objective genetic algorithm</title>
		<author>
			<persName><forename type="first">G</forename><surname>Tremblay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sabourin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Maupin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th International Conference on Pattern Recognition</title>
				<meeting>the 17th International Conference on Pattern Recognition</meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2004">2004. 2004</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="208" to="211" />
		</imprint>
	</monogr>
	<note>ICPR 2004</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Mutation-based genetic algorithm: performance evaluation</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">De</forename><surname>Falco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Della Cioppa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Tarantino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Soft Computing</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="285" to="299" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Linear regression</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M H</forename><surname>Hope</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning</title>
				<imprint>
			<publisher>Elsevier</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="67" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Principles of Real Analysis</title>
		<author>
			<persName><forename type="first">W</forename><surname>Rudin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Mathematics Series</title>
		<imprint>
			<date type="published" when="1976">1976</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Large scale microbiome profiling in the cloud</title>
		<author>
			<persName><forename type="first">C</forename><surname>Valdes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stebliankin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Narasimhan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="13" to="22" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Snakemake-a scalable bioinformatics workflow engine</title>
		<author>
			<persName><forename type="first">J</forename><surname>Köster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rahmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="2520" to="2522" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Bader</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lehmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Thamsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Will</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Leser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.11181</idno>
		<title level="m">Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
